Numerous organizations allocate significant resources to recruit skilled individuals in order to develop the high-efficiency library code that serves as the foundation for contemporary artificial intelligence frameworks. NVIDIA, for example, has produced some of the most sophisticated high-efficiency computing (HPC) libraries, establishing a competitive barrier that has proven challenging for others to penetrate.
But what if a pair of students, in just a few months, could rival cutting-edge HPC libraries with merely a few hundred lines of code, instead of tens or hundreds of thousands?
This is precisely what researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have demonstrated with a novel programming language known as Exo 2.
Exo 2 is classified under a new category of programming languages that MIT Professor Jonathan Ragan-Kelley refers to as “user-schedulable languages” (USLs). Rather than relying on an obscure compiler to automatically generate the most efficient code, USLs empower programmers by enabling them to craft “schedules” that explicitly dictate how the compiler produces code. This allows performance engineers to convert straightforward programs that outline their computational goals into more intricate programs that accomplish the same objectives, with significantly faster execution.
One limitation of current USLs (like the first Exo) is their relatively inflexible set of scheduling operations, making it challenging to repurpose scheduling code across different “kernels” (the individual elements in a high-efficiency library).
In contrast, Exo 2 permits users to establish new scheduling operations outside the compiler, promoting the development of reusable scheduling libraries. Lead author Yuka Ikarashi, a PhD candidate at MIT in electrical engineering and computer science and a CSAIL affiliate, asserts that Exo 2 can diminish total scheduling code by a factor of 100 while providing performance comparable to state-of-the-art implementations across various platforms, including Basic Linear Algebra Subprograms (BLAS) that drive many machine learning applications. This makes it a compelling choice for HPC engineers focused on optimizing kernels across diverse operations, data types, and target architectures.
“It’s a grassroots approach to automation, rather than conducting an ML/AI search over high-performance code,” states Ikarashi. “What this implies is that performance engineers and hardware developers can create their own scheduling library, which constitutes a collection of optimization techniques applicable to their hardware to achieve peak performance.”
An important benefit of Exo 2 is that it minimizes the coding effort required at any single point by reusing the scheduling code across various applications and hardware targets. The researchers have implemented a scheduling library using approximately 2,000 lines of code in Exo 2, encompassing reusable optimizations tailored for linear algebra and specific to various targets (AVX512, AVX2, Neon, and Gemmini hardware accelerators). This library consolidates scheduling efforts across over 80 high-efficiency kernels, each utilizing up to a dozen lines of code, achieving performance that rivals or surpasses that of MKL, OpenBLAS, BLIS, and Halide.
Exo 2 features an innovative mechanism dubbed “Cursors,” which provides what they term a “stable reference” for pointing at the object code throughout the scheduling procedure. Ikarashi emphasizes that a stable reference is crucial for users to encapsulate schedules within a library function, as it makes the scheduling code independent of object-code transformations.
“We believe that USLs should be designed to be user-extensible, rather than constrained by a fixed set of operations,” remarks Ikarashi. “In this manner, a language can evolve to accommodate extensive projects through the creation of libraries that meet diverse optimization needs and application areas.”
Exo 2’s architecture enables performance engineers to concentrate on high-level optimization strategies while ensuring that the foundational object code remains functionally equivalent through the use of reliable primitives. Moving forward, the team intends to broaden Exo 2’s compatibility with various types of hardware accelerators, such as GPUs. Several ongoing initiatives are aimed at enhancing compiler analysis itself, focusing on correctness, compilation duration, and expressiveness.
Ikarashi and Ragan-Kelley co-authored the paper alongside graduate students Kevin Qian and Samir Droubi, Alex Reinking from Adobe, and former CSAIL postdoc Gilbert Bernstein, now a professor at the University of Washington. This research was partially funded by the U.S. Defense Advanced Research Projects Agency (DARPA) and the U.S. National Science Foundation, while the lead author received additional support through Masason, Funai, and Quad Fellowships.