Managing intricate interactive systems, be it the diverse modes of transport in a metropolis or the myriad elements that must collaborate to create a capable and efficient robot, is an increasingly crucial topic for software developers to address. Presently, investigators at MIT have formulated an entirely novel method for tackling these intricate issues, employing basic diagrams as a means to uncover superior strategies for software enhancement in deep-learning models.
They assert that this innovative technique simplifies the resolution of these challenging tasks to the extent that it can be illustrated in a sketch that would fit on the backside of a napkin.
This new methodology is detailed in the journal Transactions of Machine Learning Research, within a manuscript by incoming doctoral candidate Vincent Abbott and Professor Gioele Zardini from MIT’s Laboratory for Information and Decision Systems (LIDS).
“We developed a fresh language to discuss these emerging systems,” Zardini explains. This new diagram-centric “language” is largely grounded in a concept known as category theory, he clarifies.
This all pertains to the design of the fundamental framework of computer algorithms — the software that ultimately senses and regulates the various components of the system being optimized. “The elements are various segments of an algorithm, and they must communicate with one another, exchange data, while also considering energy consumption, memory usage, and more.” Such enhancements are notoriously challenging due to the fact that any modification in one segment of the system can trigger changes in other segments, which can further influence additional segments, and so forth.
The researchers chose to concentrate on a specific category of deep-learning algorithms, which are currently a trending focus of research. Deep learning serves as the foundation for large artificial intelligence models, encompassing substantial language models like ChatGPT and image synthesis models such as Midjourney. These models manipulate information through a “deep” sequence of matrix multiplications interspersed with other operations. The figures within matrices are parameters that get updated during extensive training sessions, enabling the discovery of intricate patterns. Models encompass billions of parameters, making computational demands high, thereby rendering improved resource management and optimization invaluable.
Diagrams can illustrate the specifics of the parallelized operations that constitute deep-learning models, elucidating the connections between algorithms and the parallelized graphics processing unit (GPU) hardware they operate on, provided by firms like NVIDIA. “I’m truly enthusiastic about this,” states Zardini, because “we seem to have established a language that articulately conveys deep learning algorithms, explicitly highlighting all the significant aspects, which includes the operators employed,” such as energy consumption, memory allocation, and any other parameters being optimized.
Much of the advancement within deep learning has emerged from improvements in resource efficiency. The recent DeepSeek model demonstrated that a small team could compete with leading models from OpenAI and other prominent laboratories by prioritizing resource efficiency and the interplay between software and hardware. Typically, in deriving these optimizations, he notes, “individuals require a significant amount of trial and error to uncover new architectures.” For instance, a widely adopted optimization tool known as FlashAttention took over four years to develop, he remarks. However, with the new framework they created, “we can genuinely approach this issue in a more systematic manner.” And all of this is visually represented in a precisely defined graphical language.
However, the methodologies traditionally employed to discover these advancements “are very constrained,” he notes. “I believe this illustrates a significant gap, as we lack a formal, systematic approach to relating an algorithm to its optimal execution, or even comprehending the resources it necessitates to operate.” But now, with the newly devised diagram-based system, such a solution is available.
Category theory, which forms the foundation of this approach, is a means of mathematically representing the various components of a system and how they interact in a generalized, abstract fashion. Distinct perspectives can be related. For example, mathematical expressions can be associated with algorithms that implement them and consume resources, or descriptions of systems can be correlated with robust “monoidal string diagrams.” These visual representations enable direct experimentation with how the different components connect and interact. What they have developed, he states, amounts to “string diagrams on steroids,” encompassing a multitude of graphical conventions and numerous properties.
“Category theory can be regarded as the mathematics of abstraction and composition,” Abbott articulates. “Any compositional system can be characterized using category theory, and the interrelationship between compositional systems can also be examined.” Algebraic principles that typically relate to functions can also be symbolized as diagrams, he points out. “Thus, many of the visual strategies we can employ with diagrams can be related to algebraic strategies and functions. This creates a correspondence between these disparate systems.”
Consequently, he asserts, “this addresses a significant challenge, which is that we possess these deep-learning algorithms, yet they are not clearly understood as mathematical models.” However, by depicting them as diagrams, it becomes feasible to approach them in a formal and systematic manner, he maintains.
One aspect this facilitates is a clear visual comprehension of how parallel real-world processes can be symbolized by parallel processing in multicore computer GPUs. “In this manner,” Abbott states, “diagrams can encapsulate a function, and subsequently unveil how to optimally implement it on a GPU.”
The “attention” algorithm is utilized by deep-learning algorithms that necessitate general, contextual information, and is a vital component of the serialized blocks that comprise large language models such as ChatGPT. FlashAttention is an optimization that required years to develop, but culminated in a sixfold enhancement in the speed of attention algorithms.
This methodology, Abbott explains, “enables optimization to be rapidly achieved, in contrast to existing methods.” While they initially applied this strategy to the pre-existing FlashAttention algorithm, thereby affirming its efficacy, “we hope to utilize this language to automate the identification of enhancements,” states Zardini, who, besides being a principal investigator at LIDS, serves as the Rudge and Nancy Allen Assistant Professor of Civil and Environmental Engineering, alongside being an affiliate faculty member with the Institute for Data, Systems, and Society.
The ultimate aim, he indicates, is to advance the software to the extent that “the researcher can upload their code, and through the new algorithm it will automatically identify what can be enhanced, what can be optimized, and an optimized version of the algorithm will be returned to the user.”
In addition to automating algorithm optimization, Zardini emphasizes that a thorough analysis of how deep-learning algorithms relate to hardware resource consumption permits systematic co-design of hardware and software. This line of inquiry aligns with Zardini’s emphasis on categorical co-design, which utilizes the tools of category theory to concurrently enhance various elements of engineered systems.
Abbott remarks that “this entire domain of optimized deep learning models, in my opinion, is critically underexplored, and that’s why these diagrams are so thrilling. They pave the way for a methodical approach to this dilemma.”
“I’m genuinely impressed by the caliber of this research. … The new method of diagramming deep-learning algorithms represented in this paper could mark a significant advancement,” states Jeremy Howard, founder and CEO of Answers.ai, who was not involved with this research. “This paper is the first instance I’ve encountered such notation employed to extensively analyze the performance of a deep-learning algorithm on actual hardware. … The subsequent step will be determining if tangible performance enhancements can be realized.”
“This is an exceptionally well-executed piece of theoretical research, which also seeks to be highly accessible to uninitiated readers — a quality seldom found in papers of this nature,” comments Petar Velickovic, a senior research scientist at Google DeepMind and a lecturer at Cambridge University, who was also not linked to this project. He notes that these researchers “are evidently outstanding communicators, and I eagerly anticipate their forthcoming contributions!”
The new diagram-centric language has already garnered significant interest and attention from software developers, having been shared online. A reviewer of Abbott’s prior manuscript introducing the diagrams remarked that “The suggested neural circuit diagrams appear impressive from an artistic viewpoint (based on my limited judgment).” “It’s technical research, but it’s also visually appealing!” Zardini states.