The capability to swiftly create high-caliber images is vital for crafting realistic simulated scenarios that can be utilized to instruct autonomous vehicles in avoiding unforeseen dangers, thereby improving their safety on actual roads.
However, the generative artificial intelligence methodologies increasingly employed to create such visuals have limitations. A widely used model known as a diffusion model is capable of generating impressively lifelike images but is often too sluggish and resource-intensive for various uses. Conversely, autoregressive models that drive large language models (LLMs) like ChatGPT are significantly quicker, yet they yield lower-quality pictures that are frequently filled with mistakes.
Experts from MIT and NVIDIA have devised a novel strategy that merges the advantages of both approaches. Their combined image-generation tool employs an autoregressive model to swiftly grasp the overall composition and subsequently utilizes a compact diffusion model to enhance the finer aspects of the image.
This tool, referred to as HART (which stands for hybrid autoregressive transformer), can produce visuals that either match or surpass the quality of cutting-edge diffusion models while operating approximately nine times faster.
The generation mechanism utilizes fewer computational resources compared to standard diffusion models, allowing HART to function locally on a consumer laptop or smartphone. A user simply needs to input a single natural language prompt into the HART interface to generate an image.
HART holds the potential for various applications, including assisting researchers in training robots for executing intricate real-world duties and aiding designers in crafting captivating scenes for video games.
“If you are creating a landscape, and you fill in the entire canvas at once, it may not turn out very well. However, if you paint the larger picture first and then refine it with smaller brush strokes, your artwork could become significantly better. That encapsulates the fundamental concept behind HART,” explains Haotian Tang SM ’22, PhD ’25, co-lead author of a new paper on HART.
He is accompanied by co-lead author Yecheng Wu, an undergraduate at Tsinghua University; senior author Song Han, an associate professor in the MIT Department of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab, and a distinguished scientist at NVIDIA; alongside other collaborators from MIT, Tsinghua University, and NVIDIA. The findings will be showcased at the International Conference on Learning Representations.
The best of both worlds
Well-known diffusion models like Stable Diffusion and DALL-E are recognized for producing exceptionally detailed images. These models create visuals through a repetitive process whereby they estimate a degree of random noise for each pixel, remove the noise, and then reiterate the predicting and “de-noising” process multiple times until a new image emerges that is entirely devoid of noise.
As the diffusion model eliminates noise from all pixels in an image with each iteration, and with the possibility of 30 or more iterations, the procedure tends to be slow and resource-heavy. Yet, due to the model’s multiple opportunities to rectify incorrect details, the images are of high quality.
Autoregressive models, typically used for text prediction, can create images by sequentially anticipating sections of an image, few pixels at a time. They do not have the ability to revisit and correct their errors, though the sequential prediction process is far quicker than that of diffusion.
These models rely on representations termed tokens to make predictions. An autoregressive model uses an autoencoder to transform raw image pixels into discrete tokens and to reconstruct the image from the predicted tokens. While this enhances the speed of the model, the information loss during compression can lead to errors in the newly generated image.
With HART, the researchers formulated a hybrid strategy that employs an autoregressive model to predict compressed, discrete image tokens followed by a compact diffusion model to estimate residual tokens. Residual tokens address the information loss during compression by capturing details omitted by discrete tokens.
“We can achieve a substantial improvement in reconstruction quality. Our residual tokens capture high-frequency details, such as the edges of objects or features like a person’s hair, eyes, or mouth. These are areas where discrete tokens may falter,” states Tang.
Since the diffusion model only predicts the outstanding details after the autoregressive model has completed its task, it can fulfill this within eight steps, as opposed to the standard 30 or more that a traditional diffusion model needs to create a complete image. This minimal additional burden of the extra diffusion model allows HART to maintain the speed advantage of the autoregressive model while significantly enhancing its capability to generate complex image details.
“The diffusion model has a simpler task, leading to greater efficiency,” he adds.
Outperforming larger models
Throughout the creation of HART, the researchers encountered obstacles in effectively incorporating the diffusion model to enhance the autoregressive model. They discovered that integrating the diffusion model during the early phases of the autoregressive process led to an accumulation of errors. Ultimately, their design—applying the diffusion model to predict only residual tokens as the final phase—greatly enhanced the quality of the generated images.
Their method, utilizing a blend of an autoregressive transformer model with 700 million parameters and a lightweight diffusion model with 37 million parameters, can generate images of equivalent quality to those produced by a diffusion model with 2 billion parameters, yet it achieves this around nine times faster. It consumes about 31 percent fewer computations than leading models.
Additionally, as HART employs an autoregressive model to handle the majority of the tasks—the same type of model that underlines LLMs—it is better suited for integration with the new category of unified vision-language generative models. In the future, one might interact with a unified vision-language generative model, potentially requesting it to display the intermediate steps necessary to assemble a piece of furniture.
“LLMs provide an excellent interface for various models, such as multimodal models and those capable of reasoning. This represents a way to elevate intelligence to a new level. An efficient image-generation model would unlock numerous possibilities,” he remarks.
Looking ahead, the researchers aim to pursue this direction and develop vision-language models based on the HART framework. Since HART is adaptable and applicable to multiple modalities, they also aspire to extend its use to video generation and audio prediction tasks.
This study was partially supported by the MIT-IBM Watson AI Lab, the MIT and Amazon Science Hub, the MIT AI Hardware Program, and the U.S. National Science Foundation. The GPU infrastructure for training this model was generously provided by NVIDIA.