"Training AI to Capture the Human Touch in Sketching"

When attempting to convey or comprehend concepts, words don’t always suffice. Occasionally, a more effective strategy is to create a quick sketch of that idea — for instance, illustrating a circuit may assist in understanding how the system functions.

But imagine if artificial intelligence could aid us in exploring these visual representations? While these systems are generally adept at producing realistic images and cartoon-like illustrations, numerous models struggle to encapsulate the essence of sketching: its progressive, stroke-by-stroke approach, which facilitates human brainstorming and refinement in portraying their ideas.

A novel drawing system from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Stanford University can sketch more akin to human methods. Their technique, dubbed “SketchAgent,” employs a multimodal language model — AI systems that learn from both text and images, akin to Anthropic’s Claude 3.5 Sonnet — to convert natural language prompts into sketches within seconds. For example, it can spontaneously sketch a house or collaborate by drawing with a human or incorporating text-based instructions to depict each part individually.

The researchers demonstrated that SketchAgent can produce abstract illustrations of various concepts, including a robot, butterfly, DNA helix, flowchart, and even the Sydney Opera House. In the future, the tool could evolve into an interactive art game that supports educators and scholars in diagramming intricate concepts or providing users with a brief drawing tutorial.

CSAIL postdoc Yael Vinker, who is one of the primary authors of a paper introducing SketchAgent, observes that the system offers a more intuitive way for humans to interact with AI.

“Not everyone realizes how frequently they sketch in their day-to-day life. We often illustrate our thoughts or brainstorm ideas using sketches,” she remarks. “Our tool seeks to replicate that experience, enhancing multimodal language models’ efficacy in visually expressing ideas.”

SketchAgent equips these models to draw stroke-by-stroke without relying on prior data — instead, the researchers devised a “sketching language” wherein a sketch translates into a numbered sequence of strokes on a grid. The system received exemplars of how various objects, like a house, would be depicted, with each stroke annotated to indicate what it represented — such as the seventh stroke being a rectangle marked as a “front door” — aiding the model in generalizing to unfamiliar concepts.

Vinker authored the paper with three CSAIL collaborators — postdoc Tamar Rott Shaham, undergraduate researcher Alex Zhao, and MIT Professor Antonio Torralba — alongside Stanford University Research Fellow Kristine Zheng and Assistant Professor Judith Ellen Fan. They will present their findings at the 2025 Conference on Computer Vision and Pattern Recognition (CVPR) this month.

Evaluating AI’s sketching capabilities

While text-to-image models like DALL-E 3 can generate captivating illustrations, they lack a vital element of sketching: the spontaneous, imaginative process where every stroke can influence the overall composition. In contrast, SketchAgent’s illustrations are composed of a sequence of strokes, appearing more fluid and organic, resembling human sketches.

Previous efforts have emulated this process as well, yet they trained their models on human-drawn datasets, which are often restricted in size and variety. SketchAgent utilizes pre-trained language models instead, which possess knowledge about numerous concepts but lack sketching skills. Upon teaching language models this technique, SketchAgent began to render diverse concepts not specifically trained on.

Nevertheless, Vinker and her team were curious whether SketchAgent was actively engaging with humans during the sketching process or if it was functioning independently of its drawing counterpart. The team assessed their system in collaboration mode, where a human and a language model collectively work on drawing a specific concept. Removing SketchAgent’s contributions indicated that its strokes were crucial to the final illustration. For example, in a drawing of a sailboat, eliminating the AI-generated strokes depicting a mast rendered the overall sketch unrecognizable.

In another experiment, CSAIL and Stanford researchers incorporated different multimodal language models into SketchAgent to determine which could produce the most recognizable sketches. Their default backbone model, Claude 3.5 Sonnet, yielded the most human-like vector graphics (essentially text-based files that can be converted into high-resolution images). It surpassed models such as GPT-4o and Claude 3 Opus.

“The fact that Claude 3.5 Sonnet surpassed other models like GPT-4o and Claude 3 Opus implies that this model processes and generates visual-related information differently,” comments co-author Tamar Rott Shaham.

She adds that SketchAgent could evolve into a valuable interface for collaborating with AI models beyond conventional text-based communication. “As models improve in understanding and generating other modalities, such as sketches, they offer new avenues for users to articulate ideas and receive responses that feel more instinctive and human-like,” asserts Shaham. “This could significantly enhance interactions, making AI more approachable and versatile.”

Although SketchAgent’s drawing skills are promising, it isn’t yet capable of producing professional sketches. It conveys basic representations of ideas using stick figures and simple doodles but encounters challenges with rendering designs like logos, sentences, complex creatures such as unicorns and cows, and specific human figures.

At times, the model misinterpreted users’ intentions in collaborative sketches, such as when SketchAgent illustrated a bunny with two heads. According to Vinker, this might be due to the model breaking down each task into smaller steps (also referred to as “Chain of Thought” reasoning). In collaboration with humans, the model forms a drawing plan, which could lead to potential misinterpretation of which segment of that outline a human is contributing to. Future research could enhance these drawing abilities by training on synthetic data derived from diffusion models.

Additionally, SketchAgent frequently demands several prompting rounds to generate human-like doodles. Moving forward, the team intends to simplify interaction and sketching with multimodal language models, encompassing improvements to their interface.

Nonetheless, the tool indicates that AI could depict various concepts in a manner akin to human methods, with a step-by-step collaboration that results in more cohesive final designs.

This project was partially funded by the U.S. National Science Foundation, a Hoffman-Yee Grant from the Stanford Institute for Human-Centered AI, Hyundai Motor Co., the U.S. Army Research Laboratory, the Zuckerman STEM Leadership Program, and a Viterbi Fellowship.

Leave a Reply Cancel reply