Introduction to SketchAgent
When you’re trying to communicate or understand ideas, words don’t always do the trick. Sometimes the more efficient approach is to do a simple sketch of that concept — for example, diagramming a circuit might help make sense of how the system works.
The Power of Sketching
But what if artificial intelligence could help us explore these visualizations? While these systems are typically proficient at creating realistic paintings and cartoonish drawings, many models fail to capture the essence of sketching: its stroke-by-stroke, iterative process, which helps humans brainstorm and edit how they want to represent their ideas.
What is SketchAgent?
A new drawing system from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Stanford University can sketch more like we do. Their method, called “SketchAgent,” uses a multimodal language model — AI systems that train on text and images, like Anthropic’s Claude 3.5 Sonnet — to turn natural language prompts into sketches in a few seconds. For example, it can doodle a house either on its own or through collaboration, drawing with a human or incorporating text-based input to sketch each part separately.
Capabilities of SketchAgent
The researchers showed that SketchAgent can create abstract drawings of diverse concepts, like a robot, butterfly, DNA helix, flowchart, and even the Sydney Opera House. One day, the tool could be expanded into an interactive art game that helps teachers and researchers diagram complex concepts or give users a quick drawing lesson.
How SketchAgent Works
CSAIL postdoc Yael Vinker, who is the lead author of a paper introducing SketchAgent, notes that the system introduces a more natural way for humans to communicate with AI. SketchAgent teaches these models to draw stroke-by-stroke without training on any data — instead, the researchers developed a “sketching language” in which a sketch is translated into a numbered sequence of strokes on a grid.
Assessing AI’s Sketching Abilities
While text-to-image models such as DALL-E 3 can create intriguing drawings, they lack a crucial component of sketching: the spontaneous, creative process where each stroke can impact the overall design. On the other hand, SketchAgent’s drawings are modeled as a sequence of strokes, appearing more natural and fluid, like human sketches.
Collaboration Mode
The team tested their system in collaboration mode, where a human and a language model work toward drawing a particular concept in tandem. Removing SketchAgent’s contributions revealed that their tool’s strokes were essential to the final drawing. In a drawing of a sailboat, for instance, removing the artificial strokes representing a mast made the overall sketch unrecognizable.
Experimentation and Results
In another experiment, CSAIL and Stanford researchers plugged different multimodal language models into SketchAgent to see which could create the most recognizable sketches. Their default backbone model, Claude 3.5 Sonnet, generated the most human-like vector graphics (essentially text-based files that can be converted into high-resolution images). It outperformed models like GPT-4o and Claude 3 Opus.
Future Possibilities
While SketchAgent’s drawing prowess is promising, it can’t make professional sketches yet. It renders simple representations of concepts using stick figures and doodles, but struggles to doodle things like logos, sentences, complex creatures like unicorns and cows, and specific human figures. The researchers could possibly refine these drawing skills by training on synthetic data from diffusion models.
Conclusion
SketchAgent suggests AI could draw diverse concepts the way humans do, with step-by-step human-AI collaboration that results in more aligned final designs. This work was supported, in part, by the U.S. National Science Foundation, a Hoffman-Yee Grant from the Stanford Institute for Human-Centered AI, the Hyundai Motor Co., the U.S. Army Research Laboratory, the Zuckerman STEM Leadership Program, and a Viterbi Fellowship.
FAQs
- What is SketchAgent?
SketchAgent is a drawing system that uses a multimodal language model to turn natural language prompts into sketches in a few seconds. - How does SketchAgent work?
SketchAgent teaches models to draw stroke-by-stroke without training on any data, using a “sketching language” to translate sketches into numbered sequences of strokes on a grid. - What are the capabilities of SketchAgent?
SketchAgent can create abstract drawings of diverse concepts, like robots, butterflies, and DNA helices, and can collaborate with humans to create sketches. - What are the limitations of SketchAgent?
SketchAgent can’t make professional sketches yet and struggles to doodle complex creatures, logos, and specific human figures. - What are the future possibilities of SketchAgent?
SketchAgent could be expanded into an interactive art game, help teachers and researchers diagram complex concepts, or give users a quick drawing lesson.