The Art of Imitation: How AI Can Mimic Sounds Like a Pro
Vocal Imitation: A New Frontier in Communication
Whether you’re describing the sound of your faulty car engine or meowing like your neighbor’s cat, imitating sounds with your voice can be a helpful way to relay a concept when words don’t do the trick.
A New AI System That Can Mimic Sounds
Inspired by the cognitive science of how we communicate, MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers have developed an AI system that can produce human-like vocal imitations with no training, and without ever having “heard” a human vocal impression before.
How It Works
To achieve this, the researchers engineered their system to produce and interpret sounds much like we do. They started by building a model of the human vocal tract that simulates how vibrations from the voice box are shaped by the throat, tongue, and lips. Then, they used a cognitively-inspired AI algorithm to control this vocal tract model and make it produce imitations, taking into consideration the context-specific ways that humans choose to communicate sound.
The Art of Imitation, in Three Parts
The team developed three increasingly nuanced versions of the model to compare to human vocal imitations. First, they created a baseline model that simply aimed to generate imitations that were as similar to real-world sounds as possible — but this model didn’t match human behavior very well.
The researchers then designed a second “communicative” model. According to Caren, this model considers what’s distinctive about a sound to a listener. For instance, you’d likely imitate the sound of a motorboat by mimicking the rumble of its engine, since that’s its most distinctive auditory feature, even if it’s not the loudest aspect of the sound (compared to, say, the water splashing). This second model created imitations that were better than the baseline, but the team wanted to improve it even more.
To take their method a step further, the researchers added a final layer of reasoning to the model. “Vocal imitations can sound different based on the amount of effort you put into them. It costs time and energy to produce sounds that are perfectly accurate,” says Chandra. The researchers’ full model accounts for this by trying to avoid utterances that are very rapid, loud, or high- or low-pitched, which people are less likely to use in a conversation. The result: more human-like imitations that closely match many of the decisions that humans make when imitating the same sounds.
Conclusion
The team’s work presents an exciting step toward formalizing and testing theories of the intricate interplay between physiology, social reasoning, and communication in the evolution of language. As Caren notes, their method captures the abstract, non-phono-realistic ways humans express the sounds they hear, teaching us about the process of auditory abstraction.
FAQs
Q: What is vocal imitation?
A: Vocal imitation is the sonic equivalent of doodling a quick picture to communicate something you saw — except instead of using a pencil to illustrate an image, you use your vocal tract to express a sound.
Q: How does the AI system work?
A: The AI system produces and interprets sounds much like we do, using a cognitively-inspired AI algorithm to control a model of the human vocal tract.
Q: What are the potential applications of this technology?
A: The technology could lead to more intuitive “imitation-based” interfaces for sound designers, more human-like AI characters in virtual reality, and even methods to help students learn new languages.
Q: What are the limitations of the current model?
A: The model struggles with some consonants, like “z,” and can’t yet replicate how humans imitate speech, music, or sounds that are imitated differently across different languages.