Artificial Intelligence Meets Human-Like Speech: Sesame’s CSM
Gavin Purcell, co-host of the AI for Humans podcast, shared an example video on Reddit where a human pretends to be an embezzler and argues with a boss using Sesame’s new voice AI. The result is so convincing that it’s difficult to tell who’s the human and who’s the AI model. By our own demo, it’s entirely capable of what you see in the video.
“Near-human quality”
Sesame’s CSM achieves its realism by using two AI models working together, based on Meta’s Llama architecture that processes interleaved text and audio. The largest model uses 8.3 billion parameters, with 8 billion for the backbone and 300 million for the decoder, trained on approximately 1 million hours of primarily English audio.
A Single-Stage Approach
Unlike traditional two-stage text-to-speech systems, Sesame’s CSM integrates into a single-stage, multimodal transformer-based model, jointly processing interleaved text and audio tokens to produce speech. OpenAI’s voice model uses a similar multimodal approach.
Evaluation Results
In blind tests without conversational context, human evaluators showed no clear preference between CSM-generated speech and real human recordings, suggesting the model achieves near-human quality for isolated speech samples. However, when provided with conversational context, evaluators still consistently preferred real human speech, indicating a gap remains in fully contextual speech generation.
Limitations Acknowledged
Sesame co-founder Brendan Iribe acknowledged current limitations in a comment on Hacker News, noting that the system is “still too eager and often inappropriate in its tone, prosody, and pacing” and has issues with interruptions, timing, and conversation flow. “Today, we’re firmly in the valley, but we’re optimistic we can climb out,” he wrote.
Conclusion
Sesame’s CSM is a significant advancement in human-like speech generation, but it’s not without its limitations. While it achieves near-human quality in isolated speech samples, there’s still a gap in fully contextual speech generation. As the technology continues to evolve, it’s exciting to think about the potential applications of this technology, from customer service chatbots to language translation systems.
Frequently Asked Questions
* Q: How does Sesame’s CSM generate speech?
A: Sesame’s CSM uses a single-stage, multimodal transformer-based model that jointly processes interleaved text and audio tokens to produce speech.
* Q: How does the model process audio and text?
A: The model processes interleaved text and audio tokens using Meta’s Llama architecture, which is trained on approximately 1 million hours of primarily English audio.
* Q: How does the model achieve near-human quality?
A: The model achieves near-human quality by using a large-scale multimodal transformer-based model with 8.3 billion parameters, trained on a vast amount of audio data.