March 24, 2025
Deploying Sesame CSM: The Most Realistic Voice Model as an API

Kyle Gani
Senior Technical Product Manager
If you've been following the AI voice space, you might have noticed something exciting: we're finally reaching a point where AI-generated speech doesn't sound like, well... AI. Sesame AI Labs recently dropped their Conversational Speech Model (CSM), and it's a game-changer.
In their research paper, Sesame demonstrates how CSM produces speech so natural that even audio experts struggle to distinguish it from human recordings. It's not just about clarity—it's the hesitations, the "umms," the natural rhythm that makes it feel genuinely human.
In this article, we’ll walk you through how to deploy CSM on a serverless cloud platform like Cerebrium. By the end, you'll have your very own hyper-realistic voice API up and running.
Check out the final code, here.
What is Sesame CSM?
Sesame's Conversational Speech Model (CSM) represents a significant advancement in text-to-speech technology. Unlike traditional TTS systems that often sound robotic or unnatural, CSM generates speech with remarkably human-like qualities, including natural pauses, "umms", "uhhs," expressive mouth sounds, and subtle intonation changes that characterize human conversation.
The model works by combining a large language model (LLM) architecture (based on Llama 3.2) with specialized audio tokenization. When generating speech, CSM takes into account not just the text to be spoken, but also the conversational context, allowing it to maintain a coherent speaking style throughout an exchange.
Prerequisites
Before getting started, you’ll need access to the following:
A Cerebrium account, sign up here
A Huggingface account (Sign up here) and API Key
Access to the CSM-1B model, which you can request here
Access to the Llama 3.2 1B model, which you can request here
Getting Started with the CSM Repository
First things first—let's grab the code we need. Sesame has kindly provided the CSM repository on GitHub, which gives us everything we need to get started. This repo is our springboard, containing the model architecture and generation code.
Here's how to get set up:
Clone the repository:
For our Cerebrium deployment, we'll need three key files from this repo:
generator.py
: Contains the speech generation logicmodels.py
: Defines the model architecturerequirements.txt
: Lists all the dependencieswatermarking.py
: Adds watermark to au for ethical reasons
Let's create a new project directory and copy these files over:
Now we'll need to add our own cerebrium.toml
and main.py
files. When we're done, our project structure should look like this:
Let's create our cerebrium.toml
. This is where we'll configure our deployment settings. Don't worry if you're new to this—the file is pretty straightforward:
Let’s break down what's happening here:
We're naming our deployment "sesame" and using Python 3.12
For hardware, we're requesting 3 CPU cores, 12GB memory, and an NVIDIA A10 GPU (the model needs some decent GPU horsepower)
The scaling settings ensure our deployment can handle traffic spikes by automatically scaling up to 5 replicas when needed and scaling down to 0 when idle (saving us money)
We're also making sure ffmpeg is installed, which we need for audio processing
Pro tip: these settings provide a good balance between performance and cost. The model runs comfortably on an A10, and you can adjust the replicas based on your expected traffic.
Setting our Environment variables
For our application to cache and download our gated models, we need to set the following environment variables in our Cerebrium dashboard (Under the secrets section):
HF_TOKEN
: You can generate a new token from your Huggingface dashboard (Our access to the gated CSM and Llama repositories are linked to this)HF_HUB_ENABLE_HF_TRANSFER=1
: This enables faster downloads of our models from HuggingfaceHF_HOME=/persistent-storage/.cache/huggingface/hub
: This sets our caching directory to our Cerebrium persistent volume so that caches are persisted between app cold starts
Note: Don’t forget to add hf_transfer
to your requirements.txt
file, otherwise you will receive an error.
Creating the main.py File
Now for the fun part—our main.py
file. This is where all the moving parts come together. If you look at the CSM repo, you'll find some examples, but they're more research-oriented. We're going to create something tailored for our Cerebrium deployment:
Here's what's happening in this code:
We're loading the CSM-1B model when our service starts
We set up two conversation examples that define the speaking style (what are known as utterances)
The
generate_audio
function:Takes text input from the API request
Creates context segments from our examples
Generates speech in the style of speaker 1
Returns the audio as base64-encoded data
The Segment class from generator.py
is crucial here - it packages up text, audio, and speaker information that helps the model understand how to generate natural speech.
Notice how we're limiting to 10 seconds with max_audio_length_ms
. You can adjust this if you need longer audio clips, though keep in mind that longer clips will take more time to generate.
Deploying to Cerebrium
For the moment of truth, we’ll deploy our new app to Cerebrium:
Login to your Cerebrium account:
This will open up your browser and authenticate your cli.
Add your Huggingface token (HF_TOKEN) in your Cerebrium dashboard under your secrets section (This allows you to authenticate against Huggingface and provides access to the gated model repositories).
Deploy your model:
Cerebrium takes over and will:
Upload your files
Build a custom container with all your dependencies
Provision the A10 GPU
Deploy your app, test that your app runs successfully and download model weights
Set up the API endpoint
The first deployment might take a few minutes as it installs all the dependencies and tests that your application works as expected.
Once your app is deployed, you should see your endpoint URL printed in your terminal. You can also get find your URL in your dashboard along with the API key you’d need to make requests against your endpoint:
Testing Your Voice API
Next, let's create a simple test script to see our model in action. Create a new file called test.py
:
To run this test, replace the placeholder values and run the following in your terminal:
If all goes well, you should see something like:
Now play the output.wav
file! The first thing you'll notice is how eerily natural it sounds. There are hesitations, subtle intonation changes, and even that "uh" we included gets rendered in a way that sounds completely human.
Try experimenting with different texts - add more filler words like "um" and "like" to see how the model handles them. It's impressive how it keeps the same speaking style consistent throughout.
Taking It Further: Improvements and Next Steps
Our basic implementation is already a good start, but there's so much more we can do with this technology. Here are some ways to supercharge our implementation:
Streaming Audio for Real-Time Applications
The current implementation has one limitation: you need to wait for the entire audio clip to generate before receiving anything. For more interactive applications, we can implement streaming. On the client side, you'd use a WebSocket connection or server-sent events to receive and play these chunks in real-time. This approach could reduce perceived latency from ~6 seconds down to just 1-2 seconds for the first audio to start playing.
Conclusion
We’re genuinely excited to see what you'll build with this. Maybe it's a more accessible reading app for those with visual impairments. Perhaps it's a language learning tool with perfectly natural pronunciation. Or maybe you're creating the next generation of voice assistants that finally sound human.
Whatever you build, remember to use this power responsibly. Sesame has thoughtfully included audio watermarking to help identify AI-generated speech, and it's important to be transparent with users when they're interacting with AI-generated voices.
Happy coding, and enjoy crossing the uncanny valley of voice with Sesame and Cerebrium!