March 24, 2025

Deploying Sesame CSM: The Most Realistic Voice Model as an API

Kyle Gani - Technical Product Manager

Kyle Gani

Senior Technical Product Manager

If you've been following the AI voice space, you might have noticed something exciting: we're finally reaching a point where AI-generated speech doesn't sound like, well... AI. Sesame AI Labs recently dropped their Conversational Speech Model (CSM), and it's a game-changer.

In their research paper, Sesame demonstrates how CSM produces speech so natural that even audio experts struggle to distinguish it from human recordings. It's not just about clarity—it's the hesitations, the "umms," the natural rhythm that makes it feel genuinely human.

In this article, we’ll walk you through how to deploy CSM on a serverless cloud platform like Cerebrium. By the end, you'll have your very own hyper-realistic voice API up and running.

Check out the final code, here.

What is Sesame CSM?

Sesame's Conversational Speech Model (CSM) represents a significant advancement in text-to-speech technology. Unlike traditional TTS systems that often sound robotic or unnatural, CSM generates speech with remarkably human-like qualities, including natural pauses, "umms", "uhhs," expressive mouth sounds, and subtle intonation changes that characterize human conversation.

The model works by combining a large language model (LLM) architecture (based on Llama 3.2) with specialized audio tokenization. When generating speech, CSM takes into account not just the text to be spoken, but also the conversational context, allowing it to maintain a coherent speaking style throughout an exchange.

Prerequisites

Before getting started, you’ll need access to the following:

  • A Cerebrium account, sign up here

  • A Huggingface account (Sign up here) and API Key

  • Access to the CSM-1B model, which you can request here

  • Access to the Llama 3.2 1B model, which you can request here

Getting Started with the CSM Repository

First things first—let's grab the code we need. Sesame has kindly provided the CSM repository on GitHub, which gives us everything we need to get started. This repo is our springboard, containing the model architecture and generation code.

Here's how to get set up:

  1. Clone the repository:

    git
  2. For our Cerebrium deployment, we'll need three key files from this repo:

    • generator.py: Contains the speech generation logic

    • models.py: Defines the model architecture

    • requirements.txt: Lists all the dependencies

    • watermarking.py: Adds watermark to au for ethical reasons

Let's create a new project directory and copy these files over:

pip install cerebrium --upgrade
cerebrium init 10-sesame-voice-api
cp csm/generator.py csm/models.py csm/requirements.txt csm/watermarking.py 10-sesame-voice-api/
cd 10-sesame-voice-api

Now we'll need to add our own cerebrium.toml and main.py files. When we're done, our project structure should look like this:


Let's create our cerebrium.toml. This is where we'll configure our deployment settings. Don't worry if you're new to this—the file is pretty straightforward:

[cerebrium.deployment]
name = "10-sesame-voice-api"
python_version = "3.12"
docker_base_image_url = "debian:bookworm-slim"
disable_auth = false
include = ['./*', 'main.py', 'cerebrium.toml']
exclude = ['.*', 'test.py']

[cerebrium.hardware]
cpu = 3.0
memory = 12.0
compute = "AMPERE_A10"

[cerebrium.scaling]
min_replicas = 1 #Don't forget to set this back to 0 when you're done
max_replicas = 5
cooldown = 30
replica_concurrency = 1
response_grace_period = 900
scaling_metric = "concurrency_utilization"
scaling_target = 100
scaling_buffer = 0
roll_out_duration_seconds = 0

[cerebrium.dependencies.paths]
pip = "requirements.txt"

[cerebrium.dependencies.apt]

Let’s break down what's happening here:

  • We're naming our deployment "sesame" and using Python 3.12

  • For hardware, we're requesting 3 CPU cores, 12GB memory, and an NVIDIA A10 GPU (the model needs some decent GPU horsepower)

  • The scaling settings ensure our deployment can handle traffic spikes by automatically scaling up to 5 replicas when needed and scaling down to 0 when idle (saving us money)

  • We're also making sure ffmpeg is installed, which we need for audio processing

Pro tip: these settings provide a good balance between performance and cost. The model runs comfortably on an A10, and you can adjust the replicas based on your expected traffic.

Setting our Environment variables

For our application to cache and download our gated models, we need to set the following environment variables in our Cerebrium dashboard (Under the secrets section):

  • HF_TOKEN: You can generate a new token from your Huggingface dashboard (Our access to the gated CSM and Llama repositories are linked to this)

  • HF_HUB_ENABLE_HF_TRANSFER=1: This enables faster downloads of our models from Huggingface

  • HF_HOME=/persistent-storage/.cache/huggingface/hub : This sets our caching directory to our Cerebrium persistent volume so that caches are persisted between app cold starts

Note: Don’t forget to add hf_transfer to your requirements.txt file, otherwise you will receive an error.

Creating the main.py File

Now for the fun part—our main.py file. This is where all the moving parts come together. If you look at the CSM repo, you'll find some examples, but they're more research-oriented. We're going to create something tailored for our Cerebrium deployment:

from generator import load_csm_1b, Segment
import torchaudio
import torch
from huggingface_hub import hf_hub_download
import os
import base64

# This device selection lets our code work on any Cerebrium hardware
if torch.backends.mps.is_available():
    device = "mps"
elif torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

# Load the model - this happens once when the service starts
# The model will stay loaded in memory for faster inference
generator = load_csm_1b(device=device)

# These example conversations give the model context for how to speak
# The model will mimic the speaking style in these samples
speakers = [0, 1]  # Speaker 0 and Speaker 1
transcripts = [
    (
        "like revising for an exam I'd have to try and like keep up the momentum because I'd "
        "start really early I'd be like okay I'm gonna start revising now and then like "
        "you're revising for ages and then I just like start losing steam I didn't do that "
        "for the exam we had recently to be fair that was a more of a last minute scenario "
        "but like yeah I'm trying to like yeah I noticed this yesterday that like Mondays I "
        "sort of start the day with this not like a panic but like a"
    ),
    (
        "like a super Mario level. Like it's very like high detail. And like, once you get "
        "into the park, it just like, everything looks like a computer game and they have all "
        "these, like, you know, if, if there's like a, you know, like in a Mario game, they "
        "will have like a question block. And if you like, you know, punch it, a coin will "
        "come out. So like everyone, when they come into the park, they get like this little "
        "bracelet and then you can go punching question blocks around."
    )
]

# Download the audio samples that accompany the transcripts
# With Cerebrium, we use persistent storage to cache these files
audio_paths = [
    hf_hub_download(
        repo_id="sesame/csm-1b",
        filename="prompts/conversational_a.wav",
    ),
    hf_hub_download(
        repo_id="sesame/csm-1b",
        filename="prompts/conversational_b.wav",
    )
]

def _load_prompt_audio(audio_path: str, target_sample_rate: int) -> torch.Tensor:
    """Helper function to load and resample audio files"""
    audio_tensor, sample_rate = torchaudio.load(audio_path)
    audio_tensor = audio_tensor.squeeze(0)  # Remove channel dimension
    audio_tensor = torchaudio.functional.resample(
        audio_tensor, orig_freq=sample_rate, new_freq=target_sample_rate
    )
    return audio_tensor

# This is the function Cerebrium will call when we hit our endpoint
def generate_audio(text: str):
    """
    Generate conversational speech from text, using the CSM-1B model.

    Args:
        text (str): The text to convert to speech

    Returns:
        dict: Contains base64-encoded audio data and format information
    """
    # Create context segments from our example conversations
    segments = [
        Segment(text=transcript, speaker=speaker, audio=_load_prompt_audio(audio_path, generator.sample_rate))
        for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
    ]

    # Generate audio with speaker 1's voice (you can change to 0 if preferred)
    audio = generator.generate(
        text=text,
        speaker=1,
        context=segments,
        max_audio_length_ms=10_000,  # Limit to 10 seconds
        temperature=0.9,  # Controls randomness - higher = more variation
    )

    # Save to temporary WAV file, read it, and convert to base64
    torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
    with open("audio.wav", "rb") as f:
        wav_data = f.read()
    os.remove("audio.wav")  # Clean up the temporary file
    encoded_data = base64.b64encode(wav_data).decode('utf-8')

    return {"audio_data": encoded_data, "format": "wav", "encoding": "base64"}

Here's what's happening in this code:

  1. We're loading the CSM-1B model when our service starts

  2. We set up two conversation examples that define the speaking style (what are known as utterances)

  3. The generate_audio function:

    • Takes text input from the API request

    • Creates context segments from our examples

    • Generates speech in the style of speaker 1

    • Returns the audio as base64-encoded data

The Segment class from generator.py is crucial here - it packages up text, audio, and speaker information that helps the model understand how to generate natural speech.

Notice how we're limiting to 10 seconds with max_audio_length_ms. You can adjust this if you need longer audio clips, though keep in mind that longer clips will take more time to generate.

Deploying to Cerebrium

For the moment of truth, we’ll deploy our new app to Cerebrium:

  1. Login to your Cerebrium account:

    This will open up your browser and authenticate your cli.

  2. Add your Huggingface token (HF_TOKEN) in your Cerebrium dashboard under your secrets section (This allows you to authenticate against Huggingface and provides access to the gated model repositories).

  3. Deploy your model:

    Cerebrium takes over and will:

    • Upload your files

    • Build a custom container with all your dependencies

    • Provision the A10 GPU

    • Deploy your app, test that your app runs successfully and download model weights

    • Set up the API endpoint

The first deployment might take a few minutes as it installs all the dependencies and tests that your application works as expected.

Once your app is deployed, you should see your endpoint URL printed in your terminal. You can also get find your URL in your dashboard along with the API key you’d need to make requests against your endpoint:

Testing Your Voice API

Next, let's create a simple test script to see our model in action. Create a new file called test.py:

import requests
import json
import base64
import soundfile as sf
import io
import time

# Replace with your actual endpoint and API key
url = "<https://api.cortex.cerebrium.ai/v4/[YOUR_PROJECT_ID]/sesame/generate_audio>"
api_key = "[YOUR_API_KEY]"  # Replace with your Cerebrium API key

# The text we want to convert to speech
test_text = "Cerebrium is a, uh, really great cloud platform for deploying your voice models. It's easy to use and the team is very helpful."

# Prepare the request
payload = json.dumps({"text": test_text})
headers = {
  'Authorization': f'Bearer {api_key}',
  'Content-Type': 'application/json'
}

# Time the request
print(f"Sending text to be converted: \\"{test_text}\\"")
start_time = time.time()
response = requests.request("POST", url, headers=headers, data=payload)
end_time = time.time()

# Check if the request was successful
if response.status_code == 200:
    result = response.json()
    print(f"Generated audio in {end_time - start_time:.2f} seconds!")

    # Convert base64 to audio file
    audio_data = base64.b64decode(result['result']["audio_data"])
    audio_buffer = io.BytesIO(audio_data)
    audio, rate = sf.read(audio_buffer)

    # Save to file
    output_file = "output.wav"
    sf.write(output_file, audio, rate)
    print(f"Audio saved to {output_file}")
    print(f"Audio length: {len(audio) / rate:.2f} seconds")
else:
    print(f"Error: {response.status_code}")
    print(response.text)

To run this test, replace the placeholder values and run the following in your terminal:

If all goes well, you should see something like:


Now play the output.wav file! The first thing you'll notice is how eerily natural it sounds. There are hesitations, subtle intonation changes, and even that "uh" we included gets rendered in a way that sounds completely human.

Try experimenting with different texts - add more filler words like "um" and "like" to see how the model handles them. It's impressive how it keeps the same speaking style consistent throughout.

Taking It Further: Improvements and Next Steps

Our basic implementation is already a good start, but there's so much more we can do with this technology. Here are some ways to supercharge our implementation:

Streaming Audio for Real-Time Applications

The current implementation has one limitation: you need to wait for the entire audio clip to generate before receiving anything. For more interactive applications, we can implement streaming. On the client side, you'd use a WebSocket connection or server-sent events to receive and play these chunks in real-time. This approach could reduce perceived latency from ~6 seconds down to just 1-2 seconds for the first audio to start playing.

Conclusion

We’re genuinely excited to see what you'll build with this. Maybe it's a more accessible reading app for those with visual impairments. Perhaps it's a language learning tool with perfectly natural pronunciation. Or maybe you're creating the next generation of voice assistants that finally sound human.

Whatever you build, remember to use this power responsibly. Sesame has thoughtfully included audio watermarking to help identify AI-generated speech, and it's important to be transparent with users when they're interacting with AI-generated voices.

Happy coding, and enjoy crossing the uncanny valley of voice with Sesame and Cerebrium!

© 2024 Cerebrium, Inc.