This tutorial creates a real-time voice agent that responds to queries via speech in ~500ms. The implementation supports swapping in any Large Language Model (LLM) or Text-to-Speech (TTS) model, making it ideal for voice-based use cases like customer support bots and receptionists. The app uses PipeCat, a framework that handles component integration, user interruptions, and audio data processing. The example joins a meeting room with a voice agent using Daily (PipeCat’s creators) and deploys on Cerebrium for scaling. The application has 3–4 parts:Documentation Index
Fetch the complete documentation index at: https://cerebrium.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
- A Pipecat agent that acts as the orchestrator
- A Deepgram TTS/STT service (requires a Deepgram Enterprise account)
- A self-hosted LLM using the vLLM framework

Deepgram deployment
See the Partner Services page to deploy a Deepgram service on Cerebrium.You need a Deepgram Enterprise License to do deploy Deegram on Cerebrium else
you must use their API endpoint below.
LLM Deployment
The LLM is an OpenAI-compatible Llama-3 endpoint using the vLLM framework. For low TTFT, a quantized version is used (RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8). Runcerebrium init llama-llm and add the following to cerebrium.toml:
main.py — this uses the vLLM framework and makes it OpenAI compatible:
HF_TOKEN.
Run cerebrium deploy to make it live. The deployment URL appears in the dashboard and is used in the next step.
Adjust the GPU hardware and replica_concurrency in cerebrium.toml to control how many concurrent calls the LLM handles.
Pipecat setup
Run the following command to create the pipecat-agent:cerebrium init pipecat-agent. The Pipecat framework orchestrates the services to create a voice agent.
Add the following pip packages to cerebrium.toml:
main.py:
- WebRTC functionality from Daily creates the room (swappable for Twilio/Telenyx). Two functions handle room creation and authentication:
create_room()andcreate_token(). - The Deepgram and LLM services use a local URL to connect within the Cerebrium cluster. Edit the project key in the URL as needed.
- TTS uses the Cartesia service to demonstrate Pipecat’s versatility, but the Deepgram TTS service works as well.
main() function:
This code handles these events:
- First participant joins: Bot introduces itself via a conversation message
- Additional participants join: Bot listens and responds to all participants
- Participant leaves or call ends: Bot terminates itself
replica_concurrency in cerebrium.toml to control how many concurrent calls the Pipecat agent handles.
Create a .env file in the pipecat-agent folder with the following:

python main.py.
The result is a fully functioning AI bot that interacts with users through speech in ~500ms. The next section creates a user interface for it.
Deploy to Cerebrium
Deploy to Cerebrium by runningcerebrium deploy.
The endpoints are used in the frontend interface below.