Tutorial
Oct 14, 2024
An Alternative to OpenAI Realtime API for Voice Capabilities
Michael Louis
Founder & CEO
The recent launch of OpenAI’s Realtime API has sparked excitement within the developer community, offering a unified solution that combines large language models (LLMs) and voice capabilities. While this is a significant advancement, we at Cerebrium, in collaboration with Rime, have developed an alternative solution that not only matches but surpasses the OpenAI offering in several crucial ways.
In this post, we’ll explore how Cerebrium and Rime provide an advanced, cost-effective, and flexible platform for implementing LLM + voice solutions. We’ll dive into the benefits of our approach over OpenAI’s Realtime API and how this partnership leverages the Pipecat framework from Daily to deliver an efficient and customizable solution for enterprises and developers.
You can play with a final demo of the implementation here and see the final code here
Cerebrium Setup
First let’s create a Cerebrium project. If you don’t have a Cerebrium account, you can create one by signing up here and following the documentation here to get set up.
In your IDE, run the following command to create the Cerebrium starter project: cerebrium init openai-realtime-comparison. This creates two files:
main.py - Our entrypoint file where our code lives
cerebrium.toml - A configuration file that contains all our build and environment settings
Cerebrium is the underlying platform on which our entire application will run. As we go through the tutorial, we will edit the two files above as well as add additional files.
Daily Setup:
Daily provides developers with powerful APIs and SDKs to easily integrate realtime video and audio capabilities into their applications. Daily allows us to pick up the frames from a live stream.
If you don’t have a Daily account you can sign up for one here (they have a generous free tier). Next, get your Daily developer token from your profile to authenticate calls. Navigate to the “developers” tab to fetch your API key and add this to your Cerebrium Secrets.
Next we need to create the endpoints that create our Daily room that both us and our users can join - this is the live stream that we will monitor. You can look at the Daily Docs to see the options available to you. Create the following code in your main.py:
Here we use the Daily Token to authenticate these requests. Later in the tutorial we will show you how to connect these API endpoints to our frontend repository.
OpenAI Realtime
First we will show you how you can implement the OpenAI Realtime API using the Pipecat framework to easily get an end-to-end voice bot created. At the time of writing, the OpenAI realtime implementation is still in a draft PR here but should be merged soon.
To start, add the following to your cerebrium.toml
You will see we install the Pipecat framework from Github directly. Usually you would install a pip package under [cerebrium.dependencies.pip] however, at the time of writing I was using a draft PR implementation to achieve the desired functionality. The rest of the packages are what we require to implement our functionality.
In this code snippet, we initialize a Daily transport layer to handle audio and transcription within a WebRTC call. This setup involves configuring the Daily room using a designated Meeting URL and token, which grants the bot permission to join the room as an active participant.
We leverage Pipecat for phrase endpointing, which detects when a speaker has completed their turn. The implementation here supports both OpenAI’s built-in turn detection and Pipecat’s custom solution. Based on our tests, Pipecat’s approach currently offers superior performance, although this may change over time.
Next, we configure the OpenAI realtime language model service with specific parameters:
Transcription is enabled for input audio, allowing the system to process and interpret spoken language in real time.
Turn detection is set with a 1000-millisecond silence threshold, which helps determine when a speaker has finished talking and when to process the next response.
We provide detailed instructions for the AI assistant, setting its behavior to be helpful and concise. Additionally, the assistant is given the capability to switch between two services, “openai_realtime” and “custom,” which it can do upon request.
The OpenAI realtime service is initialized with the following key settings:
An API key is included for authentication purposes. In production environments, sensitive data like this should be securely retrieved using a function such as get_secret.
We pass the session properties configured earlier, which include parameters for transcription and turn detection.
The start_audio_paused parameter is set to False, meaning the audio stream begins in an active state.
Lastly, we initialize a list of messages with a system message that defines the AI assistant’s role and abilities. This message establishes the context in which the assistant operates, including its capacity to switch between services and offer optimized responses based on performance and cost considerations.
To show an example of tool calling we implement the following function - it allows us to switch to a custom implementation of the OpenAI Realtime API which we get to in the next section. The reason it's called tools_realtime is because the OpenAI realtime API expects a slightly different function format.
Above we are creating the function definition. We pass this to the context manager for our OpenAI realtime implementation.
We bring it all together with the Pipecat framework:
Lastly, there is some functionality we would like to trigger when someone joins/leaves a meeting which can be done using Pipecat events.
Custom Implementation: OpenAI+Rime
We will now implement a custom service to show how similar latency can be achieved using OpenAI’s GPT-4 as the LLM and Rime for the Text-to-Speech model. This ultimately results in a 3x cost saving. If you were to host your LLM on Cerebrium’s serverless infrastructure (please look at articles here) - latencies are 50% faster and still 3x cheaper.
Rime trains and deploys models specifically for enterprise Voice AI: lightning fast, reliable, and highly customizable. Rime trains bespoke architectures entirely on a proprietary set of conversational data, so the voices they offer sound like real people and not trained voiceover artists. They are powering millions of customer contacts over the phone and at the drive-thru and have had 100% API uptime over 2024.
To continue, please sign up for a Rime account here to get an API key. Then activate the free trial and store your API key to your Cerebrium secrets:
In the Github repository we created a rime.py
file - this combines Rime with the Pipecat framework. Please add the following to your main.py
.
In the above code, we initialize the OpenAILLM service as well as the RimeTTS service. The rest is pretty similar to previous steps.
Below we slightly alter the pipeline to filter the E2E pipeline based on the select service.
Deploy to Cerebrium
To deploy this application to Cerebrium you can simply run the command: cerebrium deploy in your terminal.
If it deployed successfully, you should see something like this:
Connect frontend
We created a public fork of the frontend to show you a demo of this application. You can clone the repo here.
Follow the instructions in the README.md and then populate the following variables in your .env.development.local
VITE_SERVER_URL=https://api.cortex.cerebrium.ai/v4/p-xxxxx/<APP_NAME> #This is the base url. Do not include the function names
VITE_SERVER_AUTH= #This is the JWT token you can get from the API Keys section of your Cerebrium Dashboard.
You can now run yarn dev and go to the url: **http://localhost:5173/** to test your application!
Conclusion
While OpenAI’s Realtime API has made significant strides in integrating LLMs with voice capabilities, Cerebrium’s collaboration with Rime offers a powerful alternative that delivers enhanced performance, cost savings, and flexibility. By leveraging the Pipecat framework, this solution seamlessly integrates realtime voice processing with advanced AI capabilities, providing a robust platform for enterprises and developers alike.
With this setup, you can efficiently handle transcription and turn detection, enabling responsive, conversational AI assistants that can easily switch between services for a tailored experience. Our approach not only surpasses the OpenAI Realtime API in terms of speed and affordability but also opens up opportunities for further customization and scalability.