Tutorial

Oct 14, 2024

An Alternative to OpenAI Realtime API for Voice Capabilities

Michael Louis

Founder & CEO

The recent launch of OpenAI’s Realtime API has sparked excitement within the developer community, offering a unified solution that combines large language models (LLMs) and voice capabilities. While this is a significant advancement, we at Cerebrium, in collaboration with Rime, have developed an alternative solution that not only matches but surpasses the OpenAI offering in several crucial ways.

In this post, we’ll explore how Cerebrium and Rime provide an advanced, cost-effective, and flexible platform for implementing LLM + voice solutions. We’ll dive into the benefits of our approach over OpenAI’s Realtime API and how this partnership leverages the Pipecat framework from Daily to deliver an efficient and customizable solution for enterprises and developers.

You can play with a final demo of the implementation here and see the final code here

Cerebrium Setup

First let’s create a Cerebrium project. If you don’t have a Cerebrium account, you can create one by signing up here and following the documentation here to get set up.

In your IDE, run the following command to create the Cerebrium starter project: cerebrium init openai-realtime-comparison. This creates two files:

  • main.py - Our entrypoint file where our code lives

  • cerebrium.toml - A configuration file that contains all our build and environment settings

Cerebrium is the underlying platform on which our entire application will run. As we go through the tutorial, we will edit the two files above as well as add additional files.

Daily Setup:

Daily provides developers with powerful APIs and SDKs to easily integrate realtime video and audio capabilities into their applications. Daily allows us to pick up the frames from a live stream.

If you don’t have a Daily account you can sign up for one here (they have a generous free tier). Next, get your Daily developer token from your profile to authenticate calls. Navigate to the “developers” tab to fetch your API key and add this to your Cerebrium Secrets.

Next we need to create the endpoints that create our Daily room that both us and our users can join - this is the live stream that we will monitor. You can look at the Daily Docs to see the options available to you. Create the following code in your main.py:

import requests
from cerebrium import get_secret

def create_room():
    url = "<https://api.daily.co/v1/rooms/>"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {get_secret('DAILY_TOKEN')}"
    }
    data = {
        "properties": {
            "exp": int(time.time()) + 60*5, ##5 mins
            "eject_at_room_exp" : True
        }
    }

    response = requests.post(url, headers=headers, json=data)
    if response.status_code == 200:
        room_info = response.json()
        token = create_token(room_info['name'])
        if token and 'token' in token:
            room_info['token'] = token['token']
        else:
            print("Failed to create token")
            return {"message": 'There was an error creating your room', "status_code": 500}
        return room_info
    else:
        return {"message": 'There was an error creating your room', "status_code": 500}

def create_token(room_name: str):

    url = "<https://api.daily.co/v1/meeting-tokens>"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {get_secret('DAILY_TOKEN')}"
    }
    data = {
        "properties": {
            "room_name": room_name,
            "is_owner": True,
        }
    }

    response = requests.post(url, headers=headers, json=data)
    if response.status_code == 200:
        token_info = response.json()
        return token_info
    else:
        print(f"Failed to create token: {response.status_code}")
        return None

Here we use the Daily Token to authenticate these requests. Later in the tutorial we will show you how to connect these API endpoints to our frontend repository.

OpenAI Realtime

First we will show you how you can implement the OpenAI Realtime API using the Pipecat framework to easily get an end-to-end voice bot created. At the time of writing, the OpenAI realtime implementation is still in a draft PR here but should be merged soon.

To start, add the following to your cerebrium.toml

[cerebrium.deployment]
....
shell_commands = ["pip install git+https://github.com/pipecat-ai/pipecat.git@khk/openai-realtime-beta#egg=pipecat-ai[silero,daily,openai,deepgram]"]

[cerebrium.dependencies.pip]
torch = ">=2.0.0"
aiohttp = ">=3.9.4"
torchaudio = ">=2.3.0"
channels = ">=4.0.0"
requests = "==2.32.2"
vllm = "latest"
huggingface_hub = "latest"

[cerebrium.dependencies.apt]
git = "latest"

You will see we install the Pipecat framework from Github directly. Usually you would install a pip package under [cerebrium.dependencies.pip] however, at the time of writing I was using a draft PR implementation to achieve the desired functionality. The rest of the packages are what we require to implement our functionality.

from openai.types.chat import ChatCompletionToolParam
from pipecat.transports.services.daily import DailyParams, DailyTransport
from pipecat.vad.silero import SileroVADAnalyzer
from pipecat.vad.vad_analyzer import VADParams
from pipecat.services.openai_realtime_beta import OpenAILLMServiceRealtimeBeta, SessionProperties, InputAudioTranscription, TurnDetection

async def combined_main(room_url: str, token: str):
async with aiohttp.ClientSession() as session:
        transport = DailyTransport(
            room_url,
            token,
            "Respond bot",
            DailyParams(
                audio_in_enabled=True,
                audio_in_sample_rate=24000,
                audio_out_enabled=True,
                audio_out_sample_rate=24000,
                transcription_enabled=True,
                vad_enabled=True,
                vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.8)),
                vad_audio_passthrough=True,
            ),
        )
        session_properties = SessionProperties(
            input_audio_transcription=InputAudioTranscription(),
            # turn_detection=False,
            turn_detection=TurnDetection(silence_duration_ms=1000),
            instructions="""
                You are a helpful and friendly AI assistant. Keep your responses concise and to the point.
                You can switch between two services: 'openai_realtime' and 'custom'. Use the switch_service
                function when asked to change services.
                """,
        )
        openai_realtime = OpenAILLMServiceRealtimeBeta(
            api_key=get_secret("OPENAI_API_KEY"),
            session_properties=session_properties,
            start_audio_paused=False,
        )

        messages = [
            {
                "role": "system",
                "content": "You are a helpful AI assistant that can switch between two services to showcase the difference in performance and cost: 'openai_realtime' and 'custom'. Respond to user queries and switch services when asked.",
            },
        ]

In this code snippet, we initialize a Daily transport layer to handle audio and transcription within a WebRTC call. This setup involves configuring the Daily room using a designated Meeting URL and token, which grants the bot permission to join the room as an active participant.

We leverage Pipecat for phrase endpointing, which detects when a speaker has completed their turn. The implementation here supports both OpenAI’s built-in turn detection and Pipecat’s custom solution. Based on our tests, Pipecat’s approach currently offers superior performance, although this may change over time.

Next, we configure the OpenAI realtime language model service with specific parameters:

  • Transcription is enabled for input audio, allowing the system to process and interpret spoken language in real time.

  • Turn detection is set with a 1000-millisecond silence threshold, which helps determine when a speaker has finished talking and when to process the next response.

  • We provide detailed instructions for the AI assistant, setting its behavior to be helpful and concise. Additionally, the assistant is given the capability to switch between two services, “openai_realtime” and “custom,” which it can do upon request.

The OpenAI realtime service is initialized with the following key settings:

  • An API key is included for authentication purposes. In production environments, sensitive data like this should be securely retrieved using a function such as get_secret.

  • We pass the session properties configured earlier, which include parameters for transcription and turn detection.

  • The start_audio_paused parameter is set to False, meaning the audio stream begins in an active state.

Lastly, we initialize a list of messages with a system message that defines the AI assistant’s role and abilities. This message establishes the context in which the assistant operates, including its capacity to switch between services and offer optimized responses based on performance and cost considerations.

To show an example of tool calling we implement the following function - it allows us to switch to a custom implementation of the OpenAI Realtime API which we get to in the next section. The reason it's called tools_realtime is because the OpenAI realtime API expects a slightly different function format.

tools_realtime = [
    {
        "type": "function",
        "name": "switch_service",
        "description": "Switch to the service when the user asks you to",
        "parameters": {
            "type": "object",
            "properties": {
                "service": {
                    "type": "string",
                    "description": "The service the user wants you to switch to",
                },
            },
            "required": ["service"],
        },
    }
]

openai_realtime_context = OpenAILLMContext(
    messages=messages,
    tools=tools_realtime
)
context_aggregator = openai_realtime.create_context_aggregator(openai_realtime_context)

Above we are creating the function definition. We pass this to the context manager for our OpenAI realtime implementation.

We bring it all together with the Pipecat framework:

pipeline = Pipeline(
    [
        transport.input(),  # Transport user input
        context_aggregator.user(),
        openai_realtime,  # LLM + voice
        context_aggregator.assistant(),
          transport.output(),  # Transport bot output 
        
    ]
)

task = PipelineTask(
    pipeline,
    PipelineParams(
        allow_interruptions=True,
        enable_metrics=True,
        enable_usage_metrics=True,
    ),
)

runner = PipelineRunner()
await runner.run(task)
await session.close()

Lastly, there is some functionality we would like to trigger when someone joins/leaves a meeting which can be done using Pipecat events.

@transport.event_handler("on_first_participant_joined")
async def on_first_participant_joined(transport, participant):
  transport.capture_participant_transcription(participant["id"])
  time.sleep(1.5)
  messages.append(
      {
        "role": "system",
        "content": "Introduce yourself and explain that you can switch between 'openai_realtime' and 'openai_rime' services.",
      }
  )
  await task.queue_frames([context_aggregator.user().get_context_frame()])
  # await task.queue_frame(LLMMessagesFrame(messages))

@transport.event_handler("on_participant_left")
async def on_participant_left(transport, participant, reason):
  await task.queue_frame(EndFrame())

@transport.event_handler("on_call_state_updated")
async def on_call_state_updated(transport, state):
  if state == "left":
      await task.queue_frame(EndFrame())

Custom Implementation: OpenAI+Rime

We will now implement a custom service to show how similar latency can be achieved using OpenAI’s GPT-4 as the LLM and Rime for the Text-to-Speech model. This ultimately results in a 3x cost saving. If you were to host your LLM on Cerebrium’s serverless infrastructure (please look at articles here) - latencies are 50% faster and still 3x cheaper.

Rime trains and deploys models specifically for enterprise Voice AI: lightning fast, reliable, and highly customizable. Rime trains bespoke architectures entirely on a proprietary set of conversational data, so the voices they offer sound like real people and not trained voiceover artists. They are powering millions of customer contacts over the phone and at the drive-thru and have had 100% API uptime over 2024.

To continue, please sign up for a Rime account here to get an API key. Then activate the free trial and store your API key to your Cerebrium secrets:

In the Github repository we created a rime.py file - this combines Rime with the Pipecat framework. Please add the following to your main.py.

from pipecat.services.openai import OpenAILLMService
from rime import RimeTTSService

async def combined_main(room_url: str, token: str):
####previous implementation

        # OpenAI LLM + Rime TTS service
        openai_llm = OpenAILLMService(
            name="LLM",
            api_key=get_secret("OPENAI_API_KEY"),
            model="gpt-4",
        )
        openai_llm.register_function("switch_service", switch_service)

        rime_tts = RimeTTSService(
            name="Voice",
            api_key=get_secret("RIME_API_KEY"),
            voice="grove",
            modelId="mist",
            sample_rate=24000,
            encoding="linear16"
        )

        tools_custom = [
            ChatCompletionToolParam(
                name="switch_service",
                type="function",
                function={
                "type": "function",
                "name": "switch_service",
                "description": "Switch to the service when the user asks you to",
                "parameters": {
                    "type": "object",
                    "properties": {
                    "service": {
                            "type": "string",
                            "description": "The service the user wants you to switch to",
                        },
                    },
                    "required": ["service"],
                },
            }
            )
        ]

        custom_context = OpenAILLMContext(
            messages=messages,
            tools=tools_custom
        )
        context_aggregator_custom = openai_llm.create_context_aggregator(custom_context)

In the above code, we initialize the OpenAILLM service as well as the RimeTTS service. The rest is pretty similar to previous steps.

Below we slightly alter the pipeline to filter the E2E pipeline based on the select service.

pipeline = Pipeline(
            [
                transport.input(),  # Transport user input
                ParallelPipeline([
                    # openai_realtime_beta
                    FunctionFilter(openai_realtime_filter),
                    Pipeline([ 
                        context_aggregator.user(),
                        openai_realtime,  # LLM
                        context_aggregator.assistant(),
                    ])],
                    # local inference
                    [FunctionFilter(custom_filter),
                    Pipeline([
                        # stt,
                        context_aggregator_custom.user(),
                        openai_llm,
                        rime_tts,
                        context_aggregator_custom.assistant(),
                    ]),
                ]),
                transport.output(),  # Transport bot output 
                
            ]
        )

Deploy to Cerebrium

To deploy this application to Cerebrium you can simply run the command: cerebrium deploy in your terminal.

If it deployed successfully, you should see something like this:

Connect frontend

We created a public fork of the frontend to show you a demo of this application. You can clone the repo here.

Follow the instructions in the README.md and then populate the following variables in your .env.development.local

VITE_SERVER_URL=https://api.cortex.cerebrium.ai/v4/p-xxxxx/<APP_NAME> #This is the base url. Do not include the function names

VITE_SERVER_AUTH= #This is the JWT token you can get from the API Keys section of your Cerebrium Dashboard.

You can now run yarn dev and go to the url: **http://localhost:5173/** to test your application!

Conclusion

While OpenAI’s Realtime API has made significant strides in integrating LLMs with voice capabilities, Cerebrium’s collaboration with Rime offers a powerful alternative that delivers enhanced performance, cost savings, and flexibility. By leveraging the Pipecat framework, this solution seamlessly integrates realtime voice processing with advanced AI capabilities, providing a robust platform for enterprises and developers alike.

With this setup, you can efficiently handle transcription and turn detection, enabling responsive, conversational AI assistants that can easily switch between services for a tailored experience. Our approach not only surpasses the OpenAI Realtime API in terms of speed and affordability but also opens up opportunities for further customization and scalability.

© 2024 Cerebrium, Inc.

© 2024 Cerebrium, Inc.