Tutorial
Jul 21, 2024
Creating a realtime RAG voice agent
Michael Louis
Founder
Educational resources are highly accessible today. Anyone, anywhere can watch online content from some of the most renowned institutions and individuals in the world, as long as they have a internet connection. However, something has been missing from this loop. What if you don't understand something in the lecture? How do you ask a video questions? Well, what if you could do that very thing!
In this tutorial, we build our very own personalised tutor embodying Andrej Karpathy, the previous Director of AI at Tesla and the former Head of AI at OpenAI. We selected a portion of his YouTube videos to train the tutor with. While you watch, you can ask him about various concepts in the lecture, and have him explain them to you in Andrej's original voice!
You can find the final code example here and a demo here
Tutorial:
Cerebrium will be hosting the entire application, so if you don’t have a Cerebrium account you can create one by signing up here (we give you $30 in free credits). Follow the documentation here to get setup.
Data processing
In order to get started, we need to create our Cerebrium project.
main.py
- The entrypoint file where our code lives.
cerebrium.toml
- The configuration file that contains all our build and environment settings.
We first need to start with our data processing task - downloading Andrej’s Youtube videos, chunking and embedding them. Since this will be a once off task and not part of our live application, create a new Python file called data-processing.py
.
Following good practice, let us create a Python virtual environment to make sure our our application dependencies are consistent. We will also be using environment variables, so install the python-dotenv
Pip package to enable this:
You should be in your new environment!
In order to create the RAG element of our application, we need to download Andrej’s Youtube videos, transcribe them and finally upload them to our vector database. This will allow the application to retrieve the necessary transcriptions it needs in order to give the embedded LLM the correct context.
To download the videos locally, we can use the
pytube
library. In this tutorial we need to only download the audio of the video. Runpip install pytube, and
create the following script:Next we need to transcribe the audio files into text. We do this using the Deegram API. You can signup for a Deepgram account here (they have a generous free tier). You can then create an API key right from the initial screen. Run
pip install deepgram httpx
. Create a.env
file in your project root and add your API key to it - we named ours DEEPGRAM_API_KEY.We can then create the following code to transcribe the audio file and return the text.
Next, we need to embed the text in a vector database so that our application can easily retrieve the context our LLM needs to respond effectively. There are many articles on the different models and strategies to choose for this task. We recommend looking at the tool here to see what might suit your use case.
For simplicity we will use OpenAI for embeddings and Pinecone as our vector store. You can signup for an OpenAI account here and a PineCone account here. We will use the Langchain framework in order to create our RAG application and so we will also use it to chunk, upload, embed and store our transcribed text
You will need to create a index in Pinecone that we will upload our embeddings to. Since we are using the OpenAI embeddings model you must set the dimension to 1536 and we will be using the cosine metric to measure similarity.
Run
pip install -qU langchain-text-splitters langchain_openai langchain_pinecone
. You will then need to fetch API Keys from OpenAI and PineCone and add then to your .env file. We called ours OPENAI_API_KEY and PINECONE_API_KEY respectively.You can then implement the code below:
The above code takes our transcribed text, chunks it based on the values we set in terms of size and overlap, and then uploads it to our index in Pinecone.
Lastly, let us bring this all together:
You can then run the script with: python data-processing.py
. You should see some logs and it will take about 5 minutes to execute.
You should then be able to navigate to your index in Pinecone and see some records.
Voice Agent
We previously did a tutorial on building a voice agent on Cerebrium using Deepgram, Daily and the Pipecat framework. If you haven’t read through it, we recommend you do as here we are only going to talk about the changes we made as well as how we implemented RAG with Pipecat.
What is different about this implementation is we will be using an external API for our LLM instead of a local model. We did this for two reasons:
To showcase how you can utilize external LLMs if you require more performant models such as OpenAI or Anthropic. This does come with a latency trade-off (~800ms vs ~100ms locally).
How you can make your implementation run on a CPU instead of a H100 for similarly performant models so you don’t hit capacity/cost constraints.
I recommend you clone the repository at this point and I will explain the code changes and only show snippets of changes.
To start, let us upload our secrets from our .env file to our Cerebrium account so we can use them in our application. Navigate to “Secrets” in your Cerebrium Dashboard and upload your .env
file - you should see your values populate. We will reference these values later in the tutorial.
In your cerebrium.toml
file, make sure you have the following set:
Here we are:
Setting our Base Docker image to our Daily image that contains the Deepgram model locally. This makes the STT conversion extremely quick since it happens locally than rather over the network.
We set our compute type to CPU since we are calling an API for our LLM and do not need a GPU.
We listed the pip packages we need for our application
Voice Cloning
In order to make our demo more realistic, we thought we would clone Andrej’s voice using ElevenLabs so that the voice speaking on the video sounds less robotic. You can sign up for a ElevenLabs account here. It comes with a generous free tier, however if you would like to do the voice cloning you need to upgrade to their Starter plan which costs $5.
In order to clone a voice on ElevenLabs, you need to upload a voice recording that is less than 4MB in size. Since we already downloaded audio files in our data-processing step we just need to cut them down using a platform like Clideo (its free). Once you have cut down one of the files, you can upload it to ElevenLabs - you should then get back a voiceID that we will use later in our application.
You need to upload your ElevenLabs API key to your Secrets! We called ours ELEVENLABS_API_KEY - we will use this in the next step.
Pipecat
Below is our base Pipecat implementation with a few changes:
We use ElevenLabs for our TTS element using the voice we cloned in the previous step. We updated the voiceID to that which ElevenLabs specified.
We implement the Pipecat Langchain integration to create a conversation agent that can remember the history of our conversation. We will edit this section of the code in the next step.
Langchain RAG Pipeline
In order to create a RAG pipeline with Langchain, we can just create a retrieval chain. This requires:
An LLM which in this case will be OpenAI’s new GPT-4o-mini model.
We will use OpenAI for our embeddings and Pinecone for our vector store - just link in the data processing step.
We with then use RunnableWithMessageHistory from Langchain in order to use our message history along with the retrieval for our LLM context.
Above you can implement a history_aware_retriever from Langchain which generates a new prompt based on your chat history, vector store and original question. We found this added too much latency and didn’t make a big enough impact on the results.
Langchain’s retrieval chain creates a dict that provides its response in the “answer” key shown by the output_messages_key
parameter above. We therefore need to extend the Pipecat Langchain processor to cater for this. In helpers.py
add the following code:
There are three things to notice here:
We extend the
LangchainProcessor
from Pipecat since it already contains a lot of the functionality we need - I simply edited some of the functions.In
__get_token_value
we look for when theAIMessageChuck
contains the dict object “answer” since that is from our retrieval chain - we return this value in that case.In our
_ainvoke
method (essentially calling Langchain invoke) we do fuzzy matching on what the user said in order to pick up when they say we can continue with playing the video. We do this to stop the message going to the LLM and getting a response. You could do this with function calling but for simplicity of the demo I did this with fuzzy matching.
Now in your main.py
you can add the following under your history_chain variable
Deploy to Cerebrium
To deploy this application to Cerebrium you can simply run the command: cerebrium deploy
in your terminal.
If it deployed successfully, you should see something like this:
We will add these endpoints to our frontend interface.
Connect frontend
We created a public fork of the PipeCat frontend in order to show you a demo of this application. You can clone the repo here.
Follow the instructions in the README.md
and then populate the following variables in your .env.development.local
VITE_SERVER_URL=https://api.cortex.cerebrium.ai/v4/p-xxxxx/<APP_NAME> #This is the base url. Do not include the function names VITE_SERVER_AUTH= #This is the JWT token you can get from the API Keys section of your Cerebrium Dashboard.
You can now run yarn dev
and go to the url: **http://localhost:5173/** to test your application!
Conclusion
This tutorial shows how you can build a scalable personalized tutor using Cerebrium and a variety of complimentary services (Daily, Deepgram, ElevenLabs, OpenAI etc). Combining RAG and voice unlocks a myriad of applications and since it's fully customizable you can make your own tradeoffs in terms of latency, cost and accuracy.
In a space that is moving as quickly as AI, it is our job at Cerebrium to constantly innovate and think about what the future of industries might look like, so we can make sure we are in the best position to support. With our entire team being from South Africa, education has always been an important topic, and so we thought about what impact we could have on a industry as important as this. As Nelson Mandela once famously said: “Education is the most powerful tool we can use to change the world.”
Tag us as @cerebrimai so we can see what you build and please feel free to ask questions/send feedback to us on Slack or Discord communities