January 27, 2025
Deploying DeepSeek-R1: A Guide to a Serverless, High-Performaning OpenAI-Compatible Endpoint
data:image/s3,"s3://crabby-images/078e9/078e9985e0b23fe3d1618131b3a09b1749e7d6d3" alt=""
Michael Louis
CEO & Founder
DeepSeek, a Chinese artificial intelligence startup, has recently unveiled their first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, is a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, which demonstrated remarkable performance on reasoning. However, it encounted challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, Deepseek introduced DeepSeek-R1, which incorporates cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks.
To support the research community, DeepSeek have open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 based on Llama and Qwen. DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across various benchmarks, achieving new state-of-the-art results for dense models.
In this tutorial we will show you how to deploy Deepseek on Cerebrium’s servlerss architecture enabling you to create a scalable OpenAI compatible endpoint using vLLM.
Why Build with Deepseek-R1 on Cerebrium
Cost efficiencies: Since Cerebrium is serverless you can quickly spin up your model, execute workloads and then scale back down only paying for the compute you used. You can further reduce costs by implementing batching.
Security & Compliance: Cerebrium is SOC 2 and HIPAA complaint and has various other methods in order for you to keep your customers data safe and secure.
Ease of Deployment: Cerebrium’s streamlined serverless infrastructure simplifies the process of deploying and managing AI models. With pre-integrated tools like vLLM, you can quickly configure DeepSeek-R1 into an OpenAI-compatible endpoint without complex setup.
Scalability: Cerebrium’s architecture allows you to seamlessly scale workloads based on demand, whether you’re running small inference tasks or large-scale batch jobs, ensuring performance consistency without over-provisioning.
Below is a table summarizing the hardware configurations required for the different model types.
data:image/s3,"s3://crabby-images/c7b25/c7b256ed02629552b6157634847ab13255f2c020" alt=""
Tutorial
In this tutorial, we will show you how to deploy deepseek-ai/DeepSeek-R1-Distill-Qwen-32B as a OpenAI compatible endpoint using the popular vLLM inference framework.
Cerebrium setup
If you don’t have a Cerebrium account, you can create one by signing up here and following the documentation here to get setup
In your IDE, run the following command to create our Cerebrium starter project: cerebrium init deepseek
. This creates two files:
Main.py - Our entrypoint file where our code lives
cerebrium.toml - A configuration file that contains all our build and environment settings Add the following pip packages and hardware requirements near the bottom of your cerebrium.toml. This will be used in creating our deployment environment.
Add/change to the following in your cerebrium.toml file:
To start, let us define our imports and initialize our model - add the following to your main.py
Above the following is happening:
We specify all the parameters we send in our function signature. You can set optional or default values. The run_id parameter we automatically add to your function with a unique identifier for every request.
We put the entire prompt through the model and loop through the generated results.
If stream=True, we yield a result. Since we are using a async function and yield, this is how we achieve streaming functionality on Cerebrium else we return the entire result at the end.
Deploy & Inference
To deploy the model use the following command:
cerebrium deploy
Once deployed, you will see that it should show that the POST /run endpoint it live.
In Cerebrium, every function name is now and endpoint so to call this endpoint we would end the URL with /run. However, OpenAI compatible endpoints need to end with /chat/completions. We have made all endpoints OpenAI compatible so to call the endpoint you can do the following in another file called test.py:
Above we set our base url to the one returned by our deploy command - it ends in /run since that’s the function we are calling. Lastly, we use our JWT token, which is returned in the CURL command when you deploy or can be found in your Cerebrium dashboard under the section API Keys.
Voilà! You now have a OpenAI compatible endpoint that you can customize to your liking!