Cerebrium articles | Deploying DeepSeek-R1: A Guide to a Serverless, High-Performaning OpenAI-Compatible Endpoint

January 27, 2025

Deploying DeepSeek-R1: A Guide to a Serverless, High-Performaning OpenAI-Compatible Endpoint

Michael Louis

CEO & Founder

DeepSeek, a Chinese artificial intelligence startup, has recently unveiled their first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, is a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, which demonstrated remarkable performance on reasoning. However, it encounted challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, Deepseek introduced DeepSeek-R1, which incorporates cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks.

To support the research community, DeepSeek have open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 based on Llama and Qwen. DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across various benchmarks, achieving new state-of-the-art results for dense models.

In this tutorial we will show you how to deploy Deepseek on Cerebrium’s servlerss architecture enabling you to create a scalable OpenAI compatible endpoint using vLLM.

Why Build with Deepseek-R1 on Cerebrium

Cost efficiencies: Since Cerebrium is serverless you can quickly spin up your model, execute workloads and then scale back down only paying for the compute you used. You can further reduce costs by implementing batching.
Security & Compliance: Cerebrium is SOC 2 and HIPAA complaint and has various other methods in order for you to keep your customers data safe and secure.
Ease of Deployment: Cerebrium’s streamlined serverless infrastructure simplifies the process of deploying and managing AI models. With pre-integrated tools like vLLM, you can quickly configure DeepSeek-R1 into an OpenAI-compatible endpoint without complex setup.
Scalability: Cerebrium’s architecture allows you to seamlessly scale workloads based on demand, whether you’re running small inference tasks or large-scale batch jobs, ensuring performance consistency without over-provisioning.

Below is a table summarizing the hardware configurations required for the different model types.

Tutorial

In this tutorial, we will show you how to deploy deepseek-ai/DeepSeek-R1-Distill-Qwen-32B as a OpenAI compatible endpoint using the popular vLLM inference framework.

Cerebrium setup

If you don’t have a Cerebrium account, you can create one by signing up here and following the documentation here to get setup

In your IDE, run the following command to create our Cerebrium starter project: cerebrium init deepseek. This creates two files:

Main.py - Our entrypoint file where our code lives
cerebrium.toml - A configuration file that contains all our build and environment settings ‍ Add the following pip packages and hardware requirements near the bottom of your cerebrium.toml. This will be used in creating our deployment environment.

Add/change to the following in your cerebrium.toml file:

[cerebrium.hardware]
cpu = 4
memory = 16.0
compute = "ADA_L40"
gpu_count=2

[cerebrium.dependencies.pip]
vllm = "latest"
pydantic = "latest"

To start, let us define our imports and initialize our model - add the following to your main.py

from vllm import  SamplingParams. AsyncLLMEngine
from vllm.engine.arg_utils import AsyncEngineArgs
from pydantic import BaseModel
from typing import Any
import time
import json

engine_args = AsyncEngineArgs(
  model="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
  gpu_memory_utilization=0.9,
  max_model_len=8192
  tensor_parallel_size=2 #we require 2 gpus
)

engine = AsyncLLMEngine.from_engine_args(engine_args)

class Message(BaseModel):
  role: str
  content: str
    
class ChatCompletionResponse(BaseModel):
  id: str
  object: str
  created: int
  model: str
  choices: List[Any]
    
async def run(messages: list, model: str, run_id: str, stream: bool = True, temperature: float = 0.8, top_p: float = 0.95):
  
  prompt = " ".join([f"{Message(**msg).role}: {Message(**msg).content}" for msg in messages])
  sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
  results_generator = engine.generate(prompt, sampling_params, run_id)
  previous_text = ""
  first_chunk = True

  async for output in results_generator:
      prompt_output = output.outputs
      new_text = prompt_output[0].text[len(previous_text) :]
      previous_text = prompt_output[0].text

      # Construct OpenAI-compatible chunk
      chunk = {
          "id": run_id,
          "object": "chat.completion.chunk",
          "created": int(time.time()),
          "model": model,
          "choices": [
              {
                  "index": 0,
                  "delta": {},
                  "finish_reason": None,
              }
          ],
      }

      # Include the role in the first chunk
      if first_chunk:
          chunk["choices"][0]["delta"]["role"] = "assistant"
          first_chunk = False

      # Add new text to the delta if any
      if new_text:
          chunk["choices"][0]["delta"]["content"] = new_text

      # Capture a finish reason if it's provided
      finish_reason = prompt_output[0].finish_reason or None
      if finish_reason and finish_reason != "none":
          chunk["choices"][0]["finish_reason"] = finish_reason

      yield f"data: {json.dumps(chunk)}\n\n"

  # Send the final [DONE] message
  yield "data: [DONE]\n\n"

Above the following is happening:

We specify all the parameters we send in our function signature. You can set optional or default values. The run_id parameter we automatically add to your function with a unique identifier for every request.
We put the entire prompt through the model and loop through the generated results.
If stream=True, we yield a result. Since we are using a async function and yield, this is how we achieve streaming functionality on Cerebrium else we return the entire result at the end.

Deploy & Inference

To deploy the model use the following command:

cerebrium deploy

Once deployed, you will see that it should show that the POST /run endpoint it live.

In Cerebrium, every function name is now and endpoint so to call this endpoint we would end the URL with /run. However, OpenAI compatible endpoints need to end with /chat/completions. We have made all endpoints OpenAI compatible so to call the endpoint you can do the following in another file called test.py:

import os
from openai import OpenAI
  
client = OpenAI( 
  base_url="https://api.cortex.cerebrium.ai/v4/p-xxxxxxx/deepseek/run", 
  api_key="<CEREBRIUM_JWT_TOKEN>",
)
  
chat_completion = client.chat.completions.create(
  messages=[ 
    {"role": "user", "content": "What is a mistral?"},
    {"role": "assistant", "content": "A mistral is a type of cold, dry wind that blows across the southern slopes of the Alps from the Valais region of Switzerland into the Ligurian Sea near Genoa. It is known for its strong and steady gusts, sometimes reaching up to 60 miles per hour."},
    {"role": "user", "content": "How does the mistral wind form?"} ], 
  model="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B", 
  stream=True
)
  
for chunk in chat_completion: 
  print(chunk)
print("Finished receiving chunks.")

Above we set our base url to the one returned by our deploy command - it ends in /run since that’s the function we are calling. Lastly, we use our JWT token, which is returned in the CURL command when you deploy or can be found in your Cerebrium dashboard under the section API Keys.

Voilà! You now have a OpenAI compatible endpoint that you can customize to your liking!

Product

Pricing

Developers

Docs

Status

Company

Blog

Use cases

Large language models

Voice

Image & Video

Resources

Examples

Articles