> ## Documentation Index
> Fetch the complete documentation index at: https://cerebrium.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# OpenAI compatible vLLM endpoint

> Create a OpenAI compatible endpoint using the vLLM framework

This tutorial creates an OpenAI-compatible endpoint that works with any open-source model. Use existing OpenAI code with Cerebrium serverless functions by changing just two lines of code.

To see the final code implementation, you can view it [here](https://github.com/CerebriumAI/examples/tree/master/5-large-language-models/1-openai-compatible-endpoint)

### Cerebrium setup

Create a Cerebrium account by signing up [here](https://dashboard.cerebrium.ai/register) and follow the [installation docs](https://docs.cerebrium.ai/getting-started/installation).

Run the following command to create the Cerebrium starter project: `cerebrium init 1-openai-compatible-endpoint`. This creates two files:

* `main.py`: The entrypoint file where application code lives
* `cerebrium.toml`: A configuration file that contains all build and environment settings

Add the following pip packages and hardware requirements to your `cerebrium.toml` to create your deployment environment:

```toml theme={null}
[cerebrium.hardware]
cpu = 2
memory = 12.0
compute = "AMPERE_A10"

[cerebrium.dependencies.pip]
vllm = "latest"
pydantic = "latest"
```

Define the imports and initialize the model. This example uses Meta's Llama 3.1 model, which requires Hugging Face authorization. Add the HF token to secrets in the Cerebrium dashboard, then add this code to `main.py`:

```python theme={null}
from vllm import SamplingParams, AsyncLLMEngine
from vllm.engine.arg_utils import AsyncEngineArgs
from pydantic import BaseModel
from typing import Any
import time
import json
import os
from huggingface_hub import login

# Your huggingface token (HF_AUTH_TOKEN) should be stored in your project secrets on your dashboard
login(token=os.environ.get("HF_AUTH_TOKEN"))

engine_args = AsyncEngineArgs(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    gpu_memory_utilization=0.9,  # Increase GPU memory utilization
    max_model_len=8192  # Decrease max model length
)
engine = AsyncLLMEngine.from_engine_args(engine_args)
```

Next, define the required output format for OpenAI endpoints using Pydantic:

```python theme={null}
class Message(BaseModel):
    role: str
    content: str

class ChatCompletionResponse(BaseModel):
    id: str
    object: str
    created: int
    model: str
    choices: List[Any]

async def run(messages: list, model: str, run_id: str, stream: bool = True, temperature: float = 0.8, top_p: float = 0.95):
    prompt = " ".join([f"{Message(**msg).role}: {Message(**msg).content}" for msg in messages])
    sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
    results_generator = engine.generate(prompt, sampling_params, run_id)
    previous_text = ""
    full_text = ""  # Collect all generated text here

    async for output in results_generator:
        prompt = output.outputs
        new_text = prompt[0].text[len(previous_text):]
        previous_text = prompt[0].text
        full_text += new_text  # Append new text to full_text

        response = ChatCompletionResponse(
            id=run_id,
            object="chat.completion",
            created=int(time.time()),
            model=model,
            choices=[{
                "text": new_text,
                "index": 0,
                "logprobs": None,
                "finish_reason": prompt[0].finish_reason or "stop"
            }]
        )
         print(response.model_dump())
        yield f"data: {json.dumps(response.model_dump())}\n\n"

    # Send the final [DONE] message
    yield "data: [DONE]\n\n"
```

The function:

* Takes parameters through its signature, with optional and default values available
* Automatically receives a unique `run_id` for each request
* Processes the entire prompt through the model
* Streams results when `stream=True` using async functionality
* Returns the complete result at the end if streaming is disabled

## Deploy & Inference

To deploy the model use the following command:

```bash theme={null}
cerebrium deploy
```

After deployment, a curl command like this appears:

```curl theme={null}
curl --location 'https://api.aws.us-east-1.cerebrium.ai/v4/p-<YOUR PROJECT ID>/5-openai-compatible-endpoint/{function}' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <YOUR TOKEN HERE>' \
--data '{"..."}'
```

In Cerebrium, each function name becomes an endpoint (ending with `/run`). While OpenAI-compatible endpoints typically end with `/chat/completions`, all Cerebrium endpoints are OpenAI-compatible. Call the endpoint as follows:

```python theme={null}
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.aws.us-east-1.cerebrium.ai/v4/p-xxxxxxx/5-openai-compatible-endpoint/run",
    api_key="<CEREBRIUM_JWT_TOKEN>",
)

chat_completion = client.chat.completions.create(
    messages=[
   {"role": "user", "content": "What is a mistral?"},
   {"role": "assistant", "content": "A mistral is a type of cold, dry wind that blows across the southern slopes of the Alps from the Valais region of Switzerland into the Ligurian Sea near Genoa. It is known for its strong and steady gusts, sometimes reaching up to 60 miles per hour."},
   {"role": "user", "content": "How does the mistral wind form?"}
 ],
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    stream=True
)
for chunk in chat_completion:
    print(chunk)
print("Finished receiving chunks.")
```

Set the base URL to the one from the deploy command (ending in `/run`). Use the JWT token from either the curl command or the Cerebrium dashboard's API Keys section.
