> ## Documentation Index
> Fetch the complete documentation index at: https://cerebrium.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Serving GPT-OSS with vLLM

> Deploy OpenAI's Latest Open Source Model

GPT recently released GPT-OSS ([gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) and [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b)) two state-of-the-art open-weight language models that deliver strong real-world performance at low cost. Available under the flexible Apache 2.0 license, these models outperform similarly sized open models on reasoning tasks, demonstrate strong tool use capabilities, and are optimized for efficient deployment on consumer hardware.

## What Makes GPT-OSS Special?

GPT-OSS introduces capabilities that set it apart from other open-source LLMs:

* **Mixture of Experts (MoE) Architecture**: The model comes in 20B and 120B parameter variants, but uses MoE to keep active parameters low while maintaining strong capabilities
* **MXFP4 Quantization**: A novel 4-bit floating point format specifically designed for MoE layers, enabling efficient serving
* **Attention Sinks**: Special attention mechanism that allows for longer context lengths without degrading output quality
* **Harmony Response Format**: Built-in support for structured outputs like chain-of-thought reasoning and tool use. See examples from OpenAI [here](https://cookbook.openai.com/articles/openai-harmony)

<Note>
  Please note that in vLLM, you can only run it on NVIDIA H100, H200, B200 as
  well as MI300x, MI325x, MI355x and Radeon AI PRO R9700 as of 6th August 2025
</Note>

This tutorial covers the simplest variation of deploying this model using `vllm serve`. For more control, see the [OpenAI compatible endpoint with vLLM guide](/v4/examples/openai-compatible-endpoint-vllm).

### Project Setup

1. Run the command, `cerebrium init gpt-oss`
2. Edit your toml file with the following settings

```
[cerebrium.deployment]
name = "7-openai-gpt-oss"
python_version = "3.12"
docker_base_image_url = "nvidia/cuda:12.8.1-devel-ubuntu22.04"
disable_auth = true
include = ['./*', 'main.py', 'cerebrium.toml']
exclude = ['.*']
pre_build_commands = [
    "apt-get update",
    "apt-get install -y curl",
    "curl -LsSf https://astral.sh/uv/install.sh | sh",
    "export PATH=\"$HOME/.local/bin:$PATH\" && uv pip install --pre vllm==0.10.1+gptoss --extra-index-url https://wheels.vllm.ai/gpt-oss/ --extra-index-url https://download.pytorch.org/whl/nightly/cu128 --index-strategy unsafe-best-match",
    "uv pip install huggingface_hub[hf_transfer]==0.34"
]

[cerebrium.hardware]
cpu = 8.0
memory = 18.0
compute = "HOPPER_H100"

[cerebrium.scaling]
min_replicas = 0
max_replicas = 5
cooldown = 30
replica_concurrency = 32
scaling_metric = "concurrency_utilization"
scaling

[cerebrium.runtime.custom]
port = 8000
entrypoint = ["sh", "-c", "export HF_HUB_ENABLE_HF_TRANSFER=1 && export VLLM_USER_V1=1 && vllm serve openai/gpt-oss-20b --enforce-eager"]
```

Key configuration details:

* The `docker_base_image_url` is set to the <b>cuda:12.8.1-devel image</b>, which is large but necessary to include all required packages/libraries.
* Pre-build commands install uv (a faster Python package installer) and the required vLLM packages. These commands execute at the start of the build process, before dependency installation begins, making them essential for setting up the build environment. Read more [here](https://docs.cerebrium.ai/container-images/defining-container-images#pre-build-commands).
* Hardware is set to an <b>H100</b> in the <b>us-east-1 region</b>.
* <b>Replica concurrency is 32</b>, meaning a single H100 container handles 32
  concurrent requests.
* `vllm serve` turns the container into an <b>OpenAI compatible server</b> running on <b>port 8000</b>.

### Deploy & Test

Deploy by running `cerebrium deploy`. The environment is created and the model is downloaded.

Test the endpoint with the following request:

```
curl --location 'https://api.aws.us-east-1.cerebrium.ai/v4/p-xxxxxx/7-openai-gpt-oss/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Accept: text/event-stream' \
--data '{"messages": [{"role": "user", "content": "hello how are you"}], "model": "Qwen/Qwen2.5-1.5B-Instruct", "stream": true}'
```

On the first request, a container spins up, loads the model, and streams the output. As of 6th August, this generates roughly 30 tokens per second.
