GPT recently released GPT-OSS (gpt-oss-20b and gpt-oss-120b) two state-of-the-art open-weight language models that deliver strong real-world performance at low cost. Available under the flexible Apache 2.0 license, these models outperform similarly sized open models on reasoning tasks, demonstrate strong tool use capabilities, and are optimized for efficient deployment on consumer hardware.
What Makes GPT-OSS Special?
GPT-OSS introduces capabilities that set it apart from other open-source LLMs:
- Mixture of Experts (MoE) Architecture: The model comes in 20B and 120B parameter variants, but uses MoE to keep active parameters low while maintaining strong capabilities
- MXFP4 Quantization: A novel 4-bit floating point format specifically designed for MoE layers, enabling efficient serving
- Attention Sinks: Special attention mechanism that allows for longer context lengths without degrading output quality
- Harmony Response Format: Built-in support for structured outputs like chain-of-thought reasoning and tool use. See examples from OpenAI here
Please note that in vLLM, you can only run it on NVIDIA H100, H200, B200 as
well as MI300x, MI325x, MI355x and Radeon AI PRO R9700 as of 6th August 2025
This tutorial covers the simplest variation of deploying this model using vllm serve. For more control, see the OpenAI compatible endpoint with vLLM guide.
Project Setup
- Run the command,
cerebrium init gpt-oss
- Edit your toml file with the following settings
[cerebrium.deployment]
name = "7-openai-gpt-oss"
python_version = "3.12"
docker_base_image_url = "nvidia/cuda:12.8.1-devel-ubuntu22.04"
disable_auth = true
include = ['./*', 'main.py', 'cerebrium.toml']
exclude = ['.*']
pre_build_commands = [
"apt-get update",
"apt-get install -y curl",
"curl -LsSf https://astral.sh/uv/install.sh | sh",
"export PATH=\"$HOME/.local/bin:$PATH\" && uv pip install --pre vllm==0.10.1+gptoss --extra-index-url https://wheels.vllm.ai/gpt-oss/ --extra-index-url https://download.pytorch.org/whl/nightly/cu128 --index-strategy unsafe-best-match",
"uv pip install huggingface_hub[hf_transfer]==0.34"
]
[cerebrium.hardware]
cpu = 8.0
memory = 18.0
compute = "HOPPER_H100"
provider = "aws"
region = "us-east-1"
[cerebrium.scaling]
min_replicas = 0
max_replicas = 5
cooldown = 30
replica_concurrency = 32
scaling_metric = "concurrency_utilization"
scaling
[cerebrium.runtime.custom]
port = 8000
entrypoint = ["sh", "-c", "export HF_HUB_ENABLE_HF_TRANSFER=1 && export VLLM_USER_V1=1 && vllm serve openai/gpt-oss-20b --enforce-eager"]
Key configuration details:
- The
docker_base_image_url is set to the cuda:12.8.1-devel image, which is large but necessary to include all required packages/libraries.
- Pre-build commands install uv (a faster Python package installer) and the required vLLM packages. These commands execute at the start of the build process, before dependency installation begins, making them essential for setting up the build environment. Read more here.
- Hardware is set to an H100 in the us-east-1 region.
- Replica concurrency is 32, meaning a single H100 container handles 32
concurrent requests.
vllm serve turns the container into an OpenAI compatible server running on port 8000.
Deploy & Test
Deploy by running cerebrium deploy. The environment is created and the model is downloaded.
Test the endpoint with the following request:
curl --location 'https://api.aws.us-east-1.cerebrium.ai/v4/p-xxxxxx/7-openai-gpt-oss/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Accept: text/event-stream' \
--data '{"messages": [{"role": "user", "content": "hello how are you"}], "model": "Qwen/Qwen2.5-1.5B-Instruct", "stream": true}'
On the first request, a container spins up, loads the model, and streams the output. As of 6th August, this generates roughly 30 tokens per second.