TOML Reference - Cerebrium

The configuration is organized into the following main sections:

[cerebrium.deployment] Core settings like app name, Python version, and file inclusion rules
[cerebrium.runtime.custom] Custom web server settings and app startup behavior
[cerebrium.hardware] Compute resources including CPU, memory, and GPU specifications
[cerebrium.scaling] Auto-scaling behavior and replica management
[cerebrium.dependencies] Package management for Python (pip), system (apt), and Conda dependencies

Deployment Configuration

The [cerebrium.deployment] section defines core deployment settings.

Option	Type	Default	Description
name	string	required	Desired app name
python_version	string	”3.12”	Python version to use (3.10, 3.11, 3.12)
disable_auth	boolean	false	Disable default token-based authentication on app endpoints
include	string[]	[”*“]	Files/patterns to include in deployment
exclude	string[]	[”.*“]	Files/patterns to exclude from deployment
shell_commands	string[]	[]	Commands to run at the end of the build
pre_build_commands	string[]	[]	Commands to run before dependencies install
docker_base_image_url	string	”debian:bookworm-slim”	Base Docker image
use_uv	boolean	false	Use UV for faster Python package installation
deployment_initialization_timeout	integer	600 (10 minutes)	The max time to wait for app initialisation during build before timing out. Value must be between 60 and 830

Changes to python_version or docker_base_image_url trigger full rebuilds since they affect the base environment.

UV Package Manager

UV is a fast Python package installer written in Rust that significantly speeds up deployment times. When enabled, UV replaces pip for installing Python dependencies.

UV typically installs packages 10-100x faster than pip, especially beneficial for:

Large dependency trees
Multiple packages
Clean builds without cache

Example with UV enabled:

[cerebrium.deployment]
use_uv = true

Monitoring UV Usage

Check your build logs for these indicators:

UV_PIP_INSTALL_STARTED - UV is successfully being used
PIP_INSTALL_STARTED - Standard pip installation (when use_uv is false)

While UV is compatible with most packages, some edge cases may cause build failures, such as legacy packages with non-standard metadata.

Deploying with UV Lock Files

read only if you’re using pyproject.toml and uv.lock

Generate your lock file locally. This creates a uv.lock file with exact dependency versions.

# In your project directory with pyproject.toml
uv sync

Export your locked dependencies to requirements.txt

uv pip compile pyproject.toml -o requirements.txt
# Or if you want to use the lock file:
uv pip compile uv.lock -o requirements.txt

Include in your deployment:

Ensure requirements.txt is in your project directory
Deploy with UV enabled

Runtime Configuration

The [cerebrium.runtime.custom] section configures custom web servers and runtime behavior.

Option	Type	Default	Description
port	integer	required	Port the application listens on
entrypoint	string[]	required	Command to start the application
healthcheck_endpoint	string	""	HTTP path for health checks (empty uses TCP). Failure causes the instance to restart
readycheck_endpoint	string	""	HTTP path for readiness checks (empty uses TCP). Failure ensures the load balancer does not route to the instance

The port specified in entrypoint must match the port parameter. All endpoints will be available at https://api.aws.us-east-1.cerebrium.ai/v4/{project - id} /{app - name}/your/endpoint

Hardware Configuration

The [cerebrium.hardware] section defines compute resources.

Option	Type	Default	Description
cpu	float	required	Number of CPU cores
memory	float	required	Memory allocation in GB
compute	string	”CPU”	Compute type (CPU, AMPERE_A10, etc.)
gpu_count	integer	0	Number of GPUs
provider	string	”aws”	Cloud provider
region	string	”us-east-1”	Deployment region

Memory refers to RAM, not GPU VRAM. Ensure sufficient memory for your workload.

Scaling Configuration

The [cerebrium.scaling] section controls auto-scaling behavior.

Option	Type	Default	CLI Requirement	Description
min_replicas	integer	0	2.1.2+	Minimum running instances
max_replicas	integer	2	2.1.2+	Maximum running instances
replica_concurrency	integer	10	2.1.2+	Concurrent requests per replica
response_grace_period	integer	3600	2.1.2+	Grace period in seconds
cooldown	integer	1800	2.1.2+	Time window (seconds) that must pass at reduced concurrency before scaling down. Helps avoid cold starts from brief traffic dips.
scaling_metric	string	”concurrency_utilization”	2.1.2+	Metric for scaling decisions (concurrency_utilization, requests_per_second, cpu_utilization, memory_utilization)
scaling_target	integer	100	2.1.2+	Target value for scaling metric (percentage for utilization metrics, absolute value for requests_per_second)
scaling_buffer	integer	optional	2.1.2+	Additional replica capacity above what scaling metric suggests
evaluation_interval_seconds	integer	30	2.1.5+	Time window in seconds over which metrics are evaluated before scaling decisions (6-300s)
load_balancing_algorithm	string	""	2.1.5+	Algorithm for distributing traffic across replicas. Default: round-robin if replica_concurrency > 3, first-available otherwise. Options: round-robin, first-available, min-connections, random-choice-2
compute_tier	string	”interruptible”	2.1.6+	Controls pod scheduling on spot vs on-demand instances. Options: interruptible (spot, lower cost), protected (on-demand, higher availability)
roll_out_duration_seconds	integer	0	2.1.2+	Gradually send traffic to new revision after successful build. Max 600s. Keep at 0 during development.

Setting min_replicas > 0 maintains warm instances for immediate response but increases costs.

The scaling_metric options are:

concurrency_utilization: Maintains a percentage of your replica_concurrency across instances. For example, with replica_concurrency=200 and scaling_target=80, maintains 160 requests per instance.
requests_per_second: Maintains a specific request rate across all instances. For example, scaling_target=5 maintains 5 requests/s average across instances.
cpu_utilization: Maintains CPU usage as a percentage of cerebrium.hardware.cpu. For example, with cpu=2 and scaling_target=80, maintains 80% CPU utilization (1.6 CPUs) per instance.
memory_utilization: Maintains RAM usage as a percentage of cerebrium.hardware.memory. For example, with memory=10 and scaling_target=80, maintains 80% memory utilization (8GB) per instance.

The scaling_buffer option is only available with concurrency_utilization and requests_per_second metrics. It ensures extra capacity is maintained above what the scaling metric suggests.For example, with min_replicas=0 and scaling_buffer=3, the system will maintain 3 replicas as baseline capacity.

Dependencies

Pip Dependencies

The [cerebrium.dependencies.pip] section lists Python package requirements.

[cerebrium.dependencies.pip]
torch = "==2.0.0"      # Exact version
numpy = "latest"       # Latest version
pandas = ">=1.5.0"     # Minimum version

APT Dependencies

The [cerebrium.dependencies.apt] section specifies system packages.

[cerebrium.dependencies.apt]
ffmpeg = "latest"
libopenblas-base = "latest"

Conda Dependencies

The [cerebrium.dependencies.conda] section manages Conda packages.

[cerebrium.dependencies.conda]
cuda = ">=11.7"
cudatoolkit = "11.7"

Dependency Files

The [cerebrium.dependencies.paths] section allows using requirement files.

[cerebrium.dependencies.paths]
pip = "requirements.txt"
apt = "pkglist.txt"
conda = "conda_pkglist.txt"

Complete Example

[cerebrium.deployment]
name = "llm-inference"
python_version = "3.12"
disable_auth = false
include = ["*"]
exclude = [".*"]
shell_commands = []
pre_build_commands = []
docker_base_image_url = "debian:bookworm-slim"
use_uv = true
# Enable fast package installation with UV (omit or set to false if you want to use pip)

[cerebrium.runtime.custom]
port = 8000
entrypoint = ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
healthcheck_endpoint = "/health"
readycheck_endpoint = "/ready"

[cerebrium.hardware]
cpu = 4
memory = 16.0
compute = "AMPERE_A10"
gpu_count = 1
provider = "aws"
region = "us-east-1"

[cerebrium.scaling]
min_replicas = 0
max_replicas = 2
replica_concurrency = 10
response_grace_period = 3600
cooldown = 1800
scaling_metric = "concurrency_utilization"
scaling_target = 100
evaluation_interval_seconds = 30
# load_balancing_algorithm = ""  # Auto-selects based on replica_concurrency
# compute_tier = "interruptible"  # Use "protected" for on-demand instances
roll_out_duration_seconds = 0

[cerebrium.dependencies.pip]
torch = "latest"
transformers = "latest"
uvicorn = "latest"

[cerebrium.dependencies.apt]
ffmpeg = "latest"

[cerebrium.dependencies.conda]
# Optional conda dependencies

[cerebrium.dependencies.paths]
# Optional paths to dependency files
# pip = "requirements.txt"
# apt = "pkglist.txt"
# conda = "conda_pkglist.txt"

​Deployment Configuration

​UV Package Manager

​Monitoring UV Usage

​Deploying with UV Lock Files

​Runtime Configuration

​Hardware Configuration

​Scaling Configuration

​Dependencies

​Pip Dependencies

​APT Dependencies

​Conda Dependencies

​Dependency Files

​Complete Example

Deployment Configuration

UV Package Manager

Monitoring UV Usage

Deploying with UV Lock Files

Runtime Configuration

Hardware Configuration

Scaling Configuration

Dependencies

Pip Dependencies

APT Dependencies

Conda Dependencies

Dependency Files

Complete Example