> ## Documentation Index
> Fetch the complete documentation index at: https://cerebrium.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# TOML Reference

> Complete reference for all parameters available in Cerebrium's default `cerebrium.toml` configuration file.

The configuration is organized into the following main sections:

* **\[cerebrium.deployment]** Core settings like app name, Python version, and file inclusion rules
* **\[cerebrium.runtime.custom]** Custom web server settings and app startup behavior
* **\[cerebrium.hardware]** Compute resources including CPU, memory, and GPU specifications
* **\[cerebrium.scaling]** Auto-scaling behavior and replica management
* **\[cerebrium.dependencies]** Package management for Python (pip), system (apt), and Conda dependencies

## Deployment Configuration

The `[cerebrium.deployment]` section defines core deployment settings.

| Option                              | Type      | Default                | Description                                                                                                  |
| ----------------------------------- | --------- | ---------------------- | ------------------------------------------------------------------------------------------------------------ |
| name                                | string    | required               | Desired app name                                                                                             |
| python\_version                     | string    | "3.12"                 | Python version to use (3.10, 3.11, 3.12)                                                                     |
| disable\_auth                       | boolean   | false                  | Disable default token-based authentication on app endpoints                                                  |
| include                             | string\[] | \["\*"]                | Files/patterns to include in deployment                                                                      |
| exclude                             | string\[] | \[".\*"]               | Files/patterns to exclude from deployment                                                                    |
| shell\_commands                     | string\[] | \[]                    | Commands to run at the end of the build                                                                      |
| pre\_build\_commands                | string\[] | \[]                    | Commands to run before dependencies install                                                                  |
| docker\_base\_image\_url            | string    | "debian:bookworm-slim" | Base Docker image                                                                                            |
| use\_uv                             | boolean   | false                  | Use UV for faster Python package installation                                                                |
| deployment\_initialization\_timeout | integer   | 600 (10 minutes)       | The max time to wait for app initialisation during build before timing out. Value must be between 60 and 830 |

<Info>
  Changes to python\_version or docker\_base\_image\_url trigger full rebuilds since
  they affect the base environment.
</Info>

### UV Package Manager

UV is a fast Python package installer written in Rust that significantly speeds up deployment times. When enabled, UV replaces pip for installing Python dependencies.

<Info>
  UV typically installs packages 10-100x faster than pip, especially beneficial for:

  * Large dependency trees
  * Multiple packages
  * Clean builds without cache
</Info>

**Example with UV enabled:**

```toml theme={null}
[cerebrium.deployment]
use_uv = true
```

### Monitoring UV Usage

Check your build logs for these indicators:

* **UV\_PIP\_INSTALL\_STARTED** - UV is successfully being used
* **PIP\_INSTALL\_STARTED** - Standard pip installation (when `use_uv` is `false`)

<Warning>
  While UV is compatible with most packages, some edge cases may cause build
  failures, such as legacy packages with non-standard metadata.
</Warning>

### Deploying with UV Lock Files

<Info>read only if you're using `pyproject.toml` and `uv.lock`</Info>

Generate your lock file locally. This creates a uv.lock file with exact dependency versions.

```bash theme={null}
# In your project directory with pyproject.toml
uv sync
```

Export your locked dependencies to requirements.txt

```bash theme={null}
uv pip compile pyproject.toml -o requirements.txt
# Or if you want to use the lock file:
uv pip compile uv.lock -o requirements.txt
```

Include in your deployment:

* Ensure requirements.txt is in your project directory
* Deploy with UV enabled

## Runtime Configuration

The `[cerebrium.runtime.custom]` section configures custom web servers and runtime behavior.

| Option                | Type      | Default  | Description                                                                                                       |
| --------------------- | --------- | -------- | ----------------------------------------------------------------------------------------------------------------- |
| port                  | integer   | required | Port the application listens on                                                                                   |
| entrypoint            | string\[] | required | Command to start the application                                                                                  |
| healthcheck\_endpoint | string    | ""       | HTTP path for health checks (empty uses TCP). Failure causes the instance to restart                              |
| readycheck\_endpoint  | string    | ""       | HTTP path for readiness checks (empty uses TCP). Failure ensures the load balancer does not route to the instance |

<Info>
  The port specified in entrypoint must match the port parameter. All endpoints
  will be available at `https://api.aws.us-east-1.cerebrium.ai/v4/{project - id}
      /{app - name}/your/endpoint`
</Info>

## Hardware Configuration

The `[cerebrium.hardware]` section defines compute resources.

| Option     | Type    | Default     | Description                           |
| ---------- | ------- | ----------- | ------------------------------------- |
| cpu        | float   | required    | Number of CPU cores                   |
| memory     | float   | required    | Memory allocation in GB               |
| compute    | string  | "CPU"       | Compute type (CPU, AMPERE\_A10, etc.) |
| gpu\_count | integer | 0           | Number of GPUs                        |
| provider   | string  | "aws"       | Cloud provider                        |
| region     | string  | "us-east-1" | Deployment region                     |

<Warning>
  Memory refers to RAM, not GPU VRAM. Ensure sufficient memory for your
  workload.
</Warning>

## Scaling Configuration

The `[cerebrium.scaling]` section controls auto-scaling behavior.

| Option                        | Type    | Default                    | CLI Requirement | Description                                                                                                                                                                                              |
| ----------------------------- | ------- | -------------------------- | --------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| min\_replicas                 | integer | 0                          | 2.1.2+          | Minimum running instances                                                                                                                                                                                |
| max\_replicas                 | integer | 2                          | 2.1.2+          | Maximum running instances                                                                                                                                                                                |
| replica\_concurrency          | integer | 10                         | 2.1.2+          | Concurrent requests per replica                                                                                                                                                                          |
| response\_grace\_period       | integer | 3600                       | 2.1.2+          | Grace period in seconds                                                                                                                                                                                  |
| cooldown                      | integer | 1800                       | 2.1.2+          | Time window (seconds) that must pass at reduced concurrency before scaling down. Helps avoid cold starts from brief traffic dips.                                                                        |
| scaling\_metric               | string  | "concurrency\_utilization" | 2.1.2+          | Metric for scaling decisions (concurrency\_utilization, requests\_per\_second, cpu\_utilization, memory\_utilization)                                                                                    |
| scaling\_target               | integer | 100                        | 2.1.2+          | Target value for scaling metric (percentage for utilization metrics, absolute value for requests\_per\_second)                                                                                           |
| scaling\_buffer               | integer | optional                   | 2.1.2+          | Additional replica capacity above what scaling metric suggests                                                                                                                                           |
| evaluation\_interval\_seconds | integer | 30                         | 2.1.5+          | Time window in seconds over which metrics are evaluated before scaling decisions (6-300s)                                                                                                                |
| load\_balancing\_algorithm    | string  | ""                         | 2.1.5+          | Algorithm for distributing traffic across replicas. Default: round-robin if replica\_concurrency > 3, first-available otherwise. Options: round-robin, first-available, min-connections, random-choice-2 |
| compute\_tier                 | string  | "interruptible"            | 2.1.6+          | Controls pod scheduling on spot vs on-demand instances. Options: interruptible (spot, lower cost), protected (on-demand, higher availability)                                                            |
| roll\_out\_duration\_seconds  | integer | 0                          | 2.1.2+          | Gradually send traffic to new revision after successful build. Max 600s. Keep at 0 during development.                                                                                                   |

<Warning>
  Setting min\_replicas > 0 maintains warm instances for immediate response but
  increases costs.
</Warning>

The `scaling_metric` options are:

* **concurrency\_utilization**: Maintains a percentage of your replica\_concurrency across instances. For example, with `replica_concurrency=200` and `scaling_target=80`, maintains 160 requests per instance.
* **requests\_per\_second**: Maintains a specific request rate across all instances. For example, `scaling_target=5` maintains 5 requests/s average across instances.
* **cpu\_utilization**: Maintains CPU usage as a percentage of cerebrium.hardware.cpu. For example, with `cpu=2` and `scaling_target=80`, maintains 80% CPU utilization (1.6 CPUs) per instance.
* **memory\_utilization**: Maintains RAM usage as a percentage of cerebrium.hardware.memory. For example, with `memory=10` and `scaling_target=80`, maintains 80% memory utilization (8GB) per instance.

<Info>
  The scaling\_buffer option is only available with concurrency\_utilization and requests\_per\_second metrics.
  It ensures extra capacity is maintained above what the scaling metric suggests.

  For example, with `min_replicas=0` and `scaling_buffer=3`, the system will maintain 3 replicas as baseline capacity.
</Info>

## Dependencies

### Pip Dependencies

The `[cerebrium.dependencies.pip]` section lists Python package requirements.

```toml theme={null}
[cerebrium.dependencies.pip]
torch = "==2.0.0"      # Exact version
numpy = "latest"       # Latest version
pandas = ">=1.5.0"     # Minimum version
```

### APT Dependencies

The `[cerebrium.dependencies.apt]` section specifies system packages.

```toml theme={null}
[cerebrium.dependencies.apt]
ffmpeg = "latest"
libopenblas-base = "latest"
```

### Conda Dependencies

The `[cerebrium.dependencies.conda]` section manages Conda packages.

```toml theme={null}
[cerebrium.dependencies.conda]
cuda = ">=11.7"
cudatoolkit = "11.7"
```

### Dependency Files

The `[cerebrium.dependencies.paths]` section allows using requirement files.

```toml theme={null}
[cerebrium.dependencies.paths]
pip = "requirements.txt"
apt = "pkglist.txt"
conda = "conda_pkglist.txt"
```

## Complete Example

```toml theme={null}
[cerebrium.deployment]
name = "llm-inference"
python_version = "3.12"
disable_auth = false
include = ["*"]
exclude = [".*"]
shell_commands = []
pre_build_commands = []
docker_base_image_url = "debian:bookworm-slim"
use_uv = true
# Enable fast package installation with UV (omit or set to false if you want to use pip)

[cerebrium.runtime.custom]
port = 8000
entrypoint = ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
healthcheck_endpoint = "/health"
readycheck_endpoint = "/ready"

[cerebrium.hardware]
cpu = 4
memory = 16.0
compute = "AMPERE_A10"
gpu_count = 1
provider = "aws"
region = "us-east-1"

[cerebrium.scaling]
min_replicas = 0
max_replicas = 2
replica_concurrency = 10
response_grace_period = 3600
cooldown = 1800
scaling_metric = "concurrency_utilization"
scaling_target = 100
evaluation_interval_seconds = 30
# load_balancing_algorithm = ""  # Auto-selects based on replica_concurrency
# compute_tier = "interruptible"  # Use "protected" for on-demand instances
roll_out_duration_seconds = 0

[cerebrium.dependencies.pip]
torch = "latest"
transformers = "latest"
uvicorn = "latest"

[cerebrium.dependencies.apt]
ffmpeg = "latest"

[cerebrium.dependencies.conda]
# Optional conda dependencies

[cerebrium.dependencies.paths]
# Optional paths to dependency files
# pip = "requirements.txt"
# apt = "pkglist.txt"
# conda = "conda_pkglist.txt"
```
