> ## Documentation Index
> Fetch the complete documentation index at: https://cerebrium.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Scaling Apps

> Learn to optimise for cost and performance by scaling out apps

Cerebrium's scaling system automatically manages computing resources to match app demand, from single requests to multiple simultaneous requests, optimizing for both performance and cost.

## How Autoscaling Works

The scaling system monitors a configurable metric - such as concurrency utilization, requests per second, CPU usage, or memory usage - and compares it against a target threshold. When the metric exceeds the target, new instances start within seconds. See [Scaling Metrics](#using-scaling-metrics) for details on each option.

## Scaling Configuration

The `cerebrium.toml` file controls scaling behavior through several key parameters:

```toml theme={null}
[cerebrium.scaling]
min_replicas = 0           # Minimum running instances
max_replicas = 3           # Maximum concurrent instances
cooldown = 60              # Cooldown period in seconds
replica_concurrency = 1    # The maximum number of requests each replica of an app can accept
```

### Minimum Instances

The `min_replicas` parameter defines how many instances remain active at all times. Setting this to 1 or higher maintains warm instances for immediate response, eliminating cold starts but increasing costs. Use this for apps requiring consistent response times or specific SLA guarantees.

### Maximum Instances

The `max_replicas` parameter sets an upper limit on concurrent instances, controlling costs and protecting backend systems. When traffic increases, new instances start automatically up to this configured maximum.

### Cooldown Period

The `cooldown` parameter specifies the time window (in seconds) that must pass at reduced concurrency before an instance scales down. This prevents premature scale-down during brief traffic dips that might be followed by more requests. A longer cooldown period helps handle bursty traffic patterns but increases instance running time and cost.

### Replica Concurrency

The number of requests an app instance can handle concurrently is dictated by the `replica_concurrency` parameter. This is a hard limit, and an individual replica will
not accept more than this limit at a time. By default, once this concurrency limit is reached on an instance and there are still requests to be processed in-flight,
the system will scale out by the number of new instances required to fulfil the in-flight requests. For example, if `replica_concurrency=1` and there are
*3* requests in flight with no replicas currently available, Cerebrium will scale out 3 instances of the application to meet that demand.

<Info>
  Typically most GPU applications will require that `replica_concurrency` is set
  to **1**. If the workload requires GPU but higher throughput is desired,
  `replica_concurrency` may be increased so long as access to GPU resources is
  controlled within the application through batching.
</Info>

## Processing Multiple Requests

Apps can process multiple requests simultaneously through batching and concurrency. Cerebrium supports frameworks with built-in batching and enables custom implementations through the [custom runtime](container-images/defining-container-images#custom-runtimes) feature. See the [Batching & Concurrency Guide](/scaling/batching-concurrency) for details.

## Instance Management

Cerebrium automatically restarts failed instances, starts new instances to maintain capacity, and monitors instance health continuously.

Apps requiring maximum reliability combine several scaling features:

```toml theme={null}
[cerebrium.scaling]
min_replicas = 2              # Maintain redundant instances
cooldown = 600                # Extended warm period
max_replicas = 10             # Room for traffic spikes
response_grace_period = 1200  # Maximum request lifespan ensuring graceful exit
```

The `response_grace_period` parameter stipulates how long in seconds a request would need at most to finish, and provides time for instances to complete active requests during normal operation and shutdown.
During normal replica operation, this acts as a request timeout value. During replica shutdown, the Cerebrium system sends a SIGTERM signal to the replica,
waits for the specified grace period, issues a SIGKILL command if the instance has not stopped, and kills any active requests with a GatewayTimeout error.

<Note>
  When using the Cortex runtime (default), SIGTERM signals are automatically
  handled to allow graceful termination of requests. For custom runtimes, you'll
  need to implement SIGTERM handling yourself to ensure requests complete
  gracefully before termination. See our [Graceful Termination
  guide](/scaling/graceful-termination) for detailed implementation examples,
  including FastAPI patterns for tracking and completing in-flight requests
  during shutdown.
</Note>

Performance metrics available through the dashboard help monitor scaling behavior:

* Request processing times
* Active instance count
* Cold start frequency
* Resource usage patterns

## Using Scaling Metrics

Cerebrium supports multiple scaling metrics beyond the default `replica_concurrency`. Four scaling metrics are available:

* `concurrency_utilization`
* `requests_per_second`
* `cpu_utilization`
* `memory_utilization`

Specify a metric and target in the `cerebrium.scaling` section:

```toml theme={null}
[cerebrium.scaling]
min_replicas = 0
cooldown = 600
max_replicas = 10
response_grace_period = 120
replica_concurrency = 1
scaling_metric = "concurrency_utilization"
scaling_target = 100
```

### Concurrency Utilization

`concurrency_utilization` is the default scaling metric, with a default target of *100%*.
This metric maintains a maximum percentage of `replica_concurrency` averaged across all instances.
For example, with `replica_concurrency=1` and `scaling_target=70`, Cerebrium maintains *0.7* requests per instance, ensuring 30% excess capacity.

With `replica_concurrency=200` and `scaling_target=80`, Cerebrium maintains *160* requests per instance and scales out once that target is exceeded.

### Requests per Second

`requests_per_second` maintains a maximum application throughput measured in requests per second, averaged over all instances. This metric is more effective than `concurrency_utilization` when application throughput has been benchmarked. It does not enforce concurrency limits, so it is not recommended for most GPU applications. For example, `scaling_target=5` maintains a 5 requests/s average across all instances.

### CPU Utilization

`cpu_utilization` scales based on maximum CPU percentage utilization averaged over all instances, relative to the `cerebrium.hardware.cpu` value. For example, with `cpu=2` and `scaling_target=80`, Cerebrium maintains *80%* CPU utilization (1.6 CPUs) per instance. This metric requires `min_replicas=1` since scaling relative to 0 CPU units is undefined.

### Memory Utilization

`memory_utilization` scales based on maximum RAM percentage utilization averaged over all instances, relative to `cerebrium.hardware.memory`. This refers to RAM, **not** GPU VRAM. For example, with `memory=10` and `scaling_target=80`, Cerebrium maintains *80%* memory utilization (8GB) per instance. This metric requires `min_replicas=1` since scaling relative to 0GB of memory is undefined.

## Keeping a Scaling Buffer

For apps with long startup times or predictable traffic, a replica buffer maintains consistent excess capacity above what the scaling metric suggests. The `scaling_buffer` option adds a fixed number of extra replicas to the autoscaler's recommendation. This is available with the following scaling metrics:

* `concurrency_utilization`
* `requests_per_second`

Add `scaling_buffer` to the `cerebrium.scaling` section:

```toml theme={null}
[cerebrium.scaling]
min_replicas = 1
cooldown = 600
max_replicas = 10
response_grace_period = 120
replica_concurrency = 1
scaling_metric = "concurrency_utilization"
scaling_target = 100
scaling_buffer = 3
```

With the above config: when no traffic is received, the app runs **1 replica** as a baseline — the buffer scales based on requests.
With `concurrency_utilization` at `100` and `replica_concurrency=1`, receiving 1 request causes the autoscaler to suggest 1 replica. With `scaling_buffer=3`, the app scales to **(1+3)=4** replicas.
The buffer adds a static number of replicas on top of the autoscaler's recommendation.
After the request completes, the `cooldown` period applies and the replica count scales back to the **1 replica** baseline.

## Evaluation Interval

<Warning>Requires CLI version 2.1.5 or higher.</Warning>

The `evaluation_interval` parameter controls the time window (in seconds) over which the autoscaler evaluates metrics before making scaling decisions. The default is 30 seconds, with a valid range of 6-300 seconds.

```toml theme={null}
[cerebrium.scaling]
evaluation_interval_seconds = 30  # Evaluate metrics over 30-second windows
```

A shorter interval makes the autoscaler more responsive to traffic spikes but may cause more frequent scaling events. A longer interval smooths out transient spikes but may delay scaling responses.

<Info>
  For bursty workloads, a shorter `evaluation_interval` (e.g., 10-15 seconds)
  helps the system respond quickly to demand. For steady workloads, a longer
  interval (e.g., 60 seconds) reduces unnecessary scaling churn.
</Info>

## Load Balancing

<Warning>Requires CLI version 2.1.5 or higher.</Warning>

The `load_balancing_algorithm` parameter controls how incoming requests are distributed across your replicas. When not specified, the system automatically selects the best algorithm based on your `replica_concurrency` setting.

```toml theme={null}
[cerebrium.scaling]
load_balancing_algorithm = "min-connections"  # Explicitly set load balancing algorithm
```

**Default behavior**: When `load_balancing_algorithm` is not set, the system uses `first-available` for `replica_concurrency <= 3` (typical for GPU workloads) and `round-robin` for higher concurrency.

### Available Algorithms

#### round-robin

Cycles through replicas starting from the last successful target. Each replica's concurrency limit is respected - if a replica is at capacity, the algorithm proceeds to the next one in rotation.

| Characteristic       | Value                                                                   |
| -------------------- | ----------------------------------------------------------------------- |
| Selection complexity | O(1) typical, O(N) worst case when scanning for available capacity      |
| Latency profile      | Consistent p50, good p90 under uniform load                             |
| Strategy             | Stateful index rotation with mutex synchronization; skips full replicas |

**Best for**: Workloads with predictable request times where you want even distribution across replicas over time.

#### first-available

Scans replicas from the start of the list and selects the first one with available capacity.

| Characteristic       | Value                                                                         |
| -------------------- | ----------------------------------------------------------------------------- |
| Selection complexity | O(1) typical, O(N) worst case                                                 |
| Latency profile      | Optimal p50 when load is light, may degrade p90 under high load               |
| Strategy             | Linear scan from list start; returns first replica that accepts via Reserve() |

**Best for**: GPU workloads with low concurrency (`replica_concurrency <= 3`). Maximizes utilization of warm replicas before spreading load, reducing cold starts and keeping models in VRAM.

**Tradeoff**: Earlier replicas in the list handle more traffic. This is desirable for GPU workloads but may cause uneven distribution for CPU workloads.

#### min-connections

Linear scan to find the replica with the fewest in-flight requests, then attempts to reserve it. If that replica cannot accept (at capacity), falls back to trying other replicas in iteration order.

| Characteristic       | Value                                                              |
| -------------------- | ------------------------------------------------------------------ |
| Selection complexity | Θ(N) - always scans all replicas to find minimum                   |
| Latency profile      | Best p90/p99 tail latency                                          |
| Strategy             | Single pass to find minimum in-flight; fallback in iteration order |

**Best for**: Workloads with variable request times (e.g., LLM inference where output length varies). Routes new requests to the least busy replica, preventing fast requests from queuing behind slow ones.

#### random-choice-2

Implements the "Power of Two Choices" algorithm: randomly samples two replicas and routes to the one with lower weight (based on active request tracking). Ties are broken randomly.

| Characteristic       | Value                                                       |
| -------------------- | ----------------------------------------------------------- |
| Selection complexity | Θ(1) - constant time regardless of replica count            |
| Latency profile      | Good balance of p50 and p90                                 |
| Strategy             | Sample 2 random replicas, compare weights, pick lighter one |

**Best for**: High-throughput scenarios with many replicas where selection overhead matters. Research shows this achieves exponentially better load distribution than pure random selection.

**Note**: Uses weight-based tracking rather than reservation-based concurrency limiting, making it suitable for unlimited concurrency scenarios.

### Choosing an Algorithm

| Scenario                                 | Recommended                 | Reason                                             |
| ---------------------------------------- | --------------------------- | -------------------------------------------------- |
| GPU inference, `replica_concurrency=1`   | `first-available` (default) | Maximizes GPU utilization, keeps models warm       |
| LLMs with variable output lengths        | `min-connections`           | Prevents head-of-line blocking, best tail latency  |
| High-throughput, many replicas           | `random-choice-2`           | Θ(1) selection with near-optimal distribution      |
| Uniform request times, even distribution | `round-robin`               | Predictable rotation, no hot spots over time       |
| Latency-sensitive with variable load     | `min-connections`           | Minimizes p90/p99 by routing to least busy replica |

## Compute Tier

<Warning>Requires CLI version 2.1.6 or higher.</Warning>

The `compute_tier` parameter controls whether your replicas are scheduled on spot (interruptible) or on-demand (protected) instances. This directly affects cost and availability.

```toml theme={null}
[cerebrium.scaling]
compute_tier = "protected"  # Use on-demand instances for higher availability
```

### Available Tiers

| Tier            | Description                                                                                                                               |
| --------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
| `interruptible` | **(Default)** Scheduled on preemptible instances. Lower cost, but may be interrupted by the cloud provider during periods of high demand. |
| `protected`     | Scheduled on instances with higher availability with no interruptions, at increased cost.                                                 |

**Choosing a tier:**

* Use `interruptible` (default) for batch workloads, development, or cost-sensitive applications that can tolerate occasional interruptions.
* Use `protected` for production services with strict availability requirements or long-running requests where interruption would be costly.
