Skip to main content

Introduction

Memory checkpointing takes a snapshot of a container’s CPU memory and GPU memory, and uses it to speed up the startup of future containers. Applications that perform a large amount of work at container start time benefit the most from this process. This is useful for both CPU-only and GPU workloads. For CPU applications, checkpointing can preserve expensive initialization work such as imports, dependency loading, configuration setup, and in-memory state. For GPU applications, it can also preserve model weights, CUDA state, and compiled kernels. For example, ML and LLM frameworks often load large model weights and compile CUDA kernels at container start time, which can take many seconds or minutes. Loading from a checkpoint that already contains this initialized state can skip most of that delay. Since this feature is still in beta, please report all issues to the team via our Discord Community or via Email.

How to use

Checkpointing is available in early beta to our customer base.

1. Enable checkpointing in cerebrium.toml

[cerebrium.experimental]
checkpointing = true

2. Send the checkpoint trigger

You have full control over when to create a checkpoint - preferably it’s after the majority of your application’s initialization work is complete and your desired state is reached. The state of your application at this exact moment is what will be restored for future container launches. Once you reach this optimal point in your startup logic, send a POST request from inside the container to instruct Cerebrium to capture the checkpoint:
import json
import urllib.request

# Trigger the checkpoint only after finishing all major initialization tasks.
req = urllib.request.Request(
    "http://169.254.169.253:8234/checkpoint",
    method="POST",
)
with urllib.request.urlopen(req, timeout=300) as response:
    print(json.loads(response.read()))
The endpoint uses 169.254.169.253, a link-local address that routes to the Cerebrium runtime sidecar inside the container. The address is reachable only from inside the container, not from external networks. Set the HTTP client timeout to at least 300 seconds. Checkpoint duration scales with the amount of memory captured - from a few seconds for small CPU workloads to several minutes for large GPU snapshots.

3. When the runtime creates a checkpoint

When the runtime receives the POST, it checks whether a new checkpoint is required. To save resources, the system skips checkpoint creation if:
  1. A checkpoint already exists for the current build version.
  2. Another container instance is already undergoing the checkpointing process.
If a checkpoint should occur, the container is frozen for the duration of the process. GPU memory is copied to CPU memory, and then all container memory is written to storage. The saved checkpoint is then distributed throughout the region. Checkpoint size roughly equals CPU memory in use at trigger time plus GPU memory copied during the freeze. For GPU workloads, expect the snapshot to be on the order of model weights plus runtime overhead unless caches are dropped first. Allocate enough container memory to hold the GPU dump in addition to normal usage — see Memory overhead under Limitations.

4. Verify restoration

If checkpoint creation succeeds, subsequent containers restore from that snapshot. A restored container logs CEREBRIUM_RESTORED: container restored from checkpoint as its first log line.

5. Disable checkpointing

A checkpoint is tightly coupled to a single deployment. To stop restoring from checkpoints, remove the POST request and redeploy the application. You can find several implementations in our Examples repository on Github.

vLLM example

from vllm import AsyncLLMEngine
from vllm.engine.arg_utils import AsyncEngineArgs
import http
import urllib

# Init vLLM engine
engine_args = AsyncEngineArgs(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    async_scheduling=False,  # Async scheduling is incompatible with checkpoint restore
    sleep_mode=True,  # Required for engine.sleep() / wake_up() to drop KV cache before checkpoint
)
engine = AsyncLLMEngine.from_engine_args(engine_args)

# Drop KV cache for reduced GPU memory footprint.
engine.sleep(level=1)
# Trigger checkpoint
try:
    import json
    req = urllib.request.Request("http://169.254.169.253:8234/checkpoint", method="POST")
    with urllib.request.urlopen(req, timeout=300) as response:
        result = response.read()
        print(json.loads(result))
except http.client.RemoteDisconnected:
    # TCP connections disconnect on restore and throw remote
    pass

# Restore KV cache
engine.wake_up()

Limitations

Memory overhead: The container memory allocation must be large enough to contain the GPU memory dump in addition to your regular memory use. Execution lifecycle: When a container is restored from a checkpoint, execution continues from the point where the HTTP request was sent. Any environment variables read before this point remain the same as they were at the time of the checkpoint. Network connections: Any TCP connections made before the checkpoint will have disconnected. For example, if you connected to a database before the checkpoint, you must reestablish that connection after restore. Ephemeral filesystem: Any files written to disk before the checkpoint are not copied to the restored container. Only memory is checkpointed. Provider Availability: Checkpointing is only available on the AWS provider. More coming soon.

Platform-specific recommendations

vLLM

vLLM checkpointing support is not complete but is still possible. See vllm-project/vllm#34303 and related issues. The larger the size of the memory checkpoint the slower the restore is. Reduce the size of the snapshot substantially and improve startup times by dropping the KV Cache before checkpoint and recreating it after restore. vLLM has functionality that does this built in as part of vLLM Sleep Mode.