Introduction
Memory checkpointing takes a snapshot of a container’s CPU memory and GPU memory, and uses it to speed up the startup of future containers. Applications that perform a large amount of work at container start time benefit the most from this process. This is useful for both CPU-only and GPU workloads. For CPU applications, checkpointing can preserve expensive initialization work such as imports, dependency loading, configuration setup, and in-memory state. For GPU applications, it can also preserve model weights, CUDA state, and compiled kernels. For example, ML and LLM frameworks often load large model weights and compile CUDA kernels at container start time, which can take many seconds or minutes. Loading from a checkpoint that already contains this initialized state can skip most of that delay. Since this feature is still in beta, please report all issues to the team via our Discord Community or via Email.How to use
Checkpointing is available in early beta to our customer base.1. Enable checkpointing in cerebrium.toml
2. Send the checkpoint trigger
You have full control over when to create a checkpoint - preferably it’s after the majority of your application’s initialization work is complete and your desired state is reached. The state of your application at this exact moment is what will be restored for future container launches. Once you reach this optimal point in your startup logic, send a POST request from inside the container to instruct Cerebrium to capture the checkpoint:169.254.169.253, a link-local address that routes to the Cerebrium runtime sidecar inside the container. The address is reachable only from inside the container, not from external networks.
Set the HTTP client timeout to at least 300 seconds. Checkpoint duration scales with the amount of memory captured - from a few seconds for small CPU workloads to several minutes for large GPU snapshots.
3. When the runtime creates a checkpoint
When the runtime receives the POST, it checks whether a new checkpoint is required. To save resources, the system skips checkpoint creation if:- A checkpoint already exists for the current build version.
- Another container instance is already undergoing the checkpointing process.
4. Verify restoration
If checkpoint creation succeeds, subsequent containers restore from that snapshot. A restored container logsCEREBRIUM_RESTORED: container restored from checkpoint as its first log line.