Documentation Index
Fetch the complete documentation index at: https://cerebrium.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
This tutorial deploys Llama 3.2 3B using TensorRT-LLM’s PyTorch backend served through Nvidia Triton Inference Server.
The TensorRT + Triton setup delivers 15x higher throughput with 100% reliability compared to the baseline (vanilla deployment), while reducing latency by 7-9x across all percentiles. See the Performance Analysis section for detailed test methodology and results.
You can view the final implementation here.
Why TensorRT + Triton?
Why TensorRT?
NVIDIA TensorRT is a software development kit for high-performance deep learning inference. It compiles model weights into optimized engines that run more efficiently on specific GPU hardware through CUDA-level optimizations, custom kernels, and optional quantization.
TensorRT requires you to specify optimization parameters upfront - GPU architecture, batch size, precision (FP8, INT8, etc.), and input/output shapes. This specialization allows TensorRT to generate highly optimized inference engines that maximize GPU utilization, reduce latency, and lower inference costs compared to serving raw model weights.
Why Triton?
NVIDIA Triton Inference Server streamlines production AI deployment by handling operational concerns that are critical for serving models at scale. It provides automatic request batching, health checks, metrics collection, and standardized HTTP/gRPC APIs out of the box.
Triton supports multiple frameworks (TensorRT, PyTorch, TensorFlow, ONNX, etc.), offers built-in Prometheus metrics for observability, and integrates seamlessly with Kubernetes for auto-scaling. It also supports model versioning, A/B testing, and can chain multiple models into pipelines.
Here is a diagram of how Triton works.
Below is the process of how the two work together in terms of handling requests:
- Client sends text via HTTP/gRPC to Triton
- Triton queues the request in the scheduler
- Triton batches incoming requests (waits for more or timeout)
- When batch is ready, Triton calls your Python backend
- TensorRT-LLM generates tokens for the entire batch in parallel on GPU
- Triton returns responses to clients
This setup allows multiple concurrent requests to be processed together on the GPU for maximum throughput.
The following sections combine Triton and TensorRT-LLM into a working deployment.
Basic Setup
Install the Cerebrium CLI:
pip install cerebrium
cerebrium login
Create your project:
cerebrium init tensorrt-triton-demo
cd tensorrt-triton-demo
To download the model, request access on Hugging Face. Then add the HuggingFace token to Cerebrium project secrets as HF_AUTH_TOKEN through the dashboard for authentication during download.
Implementation
All files should be placed in the same project directory.
Triton Model Configuration
Create config.pbtxt to define Triton’s model interface. See the full configuration reference for all available options.
name: "llama3_2"
backend: "python"
max_batch_size: 128
dynamic_batching {
max_queue_delay_microseconds: 100
}
instance_group [
{
count: 1
kind: KIND_GPU
}
]
input [
{
name: "text_input"
data_type: TYPE_STRING
dims: [ 1 ]
},
{
name: "max_tokens"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "temperature"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "top_p"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
}
]
output [
{
name: "text_output"
data_type: TYPE_STRING
dims: [ 1 ]
}
]
This configuration tells Triton:
- Use Python backend (runs our model.py)
- Automatically batch up to 128 requests together for efficient GPU utilization
- Use dynamic batching with a 100 microsecond queue delay to maximize batch sizes
- Accept text input with optional sampling parameters
- Run on a single GPU instance
- Return generated text as output
Python Backend Implementation
Triton’s Python backend requires implementing a TritonPythonModel class with three key methods:
-
initialize(args): Called once when Triton loads the model. This is where you load the tokenizer and initialize TensorRT-LLM with your build configuration.
-
execute(requests): Called every time Triton has a batch ready. Triton automatically batches incoming requests (up to your configured max_batch_size) and passes them here. This method extracts prompts from each request, runs batch inference with TensorRT-LLM, and returns responses.
-
finalize(): Called when the model is being unloaded. Use this to clean up GPU memory and shut down the TensorRT-LLM engine.
Create model.py implementing Triton’s Python backend interface:
"""
Triton Python Backend for TensorRT-LLM.
"""
import numpy as np
import triton_python_backend_utils as pb_utils
import torch
from tensorrt_llm import LLM, SamplingParams, BuildConfig
from tensorrt_llm.plugin.plugin import PluginConfig
from transformers import AutoTokenizer
MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"
MODEL_DIR = f"/persistent-storage/models/{MODEL_ID}"
class TritonPythonModel:
def initialize(self, args):
"""Initialize TensorRT-LLM with PyTorch backend."""
print("Loading tokenizer...")
self.tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
print("Initializing TensorRT-LLM...")
plugin_config = PluginConfig.from_dict({
"paged_kv_cache": True,
})
build_config = BuildConfig(
plugin_config=plugin_config,
max_input_len=4096,
max_batch_size=128, # Matches Triton max_batch_size in config.pbtxt
)
self.llm = LLM(
model=MODEL_DIR,
build_config=build_config,
tensor_parallel_size=torch.cuda.device_count(),
)
print("✓ Model ready")
def execute(self, requests):
"""
Execute inference on batched requests.
Triton automatically batches requests (up to max_batch_size: 128).
This function processes the batch that Triton provides.
"""
try:
prompts = []
sampling_params_list = []
original_prompts = []
# Extract data from each request in the batch. We need to look through requests: https://github.com/triton-inference-server/python_backend?tab=readme-ov-file#execute
for request in requests:
try:
# Get input text - handle batched tensor structures
input_tensor = pb_utils.get_input_tensor_by_name(request, "text_input")
text_array = input_tensor.as_numpy()
# Extract text handling different array structures
if text_array.ndim == 0:
text = text_array.item()
elif text_array.dtype == object:
text = text_array.flat[0] if text_array.size > 0 else text_array.item()
else:
text = text_array.flat[0] if text_array.size > 0 else text_array.item()
# Decode if bytes
if isinstance(text, bytes):
text = text.decode('utf-8')
elif isinstance(text, np.str_):
text = str(text)
# Get optional parameters with defaults
max_tokens = 1024
if pb_utils.get_input_tensor_by_name(request, "max_tokens") is not None:
max_tokens_array = pb_utils.get_input_tensor_by_name(request, "max_tokens").as_numpy()
max_tokens = int(max_tokens_array.item() if max_tokens_array.ndim == 0 else max_tokens_array.flat[0])
temperature = 0.8
if pb_utils.get_input_tensor_by_name(request, "temperature") is not None:
temp_array = pb_utils.get_input_tensor_by_name(request, "temperature").as_numpy()
temperature = float(temp_array.item() if temp_array.ndim == 0 else temp_array.flat[0])
top_p = 0.95
if pb_utils.get_input_tensor_by_name(request, "top_p") is not None:
top_p_array = pb_utils.get_input_tensor_by_name(request, "top_p").as_numpy()
top_p = float(top_p_array.item() if top_p_array.ndim == 0 else top_p_array.flat[0])
# Format prompt using chat template
prompt = self.tokenizer.apply_chat_template(
[{"role": "user", "content": text}],
tokenize=False,
add_generation_prompt=True
)
prompts.append(prompt)
original_prompts.append(prompt)
sampling_params_list.append(SamplingParams(
temperature=temperature,
top_p=top_p,
max_tokens=max_tokens,
))
except Exception as e:
print(f"Error processing request: {e}", flush=True)
prompts.append("")
original_prompts.append("")
sampling_params_list.append(SamplingParams(max_tokens=1024))
# Batch inference
if not prompts:
return []
outputs = self.llm.generate(prompts, sampling_params_list)
# Create responses
responses = []
for i, output in enumerate(outputs):
try:
generated_text = output.outputs[0].text
# Strip prompt from output if included
if original_prompts[i] and original_prompts[i] in generated_text:
generated_text = generated_text.replace(original_prompts[i], "").strip()
responses.append(pb_utils.InferenceResponse(
output_tensors=[pb_utils.Tensor(
"text_output",
np.array([generated_text.encode('utf-8')], dtype=object)
)]
))
except Exception as e:
print(f"Error creating response {i}: {e}", flush=True)
responses.append(pb_utils.InferenceResponse(
output_tensors=[pb_utils.Tensor(
"text_output",
np.array([f"Error: {str(e)}".encode('utf-8')], dtype=object)
)]
))
return responses
except Exception as e:
print(f"Error in execute: {e}", flush=True)
return [
pb_utils.InferenceResponse(
output_tensors=[pb_utils.Tensor(
"text_output",
np.array([f"Batch error: {str(e)}".encode('utf-8')], dtype=object)
)]
)
for _ in requests
]
def finalize(self):
"""Cleanup on shutdown."""
if hasattr(self, 'llm'):
self.llm.shutdown()
torch.cuda.empty_cache()
Model Download Script
Create download_model.py to download the model:
#!/usr/bin/env python3
"""Download HuggingFace model to persistent storage."""
import os
from pathlib import Path
from huggingface_hub import snapshot_download, login
MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"
MODEL_DIR = Path("/persistent-storage/models") / MODEL_ID
def download_model():
"""Download model if not already present."""
hf_token = os.environ.get("HF_AUTH_TOKEN")
if not hf_token:
print("WARNING: HF_AUTH_TOKEN not set")
return
if MODEL_DIR.exists() and any(MODEL_DIR.iterdir()):
print("✓ Model already exists")
return
print("Downloading model...")
login(token=hf_token)
snapshot_download(
MODEL_ID,
local_dir=str(MODEL_DIR),
token=hf_token
)
print("✓ Model downloaded")
if __name__ == "__main__":
download_model()
This script checks if the model exists in persistent storage before downloading to avoid redundant downloads on subsequent deployments.
Container Setup
Create Dockerfile extending Nvidia’s Triton container:
FROM nvcr.io/nvidia/tritonserver:25.10-trtllm-python-py3
ENV PYTHONPATH=/usr/local/lib/python3.12/dist-packages:$PYTHONPATH
ENV PYTHONDONTWRITEBYTECODE=1
ENV DEBIAN_FRONTEND=noninteractive
ENV HF_HOME=/persistent-storage/models
ENV TORCH_CUDA_ARCH_LIST=8.6
# Install dependencies
RUN apt-get update && apt-get install -y \
git \
git-lfs \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
RUN pip install --break-system-packages \
huggingface_hub \
transformers \
|| true
# Create directories
RUN mkdir -p \
/app/model_repository/llama3_2/1 \
/persistent-storage/models \
/persistent-storage/engines
# Copy files
COPY model.py /app/model_repository/llama3_2/1/
COPY config.pbtxt /app/model_repository/llama3_2/
EXPOSE 8000 8001 8002
CMD ["tritonserver", "--model-repository=/app/model_repository", "--http-port=8000", "--grpc-port=8001", "--metrics-port=8002"]
The Dockerfile uses Nvidia’s official Triton container with TensorRT-LLM pre-installed, creates the model repository structure that Triton expects, and copies our application files to the correct locations.
Deployment Configuration
Configure the container and autoscaling environment in cerebrium.toml:
[cerebrium.deployment]
name = "tensorrt-triton-demo"
python_version = "3.12"
disable_auth = true
include = ['./*', 'cerebrium.toml']
exclude = ['.*']
deployment_initialization_timeout = 830
[cerebrium.hardware]
cpu = 4.0
memory = 40.0
compute = "AMPERE_A10"
gpu_count = 1
[cerebrium.scaling]
min_replicas = 0
max_replicas = 5
cooldown = 300
replica_concurrency = 128
scaling_metric = "concurrency_utilization"
[cerebrium.runtime.custom]
port = 8000
healthcheck_endpoint = "/v2/health/live"
readycheck_endpoint = "/v2/health/ready"
dockerfile_path = "./Dockerfile"
Key configuration details:
replica_concurrency = 128: Each replica can handle up to 128 concurrent requests, matching our Triton batch size
max_replicas = 5: Scale up to 5 replicas for peak load
Deploy
Download Model to Persistent Storage
Before deploying, download the model to Cerebrium’s persistent storage. This ensures the model is available across all deployments and avoids redundant downloads during container startup.
The cerebrium run command executes a Python script in a temporary container with the same environment and hardware configuration as the deployment. It has access to persistent storage at /persistent-storage, so any files written there are available to deployed containers.
Run the download script:
cerebrium run download_model.py
The logs confirm whether the model already exists or has been downloaded successfully.
Deploy the Model
Deploy the model:
After successful deployment, the base endpoint URL appears in the output. Use this URL in the next section.
Test
Send a request to your deployed endpoint:
curl -X POST https://api.aws.us-east-1.cerebrium.ai/v4/<project-id>/<name>/v2/models/llama3_2/infer \
-H "Content-Type: application/json" \
-d '{
"inputs": [
{
"name": "text_input",
"shape": [1, 1],
"datatype": "BYTES",
"data": ["What is machine learning?"]
}
],
"outputs": [{"name": "text_output"}]
}'
The endpoint returns results in this format:
{
"outputs": [
{
"name": "text_output",
"datatype": "BYTES",
"shape": [1],
"data": [
"Machine learning is a subset of artificial intelligence (AI) that involves training algorithms..."
]
}
]
}
The response follows Triton’s standard inference protocol format with the generated text in the data field of the output tensor.
Test Setup
To validate performance improvements, TensorRT + Triton was compared against a vanilla HuggingFace baseline serving the same Llama 3.2 3B Instruct model. Both deployments used identical hardware (NVIDIA A10 GPU) and were tested under the same load conditions.
Vanilla Baseline Setup:
- Model served directly using HuggingFace Transformers with PyTorch
- Single request processing (no batching)
- Standard FastAPI endpoint
- Same hardware configuration (A10 GPU, 4 CPU cores, 40GB memory)
TensorRT + Triton Setup:
- TensorRT-LLM with PyTorch backend
- Triton Inference Server with dynamic batching (max batch size: 128)
- Automatic request queuing and batching
- Same hardware configuration (A10 GPU, 4 CPU cores, 40GB memory)
Both deployments were tested with the same load testing parameters to ensure fair comparison.
Results
| Metric | Vanilla Baseline | TensorRT + Triton | Improvement |
|---|
| Requests Per Second (RPS) | 0.83 | 12.46 | 15x faster |
| Success Rate | 61.6% | 100.0% | 38.4% increase |
| P50 Latency | 297.7s | 41.7s | 7.1x faster |
| P99 Latency | 593.2s | 79.3s | 7.5x faster |
| Average Latency | 376.2s | 42.4s | 8.9x faster |
The TensorRT + Triton setup delivers 15x higher throughput with 100% reliability compared to the baseline, while reducing latency by 7-9x across all percentiles. The baseline’s 61.6% success rate and high latency come from processing requests sequentially without batching, leading to GPU underutilization and request timeouts. TensorRT + Triton eliminates these issues by keeping the GPU fully utilized with batched, optimized inference, resulting in 100% success rate and consistent, predictable latency.
These results demonstrate that TensorRT + Triton is not just faster, but also more reliable and cost-effective for production LLM serving at scale.
Get Started
The complete implementation, including all configuration files and deployment scripts, is available in our GitHub repository.
Clone the repository and follow this tutorial to deploy Llama 3.2 3B (or adapt it for other models) with TensorRT-LLM and Triton Inference Server.