> ## Documentation Index
> Fetch the complete documentation index at: https://cerebrium.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Deploy Triton Inference server and TensorRT-LLM

> Achieve high throughput with Triton Inference Server and the TensorRT-LLM framework

This tutorial deploys Llama 3.2 3B using TensorRT-LLM's PyTorch backend served through Nvidia Triton Inference Server.

The TensorRT + Triton setup delivers **15x higher throughput** with **100% reliability** compared to the baseline (vanilla deployment), while reducing latency by **7-9x** across all percentiles. See the [Performance Analysis](#performance-analysis) section for detailed test methodology and results.

You can view the final implementation [here](https://github.com/CerebriumAI/examples/tree/master/5-large-language-models/8-faster-inference-with-triton-tensorrt).

## Why TensorRT + Triton?

### Why TensorRT?

NVIDIA TensorRT is a software development kit for high-performance deep learning inference. It compiles model weights into optimized engines that run more efficiently on specific GPU hardware through CUDA-level optimizations, custom kernels, and optional quantization.

TensorRT requires you to specify optimization parameters upfront - GPU architecture, batch size, precision (FP8, INT8, etc.), and input/output shapes. This specialization allows TensorRT to generate highly optimized inference engines that maximize GPU utilization, reduce latency, and lower inference costs compared to serving raw model weights.

### Why Triton?

NVIDIA Triton Inference Server streamlines production AI deployment by handling operational concerns that are critical for serving models at scale. It provides automatic request batching, health checks, metrics collection, and standardized HTTP/gRPC APIs out of the box.

Triton supports multiple frameworks (TensorRT, PyTorch, TensorFlow, ONNX, etc.), offers built-in Prometheus metrics for observability, and integrates seamlessly with Kubernetes for auto-scaling. It also supports model versioning, A/B testing, and can chain multiple models into pipelines.
[Here](https://substackcdn.com/image/fetch/\$s_!FEPb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d4460ad-0e7e-4545-aee6-274b93dd5959_2300x2304.gif) is a diagram of how Triton works.

Below is the process of how the two work together in terms of handling requests:

1. Client sends text via HTTP/gRPC to Triton
2. Triton queues the request in the scheduler
3. Triton batches incoming requests (waits for more or timeout)
4. When batch is ready, Triton calls your Python backend
5. TensorRT-LLM generates tokens for the entire batch in parallel on GPU
6. Triton returns responses to clients

This setup allows multiple concurrent requests to be processed together on the GPU for maximum throughput.

The following sections combine Triton and TensorRT-LLM into a working deployment.

## Basic Setup

Install the Cerebrium CLI:

```bash theme={null}
pip install cerebrium
cerebrium login
```

Create your project:

```bash theme={null}
cerebrium init tensorrt-triton-demo
cd tensorrt-triton-demo
```

To download the model, [request access](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) on Hugging Face. Then add the [HuggingFace token](https://huggingface.co/settings/tokens) to Cerebrium project secrets as `HF_AUTH_TOKEN` through the dashboard for authentication during download.

## Implementation

All files should be placed in the same project directory.

### Triton Model Configuration

Create `config.pbtxt` to define Triton's model interface. See the [full configuration reference](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html#dynamic-batcher) for all available options.

```protobuf theme={null}
name: "llama3_2"
backend: "python"
max_batch_size: 128

dynamic_batching {
  max_queue_delay_microseconds: 100
}

instance_group [
  {
    count: 1
    kind: KIND_GPU
  }
]

input [
  {
    name: "text_input"
    data_type: TYPE_STRING
    dims: [ 1 ]
  },
  {
    name: "max_tokens"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  }
]

output [
  {
    name: "text_output"
    data_type: TYPE_STRING
    dims: [ 1 ]
  }
]
```

This configuration tells Triton:

* Use Python backend (runs our model.py)
* Automatically batch up to 128 requests together for efficient GPU utilization
* Use dynamic batching with a 100 microsecond queue delay to maximize batch sizes
* Accept text input with optional sampling parameters
* Run on a single GPU instance
* Return generated text as output

### Python Backend Implementation

Triton's Python backend requires implementing a `TritonPythonModel` class with three key methods:

* **`initialize(args)`**: Called once when Triton loads the model. This is where you load the tokenizer and initialize TensorRT-LLM with your build configuration.

* **`execute(requests)`**: Called every time Triton has a batch ready. Triton automatically batches incoming requests (up to your configured `max_batch_size`) and passes them here. This method extracts prompts from each request, runs batch inference with TensorRT-LLM, and returns responses.

* **`finalize()`**: Called when the model is being unloaded. Use this to clean up GPU memory and shut down the TensorRT-LLM engine.

Create `model.py` implementing Triton's Python backend interface:

```python theme={null}
"""
Triton Python Backend for TensorRT-LLM.
"""

import numpy as np
import triton_python_backend_utils as pb_utils
import torch
from tensorrt_llm import LLM, SamplingParams, BuildConfig
from tensorrt_llm.plugin.plugin import PluginConfig
from transformers import AutoTokenizer

MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"
MODEL_DIR = f"/persistent-storage/models/{MODEL_ID}"


class TritonPythonModel:
    def initialize(self, args):
        """Initialize TensorRT-LLM with PyTorch backend."""
        print("Loading tokenizer...")
        self.tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)

        print("Initializing TensorRT-LLM...")
        plugin_config = PluginConfig.from_dict({
            "paged_kv_cache": True,
        })

        build_config = BuildConfig(
            plugin_config=plugin_config,
            max_input_len=4096,
            max_batch_size=128,  # Matches Triton max_batch_size in config.pbtxt
        )

        self.llm = LLM(
            model=MODEL_DIR,
            build_config=build_config,
            tensor_parallel_size=torch.cuda.device_count(),
        )
        print("✓ Model ready")

    def execute(self, requests):
        """
        Execute inference on batched requests.

        Triton automatically batches requests (up to max_batch_size: 128).
        This function processes the batch that Triton provides.
        """
        try:
            prompts = []
            sampling_params_list = []
            original_prompts = []

            # Extract data from each request in the batch. We need to look through requests: https://github.com/triton-inference-server/python_backend?tab=readme-ov-file#execute
            for request in requests:
                try:
                    # Get input text - handle batched tensor structures
                    input_tensor = pb_utils.get_input_tensor_by_name(request, "text_input")
                    text_array = input_tensor.as_numpy()

                    # Extract text handling different array structures
                    if text_array.ndim == 0:
                        text = text_array.item()
                    elif text_array.dtype == object:
                        text = text_array.flat[0] if text_array.size > 0 else text_array.item()
                    else:
                        text = text_array.flat[0] if text_array.size > 0 else text_array.item()

                    # Decode if bytes
                    if isinstance(text, bytes):
                        text = text.decode('utf-8')
                    elif isinstance(text, np.str_):
                        text = str(text)

                    # Get optional parameters with defaults
                    max_tokens = 1024
                    if pb_utils.get_input_tensor_by_name(request, "max_tokens") is not None:
                        max_tokens_array = pb_utils.get_input_tensor_by_name(request, "max_tokens").as_numpy()
                        max_tokens = int(max_tokens_array.item() if max_tokens_array.ndim == 0 else max_tokens_array.flat[0])

                    temperature = 0.8
                    if pb_utils.get_input_tensor_by_name(request, "temperature") is not None:
                        temp_array = pb_utils.get_input_tensor_by_name(request, "temperature").as_numpy()
                        temperature = float(temp_array.item() if temp_array.ndim == 0 else temp_array.flat[0])

                    top_p = 0.95
                    if pb_utils.get_input_tensor_by_name(request, "top_p") is not None:
                        top_p_array = pb_utils.get_input_tensor_by_name(request, "top_p").as_numpy()
                        top_p = float(top_p_array.item() if top_p_array.ndim == 0 else top_p_array.flat[0])

                    # Format prompt using chat template
                    prompt = self.tokenizer.apply_chat_template(
                        [{"role": "user", "content": text}],
                        tokenize=False,
                        add_generation_prompt=True
                    )

                    prompts.append(prompt)
                    original_prompts.append(prompt)
                    sampling_params_list.append(SamplingParams(
                        temperature=temperature,
                        top_p=top_p,
                        max_tokens=max_tokens,
                    ))
                except Exception as e:
                    print(f"Error processing request: {e}", flush=True)
                    prompts.append("")
                    original_prompts.append("")
                    sampling_params_list.append(SamplingParams(max_tokens=1024))

            # Batch inference
            if not prompts:
                return []

            outputs = self.llm.generate(prompts, sampling_params_list)

            # Create responses
            responses = []
            for i, output in enumerate(outputs):
                try:
                    generated_text = output.outputs[0].text

                    # Strip prompt from output if included
                    if original_prompts[i] and original_prompts[i] in generated_text:
                        generated_text = generated_text.replace(original_prompts[i], "").strip()

                    responses.append(pb_utils.InferenceResponse(
                        output_tensors=[pb_utils.Tensor(
                            "text_output",
                            np.array([generated_text.encode('utf-8')], dtype=object)
                        )]
                    ))
                except Exception as e:
                    print(f"Error creating response {i}: {e}", flush=True)
                    responses.append(pb_utils.InferenceResponse(
                        output_tensors=[pb_utils.Tensor(
                            "text_output",
                            np.array([f"Error: {str(e)}".encode('utf-8')], dtype=object)
                        )]
                    ))

            return responses

        except Exception as e:
            print(f"Error in execute: {e}", flush=True)
            return [
                pb_utils.InferenceResponse(
                    output_tensors=[pb_utils.Tensor(
                        "text_output",
                        np.array([f"Batch error: {str(e)}".encode('utf-8')], dtype=object)
                    )]
                )
                for _ in requests
            ]

    def finalize(self):
        """Cleanup on shutdown."""
        if hasattr(self, 'llm'):
            self.llm.shutdown()
            torch.cuda.empty_cache()
```

### Model Download Script

Create `download_model.py` to download the model:

```python theme={null}
#!/usr/bin/env python3
"""Download HuggingFace model to persistent storage."""

import os
from pathlib import Path
from huggingface_hub import snapshot_download, login

MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"
MODEL_DIR = Path("/persistent-storage/models") / MODEL_ID


def download_model():
    """Download model if not already present."""
    hf_token = os.environ.get("HF_AUTH_TOKEN")

    if not hf_token:
        print("WARNING: HF_AUTH_TOKEN not set")
        return

    if MODEL_DIR.exists() and any(MODEL_DIR.iterdir()):
        print("✓ Model already exists")
        return

    print("Downloading model...")
    login(token=hf_token)
    snapshot_download(
        MODEL_ID,
        local_dir=str(MODEL_DIR),
        token=hf_token
    )
    print("✓ Model downloaded")


if __name__ == "__main__":
    download_model()
```

This script checks if the model exists in persistent storage before downloading to avoid redundant downloads on subsequent deployments.

### Container Setup

Create `Dockerfile` extending Nvidia's Triton container:

```dockerfile theme={null}
FROM nvcr.io/nvidia/tritonserver:25.10-trtllm-python-py3

ENV PYTHONPATH=/usr/local/lib/python3.12/dist-packages:$PYTHONPATH
ENV PYTHONDONTWRITEBYTECODE=1
ENV DEBIAN_FRONTEND=noninteractive
ENV HF_HOME=/persistent-storage/models
ENV TORCH_CUDA_ARCH_LIST=8.6

# Install dependencies
RUN apt-get update && apt-get install -y \
    git \
    git-lfs \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

RUN pip install --break-system-packages \
    huggingface_hub \
    transformers \
    || true

# Create directories
RUN mkdir -p \
    /app/model_repository/llama3_2/1 \
    /persistent-storage/models \
    /persistent-storage/engines

# Copy files
COPY model.py /app/model_repository/llama3_2/1/
COPY config.pbtxt /app/model_repository/llama3_2/

EXPOSE 8000 8001 8002

CMD ["tritonserver", "--model-repository=/app/model_repository", "--http-port=8000", "--grpc-port=8001", "--metrics-port=8002"]
```

The Dockerfile uses Nvidia's official Triton container with TensorRT-LLM pre-installed, creates the model repository structure that Triton expects, and copies our application files to the correct locations.

### Deployment Configuration

Configure the container and autoscaling environment in `cerebrium.toml`:

```toml theme={null}
[cerebrium.deployment]
name = "tensorrt-triton-demo"
python_version = "3.12"
disable_auth = true
include = ['./*', 'cerebrium.toml']
exclude = ['.*']
deployment_initialization_timeout = 830

[cerebrium.hardware]
cpu = 4.0
memory = 40.0
compute = "AMPERE_A10"
gpu_count = 1

[cerebrium.scaling]
min_replicas = 0
max_replicas = 5
cooldown = 300
replica_concurrency = 128
scaling_metric = "concurrency_utilization"

[cerebrium.runtime.custom]
port = 8000
healthcheck_endpoint = "/v2/health/live"
readycheck_endpoint = "/v2/health/ready"
dockerfile_path = "./Dockerfile"
```

Key configuration details:

* `replica_concurrency = 128`: Each replica can handle up to 128 concurrent requests, matching our Triton batch size
* `max_replicas = 5`: Scale up to 5 replicas for peak load

## Deploy

### Download Model to Persistent Storage

Before deploying, download the model to Cerebrium's persistent storage. This ensures the model is available across all deployments and avoids redundant downloads during container startup.

The `cerebrium run` command executes a Python script in a temporary container with the same environment and hardware configuration as the deployment. It has access to persistent storage at `/persistent-storage`, so any files written there are available to deployed containers.

Run the download script:

```bash theme={null}
cerebrium run download_model.py
```

The logs confirm whether the model already exists or has been downloaded successfully.

### Deploy the Model

Deploy the model:

```bash theme={null}
cerebrium deploy
```

After successful deployment, the base endpoint URL appears in the output. Use this URL in the next section.

## Test

Send a request to your deployed endpoint:

```bash theme={null}
curl -X POST https://api.aws.us-east-1.cerebrium.ai/v4/<project-id>/<name>/v2/models/llama3_2/infer \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [
      {
        "name": "text_input",
        "shape": [1, 1],
        "datatype": "BYTES",
        "data": ["What is machine learning?"]
      }
    ],
    "outputs": [{"name": "text_output"}]
  }'
```

The endpoint returns results in this format:

```json theme={null}
{
  "outputs": [
    {
      "name": "text_output",
      "datatype": "BYTES",
      "shape": [1],
      "data": [
        "Machine learning is a subset of artificial intelligence (AI) that involves training algorithms..."
      ]
    }
  ]
}
```

The response follows Triton's standard inference protocol format with the generated text in the `data` field of the output tensor.

## Performance Analysis

### Test Setup

To validate performance improvements, TensorRT + Triton was compared against a vanilla HuggingFace baseline serving the same Llama 3.2 3B Instruct model. Both deployments used identical hardware (NVIDIA A10 GPU) and were tested under the same load conditions.

**Vanilla Baseline Setup:**

* Model served directly using HuggingFace Transformers with PyTorch
* Single request processing (no batching)
* Standard FastAPI endpoint
* Same hardware configuration (A10 GPU, 4 CPU cores, 40GB memory)

**TensorRT + Triton Setup:**

* TensorRT-LLM with PyTorch backend
* Triton Inference Server with dynamic batching (max batch size: 128)
* Automatic request queuing and batching
* Same hardware configuration (A10 GPU, 4 CPU cores, 40GB memory)

Both deployments were tested with the same load testing parameters to ensure fair comparison.

### Results

| Metric                        | Vanilla Baseline | TensorRT + Triton | Improvement        |
| ----------------------------- | ---------------- | ----------------- | ------------------ |
| **Requests Per Second (RPS)** | 0.83             | 12.46             | **15x faster**     |
| **Success Rate**              | 61.6%            | 100.0%            | **38.4% increase** |
| **P50 Latency**               | 297.7s           | 41.7s             | **7.1x faster**    |
| **P99 Latency**               | 593.2s           | 79.3s             | **7.5x faster**    |
| **Average Latency**           | 376.2s           | 42.4s             | **8.9x faster**    |

The TensorRT + Triton setup delivers **15x higher throughput** with **100% reliability** compared to the baseline, while reducing latency by **7-9x** across all percentiles. The baseline's 61.6% success rate and high latency come from processing requests sequentially without batching, leading to GPU underutilization and request timeouts. TensorRT + Triton eliminates these issues by keeping the GPU fully utilized with batched, optimized inference, resulting in 100% success rate and consistent, predictable latency.

These results demonstrate that TensorRT + Triton is not just faster, but also more reliable and cost-effective for production LLM serving at scale.

## Get Started

The complete implementation, including all configuration files and deployment scripts, is available in our [GitHub repository](https://github.com/CerebriumAI/examples/tree/master/5-large-language-models/8-faster-inference-with-triton-tensorrt).

Clone the repository and follow this tutorial to deploy Llama 3.2 3B (or adapt it for other models) with TensorRT-LLM and Triton Inference Server.
