Deploy Triton Inference server and TensorRT-LLM

This tutorial deploys Llama 3.2 3B using TensorRT-LLM’s PyTorch backend served through Nvidia Triton Inference Server. The TensorRT + Triton setup delivers 15x higher throughput with 100% reliability compared to the baseline (vanilla deployment), while reducing latency by 7-9x across all percentiles. See the Performance Analysis section for detailed test methodology and results. You can view the final implementation here.

Why TensorRT + Triton?

Why TensorRT?

NVIDIA TensorRT is a software development kit for high-performance deep learning inference. It compiles model weights into optimized engines that run more efficiently on specific GPU hardware through CUDA-level optimizations, custom kernels, and optional quantization. TensorRT requires you to specify optimization parameters upfront - GPU architecture, batch size, precision (FP8, INT8, etc.), and input/output shapes. This specialization allows TensorRT to generate highly optimized inference engines that maximize GPU utilization, reduce latency, and lower inference costs compared to serving raw model weights.

Why Triton?

NVIDIA Triton Inference Server streamlines production AI deployment by handling operational concerns that are critical for serving models at scale. It provides automatic request batching, health checks, metrics collection, and standardized HTTP/gRPC APIs out of the box. Triton supports multiple frameworks (TensorRT, PyTorch, TensorFlow, ONNX, etc.), offers built-in Prometheus metrics for observability, and integrates seamlessly with Kubernetes for auto-scaling. It also supports model versioning, A/B testing, and can chain multiple models into pipelines. Here is a diagram of how Triton works. Below is the process of how the two work together in terms of handling requests:

Client sends text via HTTP/gRPC to Triton
Triton queues the request in the scheduler
Triton batches incoming requests (waits for more or timeout)
When batch is ready, Triton calls your Python backend
TensorRT-LLM generates tokens for the entire batch in parallel on GPU
Triton returns responses to clients

This setup allows multiple concurrent requests to be processed together on the GPU for maximum throughput. The following sections combine Triton and TensorRT-LLM into a working deployment.

Basic Setup

Install the Cerebrium CLI:

pip install cerebrium
cerebrium login

Create your project:

cerebrium init tensorrt-triton-demo
cd tensorrt-triton-demo

To download the model, request access on Hugging Face. Then add the HuggingFace token to Cerebrium project secrets as HF_AUTH_TOKEN through the dashboard for authentication during download.

Implementation

All files should be placed in the same project directory.

Triton Model Configuration

Create config.pbtxt to define Triton’s model interface. See the full configuration reference for all available options.

name: "llama3_2"
backend: "python"
max_batch_size: 128

dynamic_batching {
  max_queue_delay_microseconds: 100
}

instance_group [
  {
    count: 1
    kind: KIND_GPU
  }
]

input [
  {
    name: "text_input"
    data_type: TYPE_STRING
    dims: [ 1 ]
  },
  {
    name: "max_tokens"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  }
]

output [
  {
    name: "text_output"
    data_type: TYPE_STRING
    dims: [ 1 ]
  }
]

This configuration tells Triton:

Use Python backend (runs our model.py)
Automatically batch up to 128 requests together for efficient GPU utilization
Use dynamic batching with a 100 microsecond queue delay to maximize batch sizes
Accept text input with optional sampling parameters
Run on a single GPU instance
Return generated text as output

Python Backend Implementation

Triton’s Python backend requires implementing a TritonPythonModel class with three key methods:

initialize(args): Called once when Triton loads the model. This is where you load the tokenizer and initialize TensorRT-LLM with your build configuration.
execute(requests): Called every time Triton has a batch ready. Triton automatically batches incoming requests (up to your configured max_batch_size) and passes them here. This method extracts prompts from each request, runs batch inference with TensorRT-LLM, and returns responses.
finalize(): Called when the model is being unloaded. Use this to clean up GPU memory and shut down the TensorRT-LLM engine.

Create model.py implementing Triton’s Python backend interface:

"""
Triton Python Backend for TensorRT-LLM.
"""

import numpy as np
import triton_python_backend_utils as pb_utils
import torch
from tensorrt_llm import LLM, SamplingParams, BuildConfig
from tensorrt_llm.plugin.plugin import PluginConfig
from transformers import AutoTokenizer

MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"
MODEL_DIR = f"/persistent-storage/models/{MODEL_ID}"


class TritonPythonModel:
    def initialize(self, args):
        """Initialize TensorRT-LLM with PyTorch backend."""
        print("Loading tokenizer...")
        self.tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)

        print("Initializing TensorRT-LLM...")
        plugin_config = PluginConfig.from_dict({
            "paged_kv_cache": True,
        })

        build_config = BuildConfig(
            plugin_config=plugin_config,
            max_input_len=4096,
            max_batch_size=128,  # Matches Triton max_batch_size in config.pbtxt
        )

        self.llm = LLM(
            model=MODEL_DIR,
            build_config=build_config,
            tensor_parallel_size=torch.cuda.device_count(),
        )
        print("✓ Model ready")

    def execute(self, requests):
        """
        Execute inference on batched requests.

        Triton automatically batches requests (up to max_batch_size: 128).
        This function processes the batch that Triton provides.
        """
        try:
            prompts = []
            sampling_params_list = []
            original_prompts = []

            # Extract data from each request in the batch. We need to look through requests: https://github.com/triton-inference-server/python_backend?tab=readme-ov-file#execute
            for request in requests:
                try:
                    # Get input text - handle batched tensor structures
                    input_tensor = pb_utils.get_input_tensor_by_name(request, "text_input")
                    text_array = input_tensor.as_numpy()

                    # Extract text handling different array structures
                    if text_array.ndim == 0:
                        text = text_array.item()
                    elif text_array.dtype == object:
                        text = text_array.flat[0] if text_array.size > 0 else text_array.item()
                    else:
                        text = text_array.flat[0] if text_array.size > 0 else text_array.item()

                    # Decode if bytes
                    if isinstance(text, bytes):
                        text = text.decode('utf-8')
                    elif isinstance(text, np.str_):
                        text = str(text)

                    # Get optional parameters with defaults
                    max_tokens = 1024
                    if pb_utils.get_input_tensor_by_name(request, "max_tokens") is not None:
                        max_tokens_array = pb_utils.get_input_tensor_by_name(request, "max_tokens").as_numpy()
                        max_tokens = int(max_tokens_array.item() if max_tokens_array.ndim == 0 else max_tokens_array.flat[0])

                    temperature = 0.8
                    if pb_utils.get_input_tensor_by_name(request, "temperature") is not None:
                        temp_array = pb_utils.get_input_tensor_by_name(request, "temperature").as_numpy()
                        temperature = float(temp_array.item() if temp_array.ndim == 0 else temp_array.flat[0])

                    top_p = 0.95
                    if pb_utils.get_input_tensor_by_name(request, "top_p") is not None:
                        top_p_array = pb_utils.get_input_tensor_by_name(request, "top_p").as_numpy()
                        top_p = float(top_p_array.item() if top_p_array.ndim == 0 else top_p_array.flat[0])

                    # Format prompt using chat template
                    prompt = self.tokenizer.apply_chat_template(
                        [{"role": "user", "content": text}],
                        tokenize=False,
                        add_generation_prompt=True
                    )

                    prompts.append(prompt)
                    original_prompts.append(prompt)
                    sampling_params_list.append(SamplingParams(
                        temperature=temperature,
                        top_p=top_p,
                        max_tokens=max_tokens,
                    ))
                except Exception as e:
                    print(f"Error processing request: {e}", flush=True)
                    prompts.append("")
                    original_prompts.append("")
                    sampling_params_list.append(SamplingParams(max_tokens=1024))

            # Batch inference
            if not prompts:
                return []

            outputs = self.llm.generate(prompts, sampling_params_list)

            # Create responses
            responses = []
            for i, output in enumerate(outputs):
                try:
                    generated_text = output.outputs[0].text

                    # Strip prompt from output if included
                    if original_prompts[i] and original_prompts[i] in generated_text:
                        generated_text = generated_text.replace(original_prompts[i], "").strip()

                    responses.append(pb_utils.InferenceResponse(
                        output_tensors=[pb_utils.Tensor(
                            "text_output",
                            np.array([generated_text.encode('utf-8')], dtype=object)
                        )]
                    ))
                except Exception as e:
                    print(f"Error creating response {i}: {e}", flush=True)
                    responses.append(pb_utils.InferenceResponse(
                        output_tensors=[pb_utils.Tensor(
                            "text_output",
                            np.array([f"Error: {str(e)}".encode('utf-8')], dtype=object)
                        )]
                    ))

            return responses

        except Exception as e:
            print(f"Error in execute: {e}", flush=True)
            return [
                pb_utils.InferenceResponse(
                    output_tensors=[pb_utils.Tensor(
                        "text_output",
                        np.array([f"Batch error: {str(e)}".encode('utf-8')], dtype=object)
                    )]
                )
                for _ in requests
            ]

    def finalize(self):
        """Cleanup on shutdown."""
        if hasattr(self, 'llm'):
            self.llm.shutdown()
            torch.cuda.empty_cache()

Model Download Script

Create download_model.py to download the model:

#!/usr/bin/env python3
"""Download HuggingFace model to persistent storage."""

import os
from pathlib import Path
from huggingface_hub import snapshot_download, login

MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"
MODEL_DIR = Path("/persistent-storage/models") / MODEL_ID


def download_model():
    """Download model if not already present."""
    hf_token = os.environ.get("HF_AUTH_TOKEN")

    if not hf_token:
        print("WARNING: HF_AUTH_TOKEN not set")
        return

    if MODEL_DIR.exists() and any(MODEL_DIR.iterdir()):
        print("✓ Model already exists")
        return

    print("Downloading model...")
    login(token=hf_token)
    snapshot_download(
        MODEL_ID,
        local_dir=str(MODEL_DIR),
        token=hf_token
    )
    print("✓ Model downloaded")


if __name__ == "__main__":
    download_model()

This script checks if the model exists in persistent storage before downloading to avoid redundant downloads on subsequent deployments.

Container Setup

Create Dockerfile extending Nvidia’s Triton container:

FROM nvcr.io/nvidia/tritonserver:25.10-trtllm-python-py3

ENV PYTHONPATH=/usr/local/lib/python3.12/dist-packages:$PYTHONPATH
ENV PYTHONDONTWRITEBYTECODE=1
ENV DEBIAN_FRONTEND=noninteractive
ENV HF_HOME=/persistent-storage/models
ENV TORCH_CUDA_ARCH_LIST=8.6

# Install dependencies
RUN apt-get update && apt-get install -y \
    git \
    git-lfs \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

RUN pip install --break-system-packages \
    huggingface_hub \
    transformers \
    || true

# Create directories
RUN mkdir -p \
    /app/model_repository/llama3_2/1 \
    /persistent-storage/models \
    /persistent-storage/engines

# Copy files
COPY model.py /app/model_repository/llama3_2/1/
COPY config.pbtxt /app/model_repository/llama3_2/

EXPOSE 8000 8001 8002

CMD ["tritonserver", "--model-repository=/app/model_repository", "--http-port=8000", "--grpc-port=8001", "--metrics-port=8002"]

The Dockerfile uses Nvidia’s official Triton container with TensorRT-LLM pre-installed, creates the model repository structure that Triton expects, and copies our application files to the correct locations.

Deployment Configuration

Configure the container and autoscaling environment in cerebrium.toml:

[cerebrium.deployment]
name = "tensorrt-triton-demo"
python_version = "3.12"
disable_auth = true
include = ['./*', 'cerebrium.toml']
exclude = ['.*']
deployment_initialization_timeout = 830

[cerebrium.hardware]
cpu = 4.0
memory = 40.0
compute = "AMPERE_A10"
gpu_count = 1

[cerebrium.scaling]
min_replicas = 0
max_replicas = 5
cooldown = 300
replica_concurrency = 128
scaling_metric = "concurrency_utilization"

[cerebrium.runtime.custom]
port = 8000
healthcheck_endpoint = "/v2/health/live"
readycheck_endpoint = "/v2/health/ready"
dockerfile_path = "./Dockerfile"

Key configuration details:

replica_concurrency = 128: Each replica can handle up to 128 concurrent requests, matching our Triton batch size
max_replicas = 5: Scale up to 5 replicas for peak load

Deploy

Download Model to Persistent Storage

Before deploying, download the model to Cerebrium’s persistent storage. This ensures the model is available across all deployments and avoids redundant downloads during container startup. The cerebrium run command executes a Python script in a temporary container with the same environment and hardware configuration as the deployment. It has access to persistent storage at /persistent-storage, so any files written there are available to deployed containers. Run the download script:

cerebrium run download_model.py

The logs confirm whether the model already exists or has been downloaded successfully.

Deploy the Model

Deploy the model:

cerebrium deploy

After successful deployment, the base endpoint URL appears in the output. Use this URL in the next section.

Test

Send a request to your deployed endpoint:

curl -X POST https://api.aws.us-east-1.cerebrium.ai/v4/<project-id>/<name>/v2/models/llama3_2/infer \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [
      {
        "name": "text_input",
        "shape": [1, 1],
        "datatype": "BYTES",
        "data": ["What is machine learning?"]
      }
    ],
    "outputs": [{"name": "text_output"}]
  }'

The endpoint returns results in this format:

{
  "outputs": [
    {
      "name": "text_output",
      "datatype": "BYTES",
      "shape": [1],
      "data": [
        "Machine learning is a subset of artificial intelligence (AI) that involves training algorithms..."
      ]
    }
  ]
}

The response follows Triton’s standard inference protocol format with the generated text in the data field of the output tensor.

Performance Analysis

Test Setup

To validate performance improvements, TensorRT + Triton was compared against a vanilla HuggingFace baseline serving the same Llama 3.2 3B Instruct model. Both deployments used identical hardware (NVIDIA A10 GPU) and were tested under the same load conditions. Vanilla Baseline Setup:

Model served directly using HuggingFace Transformers with PyTorch
Single request processing (no batching)
Standard FastAPI endpoint
Same hardware configuration (A10 GPU, 4 CPU cores, 40GB memory)

TensorRT + Triton Setup:

TensorRT-LLM with PyTorch backend
Triton Inference Server with dynamic batching (max batch size: 128)
Automatic request queuing and batching
Same hardware configuration (A10 GPU, 4 CPU cores, 40GB memory)

Both deployments were tested with the same load testing parameters to ensure fair comparison.

Results

Metric	Vanilla Baseline	TensorRT + Triton	Improvement
Requests Per Second (RPS)	0.83	12.46	15x faster
Success Rate	61.6%	100.0%	38.4% increase
P50 Latency	297.7s	41.7s	7.1x faster
P99 Latency	593.2s	79.3s	7.5x faster
Average Latency	376.2s	42.4s	8.9x faster

The TensorRT + Triton setup delivers 15x higher throughput with 100% reliability compared to the baseline, while reducing latency by 7-9x across all percentiles. The baseline’s 61.6% success rate and high latency come from processing requests sequentially without batching, leading to GPU underutilization and request timeouts. TensorRT + Triton eliminates these issues by keeping the GPU fully utilized with batched, optimized inference, resulting in 100% success rate and consistent, predictable latency. These results demonstrate that TensorRT + Triton is not just faster, but also more reliable and cost-effective for production LLM serving at scale.

Get Started

The complete implementation, including all configuration files and deployment scripts, is available in our GitHub repository. Clone the repository and follow this tutorial to deploy Llama 3.2 3B (or adapt it for other models) with TensorRT-LLM and Triton Inference Server.

​Why TensorRT + Triton?

​Why TensorRT?

​Why Triton?

​Basic Setup

​Implementation

​Triton Model Configuration

​Python Backend Implementation

​Model Download Script

​Container Setup

​Deployment Configuration

​Deploy

​Download Model to Persistent Storage

​Deploy the Model

​Test

​Performance Analysis

​Test Setup

​Results

​Get Started