Cerebrium blog | Getting better price-performance, latency, and availability on AWS Trn1/Inf2 instances

Tutorial

May 20, 2024

Getting better price-performance, latency, and availability on AWS Trn1/Inf2 instances

Michael Louis

Founder

At Cerebrium, the feedback we get from customers daily is that they want their applications to run faster and cheaper. While we have been described as wizards by some, there are many avenues we have seen customers achieve this with some possible options being:

To use specialized frameworks such as vLLM or Tensor-TRT to improve inference speed and/or throughput.
To use batching that allows customers to increase GPU utilization.
Semantic caching in order to avoid making a LLM call. We will do a future tutorial on this.
Using MIG instances to fractionalize GPU’s. We will do a future tutorial on this.
Lastly, to use more specialized or powerful hardware.

At Cerebrium, we want to offer engineers the flexibility to build the applications they envision using the tools they require - essentially we want to meet engineers where they work. This entails us offering the flexibility to build applications the way in which they want but also to run on the hardware that best suits their use case. With that being said, Cerebrium now officially supports AWS Tranium and Inferentia 2 - based Amazon Trn1 and inf2 instances.

Both Amazon EC2 Trn1 and Inf2 instances offer a great performance/price trade-off that makes it suitable for many companies running inference at scale. We have seen latency and throughput be similar if not better than traditional Nvidia chips such as A10, L4 and A100. Additionally, it is a great alternative to avoid capacity shortages against A100/H100’s where stability is top of mind for enterprise use cases.

In this tutorial, we will show you how you can deploy the Llama 3 8B model using Inf2 and/or Trn1 instances on Cerebrium. We will compare both throughput, latency and cost as those in our previous blog post running TensorRT-LLM on Nvidia instances. By the end of this tutorial you deploy a auto-scaling Llama 3 endpoint on AWS Inf2 where you only pay for the compute you use.

You can view the final code here

Inferentia & Trainium Overview

AWS Inferentia 2 are specialized machine learning chips designed to optimize deep learning inference workloads, offering a cost-effective and efficient alternative for applications requiring high-throughput and low-latency processing. AWS Inferentia2 accelerator delivers up to 4x higher throughput and up to 10x lower latency compared to the original Inferentia chips. For many ML use cases, we have seen Inferentia 2 be up to 50% cheaper than similar AI chips while giving the same performance.

AWS Trainium is a specialized machine learning chip designed by AWS to optimize deep learning training workloads, offering a cost-effective and efficient alternative for applications requiring high-throughput and low-latency processing.

Amazon Neuron SDK is used to run deep learning workloads on AWS Inferentia and AWS Trainium based instances. It includes a deep learning compiler, runtime, and tools that are natively integrated into TensorFlow, PyTorch and Apache MXNet.

Setup

If you don’t have a Cerebrium account, you can create one by signing up here and following the documentation here to get setup

In your IDE, run the following command to create our Cerebrium starter project: cerebrium init inferentia. This creates two files:

main.py - Our entrypoint file where our code lives
cerebrium.toml - A configuration file that contains all our build and environment settings

Deploying on Inferentia 2

You can see many examples and tutorials from the Neuron documentation here.

To deploy to Inf2 nodes through Cerebrium, you can change the “gpu” type to “INF2” under the [cerebrium.hardware] heading in your cerebrium.toml file. We will automatically use a base image that has the various Neuron drivers installed.

In this tutorial, we will be using vLLM, which is an easy to use framework that optimizes inferencing and serving of models. Recently, vLLM added support for Neuron and so we will do it to increase the performance of our models. You can follow the instructions here on how to get started with vLLM on Neuron. Since our base image contains the necessary libraries you can skip to step 2. We show the steps below:

You can add the following shell commands to your cerebrium.toml under [cerebrium.build]:

shell_commands = [
"pip install torch==2.1.* --index-url ",
"pip install wget",
"pip install awscli",
"pip install neuronx-cc==2.* torch-neuronx==2.1.* torchvision transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com",
"pip install ray huggingface_hub pydantic triton",
"git clone ",
"cd vllm && git checkout 6ef09b08f88b675f84b7140238286e5d4c5304c8 && pip install -U -r requirements-neuron.txt --extra-index-url=https://pip.repos.neuron.amazonaws.com && pip install . --extra-index-url=https://pip.repos.neuron.amazonaws.com && cd ..",
]

This code is installing the various python packages as well as building vLLM from source. You will notice we checkout a specific hash in the above code, that is because at the time of writing the latest version has a problem on Neuron.

To start we need to go to HuggingFace and accept the model permissions for Lllama 3 8B if we haven’t already. It takes about 30 minutes or less for them to accept your request. Since HuggingFace requires you to be authenticated to download the model weights, we need to authenticate ourselves in Cerebrium before downloading the model.

In your Cerebrium dashboard you can add your HuggingFace token as a secret by navigating to “Secrets” in the sidebar. For the sake of this tutorial I called mine “HF_AUTH_TOKEN”. We can now access these values in our code at runtime without exposing them in our code.

We can then add the following code to our main.py in order to run llama3 using vLLM.

from vllm import LLM, SamplingParams
import huggingface_hub
import time
import torch
from transformers import AutoTokenizer
from huggingface_hub import login
from cerebrium import get_secret

class Item(BaseModel):
    prompts: list[str] = []
    temperature: Optional[float] = 0.8
    top_p: Optional[float] = 0.95
    
login(token = get_secret("HF_AUTH_TOKEN")

model_id = "meta-llama/Meta-Llama-3-7B-Instruct"
llm = LLM(
    model=model_id,
    max_num_seqs=1,
    max_model_len=128,
    block_size=128,
    # The device can be automatically detected when AWS Neuron SDK is installed.
    # The device argument can be either unspecified for automated detection,
    # or explicitly assigned.
    device="neuron",
    tensor_parallel_size=8)
tokenizer = AutoTokenizer.from_pretrained(model_id)

def predict(
        prompts,
        temperature = 0.8,
        top_p = 0.95
    ):
    item = Item(prompts=prompts, temperature=temperature, top_p=top_p)

    # Start timing
    start_time = time.time()

    sampling_params = SamplingParams(temperature=item.temperature, top_p=item.top_p)
    outputs = llm.generate(item.prompts, sampling_params)
		
		total_tokens = 0
		generated_texts = []  # List to store all generated texts
		for output in outputs:
		    prompt = output.prompt
		    generated_text = output.outputs[0].text
		    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
		    generated_texts.append(generated_text)
		
		    # Assuming you have a way to count tokens, e.g., a function count_tokens(text)
		    total_tokens += len(tokenizer.tokenize(prompt)) + len(tokenizer.tokenize(generated_text))

# Calculate duration and tokens per second
duration = end_time - start_time
tokens_per_second = total_tokens / duration if duration > 0 else 0

print(f"Processed {total_tokens} tokens in {duration:.2f} seconds ({tokens_per_second:.2f} tokens/second)")

To break down the code above:

First set up our Pydantic object which makes sure that our incoming requests take on a specific format. Some parameters are optional so we set default values for that.
Use our Hugging Face secret in order login to authenticate the Llama 3 download.
Next create your LLM object. Neuron has some very specific settings here:some text
1. max_num_seqs: Denotes the maximum number of sequences or inputs that can be processed simultaneously by the model during inference. Here, it's set to 16, meaning the model can handle up to 16 input sequences at once. The higher the batch, the more GPU memory required.
2. max_model_len: Represents the maximum length of input sequences or tokens that the model can process. In this case, it's set to 128, indicating that the maximum length of the input text or tokens should not exceed 128.
3. block_size: Refers to the block size used for processing input data. This needs to be the same as the max_model_len.
4. device: Specifies the target device for model execution. In this scenario, it's set to "neuron", indicating that the model is intended to run on AWS Neuron instances.
5. tensor_parallel_size: This parameter is related to parallel processing of tensors. It's set to 2, suggesting that the model may be optimized for parallel execution across two tensor processing units or cores. You can set the number of “GPU’s” in your cerebrium.toml file. Just note that 1 “GPU” in your cerebrium.toml is associated with 2 neuron cores.

Lastly, we pass our input prompts to our model and time how long it takes to execute.

Deploy on Cerebrium

In order to deploy your code to Cerebrium, run cerebrium deploy. The first build will take roughly 20 minutes since all the dependencies will be installed and the model has to be compiled based on the settings you set above. Subsequent changes will be much faster since dependencies and the model will be cached. If you change any model settings like the batch size or the tensor_parallel_size, the model will need to be recompiled.

The model needs to be compiled with 6 CPUs and 60GB of memory, which you set in your cerebrium.toml - it requires a lot of power! Once it's working, we recommend redeploying with 2 CPUs and 14 GB memory in order to get the same performance for a much lower cost.

Once completed, it should output a curl which you can copy and paste to test your inference endpoint.

curl --location '' \\
--header 'Content-Type: application/json' \\
--header 'Authorization: Bearer XXXXX' \\
--data '{"prompt": "Hello World!"}'

We get the following results on Inf2 nodes with different settings:

Throughput

Latency

If you compare this to the performance from our previous article, an Ampere A10 had a throughput of ~600 tokens per second (FP16) at a cost of $0.000555 per second. Inferentia are able to generate 1014 tokens at a cost of $0.000479 per second.Therefore Inferentia provides almost double the performance at a lower price. Additionally, Inferentia generates a per token latency of 28.57ms per token whereas TensorRT was 41.77ms per second and therefore is better for latency sensitive applications.

Comparatively, you can run a Llama 3 70b model at 223 tokens per second at a cost of $0.00271 per second vs using TensorRT which has a throughput of 663 across 2 80GB GPUs at a cost of $0.00358 per second. Therefore you are getting ~⅓ of the throughput for 25% less cost. One big consideration to take into account is that you do not face the availability issues that face A100’s.

Conclusion

In conclusion, the deployment of the Llama 3 model on Inf2 instances offers a significant improvement in throughput and latency, at a potentially lower cost compared to traditional methods. Additional, as advancements are released with the vLLM framework you will be able to take advantage of this. This tutorial demonstrates how Cerebrium's flexibility allows engineers to run applications on hardware that best suits their use case, and presents a step-by-step process for deploying on inf2 nodes. The results show that both setups offer efficient processing capabilities, with specific advantages in different contexts. As technology continues to evolve, we can expect further improvements and increased efficiency in deep learning inference workloads.

Integrating PayPal’s Model Context Protocol (MCP) into a Real-time Voice Agent

Jul 31, 2025

Tutorial

Integrating PayPal’s Model Context Protocol (MCP) into a Real-time Voice Agent

Jul 31, 2025

Tutorial

Deploying a global scale, AI voice agent with 500ms latency.

Jun 25, 2025

Tutorial

Deploying a global scale, AI voice agent with 500ms latency.

Jun 25, 2025

Tutorial

Deploying Ultravox on Cerebrium for Ultra-low Latency Voice Applications

Apr 28, 2025

Tutorial

Deploying Ultravox on Cerebrium for Ultra-low Latency Voice Applications

Apr 28, 2025

Product

Pricing

Developers

Docs

Status

Company

Blog

About

Use cases

Large language models

Voice

Image & Video

Resources

Examples

Articles

Brand assets

Getting better price-performance, latency, and availability on AWS Trn1/Inf2 instances

Inferentia & Trainium Overview

Setup

Deploying on Inferentia 2

Deploy on Cerebrium

Throughput

Latency

Conclusion

MORE ARTICLES LIKE THIS

Integrating PayPal’s Model Context Protocol (MCP) into a Real-time Voice Agent

Integrating PayPal’s Model Context Protocol (MCP) into a Real-time Voice Agent

Deploying a global scale, AI voice agent with 500ms latency.

Deploying a global scale, AI voice agent with 500ms latency.

Deploying Ultravox on Cerebrium for Ultra-low Latency Voice Applications

Deploying Ultravox on Cerebrium for Ultra-low Latency Voice Applications

Product

Developers

Company

Use cases

Resources