> ## Documentation Index
> Fetch the complete documentation index at: https://cerebrium.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Deploy a High Throughput Server for Embeddings and Reranking

> Deploy a a high-throughput, low-latency REST API for serving text-embeddings, reranking models, clip, clap and colpali

This tutorial covers deploying a high-throughput, low-latency REST API for serving text-embeddings, reranking models, clip, clap, and colpali using the open-source framework
[infinity](https://github.com/michaelfeil/infinity/tree/main).

Infinity supports multiple GPUs/CPUs and frameworks.
The inference server is built on PyTorch, optimum (ONNX/TensorRT), and CTranslate2, using FlashAttention for NVIDIA CUDA, AMD ROCM, CPU, AWS INF2, and APPLE MPS accelerators.
It uses dynamic batching and dedicated tokenization worker threads.

Find the final working version [here](https://github.com/CerebriumAI/examples/tree/master/14-embeddings/1-high-throughput) on GitHub.

### Project Setup

Complete the [quickstart]() to install the CLI and create an account.

1. Run the command: `cerebrium init infinity-throughput`

This creates two files:

* <b>main.py</b>: The entrypoint code
* <b>cerebrium.toml</b>: Container image and auto-scaling parameters

Start by defining the container environment. Infinity has a public Docker image on [Dockerhub](https://hub.docker.com/r/michaelf34/infinity).
Cerebrium requires Dockerhub authentication to pull images (even public ones). Sign in with the following command:

```
docker login -u your-dockerhub-username
# Enter your password or access token when prompted
```

Add the following to <b>cerebrium.toml</b>

```
[cerebrium.deployment]
name = "1-high-throughput"
python_version = "3.11"
docker_base_image_url = "michaelf34/infinity:0.0.77"
disable_auth = true
include = ['./*', 'main.py', 'cerebrium.toml']
exclude = ['.*']
```

Autoscaling criteria vary by hardware type and model selection. Define them in the following `cerebrium.toml` sections:

```
[cerebrium.hardware]
cpu = 6.0
memory = 12.0
compute = "AMPERE_A10"

[cerebrium.scaling]
min_replicas = 0
max_replicas = 2
cooldown = 30
replica_concurrency = 500
scaling_metric = "concurrency_utilization"

[cerebrium.dependencies.pip]
numpy = "latest"
"infinity-emb[all]" = "0.0.77"
optimum = ">=1.24.0,<2.0.0"
transformers = "<4.49"
click = "==8.1.8"
fastapi = "latest"
uvicorn = "latest"
pandas = "latest"
```

The model runs on an Ampere A10, which handles up to 500 concurrent inputs.

In <b>main.py</b>, create a class that handles embedding model functionality using the Infinity framework. This example uses multiple models to demonstrate the range of supported functionality.

```python theme={null}
from infinity_emb import AsyncEngineArray, EngineArgs

class InfinityModel:
    def __init__(self):
        self.model_ids = [
            "jinaai/jina-clip-v1",
            "michaelfeil/bge-small-en-v1.5",
            "mixedbread-ai/mxbai-rerank-xsmall-v1",
            "philschmid/tiny-bert-sst2-distilled"
        ]
        self.engine_array = None

    def _get_array(self):
        return AsyncEngineArray.from_args([
            EngineArgs(model_name_or_path=model, model_warmup=False)
            for model in self.model_ids
        ])

    async def setup(self):
        print(f"Setting up models: {self.model_ids}")
        self.engine_array = self._get_array()
        await self.engine_array.astart()
        print("All models loaded successfully!")


model = InfinityModel()
```

Model loading can take time, so FastAPI provides greater control over readiness. Cerebrium supports custom ASGI servers. Add the following to main.py

```python theme={null}
from fastapi import FastAPI, Body

app = FastAPI(title="High-Throughput Embedding Service")

@app.on_event("startup")
async def startup_event():
    """Initialize models on container startup"""
    await model.setup()


@app.get("/health")
async def health():
    return {"status": "healthy"}

@app.get("/ready")
async def ready():
    """Readiness endpoint to report model initialization state."""
    is_ready = model.engine_array is not None
    return {"ready": is_ready}
```

Infinity supports text embeddings, image embeddings, reranking, and classification. Create separate endpoints for each:

```python theme={null}
def embeddings_to_list(embeddings: list) -> list:
    """Convert list of numpy arrays to list of lists."""
    return [e.tolist() for e in embeddings]

@app.post("/embed")
async def embed(sentences: list[str] = Body(...), model_index: int = Body(1)):
    """Generate embeddings using the specified model."""
    engine = model.engine_array[model_index]
    embeddings, usage = await engine.embed(sentences=sentences)

    return {
        "embeddings": to_json(embeddings),
        "usage": to_json(usage),
        "model": model.model_ids[model_index]
    }


@app.post("/image_embed")
async def image_embed(image_urls: list[str] = Body(...), model_index: int = Body(0)):
    """Generate embeddings for images using CLIP model."""
    engine = model.engine_array[model_index]
    embeddings, usage = await engine.image_embed(images=image_urls)

    return {
        "embeddings": to_json(embeddings),
        "usage": to_json(usage),
        "model": model.model_ids[model_index]
    }


@app.post("/rerank")
async def rerank(query: str = Body(...), docs: list[str] = Body(...), model_index: int = Body(2)):
    """Rerank documents based on query relevance."""
    engine = model.engine_array[model_index]
    rankings, usage = await engine.rerank(query=query, docs=docs)

    return {
        "rankings": to_json(rankings),
        "usage": to_json(usage),
        "model": model.model_ids[model_index]
    }


@app.post("/classify")
async def classify(sentences: list[str] = Body(...), model_index: int = Body(3)):
    """Classify text sentiment."""
    engine = model.engine_array[model_index]
    classes, usage = await engine.classify(sentences=sentences)

    return {
        "classifications": to_json(classes),
        "usage": to_json(usage),
        "model": model.model_ids[model_index]
    }

```

This creates a multi-purpose embedding server. Update <b>cerebrium.toml</b> to point to the FastAPI server by adding the following section:

```
[cerebrium.runtime.custom]
port = 5000
entrypoint = ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "5000"]
healthcheck_endpoint = "/health"
readycheck_endpoint = "/ready"
```

Deploy with `cerebrium deploy`.

After deployment, run inference with a command like:

```
curl --location 'https://api.aws.us-east-1.cerebrium.ai/v4/p-xxxx/infinity-throughput/image_embed' \
--header 'Content-Type: application/json' \
--data '{"image_urls": ["https://www.borrowmydoggy.com/_next/image?url=https%3A%2F%2Fcdn.sanity.io%2Fimages%2F4ij0poqn%2Fproduction%2Fe24bfbd855cda99e303975f2bd2a1bf43079b320-800x600.jpg&w=1080&q=80"]}'
```

The response looks like:

```
{
    "embeddings": [
        [
            -0.05284368246793747,
            0.0011637501884251833,
            -0.029046623036265373,
            ....
        ]
    ]
}
```

The result is a scalable, multi-purpose embedding/reranking server.
