Cerebrium blog | Building a Real-time Coding Assistant

Tutorial

Feb 20, 2025

Building a Real-time Coding Assistant

Kyle Gani

Senior Technical Product Manager

Have you ever wondered how v0.dev, bolt.new, cursor, or lovable work? Have you ever wondered about building out your own AI coding assistant to help you with your development tasks? It's easier than you think and we're going to help you do just that. In this tutorial, we'll build a powerful AI assistant that can understand your requirements, generate code, and deploy it instantly to a preview environment.

By the end of this guide, you'll have your own AI coding assistant that can:

Take natural language prompts and turn them into working code
Stream generations in real-time, just like ChatGPT
Deploy and run code automatically in a sandboxed environment
Scale seamlessly in production

The best part? You'll understand exactly how it works under the hood.

Check out the final example here, which you can deploy and run easily (Or you can check out this demo, to play around with a running application).

What We're Building

Our AI coding agent will:

Plan out component structures using an LLM of your choosing (We’re using Qwen 2.5 7b in this example)
Generate complete Next.js components
Stream code and commentary back to the frontend in real-time
Run code in a sandboxed environment
Deploy to preview URLs

The end result? A development assistant that feels like magic—type a prompt, watch as it plans and generates your application in real-time.

Before We Start:

You’ll need to set up accounts on the following platforms:

Cerebrium: Register for an account here here and check out the documentation here to set up your development environment
Huggingface: If you’re using gated models, you’ll need to sign up for a Huggingface account and generate an access token.
E2B: To deploy your app to a sandboxed environment, you’ll need to sign up for an E2B account

Setting Up Your Project

Let's start by initializing our project on Cerebrium:

cerebrium init 4-ai-coding-agent

This creates:

Let's also set up our dependencies and FastAPI server in our main.py file. We'll need several key libraries:

import os
import re
import json
import logging
import asyncio
import torch
from fastapi import FastAPI, WebSocket
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field, validator
from typing import List, Optional
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
from threading import Thread
from e2b import Sandbox
from typing import Callable
from cuid2 import cuid_wrapper
from huggingface_hub import login

# Authenticate against Huggingface if you're using any gated models
login(token=os.environ.get("HF_AUTH_TOKEN"))

cuid_generator: Callable[[], str] = cuid_wrapper()

#E2B configuration (The below timeout keeps sandboxes alive for 5 minutes)
SANDBOX_TIMEOUT = 300
E2B_API_KEY = os.environ.get("E2B_API_KEY")

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI()

# Allow all origins for testing, NB: change this in production!
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

A few things are happening here:

We’ve set up our fast API server
Set up Huggingface hub for downloading and caching our model
Use Transformers & torch to load and manage our model and hardware
Importing Pydantic for validation of our inputs and what gets generated by the LLM
Set up E2B, the sandbox we’ll deploy and run our generated components on
Implement a unique CUID for each of our generated components
Set up our logging

Set Your Environment Variables:

You can manage environment variables for your app through your Cerebrium dashboard. Check out documentation on how to do that, here. Set the following variables

E2B_API_KEY = [YOUR_KEY]: You can generate a new key from your e2b.dev dashboard
HF_AUTH_TOKEN = [YOUR_TOKEN]: You can generate a new token from your huggingface.co dashboard
HF_HUB_ENABLE_HF_TRANSFER = 1: This allows us to download large model files from Huggingface more quickly

User Interface:

While it is possible for you to build your own frontend for this AI coding assistant, or use a platform like postman to send and receive socket messages, it would be so much easier if you had one at your fingertips. For this reason, we’ve created a Next.js application that you can run locally to test your Cerebrium app. It does the following:

Handles conversions and stores them in your browsers local storage
Connects to your Cerebrium endpoint (Which we set up below)
Sends and receives Websocket messages
Has a nifty UI for previewing your code and your final preview app

Check out this repository to find the application code as well as how to run the app.

Part 1: Data Models - The Blueprint

Before our AI assistant can generate any code, we need to define how we'll structure our components. We use two main models:

class FragmentBase(BaseModel):
    id: str = Field(description="Unique identifier for the fragment")
    title: str = Field(description="Short title of the fragment")
    description: str = Field(description="Brief description of what this fragment does")
    file_path: str = Field(description="Path to the file in Next.js app structure")
    dependencies: List[str] = Field(default_factory=list)
    port: Optional[int] = 3000

class Fragment(FragmentBase):
    code: str = Field(description="Code for the fragment")
    commentary: str = Field(description="Implementation details for the fragment")
    status: str = "pending"

    @validator('status')
    def validate_status(cls, v):
        if v not in ['pending', 'in_progress', 'completed', 'error']:
            raise ValueError('Invalid status')
        return v

These models define our plan’s component structure. When a user creates the prompt: "Create a dashboard with a sidebar", it might create:

[
  {
    "id": "cuid1",
    "title": "Dashboard Header",
    "description": "Responsive header with navigation",
    "file_path": "components/Header.tsx",
    "dependencies": []
  },
  {
    "id": "cuid2",
    "title": "Dashboard Sidebar",
    "description": "Collapsible sidebar with navigation links",
    "file_path": "components/Sidebar.tsx",
    "dependencies": []
  }
]

Part 2: The AI Engine

Now let's set up our language model. We're using Qwen 2.5, a powerful open-source model that's great for code generation:

model_path = "Qwen/Qwen2.5-Coder-14B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    trust_remote_code=True,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

You’re free to choose your own model, however, we chose Qwen 2.5 because:

It's optimized for code generation
Has a good balance of speed, quality and resource requirements (We’re running this app on minimal GPU hardware for the sake of cost)
Can handle complex prompts and context

Part 3: The Generation Pipeline

The magic happens in three main functions:

1. Generating our plan

Below is the function we will use to provide instructions to our LLM about how it should generate our blueprint. it also streams the final list of fragments to the frontend, which can then be used to give the user feedback on what will be implemented:

async def get_fragments_structure(prompt: str, websocket: WebSocket) -> List[FragmentBase]:
    """Generate fragment structure"""
    system_prompt = """Return a valid JSON array of Next.js component fragments.
    Each fragment should contain only the structural information (no code, descriptions or implementation details).
    
    RULES FOR FRAGMENT STRUCTURE:
    Format must be an array of objects:
       [{
         "title": "ComponentName",
         "description": "Brief description",
         "file_path": "components/component-name.tsx",
         "dependencies": []
       }]
    
    File Paths:
       - Fragment paths should be in: components/[name].tsx
       - Main page goes in: app/page.tsx
    
    Dependencies:
       - Include ONLY npm package names
       - DO NOT include other component names as dependencies
       - DO NOT include react, next.js, or shadcn UI (they're pre-installed)
    
    Component Structure:
       - NEVER import components that are not necessary for the functionality of this component
       - ALWAYS include an app/page.tsx file in the list of fragments that import the other generated fragments
       - The page component should always consolidate all the other components, so add it last.
       - Keep components in components directory
       - Generate as few components as possible
    
    Remember:
    - Dependencies are ONLY for npm packages
    - File paths should be clean and correct
    - No implementation details, only structure
    """

    chat = f"""<|im_start|>system\n{system_prompt}<|im_end|>
    <|im_start|>user\n{prompt}<|im_end|>
    <|im_start|>assistant\n"""

    logger.info(f"chat: {chat}")

    try:
        json_str = await stream_tokens(chat, websocket, "structure")
        json_str = json_str[json_str.find('['):json_str.rfind(']') + 1]

        raw_fragments = json.loads(json_str)
        logger.info(f"Raw fragments: {raw_fragments}")

        for i, fragment in enumerate(raw_fragments):
            fragment["id"] = cuid_generator()

        fragments = [FragmentBase(**f) for f in raw_fragments]

        await websocket.send_json({
            "type": "fragment_structure",
            "content": [f.dict() for f in fragments]
        })
        return fragments
    except Exception as e:
        logger.error(f"Structure generation error: {e}")
        await websocket.send_json({"type": "error", "content": str(e)})
        raise

This piece of code describes to our LLM, how it should generate the initial fragments array, which will serve as our plan for generating each of our components.

2. Commentary Generation

Next, before generating each component, we’ll provide context (in the form of commentary) to the user about what we’ll be building and how.

async def generate_commentary(fragment: FragmentBase, fragments: List[FragmentBase], prompt: str, websocket: WebSocket) -> str:
    """Generate implementation commentary for a fragment"""

    other_fragments = "\\n".join([
        f"- {f.title}: {f.description} (in {f.file_path})"
        for f in fragments
        if f.id != fragment.id
    ])

    context_prompt = f"""You are a senior frontend developer explaining the implementation approach for a Next.js component.

    Component to implement:
    - Title: {fragment.title}
    - Description: {fragment.description}
    - Path: {fragment.file_path}

    Other components in the project:
    {other_fragments}

    Project technical stack:
    - Next.js 14.2.24 with app router
    - TypeScript
    - Tailwind CSS for styling
    - shadcn UI components (in /components/ui/)
    - React Server Components by default

    Your task:
    Write a BRIEF technical explanation of how we'll implement this component. Focus on:
    1. Component's role in the larger application
    2. Key UI elements and their arrangement
    3. Any state management needs
    4. Integration with other components
    5. Notable technical considerations

    Rules for your response:
    1. Be concise (3-4 sentences maximum)
    2. Focus on implementation approach, not generic descriptions
    3. Mention specific shadcn UI components or Tailwind classes when relevant
    4. Reference other components from the project where appropriate
    5. No code snippets or markdown
    6. No generic platitudes or obvious statements
    7. Get straight to the technical details"""

    context_chat = f"""<|im_start|>system
    You are a senior frontend developer known for clear, concise technical explanations.
    Keep responses brief and focused on specific implementation details.
    <|im_end|>
    <|im_start|>user\\n{prompt}<|im_end|>
    <|im_start|>user
    {context_prompt}
    <|im_end|>
    <|im_start|>assistant
    """

    return await stream_tokens(context_chat, websocket, f"context_{fragment.id}")

3. Code Generation

Now for the meat of our operation: Based on the initial plan we’ve built, each component is generated. Below, we provide rules for the LLM to follow, as well as context about the other components that have already been generated. We also stream the code back to frontend. This results in the following function, which we add to our main.py file:

async def generate_code(fragment: FragmentBase, fragments: List[FragmentBase], prompt: str,
                        websocket: WebSocket) -> str:
    """Generate code for a fragment with strict import validation"""

    valid_shadcn_components = [
        "accordion", "alert", "alert-dialog", "aspect-ratio", "avatar", "badge",
        "button", "calendar", "card", "carousel", "checkbox", "collapsible",
        "command", "context-menu", "dialog", "dropdown-menu", "form", "hover-card",
        "input", "label", "menubar", "navigation-menu", "popover", "progress",
        "radio-group", "scroll-area", "select", "separator", "sheet", "skeleton",
        "slider", "switch", "table", "tabs", "textarea", "toast", "toggle",
        "tooltip", "carousel"
    ]

    other_components = "\n".join([
        f"{f.title} ({f.description}) - {f.file_path}"
        for f in fragments
        if f.id != fragment.id
    ])

    code_prompt = f"""You are an expert Next.js developer. Generate code for this component:
    Title: {fragment.title}
    Description: {fragment.description}
    Path: {fragment.file_path}
    
    STRICT IMPORT RULES:
    2. ONLY import shadcn UI components from list of available components: {', '.join(valid_shadcn_components)}
    1. ONLY import shadcn UI components from '@/components/ui/[component].tsx'
    3. Other components that exist in the project that you can import from: {other_components}
    4. DO NOT import any other components unless they are in our list of available components or other components in the project
    5. Do NOT import components that are not necessary for the component you are working on
    
    RESPONSE RULES:
    1. Output ONLY the TypeScript/JavaScript code
    2. NO descriptions, comments, or explanations
    8. Follow Next.js 14 app router patterns
    9. Use Tailwind for styling"""

    code_chat = f"""<|im_start|>system
    You are an expert Next.js developer who writes clean, self-contained components.
    Your responses must contain ONLY valid TypeScript code with correct imports.
    <|im_end|>
    <|im_start|>user
    {prompt}
    <|im_end|>
    <|im_start|>user
    {code_prompt}
    <|im_end|>
    <|im_start|>assistant
    """

    return await stream_tokens(code_chat, websocket, f"code_{fragment.id}")

Part 4: Real-time Communication

Next, we need a way to stream our plan, the commentary for each component and our actual code to the frontend. For this, we’ll use Websockets to stream our results:

async def stream_tokens(prompt: str, websocket: WebSocket, msg_type: str = "token") -> str:
    """Generate and stream tokens with preprocessing to remove markdown artifacts"""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

    Thread(target=model.generate, kwargs={
        **inputs,
        "max_new_tokens": 512,
        "streamer": streamer,
    }).start()

    text = ""
    buffer = ""
    in_code_block = False
    language_detected = False

    # Common language identifiers (typescript, javascript, tsx, etc.)
    lang_pattern = re.compile(r"^(typescript|javascript|tsx|jsx|ts|js)\s*$", re.IGNORECASE)

    try:
        for token in streamer:
            buffer += token

            # Handle code block start
            if "```" in buffer and not in_code_block:
                parts = buffer.split("```", 1)
                if parts[0].strip():
                    text += parts[0]
                buffer = parts[1] if len(parts) > 1 else ""
                in_code_block = True
                language_detected = False
                continue

            # Handle language identifier line
            if in_code_block and not language_detected and "\n" in buffer:
                line, rest = buffer.split("\n", 1)
                if lang_pattern.match(line.strip()):
                    buffer = rest
                    language_detected = True
                    continue
                else:
                    language_detected = True

            # Handle code block end
            if "```" in buffer:
                buffer = buffer.replace("```", "")
                in_code_block = False

            # Process complete token chunks
            if " " in buffer or "\n" in buffer or len(buffer) > 10:
                text += buffer
                buffer = ""
                if msg_type != "structure":
                    await websocket.send_json({"type": msg_type, "content": text})

            await asyncio.sleep(0)

        # Handle any remaining buffer
        if buffer:
            text += buffer
            if msg_type != "structure":
                await websocket.send_json({"type": msg_type, "content": text})

    except Exception as e:
        logger.error(f"Streaming error: {e}")
        await websocket.send_json({"type": "error", "content": str(e)})

    return text.replace("```", "").strip()

Part 5: The Sandbox Environment

Once we generate code, we execute it in a secure sandboxed environment on E2B and return a URL, which can be used in the frontend as a preview for the user:

def deploy_to_e2b(fragments: List[Fragment]):
    sandbox = Sandbox("22wede53y0614elkgps2", timeout=SANDBOX_TIMEOUT, api_key=E2B_API_KEY)

    for fragment in fragments:
        if fragment.status == 'completed' and fragment.file_path and fragment.code:
            sandbox.files.write(fragment.file_path, fragment.code)

    if any(fragment.dependencies for fragment in fragments):
        dependencies = set()
        for fragment in fragments:
            dependencies.update(fragment.dependencies)
        dependencies_str = " ".join(dependencies)

        sandbox.commands.run(f"npm install {dependencies_str}")

    sandbox.commands.run("npm run dev", background=True)

    return sandbox.get_host(3000)

Part 6: Putting It All Together

With all of our initial code in place, we can now stitch everything together, forming our main endpoint which will handle the main socket connection, deploy processing the prompt from the user, creating a plan of action, context and code for each fragment, as well as, deploying our app to a secure sandbox:

@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
    await websocket.accept()

    try:
        data = await websocket.receive_json()
        prompt = data.get("prompt")
        if not prompt:
            raise ValueError("No prompt provided")

        await websocket.send_json({
            "type": "status",
            "content": "Generating component structure..."
        })

        fragments = []
        fragment_bases = await get_fragments_structure(prompt, websocket)

        await websocket.send_json({
            "type": "status",
            "content": "Structure generated. Creating components..."
        })

        total_fragments = len(fragment_bases)
        for idx, base in enumerate(fragment_bases, 1):
            await websocket.send_json({
                "type": "status",
                "content": f"Generating component {idx}/{total_fragments}: {base.title}"
            })
            commentary = await generate_commentary(base, fragments, prompt, websocket)

            await websocket.send_json({
                "type": "status",
                "content": f"Writing code for {base.title}..."
            })
            code = await generate_code(base, fragments, prompt, websocket)

            fragment_dict = base.dict()
            fragment = Fragment(
                **fragment_dict,
                code=code,
                commentary=commentary,
                status="completed"
            )

            fragments.append(fragment)
            await websocket.send_json({
                "type": "fragment_update",
                "content": fragment.dict()
            })

        await websocket.send_json({
            "type": "status",
            "content": "All components generated. Starting deployment..."
        })
        preview_url = deploy_to_e2b(fragments)

        await websocket.send_json({
            "type": "preview_url",
            "content": preview_url
        })

    except Exception as e:
        logger.error(f"Error: {e}")
        await websocket.send_json({"type": "error", "content": str(e)})
    finally:
        await websocket.close()

      
# This is a health endpoint used by Cerebrium, so that knows our app is currently running
@app.get("/health")
async def health_check():
    return {"status": "healthy"}

Deployment Configuration

Finally, let's set up our cerebrium.toml , which handles the configuration for our deployment and all of our dependencies:

[cerebrium.deployment]
name = "4-ai-coding-agent"
python_version = "3.12"
docker_base_image_url = "debian:bookworm-slim"
disable_auth = true
include = ['./*', 'main.py', 'cerebrium.toml']
exclude = ['.*']

[cerebrium.runtime.custom]
port = 8765
entrypoint = ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8765"]
healthcheck_endpoint = "/health"

[cerebrium.hardware]
cpu = 4
memory = 16.0
compute = "ADA_L40"

[cerebrium.scaling]
# This parameter keeps one instance running to prevent cold starts. Disable when not using your app 
min_replicas = 1
max_replicas = 5
cooldown = 30
replica_concurrency = 1
response_grace_period = 900
scaling_metric = "concurrency_utilization"
scaling_target = 100
scaling_buffer = 0

[cerebrium.dependencies.pip]

Not sure what the above configurations mean? Check our our TOML reference, which describes (in detail) how to configure the deployment of your Cerebrium Apps.

Lastly, deploy your final app to Cerebrium with the following command:

… and don’t forget to update your frontend consumer .env file so that it makes use of your new Websocket endpoint:

CEREBRIUM_SOCKET_URL="wss://api.cortex.cerebrium.ai/v4/[YOUR_PROJECT_ID]/4-ai-coding-agent/ws"

Next Steps

Now that you have a working AI coding agent, consider:

Adding support for more frameworks
Implement your LLM model using a optimization framework like vLLM, TensorRT, SGLang for faster generation
Implementing custom model fine-tuning or RAG for up-to-date use of framework documentation
Integrating with GitHub for PR creation

The possibilities are endless! Congratulations, you've successfully built a foundation that can evolve into your perfect coding assistant.

limitations

This application depends on a Large Language Model that may hallucinate or not always generate the correct code. This may result in deployed applications that may not function as intended. Please double check your code before deploying to a production environment.

PS: If you ever get stuck on any of the above implementation details, check out our examples repository for guidance or chat to us in the Cerebrium Community Discord Server

PPS: We’d love it if you could share this post with your friends and colleagues… It may just help them build the next AI unicorn startup 😛

Deploying a global scale, AI voice agent with 500ms latency.

Jun 25, 2025

Tutorial

Deploying a global scale, AI voice agent with 500ms latency.

Jun 25, 2025

Tutorial

Deploying Ultravox on Cerebrium for Ultra-low Latency Voice Applications

Apr 28, 2025

Tutorial

Deploying Ultravox on Cerebrium for Ultra-low Latency Voice Applications

Apr 28, 2025

Tutorial

Creating a realtime AI Commentator with Cerebrium, LiveKit and Cartesia

Feb 18, 2025

Tutorial

Creating a realtime AI Commentator with Cerebrium, LiveKit and Cartesia

Feb 18, 2025

Product

Pricing

Developers

Docs

Status

Company

Blog

Use cases

Large language models

Voice

Image & Video

Resources

Examples

Articles

Brand assets

Building a Real-time Coding Assistant

What We're Building

Before We Start:

Setting Up Your Project

Set Your Environment Variables:

User Interface:

Part 1: Data Models - The Blueprint

Part 2: The AI Engine

Part 3: The Generation Pipeline

1. Generating our plan

2. Commentary Generation

3. Code Generation

Part 4: Real-time Communication

Part 5: The Sandbox Environment

Part 6: Putting It All Together

Deployment Configuration

Next Steps

limitations

MORE ARTICLES LIKE THIS

Deploying a global scale, AI voice agent with 500ms latency.

Deploying a global scale, AI voice agent with 500ms latency.

Deploying Ultravox on Cerebrium for Ultra-low Latency Voice Applications

Deploying Ultravox on Cerebrium for Ultra-low Latency Voice Applications

Creating a realtime AI Commentator with Cerebrium, LiveKit and Cartesia

Creating a realtime AI Commentator with Cerebrium, LiveKit and Cartesia

Product

Developers

Company

Use cases

Resources