Tutorial

Feb 20, 2025

Building a Real-time Coding Assistant

Kyle Gani

Senior Technical Product Manager

Have you ever wondered how v0.dev, bolt.new, cursor, or lovable work? Have you ever wondered about building out your own AI coding assistant to help you with your development tasks? It's easier than you think and we're going to help you do just that. In this tutorial, we'll build a powerful AI assistant that can understand your requirements, generate code, and deploy it instantly to a preview environment.

By the end of this guide, you'll have your own AI coding assistant that can:

  • Take natural language prompts and turn them into working code

  • Stream generations in real-time, just like ChatGPT

  • Deploy and run code automatically in a sandboxed environment

  • Scale seamlessly in production

The best part? You'll understand exactly how it works under the hood.

Check out the final example here, which you can deploy and run easily (Or you can check out this demo, to play around with a running application).

What We're Building

Our AI coding agent will:

  • Plan out component structures using an LLM of your choosing (We’re using Qwen 2.5 7b in this example)

  • Generate complete Next.js components

  • Stream code and commentary back to the frontend in real-time

  • Run code in a sandboxed environment

  • Deploy to preview URLs

The end result? A development assistant that feels like magic—type a prompt, watch as it plans and generates your application in real-time.

Before We Start:

You’ll need to set up accounts on the following platforms:

  • Cerebrium: Register for an account here here and check out the documentation here to set up your development environment

  • Huggingface: If you’re using gated models, you’ll need to sign up for a Huggingface account and generate an access token.

  • E2B: To deploy your app to a sandboxed environment, you’ll need to sign up for an E2B account

Setting Up Your Project

Let's start by initializing our project on Cerebrium:

cerebrium init 4-ai-coding-agent

This creates:


Let's also set up our dependencies and FastAPI server in our main.py file. We'll need several key libraries:

import os
from fastapi import FastAPI, WebSocket
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field, validator
from typing import List, Optional
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
import json
import logging
from threading import Thread
import asyncio
from e2b import Sandbox
from typing import Callable
from cuid2 import cuid_wrapper
from huggingface_hub import login

# Authenticate against Huggingface if you're using any gated models
login(token=os.environ.get("HF_AUTH_TOKEN"))

cuid_generator: Callable[[], str] = cuid_wrapper()

#E2B configuration (The below timeout keeps sandboxes alive for 5 minutes)
SANDBOX_TIMEOUT = 300
E2B_API_KEY = os.environ.get("E2B_API_KEY")

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI()

# Allow all origins for testing, NB: change this in production!
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

A few things are happening here:

  • We’ve set up our fast API server

  • Set up Huggingface hub for downloading and caching our model

  • Use Transformers & torch to load and manage our model and hardware

  • Importing Pydantic for validation of our inputs and what gets generated by the LLM

  • Set up E2B, the sandbox we’ll deploy and run our generated components on

  • Implement a unique CUID for each of our generated components

  • Set up our logging

Set Your Environment Variables:

You can manage environment variables for your app through your Cerebrium dashboard. Check out documentation on how to do that, here. Set the following variables

  • E2B_API_KEY = [YOUR_KEY]: You can generate a new key from your e2b.dev dashboard

  • HF_AUTH_TOKEN = [YOUR_TOKEN]: You can generate a new token from your huggingface.co dashboard

  • HF_HUB_ENABLE_HF_TRANSFER = 1: This allows us to download large model files from Huggingface more quickly

User Interface:

While it is possible for you to build your own frontend for this AI coding assistant, or use a platform like postman to send and receive socket messages, it would be so much easier if you had one at your fingertips. For this reason, we’ve created a Next.js application that you can run locally to test your Cerebrium app. It does the following:

  • Handles conversions and stores them in your browsers local storage

  • Connects to your Cerebrium endpoint (Which we set up below)

  • Sends and receives Websocket messages

  • Has a nifty UI for previewing your code and your final preview app

Check out this repository to find the application code as well as how to run the app.

Part 1: Data Models - The Blueprint

Before our AI assistant can generate any code, we need to define how we'll structure our components. We use two main models:

class FragmentBase(BaseModel):
    id: str = Field(description="Unique identifier for the fragment")
    title: str = Field(description="Short title of the fragment")
    description: str = Field(description="Brief description of what this fragment does")
    file_path: str = Field(description="Path to the file in Next.js app structure")
    dependencies: List[str] = Field(default_factory=list)
    port: Optional[int] = 3000

class Fragment(FragmentBase):
    code: str = Field(description="Code for the fragment")
    commentary: str = Field(description="Implementation details for the fragment")
    status: str = "pending"

    @validator('status')
    def validate_status(cls, v):
        if v not in ['pending', 'in_progress', 'completed', 'error']:
            raise ValueError('Invalid status')
        return v

These models define our plan’s component structure. When a user creates the prompt: "Create a dashboard with a sidebar", it might create:

[
  {
    "id": "cuid1",
    "title": "Dashboard Header",
    "description": "Responsive header with navigation",
    "file_path": "components/Header.tsx",
    "dependencies": []
  },
  {
    "id": "cuid2",
    "title": "Dashboard Sidebar",
    "description": "Collapsible sidebar with navigation links",
    "file_path": "components/Sidebar.tsx",
    "dependencies": []
  }
]

Part 2: The AI Engine

Now let's set up our language model. We're using Qwen 2.5, a powerful open-source model that's great for code generation:

logger.info("Initializing model and tokenizer...")
model_path = "Qwen/Qwen2.5-7B-Instruct-1M"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,  # Use half-precision for better memory efficiency
    trust_remote_code=True,     # Required for custom model architectures
    device_map="auto",          # Automatically handle model placement on GPU/CPU
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

You’re free to choose your own model, however, we chose Qwen 2.5 because:

  • It's optimized for code generation

  • Has a good balance of speed, quality and resource requirements (We’re running this app on minimal GPU hardware for the sake of cost)

  • Can handle complex prompts and context

Part 3: The Generation Pipeline

The magic happens in three main functions:

1. Generating our plan

Below is the function we will use to provide instructions to our LLM about how it should generate our blueprint. it also streams the final list of fragments to the frontend, which can then be used to give the user feedback on what will be implemented:

async def get_fragments_structure(prompt: str, websocket: WebSocket) -> List[FragmentBase]:
    """Generate fragment structure"""
    system_prompt = """Return a valid JSON array of Next.js component fragments.
    Each fragment should contain only the structural information (no code, descriptions or implementation details).
    
    RULES FOR FRAGMENT STRUCTURE:
    1. Format must be an array of objects:
       [{
         "id": "",  // Leave empty, will be generated
         "title": "ComponentName",
         "description": "Brief description",
         "file_path": "components/component-name.tsx",
         "dependencies": []  // Only NPM packages, NOT component names
       }]
    2. File Paths:
       - Fragment paths should be in: components/[name].tsx
       - Main page goes in: app/page.tsx
    3. Dependencies:
       - Include ONLY npm package names
       - DO NOT include other component names as dependencies
       - DO NOT include react, next.js, or shadcn UI (they're pre-installed)
    4. Component Structure:
       - Create modular, reusable components
       - ALWAYS include an app/page.tsx file in the list of fragments that import the other generated fragments
       - Keep components in components directory
    
    Remember:
    - Dependencies are ONLY for npm packages
    - File paths should be clean and correct
    - No implementation details, only structure
    """

    chat = f"""<|im_start|>system\\n{system_prompt}<|im_end|>
    <|im_start|>user\\n{prompt}<|im_end|>
    <|im_start|>assistant\\n"""

    try:
        json_str = await stream_tokens(chat, websocket, "structure")
        json_str = json_str[json_str.find('['):json_str.rfind(']') + 1]
        logger.log(logging.INFO, f"Raw fragments: {json_str}")

        raw_fragments = json.loads(json_str)

        for i, fragment in enumerate(raw_fragments):
            fragment["id"] = cuid_generator()

        fragments = [FragmentBase(**f) for f in raw_fragments]

        await websocket.send_json({
            "type": "fragment_structure",
            "content": [f.dict() for f in fragments]
        })
        return fragments
    except Exception as e:
        logger.error(f"Structure generation error: {e}")
        await websocket.send_json({"type": "error", "content": str(e)})
        raise

This piece of code describes to our LLM, how it should generate the initial fragments array, which will serve as our plan for generating each of our components.

2. Commentary Generation

Next, before generating each component, we’ll provide context (in the form of commentary) to the user about what we’ll be building and how.

async def generate_commentary(fragment: FragmentBase, fragments: List[FragmentBase], prompt: str, websocket: WebSocket) -> str:
    """Generate implementation commentary for a fragment"""

    other_fragments = "\\n".join([
        f"- {f.title}: {f.description} (in {f.file_path})"
        for f in fragments
        if f.id != fragment.id
    ])

    context_prompt = f"""You are a senior frontend developer explaining the implementation approach for a Next.js component.

    Component to implement:
    - Title: {fragment.title}
    - Description: {fragment.description}
    - Path: {fragment.file_path}

    Other components in the project:
    {other_fragments}

    Project technical stack:
    - Next.js 14.2.24 with app router
    - TypeScript
    - Tailwind CSS for styling
    - shadcn UI components (in /components/ui/)
    - React Server Components by default

    Your task:
    Write a BRIEF technical explanation of how we'll implement this component. Focus on:
    1. Component's role in the larger application
    2. Key UI elements and their arrangement
    3. Any state management needs
    4. Integration with other components
    5. Notable technical considerations

    Rules for your response:
    1. Be concise (3-4 sentences maximum)
    2. Focus on implementation approach, not generic descriptions
    3. Mention specific shadcn UI components or Tailwind classes when relevant
    4. Reference other components from the project where appropriate
    5. No code snippets or markdown
    6. No generic platitudes or obvious statements
    7. Get straight to the technical details"""

    context_chat = f"""<|im_start|>system
    You are a senior frontend developer known for clear, concise technical explanations.
    Keep responses brief and focused on specific implementation details.
    <|im_end|>
    <|im_start|>user\\n{prompt}<|im_end|>
    <|im_start|>user
    {context_prompt}
    <|im_end|>
    <|im_start|>assistant
    """

    return await stream_tokens(context_chat, websocket, f"context_{fragment.id}")

3. Code Generation

Now for the meat of our operation: Based on the initial plan we’ve built, each component is generated. Below, we provide rules for the LLM to follow, as well as context about the other components that have already been generated. We also stream the code back to frontend. This results in the following function, which we add to our main.py file:

async def generate_code(fragment: FragmentBase, fragments: List[FragmentBase], prompt: str,
                        websocket: WebSocket) -> str:
    """Generate code for a fragment with strict import validation"""

    valid_shadcn_components = [
        "accordion", "alert", "alert-dialog", "aspect-ratio", "avatar", "badge",
        "button", "calendar", "card", "carousel", "checkbox", "collapsible",
        "command", "context-menu", "dialog", "dropdown-menu", "form", "hover-card",
        "input", "label", "menubar", "navigation-menu", "popover", "progress",
        "radio-group", "scroll-area", "select", "separator", "sheet", "skeleton",
        "slider", "switch", "table", "tabs", "textarea", "toast", "toggle",
        "tooltip", "carousel"
    ]

    other_components = "\\n".join([
        f"{f.title} ({f.description}) - {f.file_path}"
        for f in fragments
        if f.id != fragment.id
    ])

    code_prompt = f"""You are an expert Next.js developer. Generate code for this component:
    Title: {fragment.title}
    Description: {fragment.description}
    Path: {fragment.file_path}
    
    STRICT IMPORT RULES:
    1. ONLY import shadcn UI components from '@/components/ui/[component]'
    2. ONLY import shadcn UI components from list of available components: {', '.join(valid_shadcn_components)}
    3. Other components that exist in the project that you can import from:{other_components}
    4. DO NOT import any other components unless they are in our list of available components or other components in the project
    
    Response Rules:
    1. Output ONLY the TypeScript/JavaScript code
    2. NO descriptions, comments, or explanations
    3. NO markdown code blocks or backticks
    4. NO "Here's the code" or similar phrases
    5. Start directly with the import statements
    6. End with the last line of component code
    8. Follow Next.js 14 app router patterns
    9. Use Tailwind for styling
    
    Begin with the first import statement."""

    logger.info(f"Fragment prompt: {code_prompt}")

    code_chat = f"""<|im_start|>system
    You are an expert Next.js developer who writes clean, self-contained components.
    Your responses must contain ONLY valid TypeScript code with correct imports.
    <|im_end|>
    <|im_start|>user
    {prompt}
    <|im_end|>
    <|im_start|>user
    {code_prompt}
    <|im_end|>
    <|im_start|>assistant
    """

    return await stream_tokens(code_chat, websocket, f"code_{fragment.id}")

Part 4: Real-time Communication

Next, we need a way to stream our plan, the commentary for each component and our actual code to the frontend. For this, we’ll use Websockets to stream our results:

async def stream_tokens(prompt: str, websocket: WebSocket, msg_type: str = "token") -> str:
    """Generate and stream tokens"""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

		# Uses a separate thread so that we don't block the event loop
    Thread(target=model.generate, kwargs={
        "inputs": inputs["input_ids"],
        "attention_mask": inputs["attention_mask"],
        "max_new_tokens": 2048,
        "temperature": 0.7,
        "do_sample": True,
        "streamer": streamer,
        "pad_token_id": tokenizer.eos_token_id
    }).start()

    text = ""
    try:
        for token in streamer:
            if token.strip():
                text += token
                if msg_type != "structure":
                    await websocket.send_json({"type": msg_type, "content": text})
            await asyncio.sleep(0)
    except Exception as e:
        logger.error(f"Streaming error: {e}")
        await websocket.send_json({"type": "error", "content": str(e)})

    return text.strip()

Part 5: The Sandbox Environment

Once we generate code, we execute it in a secure sandboxed environment on E2B and return a URL, which can be used in the frontend as a preview for the user:

def deploy_to_e2b(fragments: List[Fragment]):
		# Here we create a preconfigured Next.js sandbox
    sandbox = Sandbox("22wede53y0614elkgps2", timeout=SANDBOX_TIMEOUT, api_key=E2B_API_KEY)

		# First we copy over each fragment to our sandbox environment
    for fragment in fragments:
        if fragment.status == 'completed' and fragment.file_path and fragment.code:
            sandbox.files.write(fragment.file_path, fragment.code)

		# Next we install dependencies required by each of our generated components
    if any(fragment.dependencies for fragment in fragments):
        dependencies = set()
        for fragment in fragments:
            dependencies.update(fragment.dependencies)
        dependencies_str = " ".join(dependencies)

        sandbox.commands.run(f"npm install {dependencies_str}")

    # Then, build and serve all of our code
    sandbox.commands.run("npm run dev", background=True)

		# lastly, we return a URL to our calling function
    return sandbox.get_host(3000)
    
# The sandbox provides:
# 1. Isolated environment for safe code execution
# 2. Automatic cleanup after timeout
# 3. Preview URL for testing the generated application

Part 6: Putting It All Together

With all of our initial code in place, we can now stitch everything together, forming our main endpoint which will handle the main socket connection, deploy processing the prompt from the user, creating a plan of action, context and code for each fragment, as well as, deploying our app to a secure sandbox:

@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
    await websocket.accept()

    try:
        data = await websocket.receive_json()
        prompt = data.get("prompt")
        if not prompt:
            raise ValueError("No prompt provided")

        await websocket.send_json({
            "type": "status",
            "content": "Generating component structure..."
        })

        # Generate and process fragments
        fragments = []
        fragment_bases = await get_fragments_structure(prompt, websocket)

        await websocket.send_json({
            "type": "status",
            "content": "Structure generated. Creating components..."
        })

        total_fragments = len(fragment_bases)
        for idx, base in enumerate(fragment_bases, 1):
            # Generate commentary and code separately
            await websocket.send_json({
                "type": "status",
                "content": f"Generating component {idx}/{total_fragments}: {base.title}"
            })
            commentary = await generate_commentary(base, fragments, prompt, websocket)

            await websocket.send_json({
                "type": "status",
                "content": f"Writing code for {base.title}..."
            })
            code = await generate_code(base, fragments, prompt, websocket)

            # Create complete fragment
            fragment_dict = base.dict()
            fragment = Fragment(
                **fragment_dict,
                code=code,
                commentary=commentary,
                status="completed"
            )

            fragments.append(fragment)
            await websocket.send_json({
                "type": "fragment_update",
                "content": fragment.dict()
            })

				# deploy our code to E2B, execute it and return a preview URL
       await websocket.send_json({
            "type": "status",
            "content": "All components generated. Starting deployment..."
        })
        preview_url = deploy_to_e2b(fragments)

        await websocket.send_json({
            "type": "preview_url",
            "content": preview_url
        })

    except Exception as e:
        logger.error(f"Error: {e}")
        await websocket.send_json({"type": "error", "content": str(e)})
    finally:
        await websocket.close()

# This is a health endpoint used by Cerebrium, so that knows our app is currently running
@app.get("/health")
async def health_check():
    return {"status": "healthy"}

Deployment Configuration

Finally, let's set up our cerebrium.toml , which handles the configuration for our deployment and all of our dependencies:

[cerebrium.deployment]
name = "4-ai-coding-agent"
python_version = "3.12"
docker_base_image_url = "debian:bookworm-slim"
disable_auth = true
include = ['./*', 'main.py', 'cerebrium.toml']
exclude = ['.*']

[cerebrium.runtime.custom]
port = 8765
entrypoint = ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8765"]
healthcheck_endpoint = "/health"

[cerebrium.hardware]
cpu = 4
memory = 16.0
compute = "AMPERE_A10"

[cerebrium.scaling]
# This parameter keeps 1 instance ready to handle requests. Don't foget to disable this when you're not using your application (If costs are your concern)
min_replicas = 1
max_replicas = 5
cooldown = 30
replica_concurrency = 1
response_grace_period = 900
scaling_metric = "concurrency_utilization"
scaling_target = 100
scaling_buffer = 0

[cerebrium.dependencies.pip]

Not sure what the above configurations mean? Check our our TOML reference, which describes (in detail) how to configure the deployment of your Cerebrium Apps.

Lastly, deploy your final app to Cerebrium with the following command:

… and don’t forget to update your frontend consumer .env file so that it makes use of your new Websocket endpoint:

CEREBRIUM_SOCKET_URL="wss://api.cortex.cerebrium.ai/v4/[YOUR_PROJECT_ID]/4-ai-coding-agent/ws"

Next Steps

Now that you have a working AI coding agent, consider:

  • Adding support for more frameworks

  • Implement your LLM model using a optimization framework like vLLM, TensorRT, SGLang for faster generation

  • Implementing custom model fine-tuning or RAG for up-to-date use of framework documentation

  • Integrating with GitHub for PR creation

The possibilities are endless! Congratulations, you've successfully built a foundation that can evolve into your perfect coding assistant.

limitations

This application depends on a Large Language Model that may hallucinate or not always generate the correct code. This may result in deployed applications that may not function as intended. Please double check your code before deploying to a production environment.

PS: If you ever get stuck on any of the above implementation details, check out our examples repository for guidance or chat to us in the Cerebrium Community Discord Server

PPS: We’d love it if you could share this post with your friends and colleagues… It may just help them build the next AI unicorn startup 😛

© 2024 Cerebrium, Inc.