Have you ever wondered how v0.dev, bolt.new, cursor, or lovable work? Have you ever wondered about building out your own AI coding assistant to help you with your development tasks? It's easier than you think and we're going to help you do just that. In this tutorial, we'll build a powerful AI assistant that can understand your requirements, generate code, and deploy it instantly to a preview environment.
By the end of this guide, you'll have your own AI coding assistant that can:
Take natural language prompts and turn them into working code
Stream generations in real-time, just like ChatGPT
Deploy and run code automatically in a sandboxed environment
Scale seamlessly in production
The best part? You'll understand exactly how it works under the hood.
Check out the final example here, which you can deploy and run easily (Or you can check out this demo, to play around with a running application).
What We're Building
Our AI coding agent will:
Plan out component structures using an LLM of your choosing (We’re using Qwen 2.5 7b in this example)
Generate complete Next.js components
Stream code and commentary back to the frontend in real-time
Run code in a sandboxed environment
Deploy to preview URLs
The end result? A development assistant that feels like magic—type a prompt, watch as it plans and generates your application in real-time.
Before We Start:
You’ll need to set up accounts on the following platforms:
Cerebrium: Register for an account here here and check out the documentation here to set up your development environment
Huggingface: If you’re using gated models, you’ll need to sign up for a Huggingface account and generate an access token.
E2B: To deploy your app to a sandboxed environment, you’ll need to sign up for an E2B account
Setting Up Your Project
Let's start by initializing our project on Cerebrium:
cerebrium init 4-ai-coding-agent
This creates:
Let's also set up our dependencies and FastAPI server in our main.py
file. We'll need several key libraries:
import os
from fastapi import FastAPI, WebSocket
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field, validator
from typing import List, Optional
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
import json
import logging
from threading import Thread
import asyncio
from e2b import Sandbox
from typing import Callable
from cuid2 import cuid_wrapper
from huggingface_hub import login
login(token=os.environ.get("HF_AUTH_TOKEN"))
cuid_generator: Callable[[], str] = cuid_wrapper()
SANDBOX_TIMEOUT = 300
E2B_API_KEY = os.environ.get("E2B_API_KEY")
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
app = FastAPI()
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
A few things are happening here:
We’ve set up our fast API server
Set up Huggingface hub for downloading and caching our model
Use Transformers & torch to load and manage our model and hardware
Importing Pydantic for validation of our inputs and what gets generated by the LLM
Set up E2B, the sandbox we’ll deploy and run our generated components on
Implement a unique CUID for each of our generated components
Set up our logging
Set Your Environment Variables:
You can manage environment variables for your app through your Cerebrium dashboard. Check out documentation on how to do that, here. Set the following variables
E2B_API_KEY = [YOUR_KEY]
: You can generate a new key from your e2b.dev dashboard
HF_AUTH_TOKEN = [YOUR_TOKEN]
: You can generate a new token from your huggingface.co dashboard
HF_HUB_ENABLE_HF_TRANSFER = 1
: This allows us to download large model files from Huggingface more quickly
User Interface:
While it is possible for you to build your own frontend for this AI coding assistant, or use a platform like postman to send and receive socket messages, it would be so much easier if you had one at your fingertips. For this reason, we’ve created a Next.js application that you can run locally to test your Cerebrium app. It does the following:
Handles conversions and stores them in your browsers local storage
Connects to your Cerebrium endpoint (Which we set up below)
Sends and receives Websocket messages
Has a nifty UI for previewing your code and your final preview app

Check out this repository to find the application code as well as how to run the app.
Part 1: Data Models - The Blueprint
Before our AI assistant can generate any code, we need to define how we'll structure our components. We use two main models:
class FragmentBase(BaseModel):
id: str = Field(description="Unique identifier for the fragment")
title: str = Field(description="Short title of the fragment")
description: str = Field(description="Brief description of what this fragment does")
file_path: str = Field(description="Path to the file in Next.js app structure")
dependencies: List[str] = Field(default_factory=list)
port: Optional[int] = 3000
class Fragment(FragmentBase):
code: str = Field(description="Code for the fragment")
commentary: str = Field(description="Implementation details for the fragment")
status: str = "pending"
@validator('status')
def validate_status(cls, v):
if v not in ['pending', 'in_progress', 'completed', 'error']:
raise ValueError('Invalid status')
return v
These models define our plan’s component structure. When a user creates the prompt: "Create a dashboard with a sidebar", it might create:
[
{
"id": "cuid1",
"title": "Dashboard Header",
"description": "Responsive header with navigation",
"file_path": "components/Header.tsx",
"dependencies": []
},
{
"id": "cuid2",
"title": "Dashboard Sidebar",
"description": "Collapsible sidebar with navigation links",
"file_path": "components/Sidebar.tsx",
"dependencies": []
}
]
Part 2: The AI Engine
Now let's set up our language model. We're using Qwen 2.5, a powerful open-source model that's great for code generation:
logger.info("Initializing model and tokenizer...")
model_path = "Qwen/Qwen2.5-7B-Instruct-1M"
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16,
trust_remote_code=True,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
You’re free to choose your own model, however, we chose Qwen 2.5 because:
It's optimized for code generation
Has a good balance of speed, quality and resource requirements (We’re running this app on minimal GPU hardware for the sake of cost)
Can handle complex prompts and context
Part 3: The Generation Pipeline
The magic happens in three main functions:
1. Generating our plan
Below is the function we will use to provide instructions to our LLM about how it should generate our blueprint. it also streams the final list of fragments to the frontend, which can then be used to give the user feedback on what will be implemented:
async def get_fragments_structure(prompt: str, websocket: WebSocket) -> List[FragmentBase]:
"""Generate fragment structure"""
system_prompt = """Return a valid JSON array of Next.js component fragments.
Each fragment should contain only the structural information (no code, descriptions or implementation details).
RULES FOR FRAGMENT STRUCTURE:
1. Format must be an array of objects:
[{
"id": "", // Leave empty, will be generated
"title": "ComponentName",
"description": "Brief description",
"file_path": "components/component-name.tsx",
"dependencies": [] // Only NPM packages, NOT component names
}]
2. File Paths:
- Fragment paths should be in: components/[name].tsx
- Main page goes in: app/page.tsx
3. Dependencies:
- Include ONLY npm package names
- DO NOT include other component names as dependencies
- DO NOT include react, next.js, or shadcn UI (they're pre-installed)
4. Component Structure:
- Create modular, reusable components
- ALWAYS include an app/page.tsx file in the list of fragments that import the other generated fragments
- Keep components in components directory
Remember:
- Dependencies are ONLY for npm packages
- File paths should be clean and correct
- No implementation details, only structure
"""
chat = f"""<|im_start|>system\\n{system_prompt}<|im_end|>
<|im_start|>user\\n{prompt}<|im_end|>
<|im_start|>assistant\\n"""
try:
json_str = await stream_tokens(chat, websocket, "structure")
json_str = json_str[json_str.find('['):json_str.rfind(']') + 1]
logger.log(logging.INFO, f"Raw fragments: {json_str}")
raw_fragments = json.loads(json_str)
for i, fragment in enumerate(raw_fragments):
fragment["id"] = cuid_generator()
fragments = [FragmentBase(**f) for f in raw_fragments]
await websocket.send_json({
"type": "fragment_structure",
"content": [f.dict() for f in fragments]
})
return fragments
except Exception as e:
logger.error(f"Structure generation error: {e}")
await websocket.send_json({"type": "error", "content": str(e)})
raise
This piece of code describes to our LLM, how it should generate the initial fragments array, which will serve as our plan for generating each of our components.
2. Commentary Generation
Next, before generating each component, we’ll provide context (in the form of commentary) to the user about what we’ll be building and how.
async def generate_commentary(fragment: FragmentBase, fragments: List[FragmentBase], prompt: str, websocket: WebSocket) -> str:
"""Generate implementation commentary for a fragment"""
other_fragments = "\\n".join([
f"- {f.title}: {f.description} (in {f.file_path})"
for f in fragments
if f.id != fragment.id
])
context_prompt = f"""You are a senior frontend developer explaining the implementation approach for a Next.js component.
Component to implement:
- Title: {fragment.title}
- Description: {fragment.description}
- Path: {fragment.file_path}
Other components in the project:
{other_fragments}
Project technical stack:
- Next.js 14.2.24 with app router
- TypeScript
- Tailwind CSS for styling
- shadcn UI components (in /components/ui/)
- React Server Components by default
Your task:
Write a BRIEF technical explanation of how we'll implement this component. Focus on:
1. Component's role in the larger application
2. Key UI elements and their arrangement
3. Any state management needs
4. Integration with other components
5. Notable technical considerations
Rules for your response:
1. Be concise (3-4 sentences maximum)
2. Focus on implementation approach, not generic descriptions
3. Mention specific shadcn UI components or Tailwind classes when relevant
4. Reference other components from the project where appropriate
5. No code snippets or markdown
6. No generic platitudes or obvious statements
7. Get straight to the technical details"""
context_chat = f"""<|im_start|>system
You are a senior frontend developer known for clear, concise technical explanations.
Keep responses brief and focused on specific implementation details.
<|im_end|>
<|im_start|>user\\n{prompt}<|im_end|>
<|im_start|>user
{context_prompt}
<|im_end|>
<|im_start|>assistant
"""
return await stream_tokens(context_chat, websocket, f"context_{fragment.id}")
3. Code Generation
Now for the meat of our operation: Based on the initial plan we’ve built, each component is generated. Below, we provide rules for the LLM to follow, as well as context about the other components that have already been generated. We also stream the code back to frontend. This results in the following function, which we add to our main.py
file:
async def generate_code(fragment: FragmentBase, fragments: List[FragmentBase], prompt: str,
websocket: WebSocket) -> str:
"""Generate code for a fragment with strict import validation"""
valid_shadcn_components = [
"accordion", "alert", "alert-dialog", "aspect-ratio", "avatar", "badge",
"button", "calendar", "card", "carousel", "checkbox", "collapsible",
"command", "context-menu", "dialog", "dropdown-menu", "form", "hover-card",
"input", "label", "menubar", "navigation-menu", "popover", "progress",
"radio-group", "scroll-area", "select", "separator", "sheet", "skeleton",
"slider", "switch", "table", "tabs", "textarea", "toast", "toggle",
"tooltip", "carousel"
]
other_components = "\\n".join([
f"{f.title} ({f.description}) - {f.file_path}"
for f in fragments
if f.id != fragment.id
])
code_prompt = f"""You are an expert Next.js developer. Generate code for this component:
Title: {fragment.title}
Description: {fragment.description}
Path: {fragment.file_path}
STRICT IMPORT RULES:
1. ONLY import shadcn UI components from '@/components/ui/[component]'
2. ONLY import shadcn UI components from list of available components: {', '.join(valid_shadcn_components)}
3. Other components that exist in the project that you can import from:{other_components}
4. DO NOT import any other components unless they are in our list of available components or other components in the project
Response Rules:
1. Output ONLY the TypeScript/JavaScript code
2. NO descriptions, comments, or explanations
3. NO markdown code blocks or backticks
4. NO "Here's the code" or similar phrases
5. Start directly with the import statements
6. End with the last line of component code
8. Follow Next.js 14 app router patterns
9. Use Tailwind for styling
Begin with the first import statement."""
logger.info(f"Fragment prompt: {code_prompt}")
code_chat = f"""<|im_start|>system
You are an expert Next.js developer who writes clean, self-contained components.
Your responses must contain ONLY valid TypeScript code with correct imports.
<|im_end|>
<|im_start|>user
{prompt}
<|im_end|>
<|im_start|>user
{code_prompt}
<|im_end|>
<|im_start|>assistant
"""
return await stream_tokens(code_chat, websocket, f"code_{fragment.id}")
Part 4: Real-time Communication
Next, we need a way to stream our plan, the commentary for each component and our actual code to the frontend. For this, we’ll use Websockets to stream our results:
async def stream_tokens(prompt: str, websocket: WebSocket, msg_type: str = "token") -> str:
"""Generate and stream tokens"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
Thread(target=model.generate, kwargs={
"inputs": inputs["input_ids"],
"attention_mask": inputs["attention_mask"],
"max_new_tokens": 2048,
"temperature": 0.7,
"do_sample": True,
"streamer": streamer,
"pad_token_id": tokenizer.eos_token_id
}).start()
text = ""
try:
for token in streamer:
if token.strip():
text += token
if msg_type != "structure":
await websocket.send_json({"type": msg_type, "content": text})
await asyncio.sleep(0)
except Exception as e:
logger.error(f"Streaming error: {e}")
await websocket.send_json({"type": "error", "content": str(e)})
return text.strip()
Part 5: The Sandbox Environment
Once we generate code, we execute it in a secure sandboxed environment on E2B and return a URL, which can be used in the frontend as a preview for the user:
def deploy_to_e2b(fragments: List[Fragment]):
sandbox = Sandbox("22wede53y0614elkgps2", timeout=SANDBOX_TIMEOUT, api_key=E2B_API_KEY)
for fragment in fragments:
if fragment.status == 'completed' and fragment.file_path and fragment.code:
sandbox.files.write(fragment.file_path, fragment.code)
if any(fragment.dependencies for fragment in fragments):
dependencies = set()
for fragment in fragments:
dependencies.update(fragment.dependencies)
dependencies_str = " ".join(dependencies)
sandbox.commands.run(f"npm install {dependencies_str}")
sandbox.commands.run("npm run dev", background=True)
return sandbox.get_host(3000)
Part 6: Putting It All Together
With all of our initial code in place, we can now stitch everything together, forming our main endpoint which will handle the main socket connection, deploy processing the prompt from the user, creating a plan of action, context and code for each fragment, as well as, deploying our app to a secure sandbox:
@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
await websocket.accept()
try:
data = await websocket.receive_json()
prompt = data.get("prompt")
if not prompt:
raise ValueError("No prompt provided")
await websocket.send_json({
"type": "status",
"content": "Generating component structure..."
})
fragments = []
fragment_bases = await get_fragments_structure(prompt, websocket)
await websocket.send_json({
"type": "status",
"content": "Structure generated. Creating components..."
})
total_fragments = len(fragment_bases)
for idx, base in enumerate(fragment_bases, 1):
await websocket.send_json({
"type": "status",
"content": f"Generating component {idx}/{total_fragments}: {base.title}"
})
commentary = await generate_commentary(base, fragments, prompt, websocket)
await websocket.send_json({
"type": "status",
"content": f"Writing code for {base.title}..."
})
code = await generate_code(base, fragments, prompt, websocket)
fragment_dict = base.dict()
fragment = Fragment(
**fragment_dict,
code=code,
commentary=commentary,
status="completed"
)
fragments.append(fragment)
await websocket.send_json({
"type": "fragment_update",
"content": fragment.dict()
})
await websocket.send_json({
"type": "status",
"content": "All components generated. Starting deployment..."
})
preview_url = deploy_to_e2b(fragments)
await websocket.send_json({
"type": "preview_url",
"content": preview_url
})
except Exception as e:
logger.error(f"Error: {e}")
await websocket.send_json({"type": "error", "content": str(e)})
finally:
await websocket.close()
@app.get("/health")
async def health_check():
return {"status": "healthy"}
Deployment Configuration
Finally, let's set up our cerebrium.toml
, which handles the configuration for our deployment and all of our dependencies:
[cerebrium.deployment]
name = "4-ai-coding-agent"
python_version = "3.12"
docker_base_image_url = "debian:bookworm-slim"
disable_auth = true
include = ['./*', 'main.py', 'cerebrium.toml']
exclude = ['.*']
[cerebrium.runtime.custom]
port = 8765
entrypoint = ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8765"]
healthcheck_endpoint = "/health"
[cerebrium.hardware]
cpu = 4
memory = 16.0
compute = "AMPERE_A10"
[cerebrium.scaling]
# This parameter keeps 1 instance ready to handle requests. Don't foget to disable this when you're not using your application (If costs are your concern)
min_replicas = 1
max_replicas = 5
cooldown = 30
replica_concurrency = 1
response_grace_period = 900
scaling_metric = "concurrency_utilization"
scaling_target = 100
scaling_buffer = 0
[cerebrium.dependencies.pip]
Not sure what the above configurations mean? Check our our TOML reference, which describes (in detail) how to configure the deployment of your Cerebrium Apps.
Lastly, deploy your final app to Cerebrium with the following command:
… and don’t forget to update your frontend consumer .env
file so that it makes use of your new Websocket endpoint:
CEREBRIUM_SOCKET_URL="wss://api.cortex.cerebrium.ai/v4/[YOUR_PROJECT_ID]/4-ai-coding-agent/ws"
Next Steps
Now that you have a working AI coding agent, consider:
Adding support for more frameworks
Implement your LLM model using a optimization framework like vLLM, TensorRT, SGLang for faster generation
Implementing custom model fine-tuning or RAG for up-to-date use of framework documentation
Integrating with GitHub for PR creation
The possibilities are endless! Congratulations, you've successfully built a foundation that can evolve into your perfect coding assistant.
limitations
This application depends on a Large Language Model that may hallucinate or not always generate the correct code. This may result in deployed applications that may not function as intended. Please double check your code before deploying to a production environment.
PS: If you ever get stuck on any of the above implementation details, check out our examples repository for guidance or chat to us in the Cerebrium Community Discord Server
PPS: We’d love it if you could share this post with your friends and colleagues… It may just help them build the next AI unicorn startup 😛