The Naive Architecture
Most LLM app prototypes look like this:
User Input → Call LLM API → Return Response
This works at zero scale. At production scale, it breaks in predictable ways:
- LLM APIs are slow (1-10 seconds per call)
- LLM APIs are expensive at volume
- LLM outputs are non-deterministic (same input ≠ same output)
- LLM APIs have rate limits
- Token limits constrain what you can send
Here's how to build for production.
Decision 1: Caching
LLM calls are expensive and slow. Cache aggressively.
Semantic Caching
Cache based on similarity, not exact match:
from sentence_transformers import SentenceTransformer
import numpy as np
import json
class SemanticCache:
def __init__(self, threshold=0.95):
self.model = SentenceTransformer('all-MiniLM-L6-v2')
self.cache = [] # (embedding, query, response)
self.threshold = threshold
def get(self, query: str) -> str | None:
embedding = self.model.encode([query])[0]
for cached_embedding, cached_query, response in self.cache:
similarity = np.dot(embedding, cached_embedding) / (
np.linalg.norm(embedding) * np.linalg.norm(cached_embedding)
)
if similarity > self.threshold:
return response
return None
def set(self, query: str, response: str):
embedding = self.model.encode([query])[0]
self.cache.append((embedding, query, response))
"How do I reset my password?" and "What's the process to reset my password?" return the same cached answer.
Exact Caching
For identical inputs (common in batch processing):
import hashlib
import redis
cache = redis.Redis()
def cached_llm_call(prompt: str, model: str, ttl: int = 3600) -> str:
key = hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest()
cached = cache.get(key)
if cached:
return cached.decode()
response = call_llm(prompt, model)
cache.setex(key, ttl, response)
return response
Decision 2: Streaming
LLM responses take 2-10 seconds. Streaming shows the response as it's generated:
import anthropic
client = anthropic.Anthropic()
async def stream_response(prompt: str):
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
) as stream:
for text in stream.text_stream:
yield text # SSE or WebSocket to frontend
# FastAPI SSE endpoint:
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
app = FastAPI()
@app.post("/chat")
async def chat(request: ChatRequest):
async def generate():
async for token in stream_response(request.message):
yield f"data: {json.dumps({'token': token})}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
Perceived latency drops dramatically — users start reading while the model is still generating.
Decision 3: Model Selection Strategy
Don't use your most expensive model for everything:
def route_to_model(task_type: str, context_length: int) -> str:
if task_type == "simple_qa" and context_length < 2000:
return "claude-haiku-4-5-20251001" # Fast, cheap
elif task_type == "code_generation":
return "claude-sonnet-4-6" # Balanced
elif task_type == "complex_reasoning" or context_length > 100_000:
return "claude-opus-4-6" # Most capable
else:
return "claude-sonnet-4-6"
Typical cost ratios (roughly): Haiku is 20x cheaper than Opus. Route simple tasks to cheaper models.
Decision 4: Structured Output
Non-deterministic text output is hard to parse reliably. Use structured output:
from anthropic import Anthropic
import json
client = Anthropic()
def extract_structured(text: str) -> dict:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"""Extract the following from this customer review.
Return valid JSON only, no other text:
{{
"sentiment": "positive" | "negative" | "neutral",
"score": 1-5,
"key_issues": ["list", "of", "issues"],
"recommended_action": "escalate" | "resolve" | "thank"
}}
Review: {text}"""
}]
)
# Parse and validate
try:
return json.loads(response.content[0].text)
except json.JSONDecodeError:
# Retry or fallback
raise ValueError("Model returned invalid JSON")
Alternatively, use models with native function/tool calling for guaranteed structured output.
Decision 5: Rate Limiting and Retry
LLM APIs have rate limits. Handle them gracefully:
import asyncio
import anthropic
from anthropic import RateLimitError, APIStatusError
async def call_with_retry(
client: anthropic.AsyncAnthropic,
messages: list,
max_retries: int = 3
) -> str:
for attempt in range(max_retries):
try:
response = await client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=messages
)
return response.content[0].text
except RateLimitError:
if attempt == max_retries - 1:
raise
wait = 2 ** attempt # Exponential backoff: 1s, 2s, 4s
await asyncio.sleep(wait)
except APIStatusError as e:
if e.status_code >= 500 and attempt < max_retries - 1:
await asyncio.sleep(2 ** attempt)
else:
raise
Decision 6: Observability
You can't optimize what you can't measure:
import time
import logging
logger = logging.getLogger(__name__)
def traced_llm_call(prompt: str, metadata: dict = {}) -> str:
start = time.time()
try:
response = call_llm(prompt)
latency = time.time() - start
# Log for monitoring
logger.info("llm_call", extra={
"latency_ms": latency * 1000,
"prompt_tokens": count_tokens(prompt),
"response_tokens": count_tokens(response),
"model": "claude-sonnet-4-6",
"cache_hit": False,
**metadata
})
return response
except Exception as e:
logger.error("llm_call_failed", extra={"error": str(e), **metadata})
raise
Track: latency, token counts (→ cost), cache hit rate, error rate. These metrics tell you where to optimize.
The Production Architecture
User Request
↓
Rate Limiter / Auth
↓
Semantic Cache ← Cache Hit? Return immediately
↓ (cache miss)
Model Router (select model by task type)
↓
LLM API (with retry + streaming)
↓
Response Parser / Validator
↓
Cache Update
↓
Response (streamed to user)
Key Takeaways
- Cache aggressively: semantic caching handles similar-but-not-identical queries
- Stream responses: dramatically improves perceived latency
- Route by model: use cheap models for simple tasks, reserve expensive models for hard ones
- Structured output: specify JSON format in prompts to make parsing reliable
- Retry with backoff: LLM APIs have transient failures and rate limits
- Instrument everything: track latency, token usage, and error rates from day one