ResourcesPrompt EngineeringBuilding LLM Apps: The Architecture Decisions That Matter
🧠Prompt EngineeringBuilding LLM Apps: The Architecture Decisions That Matter8 min

Building LLM Apps: The Architecture Decisions That Matter

Beyond prompting, the infrastructure decisions that determine whether your LLM app is reliable, fast, and affordable in production.

📅February 7, 2026TechTwitter.iollmai-appsarchitectureprompt-engineering

The Naive Architecture

Most LLM app prototypes look like this:

User Input → Call LLM API → Return Response

This works at zero scale. At production scale, it breaks in predictable ways:

  • LLM APIs are slow (1-10 seconds per call)
  • LLM APIs are expensive at volume
  • LLM outputs are non-deterministic (same input ≠ same output)
  • LLM APIs have rate limits
  • Token limits constrain what you can send

Here's how to build for production.


Decision 1: Caching

LLM calls are expensive and slow. Cache aggressively.

Semantic Caching

Cache based on similarity, not exact match:

from sentence_transformers import SentenceTransformer
import numpy as np
import json

class SemanticCache:
    def __init__(self, threshold=0.95):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.cache = []  # (embedding, query, response)
        self.threshold = threshold

    def get(self, query: str) -> str | None:
        embedding = self.model.encode([query])[0]
        for cached_embedding, cached_query, response in self.cache:
            similarity = np.dot(embedding, cached_embedding) / (
                np.linalg.norm(embedding) * np.linalg.norm(cached_embedding)
            )
            if similarity > self.threshold:
                return response
        return None

    def set(self, query: str, response: str):
        embedding = self.model.encode([query])[0]
        self.cache.append((embedding, query, response))

"How do I reset my password?" and "What's the process to reset my password?" return the same cached answer.

Exact Caching

For identical inputs (common in batch processing):

import hashlib
import redis

cache = redis.Redis()

def cached_llm_call(prompt: str, model: str, ttl: int = 3600) -> str:
    key = hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest()
    cached = cache.get(key)
    if cached:
        return cached.decode()
    response = call_llm(prompt, model)
    cache.setex(key, ttl, response)
    return response

Decision 2: Streaming

LLM responses take 2-10 seconds. Streaming shows the response as it's generated:

import anthropic

client = anthropic.Anthropic()

async def stream_response(prompt: str):
    with client.messages.stream(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    ) as stream:
        for text in stream.text_stream:
            yield text  # SSE or WebSocket to frontend

# FastAPI SSE endpoint:
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.post("/chat")
async def chat(request: ChatRequest):
    async def generate():
        async for token in stream_response(request.message):
            yield f"data: {json.dumps({'token': token})}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

Perceived latency drops dramatically — users start reading while the model is still generating.


Decision 3: Model Selection Strategy

Don't use your most expensive model for everything:

def route_to_model(task_type: str, context_length: int) -> str:
    if task_type == "simple_qa" and context_length < 2000:
        return "claude-haiku-4-5-20251001"  # Fast, cheap
    elif task_type == "code_generation":
        return "claude-sonnet-4-6"  # Balanced
    elif task_type == "complex_reasoning" or context_length > 100_000:
        return "claude-opus-4-6"  # Most capable
    else:
        return "claude-sonnet-4-6"

Typical cost ratios (roughly): Haiku is 20x cheaper than Opus. Route simple tasks to cheaper models.


Decision 4: Structured Output

Non-deterministic text output is hard to parse reliably. Use structured output:

from anthropic import Anthropic
import json

client = Anthropic()

def extract_structured(text: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"""Extract the following from this customer review.
Return valid JSON only, no other text:
{{
  "sentiment": "positive" | "negative" | "neutral",
  "score": 1-5,
  "key_issues": ["list", "of", "issues"],
  "recommended_action": "escalate" | "resolve" | "thank"
}}

Review: {text}"""
        }]
    )

    # Parse and validate
    try:
        return json.loads(response.content[0].text)
    except json.JSONDecodeError:
        # Retry or fallback
        raise ValueError("Model returned invalid JSON")

Alternatively, use models with native function/tool calling for guaranteed structured output.


Decision 5: Rate Limiting and Retry

LLM APIs have rate limits. Handle them gracefully:

import asyncio
import anthropic
from anthropic import RateLimitError, APIStatusError

async def call_with_retry(
    client: anthropic.AsyncAnthropic,
    messages: list,
    max_retries: int = 3
) -> str:
    for attempt in range(max_retries):
        try:
            response = await client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                messages=messages
            )
            return response.content[0].text
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            wait = 2 ** attempt  # Exponential backoff: 1s, 2s, 4s
            await asyncio.sleep(wait)
        except APIStatusError as e:
            if e.status_code >= 500 and attempt < max_retries - 1:
                await asyncio.sleep(2 ** attempt)
            else:
                raise

Decision 6: Observability

You can't optimize what you can't measure:

import time
import logging

logger = logging.getLogger(__name__)

def traced_llm_call(prompt: str, metadata: dict = {}) -> str:
    start = time.time()
    try:
        response = call_llm(prompt)
        latency = time.time() - start

        # Log for monitoring
        logger.info("llm_call", extra={
            "latency_ms": latency * 1000,
            "prompt_tokens": count_tokens(prompt),
            "response_tokens": count_tokens(response),
            "model": "claude-sonnet-4-6",
            "cache_hit": False,
            **metadata
        })
        return response
    except Exception as e:
        logger.error("llm_call_failed", extra={"error": str(e), **metadata})
        raise

Track: latency, token counts (→ cost), cache hit rate, error rate. These metrics tell you where to optimize.


The Production Architecture

User Request
    ↓
Rate Limiter / Auth
    ↓
Semantic Cache  ← Cache Hit? Return immediately
    ↓ (cache miss)
Model Router (select model by task type)
    ↓
LLM API (with retry + streaming)
    ↓
Response Parser / Validator
    ↓
Cache Update
    ↓
Response (streamed to user)

Key Takeaways

  • Cache aggressively: semantic caching handles similar-but-not-identical queries
  • Stream responses: dramatically improves perceived latency
  • Route by model: use cheap models for simple tasks, reserve expensive models for hard ones
  • Structured output: specify JSON format in prompts to make parsing reliable
  • Retry with backoff: LLM APIs have transient failures and rate limits
  • Instrument everything: track latency, token usage, and error rates from day one