The Problem RAG Solves
LLMs know what they were trained on. They don't know:
- Your internal documentation
- Your product's knowledge base
- Data from after their training cutoff
- Private, proprietary information
You can't fine-tune a model for every new piece of content. RAG is the practical alternative: retrieve relevant documents at query time and include them in the prompt.
How RAG Works
1. At index time:
Documents โ Chunks โ Embeddings โ Vector Database
2. At query time:
User Query โ Query Embedding โ Similarity Search โ Top-K Chunks
Top-K Chunks + Query โ LLM Prompt โ Answer
That's it. The "magic" is embedding similarity โ finding documents that are semantically similar to the question.
Step 1: Embedding
An embedding is a vector (array of numbers) that represents the semantic meaning of text. Similar meanings โ similar vectors.
from openai import OpenAI
client = OpenAI()
def embed(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
# These will be very similar (high cosine similarity)
embed("How do I reset my password?")
embed("Password recovery process")
# These will be dissimilar
embed("How do I reset my password?")
embed("Quarterly revenue report")
Step 2: Chunking
Documents must be split into chunks โ small enough to fit in context, large enough to contain useful information.
def chunk_document(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
"""Split text into overlapping chunks."""
words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + chunk_size
chunk = ' '.join(words[start:end])
chunks.append(chunk)
start += chunk_size - overlap # overlap prevents cutting context
return chunks
Chunking strategies:
- Fixed size โ simple, works reasonably well
- By sentence/paragraph โ better for maintaining context
- Semantic chunking โ split on topic boundaries (more complex, often better)
Step 3: Store in a Vector Database
import chromadb
client = chromadb.Client()
collection = client.create_collection("docs")
# Index your documents
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
collection.add(
documents=[chunk],
embeddings=[embedding],
ids=[f"chunk_{i}"],
metadatas=[{"source": "user-manual.pdf", "page": i // 10}]
)
Popular vector databases:
- ChromaDB โ open source, easy to start with
- Pinecone โ managed, production-ready
- Weaviate โ open source with hybrid search
- pgvector โ if you're already on PostgreSQL
Step 4: Query and Retrieve
def search(query: str, n_results: int = 5) -> list[str]:
query_embedding = embed(query)
results = collection.query(
query_embeddings=[query_embedding],
n_results=n_results,
include=['documents', 'distances', 'metadatas']
)
return results['documents'][0]
Step 5: Generate with Context
from anthropic import Anthropic
claude = Anthropic()
def answer(question: str) -> str:
# Retrieve relevant chunks
relevant_docs = search(question)
context = "\n\n---\n\n".join(relevant_docs)
# Construct the prompt
prompt = f"""Answer the question based on the provided context.
If the answer isn't in the context, say "I don't have information about that."
Context:
{context}
Question: {question}
Answer:"""
response = claude.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Common RAG Problems and Fixes
Problem: Retrieves irrelevant chunks
Fix: Better chunking (semantic), hybrid search (keyword + semantic), or re-ranking retrieved results before sending to LLM.
Problem: Answer ignores retrieved context
Fix: Make the instruction more explicit: "You MUST only use the provided context. If the answer is not there, say so."
Problem: Slow retrieval
Fix: Pre-filter by metadata before vector search (e.g., only search chunks from the relevant product category).
Problem: Context window overflow
Fix: Limit retrieved chunks, use a summarization step, or use a model with larger context.
Minimal RAG Stack (Python)
pip install anthropic chromadb sentence-transformers
from sentence_transformers import SentenceTransformer
import chromadb
from anthropic import Anthropic
# Local embeddings (free, no API needed)
model = SentenceTransformer('all-MiniLM-L6-v2')
chroma = chromadb.Client()
collection = chroma.create_collection("docs")
claude = Anthropic()
def index(documents: list[str]):
embeddings = model.encode(documents).tolist()
ids = [str(i) for i in range(len(documents))]
collection.add(documents=documents, embeddings=embeddings, ids=ids)
def query(question: str) -> str:
q_embedding = model.encode([question]).tolist()
results = collection.query(query_embeddings=q_embedding, n_results=3)
context = "\n".join(results['documents'][0])
response = claude.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
}]
)
return response.content[0].text
Key Takeaways
- RAG = retrieve relevant documents โ include in prompt โ generate answer
- The pipeline: chunk โ embed โ store โ retrieve โ generate
- Embeddings convert text to vectors; similar meaning โ similar vectors
- Start with ChromaDB and
sentence-transformersfor a local proof of concept - The most common failure: bad chunking loses context across chunk boundaries
- Hybrid search (keyword + vector) often outperforms pure vector search