RAG Explained Simply: How to Give LLMs Your Own Data

Retrieval-Augmented Generation (RAG) lets you build LLM apps that answer questions about your own documents. Here's how it actually works.

📅January 20, 2026✍TechTwitter.ioragllmprompt-engineeringai-apps

The Problem RAG Solves

LLMs know what they were trained on. They don't know:

Your internal documentation
Your product's knowledge base
Data from after their training cutoff
Private, proprietary information

You can't fine-tune a model for every new piece of content. RAG is the practical alternative: retrieve relevant documents at query time and include them in the prompt.

How RAG Works

1. At index time:
   Documents → Chunks → Embeddings → Vector Database

2. At query time:
   User Query → Query Embedding → Similarity Search → Top-K Chunks
   Top-K Chunks + Query → LLM Prompt → Answer

That's it. The "magic" is embedding similarity — finding documents that are semantically similar to the question.

Step 1: Embedding

An embedding is a vector (array of numbers) that represents the semantic meaning of text. Similar meanings → similar vectors.

from openai import OpenAI
client = OpenAI()

def embed(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

# These will be very similar (high cosine similarity)
embed("How do I reset my password?")
embed("Password recovery process")

# These will be dissimilar
embed("How do I reset my password?")
embed("Quarterly revenue report")

Step 2: Chunking

Documents must be split into chunks — small enough to fit in context, large enough to contain useful information.

def chunk_document(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks."""
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunk = ' '.join(words[start:end])
        chunks.append(chunk)
        start += chunk_size - overlap  # overlap prevents cutting context
    return chunks

Chunking strategies:

Fixed size — simple, works reasonably well
By sentence/paragraph — better for maintaining context
Semantic chunking — split on topic boundaries (more complex, often better)

Step 3: Store in a Vector Database

import chromadb

client = chromadb.Client()
collection = client.create_collection("docs")

# Index your documents
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
    collection.add(
        documents=[chunk],
        embeddings=[embedding],
        ids=[f"chunk_{i}"],
        metadatas=[{"source": "user-manual.pdf", "page": i // 10}]
    )

Popular vector databases:

ChromaDB — open source, easy to start with
Pinecone — managed, production-ready
Weaviate — open source with hybrid search
pgvector — if you're already on PostgreSQL

Step 4: Query and Retrieve

def search(query: str, n_results: int = 5) -> list[str]:
    query_embedding = embed(query)
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results,
        include=['documents', 'distances', 'metadatas']
    )
    return results['documents'][0]

Step 5: Generate with Context

from anthropic import Anthropic

claude = Anthropic()

def answer(question: str) -> str:
    # Retrieve relevant chunks
    relevant_docs = search(question)
    context = "\n\n---\n\n".join(relevant_docs)

    # Construct the prompt
    prompt = f"""Answer the question based on the provided context.
If the answer isn't in the context, say "I don't have information about that."

Context:
{context}

Question: {question}

Answer:"""

    response = claude.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Common RAG Problems and Fixes

Problem: Retrieves irrelevant chunks

Fix: Better chunking (semantic), hybrid search (keyword + semantic), or re-ranking retrieved results before sending to LLM.

Problem: Answer ignores retrieved context

Fix: Make the instruction more explicit: "You MUST only use the provided context. If the answer is not there, say so."

Problem: Slow retrieval

Fix: Pre-filter by metadata before vector search (e.g., only search chunks from the relevant product category).

Problem: Context window overflow

Fix: Limit retrieved chunks, use a summarization step, or use a model with larger context.

Minimal RAG Stack (Python)

pip install anthropic chromadb sentence-transformers

from sentence_transformers import SentenceTransformer
import chromadb
from anthropic import Anthropic

# Local embeddings (free, no API needed)
model = SentenceTransformer('all-MiniLM-L6-v2')
chroma = chromadb.Client()
collection = chroma.create_collection("docs")
claude = Anthropic()

def index(documents: list[str]):
    embeddings = model.encode(documents).tolist()
    ids = [str(i) for i in range(len(documents))]
    collection.add(documents=documents, embeddings=embeddings, ids=ids)

def query(question: str) -> str:
    q_embedding = model.encode([question]).tolist()
    results = collection.query(query_embeddings=q_embedding, n_results=3)
    context = "\n".join(results['documents'][0])

    response = claude.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}"
        }]
    )
    return response.content[0].text

Key Takeaways

RAG = retrieve relevant documents → include in prompt → generate answer
The pipeline: chunk → embed → store → retrieve → generate
Embeddings convert text to vectors; similar meaning → similar vectors
Start with ChromaDB and sentence-transformers for a local proof of concept
The most common failure: bad chunking loses context across chunk boundaries
Hybrid search (keyword + vector) often outperforms pure vector search

← Back to Prompt Engineering All Resources