RAG Implementation Checklist#

Architecture overview#

Documents → Chunking → Embedding → Vector DB
                                       ↓
User Query → Embed Query → Retrieve Top-K → Rerank → Context Assembly → LLM → Answer

Document ingestion checklist#

☐ Split documents into chunks (300–600 tokens is typical for dense text)
☐ Preserve metadata per chunk: source URL, page number, section heading, date, author
☐ Handle multiple formats: PDF, HTML, Markdown, DOCX, plain text
☐ Strip boilerplate from web sources (nav, headers, footers, cookie banners)
☐ Deduplicate chunks with a content hash before embedding
☐ Test chunking on your actual data — verify no splits mid-sentence or mid-table
☐ Store chunk text alongside its vector — never rely on ID-only lookups

Chunking strategies#

Fixed-size (baseline)#

def fixed_chunk(text: str, size: int = 500, overlap: int = 50) -> list[str]:
    words = text.split()
    chunks = []
    for i in range(0, len(words), size - overlap):
        chunk = " ".join(words[i : i + size])
        if chunk:
            chunks.append(chunk)
    return chunks

Recursive character (recommended for prose)#

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(document_text)

Semantic chunking (best quality, higher cost)#

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings  # or any embed model

chunker = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,
)
chunks = chunker.split_text(document_text)

[!TIP] For code, chunk by function/class boundaries, not by token count. For tables, keep table rows together. For long lists, chunk entire lists rather than splitting mid-list.

Embedding#

Checklist#

☐ Choose model appropriate for your domain (general vs code vs legal vs medical)
☐ Use the exact same model at query time as at ingestion time
☐ Normalize embeddings (most models expect cosine similarity on unit vectors)
☐ Batch embed during ingestion — avoid per-chunk API calls
☐ Store raw text + metadata alongside each vector

Embedding with sentence-transformers (local, free)#

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")   # 384 dims, fast

chunks = ["The capital of France is Paris.", "Python 3.12 released in 2023."]
embeddings = model.encode(chunks, normalize_embeddings=True, batch_size=32)

print(embeddings.shape)   # (2, 384)

Output:

(2, 384)

Embedding model comparison#

Model	Dims	Size	Best for
`all-MiniLM-L6-v2`	384	80 MB	Fast general-purpose
`all-mpnet-base-v2`	768	420 MB	Higher quality general
`text-embedding-3-small` (OpenAI)	1536	API	Good quality, cost-effective
`text-embedding-3-large` (OpenAI)	3072	API	Best OpenAI quality
`voyage-3` (Voyage AI)	1024	API	Best for RAG (benchmarks)
`nomic-embed-text` (Nomic)	768	API/local	Open, competitive quality

Vector database options#

DB	Type	Best for	Free tier
Chroma	Embedded/server	Local dev, prototypes	✅ self-hosted
pgvector	Postgres extension	Existing Postgres stack	✅ self-hosted
Qdrant	Dedicated vector DB	Production, filtering	✅ self-hosted
Weaviate	Dedicated vector DB	Multi-modal, GraphQL	✅ self-hosted
Pinecone	Managed SaaS	Fully managed, scale	Free tier (1 index)
Milvus	Distributed	High-scale production	✅ self-hosted
LanceDB	Embedded (files)	Serverless, embedded	✅ self-hosted

Chroma (local dev)#

import chromadb
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
client = chromadb.Client()   # in-memory; use PersistentClient for disk
collection = client.create_collection("docs")

# Ingest
texts = ["Paris is the capital of France.", "Berlin is the capital of Germany."]
embeddings = model.encode(texts, normalize_embeddings=True).tolist()
collection.add(
    documents=texts,
    embeddings=embeddings,
    ids=["doc-0", "doc-1"],
    metadatas=[{"source": "geography"}, {"source": "geography"}],
)

# Query
query_vec = model.encode(["What is the capital of France?"],
                          normalize_embeddings=True).tolist()
results = collection.query(query_embeddings=query_vec, n_results=2)
print(results["documents"][0])

Output:

['Paris is the capital of France.', 'Berlin is the capital of Germany.']

pgvector (production)#

-- Enable extension
CREATE EXTENSION vector;

-- Table with embedding column
CREATE TABLE doc_chunks (
    id       SERIAL PRIMARY KEY,
    source   TEXT,
    chunk    TEXT,
    embedding VECTOR(384)
);

-- Approximate nearest-neighbor index (HNSW — fast)
CREATE INDEX ON doc_chunks USING hnsw (embedding vector_cosine_ops);

import psycopg2
import numpy as np

conn = psycopg2.connect("postgresql://user:pass@localhost/mydb")
cur = conn.cursor()

query_vec = model.encode(["What is the capital of France?"],
                          normalize_embeddings=True)[0]

cur.execute(
    """
    SELECT source, chunk, 1 - (embedding <=> %s::vector) AS similarity
    FROM doc_chunks
    ORDER BY embedding <=> %s::vector
    LIMIT 5
    """,
    (query_vec.tolist(), query_vec.tolist())
)
rows = cur.fetchall()
for source, chunk, sim in rows:
    print(f"{sim:.3f}  [{source}]  {chunk[:80]}")

Output:

0.932  [geography]  Paris is the capital of France.
0.801  [geography]  France is a country in Western Europe.

Retrieval#

def retrieve(query: str, k: int = 5) -> list[dict]:
    query_vec = model.encode([query], normalize_embeddings=True)[0]

    # Vector similarity search (top 2k candidates)
    vector_results = collection.query(
        query_embeddings=[query_vec.tolist()],
        n_results=k * 2
    )
    candidates = [
        {"text": doc, "metadata": meta, "score": None}
        for doc, meta in zip(
            vector_results["documents"][0],
            vector_results["metadatas"][0]
        )
    ]

    # Optional: cross-encoder reranking (high-value, ~100ms)
    # from sentence_transformers import CrossEncoder
    # reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
    # pairs = [(query, c["text"]) for c in candidates]
    # scores = reranker.predict(pairs)
    # candidates = sorted(zip(scores, candidates), reverse=True)
    # candidates = [c for _, c in candidates]

    return candidates[:k]

Context assembly#

def build_context(chunks: list[dict], max_tokens: int = 6000) -> str:
    context_parts = []
    token_count = 0

    for chunk in chunks:
        # Rough token estimate: 1 token ≈ 4 chars
        chunk_tokens = len(chunk["text"]) // 4
        if token_count + chunk_tokens > max_tokens:
            break
        source = chunk["metadata"].get("source", "unknown")
        context_parts.append(f"[Source: {source}]\n{chunk['text']}")
        token_count += chunk_tokens

    return "\n\n---\n\n".join(context_parts)

Prompt template for RAG#

Answer the question using ONLY the sources provided below.
If the answer is not in the sources, say "I don't have enough information."
Do not speculate or draw on outside knowledge.
Cite sources by their [Source: ...] label.

Sources:
{context}

Question: {question}

Answer:

Full RAG pipeline#

import anthropic

anthropic_client = anthropic.Anthropic()

def answer(question: str) -> str:
    chunks = retrieve(question, k=5)
    context = build_context(chunks)

    response = anthropic_client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": (
                f"Answer using ONLY the sources below. "
                f"Cite sources. If not in sources, say so.\n\n"
                f"Sources:\n{context}\n\n"
                f"Question: {question}"
            )
        }]
    )
    return response.content[0].text

print(answer("What is the capital of France?"))

Output:

According to [Source: geography], Paris is the capital of France.

Agentic RAG#

For multi-hop questions (answer depends on multiple retrieval steps), give Claude a search tool and let it decide what to retrieve.

search_tool = {
    "name": "search_docs",
    "description": (
        "Search the documentation for relevant information. "
        "Call this when you need specific facts to answer the question. "
        "You may call it multiple times with different queries."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "query": {"type": "string", "description": "Search query"},
            "max_results": {"type": "integer", "default": 5}
        },
        "required": ["query"]
    }
}

def handle_search(inputs: dict) -> str:
    chunks = retrieve(inputs["query"], k=inputs.get("max_results", 5))
    return build_context(chunks)

# Let Claude drive the retrieval loop
answer = run_agent(
    user_message="What are the differences between Chroma and pgvector?",
    tools=[search_tool],
    max_turns=8,
)

Evaluation checklist#

☐ Faithfulness — does the answer use only retrieved chunks? (no hallucination)
☐ Answer relevance — does the answer address the actual question?
☐ Context recall — does the top-k contain the chunk needed to answer?
☐ Context precision — are retrieved chunks on-topic, or noisy?
☐ Latency — p50/p95 retrieval + generation time within SLA
☐ Hallucination rate — spot-check a sample against source documents

Common failure modes#

Symptom	Likely cause	Fix
Wrong answer despite correct chunk retrieved	Prompt doesn’t constrain to sources	Add explicit “ONLY use sources” instruction
Correct answer but wrong source cited	Chunk metadata lost at storage	Persist `source` field alongside vector
Good on short docs, bad on long	Fixed chunk too large (diluted)	Use smaller chunks or semantic chunking
Misses recent information	Stale index	Add incremental ingestion + reindex trigger
Slow retrieval	Full scan without index	Add HNSW/IVF index; shard by date
Hallucinations despite good retrieval	Context too long, key chunk buried	Use reranker; put most relevant chunk first
Poor performance on tables/lists	Character-level chunking splits structure	Keep tables and lists whole as single chunks

g h	home
g l	Linux section
g w	Windows section
g z	z/OS section
g o	macOS section
g a	AI section
g p	Python section
g g	graph view
g t	tags index

⌘K / /	open search palette
t	cycle theme (dark → light → system)
?	toggle this panel

[ / ]	previous / next sheet in section
j / k	scroll down / up