skip to content

RAG Implementation Checklist

End-to-end checklist and code for building reliable Retrieval-Augmented Generation pipelines β€” chunking, embedding, vector DBs, retrieval, and evaluation.

7 min read 20 snippets yesterday deep dive

RAG Implementation Checklist#

Architecture overview#

Documents β†’ Chunking β†’ Embedding β†’ Vector DB
                                       ↓
User Query β†’ Embed Query β†’ Retrieve Top-K β†’ Rerank β†’ Context Assembly β†’ LLM β†’ Answer

Document ingestion checklist#

☐ Split documents into chunks (300–600 tokens is typical for dense text)
☐ Preserve metadata per chunk: source URL, page number, section heading, date, author
☐ Handle multiple formats: PDF, HTML, Markdown, DOCX, plain text
☐ Strip boilerplate from web sources (nav, headers, footers, cookie banners)
☐ Deduplicate chunks with a content hash before embedding
☐ Test chunking on your actual data β€” verify no splits mid-sentence or mid-table
☐ Store chunk text alongside its vector β€” never rely on ID-only lookups

Chunking strategies#

Fixed-size (baseline)#

def fixed_chunk(text: str, size: int = 500, overlap: int = 50) -> list[str]:
    words = text.split()
    chunks = []
    for i in range(0, len(words), size - overlap):
        chunk = " ".join(words[i : i + size])
        if chunk:
            chunks.append(chunk)
    return chunks
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(document_text)

Semantic chunking (best quality, higher cost)#

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings  # or any embed model

chunker = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,
)
chunks = chunker.split_text(document_text)

[!TIP] For code, chunk by function/class boundaries, not by token count. For tables, keep table rows together. For long lists, chunk entire lists rather than splitting mid-list.

Embedding#

Checklist#

☐ Choose model appropriate for your domain (general vs code vs legal vs medical)
☐ Use the exact same model at query time as at ingestion time
☐ Normalize embeddings (most models expect cosine similarity on unit vectors)
☐ Batch embed during ingestion β€” avoid per-chunk API calls
☐ Store raw text + metadata alongside each vector

Embedding with sentence-transformers (local, free)#

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")   # 384 dims, fast

chunks = ["The capital of France is Paris.", "Python 3.12 released in 2023."]
embeddings = model.encode(chunks, normalize_embeddings=True, batch_size=32)

print(embeddings.shape)   # (2, 384)

Output:

(2, 384)

Embedding model comparison#

ModelDimsSizeBest for
all-MiniLM-L6-v238480 MBFast general-purpose
all-mpnet-base-v2768420 MBHigher quality general
text-embedding-3-small (OpenAI)1536APIGood quality, cost-effective
text-embedding-3-large (OpenAI)3072APIBest OpenAI quality
voyage-3 (Voyage AI)1024APIBest for RAG (benchmarks)
nomic-embed-text (Nomic)768API/localOpen, competitive quality

Vector database options#

DBTypeBest forFree tier
ChromaEmbedded/serverLocal dev, prototypesβœ… self-hosted
pgvectorPostgres extensionExisting Postgres stackβœ… self-hosted
QdrantDedicated vector DBProduction, filteringβœ… self-hosted
WeaviateDedicated vector DBMulti-modal, GraphQLβœ… self-hosted
PineconeManaged SaaSFully managed, scaleFree tier (1 index)
MilvusDistributedHigh-scale productionβœ… self-hosted
LanceDBEmbedded (files)Serverless, embeddedβœ… self-hosted

Chroma (local dev)#

import chromadb
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
client = chromadb.Client()   # in-memory; use PersistentClient for disk
collection = client.create_collection("docs")

# Ingest
texts = ["Paris is the capital of France.", "Berlin is the capital of Germany."]
embeddings = model.encode(texts, normalize_embeddings=True).tolist()
collection.add(
    documents=texts,
    embeddings=embeddings,
    ids=["doc-0", "doc-1"],
    metadatas=[{"source": "geography"}, {"source": "geography"}],
)

# Query
query_vec = model.encode(["What is the capital of France?"],
                          normalize_embeddings=True).tolist()
results = collection.query(query_embeddings=query_vec, n_results=2)
print(results["documents"][0])

Output:

['Paris is the capital of France.', 'Berlin is the capital of Germany.']

pgvector (production)#

-- Enable extension
CREATE EXTENSION vector;

-- Table with embedding column
CREATE TABLE doc_chunks (
    id       SERIAL PRIMARY KEY,
    source   TEXT,
    chunk    TEXT,
    embedding VECTOR(384)
);

-- Approximate nearest-neighbor index (HNSW β€” fast)
CREATE INDEX ON doc_chunks USING hnsw (embedding vector_cosine_ops);
import psycopg2
import numpy as np

conn = psycopg2.connect("postgresql://user:pass@localhost/mydb")
cur = conn.cursor()

query_vec = model.encode(["What is the capital of France?"],
                          normalize_embeddings=True)[0]

cur.execute(
    """
    SELECT source, chunk, 1 - (embedding <=> %s::vector) AS similarity
    FROM doc_chunks
    ORDER BY embedding <=> %s::vector
    LIMIT 5
    """,
    (query_vec.tolist(), query_vec.tolist())
)
rows = cur.fetchall()
for source, chunk, sim in rows:
    print(f"{sim:.3f}  [{source}]  {chunk[:80]}")

Output:

0.932  [geography]  Paris is the capital of France.
0.801  [geography]  France is a country in Western Europe.

Retrieval#

def retrieve(query: str, k: int = 5) -> list[dict]:
    query_vec = model.encode([query], normalize_embeddings=True)[0]

    # Vector similarity search (top 2k candidates)
    vector_results = collection.query(
        query_embeddings=[query_vec.tolist()],
        n_results=k * 2
    )
    candidates = [
        {"text": doc, "metadata": meta, "score": None}
        for doc, meta in zip(
            vector_results["documents"][0],
            vector_results["metadatas"][0]
        )
    ]

    # Optional: cross-encoder reranking (high-value, ~100ms)
    # from sentence_transformers import CrossEncoder
    # reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
    # pairs = [(query, c["text"]) for c in candidates]
    # scores = reranker.predict(pairs)
    # candidates = sorted(zip(scores, candidates), reverse=True)
    # candidates = [c for _, c in candidates]

    return candidates[:k]

Context assembly#

def build_context(chunks: list[dict], max_tokens: int = 6000) -> str:
    context_parts = []
    token_count = 0

    for chunk in chunks:
        # Rough token estimate: 1 token β‰ˆ 4 chars
        chunk_tokens = len(chunk["text"]) // 4
        if token_count + chunk_tokens > max_tokens:
            break
        source = chunk["metadata"].get("source", "unknown")
        context_parts.append(f"[Source: {source}]\n{chunk['text']}")
        token_count += chunk_tokens

    return "\n\n---\n\n".join(context_parts)

Prompt template for RAG#

Answer the question using ONLY the sources provided below.
If the answer is not in the sources, say "I don't have enough information."
Do not speculate or draw on outside knowledge.
Cite sources by their [Source: ...] label.

Sources:
{context}

Question: {question}

Answer:

Full RAG pipeline#

import anthropic

anthropic_client = anthropic.Anthropic()

def answer(question: str) -> str:
    chunks = retrieve(question, k=5)
    context = build_context(chunks)

    response = anthropic_client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": (
                f"Answer using ONLY the sources below. "
                f"Cite sources. If not in sources, say so.\n\n"
                f"Sources:\n{context}\n\n"
                f"Question: {question}"
            )
        }]
    )
    return response.content[0].text

print(answer("What is the capital of France?"))

Output:

According to [Source: geography], Paris is the capital of France.

Agentic RAG#

For multi-hop questions (answer depends on multiple retrieval steps), give Claude a search tool and let it decide what to retrieve.

search_tool = {
    "name": "search_docs",
    "description": (
        "Search the documentation for relevant information. "
        "Call this when you need specific facts to answer the question. "
        "You may call it multiple times with different queries."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "query": {"type": "string", "description": "Search query"},
            "max_results": {"type": "integer", "default": 5}
        },
        "required": ["query"]
    }
}

def handle_search(inputs: dict) -> str:
    chunks = retrieve(inputs["query"], k=inputs.get("max_results", 5))
    return build_context(chunks)

# Let Claude drive the retrieval loop
answer = run_agent(
    user_message="What are the differences between Chroma and pgvector?",
    tools=[search_tool],
    max_turns=8,
)

Evaluation checklist#

☐ Faithfulness β€” does the answer use only retrieved chunks? (no hallucination)
☐ Answer relevance β€” does the answer address the actual question?
☐ Context recall β€” does the top-k contain the chunk needed to answer?
☐ Context precision β€” are retrieved chunks on-topic, or noisy?
☐ Latency β€” p50/p95 retrieval + generation time within SLA
☐ Hallucination rate β€” spot-check a sample against source documents

Common failure modes#

SymptomLikely causeFix
Wrong answer despite correct chunk retrievedPrompt doesn’t constrain to sourcesAdd explicit β€œONLY use sources” instruction
Correct answer but wrong source citedChunk metadata lost at storagePersist source field alongside vector
Good on short docs, bad on longFixed chunk too large (diluted)Use smaller chunks or semantic chunking
Misses recent informationStale indexAdd incremental ingestion + reindex trigger
Slow retrievalFull scan without indexAdd HNSW/IVF index; shard by date
Hallucinations despite good retrievalContext too long, key chunk buriedUse reranker; put most relevant chunk first
Poor performance on tables/listsCharacter-level chunking splits structureKeep tables and lists whole as single chunks