RAG Implementation Checklist#
Architecture overview#
Documents β Chunking β Embedding β Vector DB
β
User Query β Embed Query β Retrieve Top-K β Rerank β Context Assembly β LLM β Answer
Document ingestion checklist#
β Split documents into chunks (300β600 tokens is typical for dense text)
β Preserve metadata per chunk: source URL, page number, section heading, date, author
β Handle multiple formats: PDF, HTML, Markdown, DOCX, plain text
β Strip boilerplate from web sources (nav, headers, footers, cookie banners)
β Deduplicate chunks with a content hash before embedding
β Test chunking on your actual data β verify no splits mid-sentence or mid-table
β Store chunk text alongside its vector β never rely on ID-only lookups
Chunking strategies#
Fixed-size (baseline)#
def fixed_chunk(text: str, size: int = 500, overlap: int = 50) -> list[str]:
words = text.split()
chunks = []
for i in range(0, len(words), size - overlap):
chunk = " ".join(words[i : i + size])
if chunk:
chunks.append(chunk)
return chunks
Recursive character (recommended for prose)#
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(document_text)
Semantic chunking (best quality, higher cost)#
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings # or any embed model
chunker = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95,
)
chunks = chunker.split_text(document_text)
[!TIP] For code, chunk by function/class boundaries, not by token count. For tables, keep table rows together. For long lists, chunk entire lists rather than splitting mid-list.
Embedding#
Checklist#
β Choose model appropriate for your domain (general vs code vs legal vs medical)
β Use the exact same model at query time as at ingestion time
β Normalize embeddings (most models expect cosine similarity on unit vectors)
β Batch embed during ingestion β avoid per-chunk API calls
β Store raw text + metadata alongside each vector
Embedding with sentence-transformers (local, free)#
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2") # 384 dims, fast
chunks = ["The capital of France is Paris.", "Python 3.12 released in 2023."]
embeddings = model.encode(chunks, normalize_embeddings=True, batch_size=32)
print(embeddings.shape) # (2, 384)
Output:
(2, 384)
Embedding model comparison#
| Model | Dims | Size | Best for |
|---|---|---|---|
all-MiniLM-L6-v2 | 384 | 80 MB | Fast general-purpose |
all-mpnet-base-v2 | 768 | 420 MB | Higher quality general |
text-embedding-3-small (OpenAI) | 1536 | API | Good quality, cost-effective |
text-embedding-3-large (OpenAI) | 3072 | API | Best OpenAI quality |
voyage-3 (Voyage AI) | 1024 | API | Best for RAG (benchmarks) |
nomic-embed-text (Nomic) | 768 | API/local | Open, competitive quality |
Vector database options#
| DB | Type | Best for | Free tier |
|---|---|---|---|
| Chroma | Embedded/server | Local dev, prototypes | β self-hosted |
| pgvector | Postgres extension | Existing Postgres stack | β self-hosted |
| Qdrant | Dedicated vector DB | Production, filtering | β self-hosted |
| Weaviate | Dedicated vector DB | Multi-modal, GraphQL | β self-hosted |
| Pinecone | Managed SaaS | Fully managed, scale | Free tier (1 index) |
| Milvus | Distributed | High-scale production | β self-hosted |
| LanceDB | Embedded (files) | Serverless, embedded | β self-hosted |
Chroma (local dev)#
import chromadb
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
client = chromadb.Client() # in-memory; use PersistentClient for disk
collection = client.create_collection("docs")
# Ingest
texts = ["Paris is the capital of France.", "Berlin is the capital of Germany."]
embeddings = model.encode(texts, normalize_embeddings=True).tolist()
collection.add(
documents=texts,
embeddings=embeddings,
ids=["doc-0", "doc-1"],
metadatas=[{"source": "geography"}, {"source": "geography"}],
)
# Query
query_vec = model.encode(["What is the capital of France?"],
normalize_embeddings=True).tolist()
results = collection.query(query_embeddings=query_vec, n_results=2)
print(results["documents"][0])
Output:
['Paris is the capital of France.', 'Berlin is the capital of Germany.']
pgvector (production)#
-- Enable extension
CREATE EXTENSION vector;
-- Table with embedding column
CREATE TABLE doc_chunks (
id SERIAL PRIMARY KEY,
source TEXT,
chunk TEXT,
embedding VECTOR(384)
);
-- Approximate nearest-neighbor index (HNSW β fast)
CREATE INDEX ON doc_chunks USING hnsw (embedding vector_cosine_ops);
import psycopg2
import numpy as np
conn = psycopg2.connect("postgresql://user:pass@localhost/mydb")
cur = conn.cursor()
query_vec = model.encode(["What is the capital of France?"],
normalize_embeddings=True)[0]
cur.execute(
"""
SELECT source, chunk, 1 - (embedding <=> %s::vector) AS similarity
FROM doc_chunks
ORDER BY embedding <=> %s::vector
LIMIT 5
""",
(query_vec.tolist(), query_vec.tolist())
)
rows = cur.fetchall()
for source, chunk, sim in rows:
print(f"{sim:.3f} [{source}] {chunk[:80]}")
Output:
0.932 [geography] Paris is the capital of France.
0.801 [geography] France is a country in Western Europe.
Retrieval#
def retrieve(query: str, k: int = 5) -> list[dict]:
query_vec = model.encode([query], normalize_embeddings=True)[0]
# Vector similarity search (top 2k candidates)
vector_results = collection.query(
query_embeddings=[query_vec.tolist()],
n_results=k * 2
)
candidates = [
{"text": doc, "metadata": meta, "score": None}
for doc, meta in zip(
vector_results["documents"][0],
vector_results["metadatas"][0]
)
]
# Optional: cross-encoder reranking (high-value, ~100ms)
# from sentence_transformers import CrossEncoder
# reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
# pairs = [(query, c["text"]) for c in candidates]
# scores = reranker.predict(pairs)
# candidates = sorted(zip(scores, candidates), reverse=True)
# candidates = [c for _, c in candidates]
return candidates[:k]
Context assembly#
def build_context(chunks: list[dict], max_tokens: int = 6000) -> str:
context_parts = []
token_count = 0
for chunk in chunks:
# Rough token estimate: 1 token β 4 chars
chunk_tokens = len(chunk["text"]) // 4
if token_count + chunk_tokens > max_tokens:
break
source = chunk["metadata"].get("source", "unknown")
context_parts.append(f"[Source: {source}]\n{chunk['text']}")
token_count += chunk_tokens
return "\n\n---\n\n".join(context_parts)
Prompt template for RAG#
Answer the question using ONLY the sources provided below.
If the answer is not in the sources, say "I don't have enough information."
Do not speculate or draw on outside knowledge.
Cite sources by their [Source: ...] label.
Sources:
{context}
Question: {question}
Answer:
Full RAG pipeline#
import anthropic
anthropic_client = anthropic.Anthropic()
def answer(question: str) -> str:
chunks = retrieve(question, k=5)
context = build_context(chunks)
response = anthropic_client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
messages=[{
"role": "user",
"content": (
f"Answer using ONLY the sources below. "
f"Cite sources. If not in sources, say so.\n\n"
f"Sources:\n{context}\n\n"
f"Question: {question}"
)
}]
)
return response.content[0].text
print(answer("What is the capital of France?"))
Output:
According to [Source: geography], Paris is the capital of France.
Agentic RAG#
For multi-hop questions (answer depends on multiple retrieval steps), give Claude a search tool and let it decide what to retrieve.
search_tool = {
"name": "search_docs",
"description": (
"Search the documentation for relevant information. "
"Call this when you need specific facts to answer the question. "
"You may call it multiple times with different queries."
),
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"max_results": {"type": "integer", "default": 5}
},
"required": ["query"]
}
}
def handle_search(inputs: dict) -> str:
chunks = retrieve(inputs["query"], k=inputs.get("max_results", 5))
return build_context(chunks)
# Let Claude drive the retrieval loop
answer = run_agent(
user_message="What are the differences between Chroma and pgvector?",
tools=[search_tool],
max_turns=8,
)
Evaluation checklist#
β Faithfulness β does the answer use only retrieved chunks? (no hallucination)
β Answer relevance β does the answer address the actual question?
β Context recall β does the top-k contain the chunk needed to answer?
β Context precision β are retrieved chunks on-topic, or noisy?
β Latency β p50/p95 retrieval + generation time within SLA
β Hallucination rate β spot-check a sample against source documents
Common failure modes#
| Symptom | Likely cause | Fix |
|---|---|---|
| Wrong answer despite correct chunk retrieved | Prompt doesnβt constrain to sources | Add explicit βONLY use sourcesβ instruction |
| Correct answer but wrong source cited | Chunk metadata lost at storage | Persist source field alongside vector |
| Good on short docs, bad on long | Fixed chunk too large (diluted) | Use smaller chunks or semantic chunking |
| Misses recent information | Stale index | Add incremental ingestion + reindex trigger |
| Slow retrieval | Full scan without index | Add HNSW/IVF index; shard by date |
| Hallucinations despite good retrieval | Context too long, key chunk buried | Use reranker; put most relevant chunk first |
| Poor performance on tables/lists | Character-level chunking splits structure | Keep tables and lists whole as single chunks |