DSPy — Programmatic Prompting and Optimisation#
What it is#
DSPy (Declarative Self-improving Python, from Stanford NLP) is a framework that replaces hand-tuned prompt strings with programs: typed Signatures describe the input → output contract; Modules compose them; and optimisers (also called teleprompters) compile the program by searching over few-shot examples, instructions, and demonstrations to maximise a developer-supplied metric. The slogan is “programming, not prompting”: you write the logic and let DSPy figure out the prompt.
The novel contribution is inference compilation: given a labelled (or self-labelled) dataset and a metric function, DSPy uses a teacher model to bootstrap traces, then selects/refines few-shot examples and instructions that move the metric. The compiled program is portable across LLMs — swap GPT-4o-mini for Claude Sonnet at runtime without rewriting prompts.
Install#
pip install dspy
pip install dspy chromadb sentence-transformers
Output:
Successfully installed dspy-2.x.x ...
[!TIP] The package was renamed from
dspy-aitodspyin 2024. Older tutorials useimport dspywithpip install dspy-ai; new installs usepip install dspyand the same import.
Quick example — a Predict module#
A signature is "inputs -> outputs". dspy.Predict(signature) turns it into a callable. The LM is configured globally with dspy.configure(lm=...).
import dspy
import os
lm = dspy.LM("openai/gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
dspy.configure(lm=lm)
summarise = dspy.Predict("text -> summary")
out = summarise(text="DSPy compiles LLM programs by optimising prompts against a metric.")
print(out.summary)
Output:
DSPy is a framework that compiles language-model programs by tuning prompts to maximise a developer-defined metric.
When / why to use it#
- You have a metric (accuracy, F1, BLEU, judge-LM score) and want the prompt tuned to it instead of guessing.
- Multi-stage LLM pipelines (decompose → retrieve → reason → answer) where hand-tuning every stage is painful.
- You expect to swap models (cheap dev model, premium prod model) without rewriting prompts.
- You want few-shot examples chosen automatically from a training set instead of cherry-picked by hand.
- Reasoning-heavy tasks (math, multi-hop QA, code) where
ChainOfThoughtandProgramOfThoughtreliably outperform plainPredict.
Common pitfalls#
[!WARNING] No metric, no optimisation — you must supply a
metric(example, pred, trace=None) -> float | boolto every optimiser. Without it there is nothing to optimise against, soBootstrapFewShotfalls back to using the trainset as raw demos.
[!WARNING] Train/dev contamination — DSPy’s optimisers select demos from the trainset and evaluate against a separate devset. Reusing the same examples in both inflates scores. Hold out at least 30% as devset.
[!WARNING] Field name → prompt key — signature field names become prompt keys (
text,summary,reasoning). Renaming a field invalidates compiled prompts. Pick stable names up front.
[!WARNING] Tracing leaks memory — DSPy stores every LM call when
dspy.settings.trace = []is set. Reset traces between batches in long-running services.
[!TIP] Inspect the compiled prompt with
dspy.inspect_history(n=1)after running. The exact text sent to the LM (system + few-shots + user) is printed, which is invaluable for debugging metric regressions.
[!TIP] Cache LM calls during optimisation with
dspy.configure(lm=lm, cache=True). Optimisers make hundreds of calls; caching cuts wall-clock and cost dramatically when iterating on metrics.
Signatures — the input/output contract#
A Signature declares what goes in and what comes out. The simplest form is a string "inputs -> outputs"; the explicit form is a class subclassing dspy.Signature with InputField and OutputField, optionally annotated with descriptions.
import dspy
class GenerateAnswer(dspy.Signature):
"""Answer the question concisely using the context."""
context: str = dspy.InputField(desc="Relevant background facts.")
question: str = dspy.InputField()
answer: str = dspy.OutputField(desc="One or two short sentences.")
predictor = dspy.Predict(GenerateAnswer)
result = predictor(
context="Paris is the capital of France.",
question="What is the capital of France?",
)
print(result.answer)
Output:
The capital of France is Paris.
The docstring becomes the system instruction. Field descriptions become prompt hints. Output types (str, int, float, bool, list[str], Pydantic models) drive parsing — DSPy validates and retries on parse failure.
Modules#
Modules are reusable LM programs. The built-in modules wrap a signature with a particular reasoning strategy.
| Module | Strategy |
|---|---|
dspy.Predict | Direct prediction (no intermediate reasoning). |
dspy.ChainOfThought | Adds a reasoning field before the final output. |
dspy.ChainOfThoughtWithHint | Same, with a hint field for evaluation-time guidance. |
dspy.ProgramOfThought | Generates and executes Python code to produce the answer. |
dspy.ReAct | Tool-using reasoning loop (Thought → Action → Observation). |
dspy.MultiChainComparison | Samples N reasoning chains and picks the best. |
dspy.Retrieve | Calls the configured retriever (RM). |
import dspy
cot = dspy.ChainOfThought("question -> answer")
result = cot(question="A train leaves at 3pm and travels 60 km/h. How far in 2.5 hours?")
print("Reasoning:", result.reasoning)
print("Answer: ", result.answer)
Output:
Reasoning: distance = speed * time = 60 * 2.5 = 150 km.
Answer: 150 km
Custom modules — composing signatures#
Subclass dspy.Module and define forward(self, ...). Sub-modules become tunable as a whole.
import dspy
class RAG(dspy.Module):
def __init__(self, num_passages=3):
super().__init__()
self.retrieve = dspy.Retrieve(k=num_passages)
self.generate = dspy.ChainOfThought("context, question -> answer")
def forward(self, question: str):
passages = self.retrieve(question).passages
context = "\n\n".join(passages)
return self.generate(context=context, question=question)
self.retrieve and self.generate are both tunable; optimisers traverse the module tree and compile them jointly.
Configuring LMs and RMs#
DSPy talks to LMs through a single dspy.LM(...) interface backed by LiteLLM, so any provider supported by LiteLLM works without extra adapters.
import dspy
import os
# OpenAI
lm = dspy.LM("openai/gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
# Anthropic
lm = dspy.LM("anthropic/claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"])
# Local via Ollama
lm = dspy.LM("ollama_chat/llama3.1", api_base="http://localhost:11434")
dspy.configure(lm=lm)
Retrieval models (RMs) are configured similarly:
from dspy.retrieve.chromadb_rm import ChromadbRM
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
rm = ChromadbRM(collection_name="docs", persist_directory="./chroma_db", k=5)
dspy.configure(lm=lm, rm=rm)
Built-in RMs include ChromadbRM, QdrantRM, WeaviateRM, PineconeRM, ColBERTv2, and MarqoRM. Any callable returning passages can also be wrapped.
Datasets and dspy.Example#
DSPy expects training data as dspy.Example objects. Mark which fields are inputs with .with_inputs(...); everything else is treated as a label.
import dspy
trainset = [
dspy.Example(question="2+2", answer="4").with_inputs("question"),
dspy.Example(question="Capital of Spain?", answer="Madrid").with_inputs("question"),
dspy.Example(question="Square root of 81?", answer="9").with_inputs("question"),
]
devset = [
dspy.Example(question="Capital of Italy?", answer="Rome").with_inputs("question"),
]
Metrics#
A metric is metric(example, prediction, trace=None) -> float | bool. Optimisers maximise the metric.
def exact_match(example, pred, trace=None):
return example.answer.strip().lower() == pred.answer.strip().lower()
For semantic matching, use an LLM judge:
import dspy
class Judge(dspy.Signature):
"""Given a gold answer and a predicted answer, decide if they are equivalent."""
gold: str = dspy.InputField()
predicted: str = dspy.InputField()
correct: bool = dspy.OutputField()
judge = dspy.Predict(Judge)
def semantic_match(example, pred, trace=None) -> bool:
return judge(gold=example.answer, predicted=pred.answer).correct
[!TIP] Metrics can use
traceto score intermediate steps. For multi-hop QA, reward both final answer correctness and good intermediate retrieval.
Optimisers — compiling a program#
Optimisers (formerly “teleprompters”) take an unoptimised program, a trainset, and a metric, and return a compiled program with selected demos and (sometimes) refined instructions.
BootstrapFewShot — the default#
Picks few-shot examples from the trainset by running the (uncompiled) program and keeping examples where the metric passes. Cheap, fast, and the default choice for getting a baseline.
import dspy
from dspy.teleprompt import BootstrapFewShot
rag = RAG()
optimiser = BootstrapFewShot(metric=exact_match, max_bootstrapped_demos=4, max_labeled_demos=4)
compiled_rag = optimiser.compile(rag, trainset=trainset)
max_bootstrapped_demos = demos generated by running the teacher; max_labeled_demos = demos used as-is from the trainset.
BootstrapFewShotWithRandomSearch — random search over demos#
Runs BootstrapFewShot with multiple random seeds and picks the candidate with the best devset score. Better than BootstrapFewShot for non-trivial budgets.
from dspy.teleprompt import BootstrapFewShotWithRandomSearch
optimiser = BootstrapFewShotWithRandomSearch(
metric=exact_match,
max_bootstrapped_demos=4,
num_candidate_programs=10,
)
compiled = optimiser.compile(rag, trainset=trainset, valset=devset)
COPRO — instruction optimisation#
COPRO (Coordinate Prompt Optimisation) refines the natural-language instructions in each signature. It does not change demos; it rewrites the docstring/prompts to be more effective for the task. Use when the system prompt matters more than the few-shots.
from dspy.teleprompt import COPRO
optimiser = COPRO(metric=exact_match, breadth=10, depth=3, init_temperature=1.4)
compiled = optimiser.compile(rag, trainset=trainset, eval_kwargs={"num_threads": 4})
breadth = candidate instructions per step; depth = optimisation steps.
MIPRO and MIPROv2 — joint instruction + demo optimisation#
MIPRO (Multi-prompt Instruction Proposal Optimiser) jointly searches over instructions and few-shot demos using Bayesian optimisation. MIPROv2 is the current recommended large-budget optimiser — significantly better than COPRO for most tasks.
from dspy.teleprompt import MIPROv2
optimiser = MIPROv2(
metric=exact_match,
auto="medium",
)
compiled = optimiser.compile(
rag,
trainset=trainset,
valset=devset,
requires_permission_to_run=False,
)
auto="light" runs ~6 trials, "medium" ~12, "heavy" ~25. Each trial is ~100 LM calls — budget accordingly.
[!WARNING] MIPROv2 cost —
auto="heavy"can spend hundreds of dollars on GPT-4-class teachers for a multi-stage RAG. Always cache, start withauto="light", and confirm the metric trends upward before scaling.
KNN-FewShot — demo retrieval by similarity#
For each new query, retrieves the K most similar trainset examples and uses them as few-shot demos. Useful when the task distribution is wide and a fixed demo set generalises poorly.
from dspy.teleprompt import KNNFewShot
optimiser = KNNFewShot(k=4, trainset=trainset, vectorizer=dspy.Embedder("openai/text-embedding-3-small"))
compiled = optimiser.compile(rag, trainset=trainset)
Saving and loading compiled programs#
compiled.save("./compiled_rag.json")
import dspy
fresh = RAG()
fresh.load("./compiled_rag.json")
The JSON contains the chosen demos and (for COPRO/MIPRO) the refined instructions. Commit it to git as a model artefact.
Evaluation harness#
dspy.evaluate.Evaluate runs a program over a devset and reports the metric.
from dspy.evaluate import Evaluate
evaluator = Evaluate(devset=devset, metric=exact_match, num_threads=4, display_progress=True)
score = evaluator(compiled_rag)
print(f"Devset score: {score:.2%}")
Output:
Devset score: 87.50%
ReAct agents#
dspy.ReAct implements the Thought → Action → Observation loop. Tools are Python callables with type-annotated signatures; DSPy generates the JSON-schema description automatically.
import dspy
def calculator(expression: str) -> float:
"""Evaluate a simple arithmetic expression."""
return float(eval(expression, {"__builtins__": {}}, {}))
def get_population(country: str) -> int:
"""Return the population of a country (rough)."""
return {"France": 68_000_000, "Spain": 47_000_000}.get(country, -1)
react = dspy.ReAct("question -> answer", tools=[calculator, get_population])
out = react(question="What is the combined population of France and Spain, plus 100?")
print(out.answer)
Output:
The combined population of France and Spain plus 100 is 115,000,100.
ReAct is itself a Module — pass it to an optimiser to tune the tool-selection prompt.
ProgramOfThought — code-generated answers#
For numeric, list, or table tasks, generating Python and executing it beats free-text reasoning.
pot = dspy.ProgramOfThought("question -> answer")
out = pot(question="What is the standard deviation of [10, 12, 23, 23, 16, 23, 21, 16]?")
print(out.answer)
Output:
4.898979485566356
[!WARNING] ProgramOfThought executes generated code — use a sandbox (
docker,nsjail, or a Restricted Python interpreter) for untrusted inputs.
Assertions — runtime guarantees#
dspy.Assert enforces post-conditions. If the assertion fails, DSPy retries with the assertion’s message included as feedback.
import dspy
class WriteSummary(dspy.Module):
def __init__(self):
super().__init__()
self.cot = dspy.ChainOfThought("text -> summary")
def forward(self, text):
out = self.cot(text=text)
dspy.Assert(
len(out.summary.split()) <= 30,
"Summary must be 30 words or fewer."
)
return out
summariser = dspy.assert_transform_module(WriteSummary())
print(summariser(text="DSPy programs are compiled by optimising prompts against a metric.").summary)
Use dspy.Suggest for soft constraints (logged but non-blocking) and dspy.Assert for hard ones.
Real-world recipes#
Recipe — multi-hop QA over a knowledge base#
Decompose the question, retrieve per sub-question, and synthesise.
import dspy
class MultiHopRAG(dspy.Module):
def __init__(self, num_hops=2, num_passages=3):
super().__init__()
self.num_hops = num_hops
self.gen_query = dspy.ChainOfThought("context, question -> next_search_query")
self.retrieve = dspy.Retrieve(k=num_passages)
self.gen_ans = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
context = []
for _ in range(self.num_hops):
q = self.gen_query(context="\n".join(context), question=question).next_search_query
context.extend(self.retrieve(q).passages)
return self.gen_ans(context="\n".join(context), question=question)
Compile with MIPROv2 and a hop-aware metric that rewards good intermediate queries.
Recipe — judge-driven metric#
When the task is open-ended, use an LM-as-judge metric. Cache the judge to control cost.
class Faithfulness(dspy.Signature):
"""Is the answer faithful to the context? Output yes or no."""
context: str = dspy.InputField()
answer: str = dspy.InputField()
faithful: bool = dspy.OutputField()
judge = dspy.Predict(Faithfulness)
def faithful_metric(example, pred, trace=None):
return judge(context=example.context, answer=pred.answer).faithful
Recipe — A/B between two compiled programs#
import dspy
from dspy.evaluate import Evaluate
eval_run = Evaluate(devset=devset, metric=exact_match, num_threads=4)
print("Bootstrap:", eval_run(compiled_bootstrap))
print("MIPROv2: ", eval_run(compiled_miprov2))
Promote whichever wins; keep both JSON artefacts in models/ for rollback.
Recipe — swap teacher / student LMs#
Optimise once with an expensive teacher, deploy with a cheap student.
teacher = dspy.LM("openai/gpt-4o")
student = dspy.LM("openai/gpt-4o-mini")
with dspy.context(lm=teacher):
compiled = MIPROv2(metric=exact_match, auto="medium").compile(rag, trainset=trainset, valset=devset)
dspy.configure(lm=student)
print(compiled(question="...").answer)
dspy.context(...) temporarily overrides the LM for the optimisation block.
Recipe — diff two compiled prompts#
import json
with open("./old.json") as f: old = json.load(f)
with open("./new.json") as f: new = json.load(f)
for mod in old:
if old[mod]["signature_instructions"] != new[mod]["signature_instructions"]:
print(f"{mod} instructions changed:")
print("- ", old[mod]["signature_instructions"])
print("+ ", new[mod]["signature_instructions"])
Commit the JSON to git and the diff makes prompt regressions visible in PRs.
Recipe — streaming responses#
import dspy
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)
stream_predict = dspy.streamify(dspy.Predict("topic -> explanation"))
import asyncio
async def main():
async for chunk in stream_predict(topic="Why is the sky blue?"):
if isinstance(chunk, dspy.streaming.StreamResponse):
print(chunk.chunk, end="", flush=True)
asyncio.run(main())
dspy.streamify(module) wraps any module for async streaming of its output fields.
Quick reference#
| Task | Code |
|---|---|
| Install | pip install dspy |
| Configure LM | dspy.configure(lm=dspy.LM("openai/gpt-4o-mini")) |
| Configure RM | dspy.configure(rm=ChromadbRM(...)) |
| Inline signature | dspy.Predict("inputs -> outputs") |
| Class signature | class S(dspy.Signature): ... InputField/OutputField |
| Chain of thought | dspy.ChainOfThought("q -> a") |
| ReAct agent | dspy.ReAct(sig, tools=[f1, f2]) |
| Program of thought | dspy.ProgramOfThought("q -> a") |
| Retrieve | dspy.Retrieve(k=5) |
| Example | dspy.Example(...).with_inputs("q") |
| Bootstrap demos | BootstrapFewShot(metric=m, max_bootstrapped_demos=4) |
| Random search | BootstrapFewShotWithRandomSearch(num_candidate_programs=10) |
| Instruction optim | COPRO(metric=m, breadth=10, depth=3) |
| Joint optim | MIPROv2(metric=m, auto="medium") |
| KNN demos | KNNFewShot(k=4, trainset=ts) |
| Evaluate | Evaluate(devset=ds, metric=m)(program) |
| Inspect history | dspy.inspect_history(n=1) |
| Save compiled | program.save("file.json") |
| Load compiled | program.load("file.json") |
| Stream | dspy.streamify(module) |
| Hard assertion | dspy.Assert(cond, "message") |
| Soft suggestion | dspy.Suggest(cond, "message") |
| Temporary LM | with dspy.context(lm=teacher): ... |