DSPy — Programmatic Prompting and Optimisation#

What it is#

DSPy (Declarative Self-improving Python, from Stanford NLP) is a framework that replaces hand-tuned prompt strings with programs: typed Signatures describe the input → output contract; Modules compose them; and optimisers (also called teleprompters) compile the program by searching over few-shot examples, instructions, and demonstrations to maximise a developer-supplied metric. The slogan is “programming, not prompting”: you write the logic and let DSPy figure out the prompt.

The novel contribution is inference compilation: given a labelled (or self-labelled) dataset and a metric function, DSPy uses a teacher model to bootstrap traces, then selects/refines few-shot examples and instructions that move the metric. The compiled program is portable across LLMs — swap GPT-4o-mini for Claude Sonnet at runtime without rewriting prompts.

Install#

pip install dspy

pip install dspy chromadb sentence-transformers

Output:

Successfully installed dspy-2.x.x ...

[!TIP] The package was renamed from dspy-ai to dspy in 2024. Older tutorials use import dspy with pip install dspy-ai; new installs use pip install dspy and the same import.

Quick example — a Predict module#

A signature is "inputs -> outputs". dspy.Predict(signature) turns it into a callable. The LM is configured globally with dspy.configure(lm=...).

import dspy
import os

lm = dspy.LM("openai/gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
dspy.configure(lm=lm)

summarise = dspy.Predict("text -> summary")
out = summarise(text="DSPy compiles LLM programs by optimising prompts against a metric.")
print(out.summary)

Output:

DSPy is a framework that compiles language-model programs by tuning prompts to maximise a developer-defined metric.

When / why to use it#

You have a metric (accuracy, F1, BLEU, judge-LM score) and want the prompt tuned to it instead of guessing.
Multi-stage LLM pipelines (decompose → retrieve → reason → answer) where hand-tuning every stage is painful.
You expect to swap models (cheap dev model, premium prod model) without rewriting prompts.
You want few-shot examples chosen automatically from a training set instead of cherry-picked by hand.
Reasoning-heavy tasks (math, multi-hop QA, code) where ChainOfThought and ProgramOfThought reliably outperform plain Predict.

Common pitfalls#

[!WARNING] No metric, no optimisation — you must supply a metric(example, pred, trace=None) -> float | bool to every optimiser. Without it there is nothing to optimise against, so BootstrapFewShot falls back to using the trainset as raw demos.

[!WARNING] Train/dev contamination — DSPy’s optimisers select demos from the trainset and evaluate against a separate devset. Reusing the same examples in both inflates scores. Hold out at least 30% as devset.

[!WARNING] Field name → prompt key — signature field names become prompt keys (text, summary, reasoning). Renaming a field invalidates compiled prompts. Pick stable names up front.

[!WARNING] Tracing leaks memory — DSPy stores every LM call when dspy.settings.trace = [] is set. Reset traces between batches in long-running services.

[!TIP] Inspect the compiled prompt with dspy.inspect_history(n=1) after running. The exact text sent to the LM (system + few-shots + user) is printed, which is invaluable for debugging metric regressions.

[!TIP] Cache LM calls during optimisation with dspy.configure(lm=lm, cache=True). Optimisers make hundreds of calls; caching cuts wall-clock and cost dramatically when iterating on metrics.

Signatures — the input/output contract#

A Signature declares what goes in and what comes out. The simplest form is a string "inputs -> outputs"; the explicit form is a class subclassing dspy.Signature with InputField and OutputField, optionally annotated with descriptions.

import dspy

class GenerateAnswer(dspy.Signature):
    """Answer the question concisely using the context."""

    context:  str = dspy.InputField(desc="Relevant background facts.")
    question: str = dspy.InputField()
    answer:   str = dspy.OutputField(desc="One or two short sentences.")

predictor = dspy.Predict(GenerateAnswer)
result = predictor(
    context="Paris is the capital of France.",
    question="What is the capital of France?",
)
print(result.answer)

Output:

The capital of France is Paris.

The docstring becomes the system instruction. Field descriptions become prompt hints. Output types (str, int, float, bool, list[str], Pydantic models) drive parsing — DSPy validates and retries on parse failure.

Modules#

Modules are reusable LM programs. The built-in modules wrap a signature with a particular reasoning strategy.

Module	Strategy
`dspy.Predict`	Direct prediction (no intermediate reasoning).
`dspy.ChainOfThought`	Adds a `reasoning` field before the final output.
`dspy.ChainOfThoughtWithHint`	Same, with a hint field for evaluation-time guidance.
`dspy.ProgramOfThought`	Generates and executes Python code to produce the answer.
`dspy.ReAct`	Tool-using reasoning loop (Thought → Action → Observation).
`dspy.MultiChainComparison`	Samples N reasoning chains and picks the best.
`dspy.Retrieve`	Calls the configured retriever (RM).

import dspy

cot = dspy.ChainOfThought("question -> answer")
result = cot(question="A train leaves at 3pm and travels 60 km/h. How far in 2.5 hours?")
print("Reasoning:", result.reasoning)
print("Answer:   ", result.answer)

Output:

Reasoning: distance = speed * time = 60 * 2.5 = 150 km.
Answer:    150 km

Custom modules — composing signatures#

Subclass dspy.Module and define forward(self, ...). Sub-modules become tunable as a whole.

import dspy

class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate = dspy.ChainOfThought("context, question -> answer")

    def forward(self, question: str):
        passages = self.retrieve(question).passages
        context  = "\n\n".join(passages)
        return self.generate(context=context, question=question)

self.retrieve and self.generate are both tunable; optimisers traverse the module tree and compile them jointly.

Configuring LMs and RMs#

DSPy talks to LMs through a single dspy.LM(...) interface backed by LiteLLM, so any provider supported by LiteLLM works without extra adapters.

import dspy
import os

# OpenAI
lm = dspy.LM("openai/gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])

# Anthropic
lm = dspy.LM("anthropic/claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"])

# Local via Ollama
lm = dspy.LM("ollama_chat/llama3.1", api_base="http://localhost:11434")

dspy.configure(lm=lm)

Retrieval models (RMs) are configured similarly:

from dspy.retrieve.chromadb_rm import ChromadbRM
import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
rm = ChromadbRM(collection_name="docs", persist_directory="./chroma_db", k=5)
dspy.configure(lm=lm, rm=rm)

Built-in RMs include ChromadbRM, QdrantRM, WeaviateRM, PineconeRM, ColBERTv2, and MarqoRM. Any callable returning passages can also be wrapped.

Datasets and `dspy.Example`#

DSPy expects training data as dspy.Example objects. Mark which fields are inputs with .with_inputs(...); everything else is treated as a label.

import dspy

trainset = [
    dspy.Example(question="2+2", answer="4").with_inputs("question"),
    dspy.Example(question="Capital of Spain?", answer="Madrid").with_inputs("question"),
    dspy.Example(question="Square root of 81?", answer="9").with_inputs("question"),
]

devset = [
    dspy.Example(question="Capital of Italy?", answer="Rome").with_inputs("question"),
]

Metrics#

A metric is metric(example, prediction, trace=None) -> float | bool. Optimisers maximise the metric.

def exact_match(example, pred, trace=None):
    return example.answer.strip().lower() == pred.answer.strip().lower()

For semantic matching, use an LLM judge:

import dspy

class Judge(dspy.Signature):
    """Given a gold answer and a predicted answer, decide if they are equivalent."""

    gold:      str = dspy.InputField()
    predicted: str = dspy.InputField()
    correct:   bool = dspy.OutputField()

judge = dspy.Predict(Judge)

def semantic_match(example, pred, trace=None) -> bool:
    return judge(gold=example.answer, predicted=pred.answer).correct

[!TIP] Metrics can use trace to score intermediate steps. For multi-hop QA, reward both final answer correctness and good intermediate retrieval.

Optimisers — compiling a program#

Optimisers (formerly “teleprompters”) take an unoptimised program, a trainset, and a metric, and return a compiled program with selected demos and (sometimes) refined instructions.

BootstrapFewShot — the default#

Picks few-shot examples from the trainset by running the (uncompiled) program and keeping examples where the metric passes. Cheap, fast, and the default choice for getting a baseline.

import dspy
from dspy.teleprompt import BootstrapFewShot

rag = RAG()
optimiser = BootstrapFewShot(metric=exact_match, max_bootstrapped_demos=4, max_labeled_demos=4)
compiled_rag = optimiser.compile(rag, trainset=trainset)

max_bootstrapped_demos = demos generated by running the teacher; max_labeled_demos = demos used as-is from the trainset.

BootstrapFewShotWithRandomSearch — random search over demos#

Runs BootstrapFewShot with multiple random seeds and picks the candidate with the best devset score. Better than BootstrapFewShot for non-trivial budgets.

from dspy.teleprompt import BootstrapFewShotWithRandomSearch

optimiser = BootstrapFewShotWithRandomSearch(
    metric=exact_match,
    max_bootstrapped_demos=4,
    num_candidate_programs=10,
)
compiled = optimiser.compile(rag, trainset=trainset, valset=devset)

COPRO — instruction optimisation#

COPRO (Coordinate Prompt Optimisation) refines the natural-language instructions in each signature. It does not change demos; it rewrites the docstring/prompts to be more effective for the task. Use when the system prompt matters more than the few-shots.

from dspy.teleprompt import COPRO

optimiser = COPRO(metric=exact_match, breadth=10, depth=3, init_temperature=1.4)
compiled = optimiser.compile(rag, trainset=trainset, eval_kwargs={"num_threads": 4})

breadth = candidate instructions per step; depth = optimisation steps.

MIPRO and MIPROv2 — joint instruction + demo optimisation#

MIPRO (Multi-prompt Instruction Proposal Optimiser) jointly searches over instructions and few-shot demos using Bayesian optimisation. MIPROv2 is the current recommended large-budget optimiser — significantly better than COPRO for most tasks.

from dspy.teleprompt import MIPROv2

optimiser = MIPROv2(
    metric=exact_match,
    auto="medium",
)
compiled = optimiser.compile(
    rag,
    trainset=trainset,
    valset=devset,
    requires_permission_to_run=False,
)

auto="light" runs ~6 trials, "medium" ~12, "heavy" ~25. Each trial is ~100 LM calls — budget accordingly.

[!WARNING] MIPROv2 cost — auto="heavy" can spend hundreds of dollars on GPT-4-class teachers for a multi-stage RAG. Always cache, start with auto="light", and confirm the metric trends upward before scaling.

KNN-FewShot — demo retrieval by similarity#

For each new query, retrieves the K most similar trainset examples and uses them as few-shot demos. Useful when the task distribution is wide and a fixed demo set generalises poorly.

from dspy.teleprompt import KNNFewShot

optimiser = KNNFewShot(k=4, trainset=trainset, vectorizer=dspy.Embedder("openai/text-embedding-3-small"))
compiled = optimiser.compile(rag, trainset=trainset)

Saving and loading compiled programs#

compiled.save("./compiled_rag.json")

import dspy
fresh = RAG()
fresh.load("./compiled_rag.json")

The JSON contains the chosen demos and (for COPRO/MIPRO) the refined instructions. Commit it to git as a model artefact.

Evaluation harness#

dspy.evaluate.Evaluate runs a program over a devset and reports the metric.

from dspy.evaluate import Evaluate

evaluator = Evaluate(devset=devset, metric=exact_match, num_threads=4, display_progress=True)
score = evaluator(compiled_rag)
print(f"Devset score: {score:.2%}")

Output:

Devset score: 87.50%

ReAct agents#

dspy.ReAct implements the Thought → Action → Observation loop. Tools are Python callables with type-annotated signatures; DSPy generates the JSON-schema description automatically.

import dspy

def calculator(expression: str) -> float:
    """Evaluate a simple arithmetic expression."""
    return float(eval(expression, {"__builtins__": {}}, {}))

def get_population(country: str) -> int:
    """Return the population of a country (rough)."""
    return {"France": 68_000_000, "Spain": 47_000_000}.get(country, -1)

react = dspy.ReAct("question -> answer", tools=[calculator, get_population])
out = react(question="What is the combined population of France and Spain, plus 100?")
print(out.answer)

Output:

The combined population of France and Spain plus 100 is 115,000,100.

ReAct is itself a Module — pass it to an optimiser to tune the tool-selection prompt.

ProgramOfThought — code-generated answers#

For numeric, list, or table tasks, generating Python and executing it beats free-text reasoning.

pot = dspy.ProgramOfThought("question -> answer")
out = pot(question="What is the standard deviation of [10, 12, 23, 23, 16, 23, 21, 16]?")
print(out.answer)

Output:

4.898979485566356

[!WARNING] ProgramOfThought executes generated code — use a sandbox (docker, nsjail, or a Restricted Python interpreter) for untrusted inputs.

Assertions — runtime guarantees#

dspy.Assert enforces post-conditions. If the assertion fails, DSPy retries with the assertion’s message included as feedback.

import dspy

class WriteSummary(dspy.Module):
    def __init__(self):
        super().__init__()
        self.cot = dspy.ChainOfThought("text -> summary")

    def forward(self, text):
        out = self.cot(text=text)
        dspy.Assert(
            len(out.summary.split()) <= 30,
            "Summary must be 30 words or fewer."
        )
        return out

summariser = dspy.assert_transform_module(WriteSummary())
print(summariser(text="DSPy programs are compiled by optimising prompts against a metric.").summary)

Use dspy.Suggest for soft constraints (logged but non-blocking) and dspy.Assert for hard ones.

Real-world recipes#

Recipe — multi-hop QA over a knowledge base#

Decompose the question, retrieve per sub-question, and synthesise.

import dspy

class MultiHopRAG(dspy.Module):
    def __init__(self, num_hops=2, num_passages=3):
        super().__init__()
        self.num_hops = num_hops
        self.gen_query = dspy.ChainOfThought("context, question -> next_search_query")
        self.retrieve  = dspy.Retrieve(k=num_passages)
        self.gen_ans   = dspy.ChainOfThought("context, question -> answer")

    def forward(self, question):
        context = []
        for _ in range(self.num_hops):
            q = self.gen_query(context="\n".join(context), question=question).next_search_query
            context.extend(self.retrieve(q).passages)
        return self.gen_ans(context="\n".join(context), question=question)

Compile with MIPROv2 and a hop-aware metric that rewards good intermediate queries.

Recipe — judge-driven metric#

When the task is open-ended, use an LM-as-judge metric. Cache the judge to control cost.

class Faithfulness(dspy.Signature):
    """Is the answer faithful to the context? Output yes or no."""

    context:   str = dspy.InputField()
    answer:    str = dspy.InputField()
    faithful: bool = dspy.OutputField()

judge = dspy.Predict(Faithfulness)

def faithful_metric(example, pred, trace=None):
    return judge(context=example.context, answer=pred.answer).faithful

Recipe — A/B between two compiled programs#

import dspy
from dspy.evaluate import Evaluate

eval_run = Evaluate(devset=devset, metric=exact_match, num_threads=4)
print("Bootstrap:", eval_run(compiled_bootstrap))
print("MIPROv2:  ", eval_run(compiled_miprov2))

Promote whichever wins; keep both JSON artefacts in models/ for rollback.

Recipe — swap teacher / student LMs#

Optimise once with an expensive teacher, deploy with a cheap student.

teacher = dspy.LM("openai/gpt-4o")
student = dspy.LM("openai/gpt-4o-mini")

with dspy.context(lm=teacher):
    compiled = MIPROv2(metric=exact_match, auto="medium").compile(rag, trainset=trainset, valset=devset)

dspy.configure(lm=student)
print(compiled(question="...").answer)

dspy.context(...) temporarily overrides the LM for the optimisation block.

Recipe — diff two compiled prompts#

import json

with open("./old.json") as f: old = json.load(f)
with open("./new.json") as f: new = json.load(f)

for mod in old:
    if old[mod]["signature_instructions"] != new[mod]["signature_instructions"]:
        print(f"{mod} instructions changed:")
        print("- ", old[mod]["signature_instructions"])
        print("+ ", new[mod]["signature_instructions"])

Commit the JSON to git and the diff makes prompt regressions visible in PRs.

Recipe — streaming responses#

import dspy

lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)

stream_predict = dspy.streamify(dspy.Predict("topic -> explanation"))

import asyncio

async def main():
    async for chunk in stream_predict(topic="Why is the sky blue?"):
        if isinstance(chunk, dspy.streaming.StreamResponse):
            print(chunk.chunk, end="", flush=True)

asyncio.run(main())

dspy.streamify(module) wraps any module for async streaming of its output fields.

Quick reference#

Task	Code
Install	`pip install dspy`
Configure LM	`dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))`
Configure RM	`dspy.configure(rm=ChromadbRM(...))`
Inline signature	`dspy.Predict("inputs -> outputs")`
Class signature	`class S(dspy.Signature): ... InputField/OutputField`
Chain of thought	`dspy.ChainOfThought("q -> a")`
ReAct agent	`dspy.ReAct(sig, tools=[f1, f2])`
Program of thought	`dspy.ProgramOfThought("q -> a")`
Retrieve	`dspy.Retrieve(k=5)`
Example	`dspy.Example(...).with_inputs("q")`
Bootstrap demos	`BootstrapFewShot(metric=m, max_bootstrapped_demos=4)`
Random search	`BootstrapFewShotWithRandomSearch(num_candidate_programs=10)`
Instruction optim	`COPRO(metric=m, breadth=10, depth=3)`
Joint optim	`MIPROv2(metric=m, auto="medium")`
KNN demos	`KNNFewShot(k=4, trainset=ts)`
Evaluate	`Evaluate(devset=ds, metric=m)(program)`
Inspect history	`dspy.inspect_history(n=1)`
Save compiled	`program.save("file.json")`
Load compiled	`program.load("file.json")`
Stream	`dspy.streamify(module)`
Hard assertion	`dspy.Assert(cond, "message")`
Soft suggestion	`dspy.Suggest(cond, "message")`
Temporary LM	`with dspy.context(lm=teacher): ...`

g h	home
g p	Programming section
g p	Python section
g j	JavaScript section
g t	TypeScript section
g o	OS section
g l	Linux section
g w	Windows section
g z	z/OS section
g o	macOS section
g a	AI section
g c	Claude Code section
g c	Codex CLI section
g c	Claude API section
g p	Prompting section
g f	Frameworks section
g p	Packages section
g p	Pip (Python) section
g p	npm (Node) section
g p	Cargo (Rust) section
g p	Go modules section
g g	graph view
g t	tags index

⌘K / /	open search palette
t	cycle theme (dark → light → system)
?	toggle this panel

[ / ]	previous / next sheet in section
j / k	scroll down / up

DSPy — Programmatic Prompting and Optimisation#

What it is#

Install#

Quick example — a Predict module#

When / why to use it#

Common pitfalls#

Signatures — the input/output contract#

Modules#

Custom modules — composing signatures#

Configuring LMs and RMs#

Datasets and dspy.Example#

Metrics#

Optimisers — compiling a program#

BootstrapFewShot — the default#

BootstrapFewShotWithRandomSearch — random search over demos#

COPRO — instruction optimisation#

MIPRO and MIPROv2 — joint instruction + demo optimisation#

KNN-FewShot — demo retrieval by similarity#

Saving and loading compiled programs#

Evaluation harness#

ReAct agents#

ProgramOfThought — code-generated answers#

Assertions — runtime guarantees#

Real-world recipes#

Recipe — multi-hop QA over a knowledge base#

Recipe — judge-driven metric#

Recipe — A/B between two compiled programs#

Recipe — swap teacher / student LMs#

Recipe — diff two compiled prompts#

Recipe — streaming responses#

Quick reference#

Datasets and `dspy.Example`#