skip to content

LangSmith — LLM Observability & Evaluation

Trace, debug, evaluate, and monitor LLM applications with LangSmith. Covers tracing setup, datasets, evaluators, prompt hub, comparing runs, and CI integration.

14 min read 39 snippets deep dive

LangSmith — LLM Observability & Evaluation#

What it is#

LangSmith is Langchain Inc.’s platform for observability and evaluation of LLM applications. It automatically captures every prompt, response, token count, latency, and error from LangChain chains and agents — and from any Python code you instrument manually. You use it to debug failures, build evaluation datasets from production traces, run automated regression tests, and compare model/prompt versions. LangSmith has a free tier and integrates with LangChain via two environment variables.

Install#

pip install langsmith
pip install langchain   # optional — auto-traces all LangChain calls

Output: (none — exits 0 on success)

Quick example#

import os

# Enable tracing with two env vars — that's all LangChain needs
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"]    = "ls__..."   # from app.langsmith.com
os.environ["LANGCHAIN_PROJECT"]    = "my-project"

from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

chain = (
    ChatPromptTemplate.from_template("Summarise in one sentence: {text}")
    | ChatAnthropic(model="claude-sonnet-4-6")
    | StrOutputParser()
)

result = chain.invoke({"text": "LangSmith traces every LLM call automatically."})
print(result)
# Every call now appears in the LangSmith UI with full prompt/response/latency

Output:

LangSmith automatically traces every LLM call and displays it in the UI.

When / why to use it#

  • Debugging LLM chains: inspect exactly what prompt was sent, what was received, and where the chain failed.
  • Building evaluation datasets: tag production traces as “good” or “bad” and export to a dataset.
  • Regression testing: run a dataset through a chain and compare scores across versions.
  • Prompt management: version prompts in the LangSmith Hub and pull them by commit hash.
  • Monitoring production: alert on latency regressions or error rate spikes.

Common pitfalls#

[!WARNING] Traces are sent asynchronously — LangSmith batches and sends traces in the background. In short-lived scripts, the process may exit before all traces are flushed. Add langsmith.Client().flush() at the end of scripts to ensure all traces are sent.

[!WARNING] LANGCHAIN_TRACING_V2 must be set before importing LangChain — the tracer registers at import time. Setting the env var after from langchain_core import ... has no effect.

[!TIP] Use @traceable to trace any Python function — not just LangChain objects. This captures non-LangChain steps (database calls, pre/post-processing) in the same trace tree.

[!TIP] with tracing_context(project_name="experiment-v2"): overrides the project for a specific block, making it easy to route A/B experiments to separate projects without changing env vars.

Tracing non-LangChain code with @traceable#

@traceable instruments any Python function so its inputs, outputs, and metadata appear in LangSmith traces.

from langsmith import traceable
import anthropic
import os

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

@traceable(name="Claude API call", run_type="llm")
def call_claude(prompt: str, model: str = "claude-sonnet-4-6") -> str:
    message = client.messages.create(
        model=model,
        max_tokens=256,
        messages=[{"role": "user", "content": prompt}],
    )
    return message.content[0].text

@traceable(name="Extract keywords", run_type="chain")
def extract_keywords(text: str) -> list[str]:
    response = call_claude(f"List 5 keywords from this text as a comma-separated list: {text}")
    return [k.strip() for k in response.split(",")]

result = extract_keywords("LangSmith traces LLM calls and helps you evaluate and debug.")
print(result)

Output:

['LangSmith', 'traces', 'LLM', 'evaluate', 'debug']

Datasets — ground truth for evaluation#

A dataset is a collection of input/output pairs used to evaluate a chain consistently. Build datasets from production traces (tag and export), from CSV, or programmatically.

from langsmith import Client

ls = Client(api_key=os.environ["LANGCHAIN_API_KEY"])

# Create a dataset
dataset = ls.create_dataset(
    dataset_name="summarisation-v1",
    description="Summarisation test cases",
)

# Add examples
examples = [
    {
        "inputs":  {"text": "The quick brown fox jumps over the lazy dog."},
        "outputs": {"summary": "A fox jumps over a dog."},
    },
    {
        "inputs":  {"text": "Python is a high-level interpreted programming language."},
        "outputs": {"summary": "Python is an interpreted high-level language."},
    },
]
ls.create_examples(inputs=[e["inputs"] for e in examples],
                   outputs=[e["outputs"] for e in examples],
                   dataset_id=dataset.id)

print(f"Dataset '{dataset.name}' created with {len(examples)} examples")

Output:

Dataset 'summarisation-v1' created with 2 examples

Evaluators — scoring predictions#

An evaluator scores a chain’s output against the expected output. LangSmith provides built-in evaluators (exact_match, embedding_distance, qa) and supports custom evaluators via EvaluatorOutputSchema.

from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
import os

chain = (
    ChatPromptTemplate.from_template("Summarise in one sentence: {text}")
    | ChatAnthropic(model="claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"])
    | StrOutputParser()
)

def predict(inputs: dict) -> dict:
    return {"summary": chain.invoke(inputs)}

# LLM-as-a-judge evaluator for helpfulness
helpfulness_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": "helpfulness",
        "llm": ChatAnthropic(model="claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"]),
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["summary"],
        "reference":  example.outputs["summary"],
        "input":      example.inputs["text"],
    },
)

results = evaluate(
    predict,
    data="summarisation-v1",
    evaluators=[helpfulness_evaluator],
    experiment_prefix="claude-sonnet-4-6",
)

print(f"Mean helpfulness: {results.get_aggregate_feedback()}")

Custom evaluators#

from langsmith.evaluation import run_evaluator
from langsmith.schemas import Run, Example

@run_evaluator
def word_count_evaluator(run: Run, example: Example) -> dict:
    """Penalise summaries that are too long."""
    prediction = run.outputs.get("summary", "")
    expected   = example.outputs.get("summary", "")
    word_ratio = len(prediction.split()) / max(len(expected.split()), 1)
    score = 1.0 if word_ratio <= 1.5 else max(0.0, 1.0 - (word_ratio - 1.5))
    return {"key": "conciseness", "score": score, "comment": f"word_ratio={word_ratio:.2f}"}

Prompt hub — versioned prompts#

Store, version, and pull prompts from the LangSmith Hub so experiments are reproducible and rollback is trivial.

from langsmith import Client
from langchain import hub

# Pull a prompt by owner/name (uses LANGCHAIN_API_KEY)
prompt = hub.pull("alicedev/summarise-v1")
print(prompt.messages)

# Pull a specific commit for reproducibility
prompt = hub.pull("alicedev/summarise-v1:abc123")

# Push a new version
from langchain_core.prompts import ChatPromptTemplate
new_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a concise summariser. Use at most 20 words."),
    ("human",  "{text}"),
])
hub.push("alicedev/summarise-v1", new_prompt, new_repo_is_public=False)

Comparing experiments#

Run the same dataset through two different chains (e.g. Claude vs GPT-4) to compare scores side-by-side in the LangSmith UI.

from langsmith.evaluation import evaluate

# Experiment A — Claude
results_a = evaluate(
    predict_with_claude,
    data="summarisation-v1",
    evaluators=[helpfulness_evaluator],
    experiment_prefix="claude-sonnet-4-6",
)

# Experiment B — GPT-4o
results_b = evaluate(
    predict_with_gpt4,
    data="summarisation-v1",
    evaluators=[helpfulness_evaluator],
    experiment_prefix="gpt-4o",
)

# Both experiments appear in the LangSmith UI under the same dataset
# for side-by-side score and latency comparison

Feedback — tagging individual runs#

from langsmith import Client

ls = Client(api_key=os.environ["LANGCHAIN_API_KEY"])

# After a production run, record human feedback
ls.create_feedback(
    run_id="<run-id-from-trace>",
    key="user_rating",
    score=1.0,           # 0.0 = bad, 1.0 = good
    comment="Perfect summary, exactly right length",
)

# Query runs with negative feedback
bad_runs = ls.list_runs(
    project_name="my-project",
    filter='and(eq(feedback_key, "user_rating"), lt(feedback_score, 0.5))',
)
for run in bad_runs:
    print(run.id, run.inputs, run.outputs)

CI integration — fail on score regression#

from langsmith.evaluation import evaluate

def test_summarisation_quality():
    results = evaluate(
        predict,
        data="summarisation-v1",
        evaluators=[helpfulness_evaluator],
        experiment_prefix="ci",
    )
    agg = results.get_aggregate_feedback()
    assert agg["helpfulness"] >= 0.75, (
        f"Helpfulness score {agg['helpfulness']:.2f} below threshold 0.75"
    )
pytest test_eval.py  # fails if mean helpfulness drops below threshold

Output: (none — exits 0 on success)

Run types — semantic categories in the trace tree#

A run type labels a span in the trace tree so the UI can render the right icon, surface token usage, and filter activity. Pick the type that matches what the function does — not the framework that produces it.

TypeMeaningTypical inputs/outputs
llmA model call (any provider){prompt}{completion, tokens, cost}
chainA multi-step orchestrationcomposite inputs → composite outputs
toolA function/tool call (search, calc, HTTP)tool args → tool return value
retrieverA vector store or BM25 retriever{query} → list of documents
embeddingAn embedding call{text} → vector
parserOutput parsing / JSON extractionraw text → structured data
promptA prompt template rendertemplate + vars → final string
from langsmith import traceable

@traceable(run_type="retriever", name="pgvector_retrieve")
def retrieve(query: str, k: int = 5) -> list[dict]:
    # The UI renders this as a retriever node with a document count badge
    return [{"page_content": "...", "metadata": {"source": "doc-1"}} for _ in range(k)]

@traceable(run_type="tool", name="weather_api")
def get_weather(city: str) -> dict:
    return {"city": city, "temp_c": 18, "conditions": "cloudy"}

Output: (none — exits 0 on success)

Tracing context — overriding project, tags, metadata#

tracing_context is a context manager that mutates the active trace settings for the lifetime of a block. Use it to route A/B variants to separate projects or attach experiment metadata without changing env vars.

from langsmith import traceable
from langsmith.run_helpers import tracing_context

@traceable
def summarise(text: str) -> str:
    return call_claude(f"Summarise: {text}")

# Route this run to a different project with extra tags + metadata
with tracing_context(
    project_name="experiment-v2",
    tags=["ab-test", "treatment"],
    metadata={"variant": "B", "user_segment": "power"},
):
    summarise("LangSmith batches and flushes traces asynchronously.")

Output: (none — exits 0 on success)

Multi-turn chats — threads and sessions#

Group related runs into a thread so the LangSmith UI shows them as a conversation. Set the session_id (LangChain) or pass metadata={"thread_id": ...} when calling @traceable functions.

import uuid
from langsmith import traceable
from langsmith.run_helpers import tracing_context

thread_id = str(uuid.uuid4())

@traceable(run_type="chain")
def chat_turn(user_message: str) -> str:
    return call_claude(user_message)

with tracing_context(metadata={"session_id": thread_id, "user_id": "alice-dev"}):
    chat_turn("Hi, what is RAG?")
    chat_turn("How does it differ from fine-tuning?")
    chat_turn("Show me a Python example.")

Output: (none — exits 0 on success)

Programmatic trace inspection#

The Client API lets you query, filter, and download runs without the UI — useful for nightly reports, dataset curation, and custom dashboards.

from langsmith import Client
from datetime import datetime, timedelta, timezone

ls = Client()

# Last 24h of failing runs in this project
runs = ls.list_runs(
    project_name="my-project",
    start_time=datetime.now(timezone.utc) - timedelta(days=1),
    filter='eq(status, "error")',
    limit=100,
)
for r in runs:
    print(f"{r.start_time:%H:%M:%S}  {r.name:30s}  err={r.error[:60] if r.error else '-'}")

Output:

14:02:11  retrieve                       err=ConnectionError: pgvector timed out
14:09:47  call_claude                    err=anthropic.APIStatusError: 529 overloaded
14:31:05  parse_json                     err=ValueError: Expecting value: line 1 col

Filtering runs — the LangSmith query DSL#

filter= accepts a small expression language for selecting runs. Combine predicates with and(...), or(...), and not(...). Operators: eq, ne, gt, gte, lt, lte, has, search.

# All runs where the question contained "rag", token cost > $0.01, and feedback score < 0.5
runs = ls.list_runs(
    project_name="my-project",
    filter=(
        'and('
        '  search("rag"),'
        '  gt(total_cost, 0.01),'
        '  and(eq(feedback_key, "user_rating"), lt(feedback_score, 0.5))'
        ')'
    ),
)
print(f"{sum(1 for _ in runs)} runs matched")

Output:

17 runs matched

Cost and token tracking#

Every traced LLM call records prompt/completion token counts and a computed dollar cost (LangSmith maintains a per-model price list). Aggregate across a project with get_project_stats or by iterating list_runs.

from langsmith import Client
from collections import defaultdict

ls = Client()
costs = defaultdict(float)
for r in ls.list_runs(project_name="my-project", run_type="llm", limit=1000):
    model = (r.extra or {}).get("invocation_params", {}).get("model", "unknown")
    costs[model] += r.total_cost or 0.0

for model, total in sorted(costs.items(), key=lambda kv: -kv[1]):
    print(f"  {model:30s}  ${total:7.2f}")

Output:

  claude-sonnet-4-6              $  42.18
  gpt-4o                         $  18.74
  text-embedding-3-small         $   0.62

Streaming and partial outputs#

For streamed token output, LangSmith records the full assembled text as the final output once the stream closes. Use streaming=True in LangChain clients; with @traceable, return an iterable or yield from a generator — LangSmith collects the full sequence automatically.

from langsmith import traceable

@traceable(run_type="llm", name="claude_stream")
def stream_completion(prompt: str):
    with client.messages.stream(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}],
    ) as stream:
        for chunk in stream.text_stream:
            yield chunk

text = "".join(stream_completion("List three uses of LangSmith."))
print(text[:120])

Output:

1. Debugging LangChain failures by inspecting the exact prompt/response per step.
2. Building evaluation

Datasets from production — promoting traces to ground truth#

A common workflow: a user thumbs-down a response → you fix the prompt → you want to make sure the fix didn’t regress other queries. The fastest loop is to clone interesting production runs into a dataset, then re-run a candidate chain against that dataset.

from langsmith import Client

ls = Client()

# Find production runs with negative feedback and clone them into a dataset
bad_runs = list(ls.list_runs(
    project_name="prod",
    filter='and(eq(feedback_key, "user_rating"), lt(feedback_score, 0.5))',
    limit=50,
))

dataset = ls.create_dataset("regressions-2026-05", description="Negative feedback from prod")
for r in bad_runs:
    ls.create_example(
        inputs=r.inputs,
        outputs=r.outputs,            # current production output, treat as "what we had"
        dataset_id=dataset.id,
        metadata={"source_run": str(r.id)},
    )
print(f"Promoted {len(bad_runs)} runs into '{dataset.name}'")

Output:

Promoted 24 runs into 'regressions-2026-05'

Pairwise (preference) evaluation#

A pairwise evaluator chooses which of two candidate outputs is better — useful for A/B tests where no single ground-truth answer exists.

from langsmith.evaluation import evaluate_comparative
from langchain_anthropic import ChatAnthropic
import os

judge = ChatAnthropic(model="claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"])

def preference_judge(runs, example):
    """Pick the run with the more concise, accurate summary."""
    a, b = runs
    prompt = (
        "You are judging two summaries. Reply with 'A' or 'B' only.\n"
        f"Question: {example.inputs['text']}\n"
        f"Reference: {example.outputs['summary']}\n"
        f"A: {a.outputs['summary']}\n"
        f"B: {b.outputs['summary']}"
    )
    choice = judge.invoke(prompt).content.strip().upper()
    winner = a if choice == "A" else b
    return {"key": "preferred", "scores": {str(a.id): int(winner is a), str(b.id): int(winner is b)}}

evaluate_comparative(
    experiments=["claude-sonnet-4-6", "gpt-4o"],   # two prior experiment names
    evaluators=[preference_judge],
)

Self-hosted LangSmith#

Set a different LANGCHAIN_ENDPOINT to send traces to a self-hosted LangSmith instance (Helm chart, Docker Compose). The client is identical otherwise.

export LANGCHAIN_ENDPOINT="https://langsmith.internal.example.com"
export LANGCHAIN_API_KEY="ls__..."
export LANGCHAIN_TRACING_V2="true"
python my_app.py

Output: (none — exits 0 on success)

Sampling traces in production#

For high-traffic services, capturing every trace is expensive. Sample 1–10% with LANGCHAIN_TRACING_SAMPLING_RATE or wrap your entry point with a manual sampler.

import os, random
from langsmith.run_helpers import tracing_context

@traceable
def handle_request(payload: dict) -> dict:
    return process(payload)

def maybe_trace(payload: dict, sample_rate: float = 0.05) -> dict:
    if random.random() < sample_rate:
        return handle_request(payload)
    # Disable tracing for this call entirely
    with tracing_context(enabled=False):
        return handle_request(payload)

Output: (none — exits 0 on success)

Real-world recipes#

These recipes string together the building blocks above into common production patterns.

Recipe: nightly evaluation report#

Run a held-out dataset against the current production chain every night and post the deltas to Slack.

import os
from datetime import datetime
from langsmith import Client
from langsmith.evaluation import evaluate

ls = Client()

def predict(inputs: dict) -> dict:
    return {"summary": production_chain.invoke(inputs)}

results = evaluate(
    predict,
    data="regression-suite-v3",
    evaluators=[helpfulness_evaluator, word_count_evaluator],
    experiment_prefix=f"nightly-{datetime.utcnow():%Y-%m-%d}",
)

agg = results.get_aggregate_feedback()
prev = ls.read_experiment("nightly-previous")           # convention: alias previous green run
prev_agg = prev.aggregate_feedback if prev else {}

delta = {k: agg[k] - prev_agg.get(k, 0) for k in agg}
print({k: round(v, 3) for k, v in delta.items()})

Output:

{'helpfulness': 0.04, 'conciseness': -0.02}

Recipe: prompt promotion gate#

Block any merge that pushes a prompt change unless evaluation scores hold or improve on the canonical dataset.

from langsmith.evaluation import evaluate
from langsmith import Client

def gate_prompt_change(prompt_path: str, baseline_score: float = 0.78) -> None:
    new_prompt = open(prompt_path).read()

    def predict(inputs: dict) -> dict:
        return {"answer": call_claude(new_prompt.format(**inputs))}

    results = evaluate(predict, data="prompt-gate-v1", evaluators=[helpfulness_evaluator])
    score = results.get_aggregate_feedback()["helpfulness"]
    if score < baseline_score:
        raise SystemExit(f"FAIL: helpfulness {score:.3f} < baseline {baseline_score:.3f}")
    print(f"PASS: helpfulness {score:.3f} >= {baseline_score:.3f}")
python -m scripts.gate_prompt_change ./prompts/summarise.txt

Output:

PASS: helpfulness 0.812 >= 0.780

Recipe: user feedback → fine-tune dataset#

Collect runs that earned a thumbs-up and export them as a Hugging Face dataset for supervised fine-tuning.

from langsmith import Client
from datasets import Dataset

ls = Client()
runs = list(ls.list_runs(
    project_name="prod",
    filter='and(eq(feedback_key, "user_rating"), eq(feedback_score, 1.0))',
    run_type="llm",
    limit=5000,
))

records = [
    {
        "prompt": (r.inputs.get("messages") or r.inputs.get("prompt") or [""])[0]
                  if isinstance(r.inputs.get("messages"), list) else str(r.inputs),
        "completion": r.outputs.get("output") or r.outputs.get("content") or "",
    }
    for r in runs if r.outputs
]
ds = Dataset.from_list(records)
ds.save_to_disk("./sft_thumbs_up")
print(f"Exported {len(ds)} thumbs-up examples")

Output:

Exported 1438 thumbs-up examples

Recipe: cost alarm on a per-user basis#

Aggregate trace cost by user_id metadata and warn on top spenders.

from collections import defaultdict
from langsmith import Client
from datetime import datetime, timezone, timedelta

ls = Client()
spend = defaultdict(float)
for r in ls.list_runs(
    project_name="prod",
    start_time=datetime.now(timezone.utc) - timedelta(days=1),
    run_type="llm",
    limit=10_000,
):
    user = (r.extra or {}).get("metadata", {}).get("user_id", "anon")
    spend[user] += r.total_cost or 0.0

for user, total in sorted(spend.items(), key=lambda kv: -kv[1])[:10]:
    if total > 5.0:
        print(f"ALERT  {user}  ${total:.2f}/day")

Output:

ALERT  user_4711  $12.93/day
ALERT  alice-dev  $ 8.40/day

Performance and reliability tips#

  • Always call Client().flush() at the end of short scripts; otherwise the background sender may drop traces on exit.
  • For high-throughput services, use LANGCHAIN_TRACING_SAMPLING_RATE=0.05 and tag the kept runs with metadata={"sampled": True} so dashboards know the sampling factor.
  • Avoid putting large payloads (>1 MB) directly in inputs/outputs — link to S3/R2 in metadata instead. LangSmith truncates oversized fields.
  • Set LANGCHAIN_HIDE_INPUTS=true to redact inputs on PII-sensitive projects; combine with a custom hash so you can still group identical queries.
  • Pin a prompt version (hub.pull("owner/name:abc123")) in production code — the floating tag can drift under you.

Quick reference#

TaskCode
Enable tracingos.environ["LANGCHAIN_TRACING_V2"] = "true" + LANGCHAIN_API_KEY
Set projectos.environ["LANGCHAIN_PROJECT"] = "name"
Trace any function@traceable(name="step", run_type="chain")
Override projectwith tracing_context(project_name="exp"):
Attach metadatawith tracing_context(metadata={"user_id": "..."}):
Group as threadmetadata={"session_id": uuid} on each turn
Disable a blockwith tracing_context(enabled=False):
Create datasetls.create_dataset("name")
Add examplesls.create_examples(inputs=[...], outputs=[...], dataset_id=...)
Run evaluationevaluate(predict_fn, data="dataset-name", evaluators=[...])
Built-in evaluatorLangChainStringEvaluator("criteria", config={"criteria": "helpfulness"})
Custom evaluator@run_evaluator def fn(run, example) -> dict:
Pairwise evalevaluate_comparative(experiments=["a","b"], evaluators=[...])
Query runsls.list_runs(project_name=..., filter='and(...)')
Pull prompthub.pull("owner/name")
Pin prompthub.pull("owner/name:abc123")
Push prompthub.push("owner/name", prompt)
Tag runls.create_feedback(run_id, key="rating", score=1.0)
Flush tracesClient().flush()
Self-hostedexport LANGCHAIN_ENDPOINT=https://...
Sample 5%export LANGCHAIN_TRACING_SAMPLING_RATE=0.05