LangSmith — LLM Observability & Evaluation#
What it is#
LangSmith is Langchain Inc.’s platform for observability and evaluation of LLM applications. It automatically captures every prompt, response, token count, latency, and error from LangChain chains and agents — and from any Python code you instrument manually. You use it to debug failures, build evaluation datasets from production traces, run automated regression tests, and compare model/prompt versions. LangSmith has a free tier and integrates with LangChain via two environment variables.
Install#
pip install langsmith
pip install langchain # optional — auto-traces all LangChain calls
Output: (none — exits 0 on success)
Quick example#
import os
# Enable tracing with two env vars — that's all LangChain needs
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls__..." # from app.langsmith.com
os.environ["LANGCHAIN_PROJECT"] = "my-project"
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
chain = (
ChatPromptTemplate.from_template("Summarise in one sentence: {text}")
| ChatAnthropic(model="claude-sonnet-4-6")
| StrOutputParser()
)
result = chain.invoke({"text": "LangSmith traces every LLM call automatically."})
print(result)
# Every call now appears in the LangSmith UI with full prompt/response/latency
Output:
LangSmith automatically traces every LLM call and displays it in the UI.
When / why to use it#
- Debugging LLM chains: inspect exactly what prompt was sent, what was received, and where the chain failed.
- Building evaluation datasets: tag production traces as “good” or “bad” and export to a dataset.
- Regression testing: run a dataset through a chain and compare scores across versions.
- Prompt management: version prompts in the LangSmith Hub and pull them by commit hash.
- Monitoring production: alert on latency regressions or error rate spikes.
Common pitfalls#
[!WARNING] Traces are sent asynchronously — LangSmith batches and sends traces in the background. In short-lived scripts, the process may exit before all traces are flushed. Add
langsmith.Client().flush()at the end of scripts to ensure all traces are sent.
[!WARNING]
LANGCHAIN_TRACING_V2must be set before importing LangChain — the tracer registers at import time. Setting the env var afterfrom langchain_core import ...has no effect.
[!TIP] Use
@traceableto trace any Python function — not just LangChain objects. This captures non-LangChain steps (database calls, pre/post-processing) in the same trace tree.
[!TIP]
with tracing_context(project_name="experiment-v2"): overrides the project for a specific block, making it easy to route A/B experiments to separate projects without changing env vars.
Tracing non-LangChain code with @traceable#
@traceable instruments any Python function so its inputs, outputs, and metadata appear in LangSmith traces.
from langsmith import traceable
import anthropic
import os
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
@traceable(name="Claude API call", run_type="llm")
def call_claude(prompt: str, model: str = "claude-sonnet-4-6") -> str:
message = client.messages.create(
model=model,
max_tokens=256,
messages=[{"role": "user", "content": prompt}],
)
return message.content[0].text
@traceable(name="Extract keywords", run_type="chain")
def extract_keywords(text: str) -> list[str]:
response = call_claude(f"List 5 keywords from this text as a comma-separated list: {text}")
return [k.strip() for k in response.split(",")]
result = extract_keywords("LangSmith traces LLM calls and helps you evaluate and debug.")
print(result)
Output:
['LangSmith', 'traces', 'LLM', 'evaluate', 'debug']
Datasets — ground truth for evaluation#
A dataset is a collection of input/output pairs used to evaluate a chain consistently. Build datasets from production traces (tag and export), from CSV, or programmatically.
from langsmith import Client
ls = Client(api_key=os.environ["LANGCHAIN_API_KEY"])
# Create a dataset
dataset = ls.create_dataset(
dataset_name="summarisation-v1",
description="Summarisation test cases",
)
# Add examples
examples = [
{
"inputs": {"text": "The quick brown fox jumps over the lazy dog."},
"outputs": {"summary": "A fox jumps over a dog."},
},
{
"inputs": {"text": "Python is a high-level interpreted programming language."},
"outputs": {"summary": "Python is an interpreted high-level language."},
},
]
ls.create_examples(inputs=[e["inputs"] for e in examples],
outputs=[e["outputs"] for e in examples],
dataset_id=dataset.id)
print(f"Dataset '{dataset.name}' created with {len(examples)} examples")
Output:
Dataset 'summarisation-v1' created with 2 examples
Evaluators — scoring predictions#
An evaluator scores a chain’s output against the expected output. LangSmith provides built-in evaluators (exact_match, embedding_distance, qa) and supports custom evaluators via EvaluatorOutputSchema.
from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
import os
chain = (
ChatPromptTemplate.from_template("Summarise in one sentence: {text}")
| ChatAnthropic(model="claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"])
| StrOutputParser()
)
def predict(inputs: dict) -> dict:
return {"summary": chain.invoke(inputs)}
# LLM-as-a-judge evaluator for helpfulness
helpfulness_evaluator = LangChainStringEvaluator(
"criteria",
config={
"criteria": "helpfulness",
"llm": ChatAnthropic(model="claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"]),
},
prepare_data=lambda run, example: {
"prediction": run.outputs["summary"],
"reference": example.outputs["summary"],
"input": example.inputs["text"],
},
)
results = evaluate(
predict,
data="summarisation-v1",
evaluators=[helpfulness_evaluator],
experiment_prefix="claude-sonnet-4-6",
)
print(f"Mean helpfulness: {results.get_aggregate_feedback()}")
Custom evaluators#
from langsmith.evaluation import run_evaluator
from langsmith.schemas import Run, Example
@run_evaluator
def word_count_evaluator(run: Run, example: Example) -> dict:
"""Penalise summaries that are too long."""
prediction = run.outputs.get("summary", "")
expected = example.outputs.get("summary", "")
word_ratio = len(prediction.split()) / max(len(expected.split()), 1)
score = 1.0 if word_ratio <= 1.5 else max(0.0, 1.0 - (word_ratio - 1.5))
return {"key": "conciseness", "score": score, "comment": f"word_ratio={word_ratio:.2f}"}
Prompt hub — versioned prompts#
Store, version, and pull prompts from the LangSmith Hub so experiments are reproducible and rollback is trivial.
from langsmith import Client
from langchain import hub
# Pull a prompt by owner/name (uses LANGCHAIN_API_KEY)
prompt = hub.pull("alicedev/summarise-v1")
print(prompt.messages)
# Pull a specific commit for reproducibility
prompt = hub.pull("alicedev/summarise-v1:abc123")
# Push a new version
from langchain_core.prompts import ChatPromptTemplate
new_prompt = ChatPromptTemplate.from_messages([
("system", "You are a concise summariser. Use at most 20 words."),
("human", "{text}"),
])
hub.push("alicedev/summarise-v1", new_prompt, new_repo_is_public=False)
Comparing experiments#
Run the same dataset through two different chains (e.g. Claude vs GPT-4) to compare scores side-by-side in the LangSmith UI.
from langsmith.evaluation import evaluate
# Experiment A — Claude
results_a = evaluate(
predict_with_claude,
data="summarisation-v1",
evaluators=[helpfulness_evaluator],
experiment_prefix="claude-sonnet-4-6",
)
# Experiment B — GPT-4o
results_b = evaluate(
predict_with_gpt4,
data="summarisation-v1",
evaluators=[helpfulness_evaluator],
experiment_prefix="gpt-4o",
)
# Both experiments appear in the LangSmith UI under the same dataset
# for side-by-side score and latency comparison
Feedback — tagging individual runs#
from langsmith import Client
ls = Client(api_key=os.environ["LANGCHAIN_API_KEY"])
# After a production run, record human feedback
ls.create_feedback(
run_id="<run-id-from-trace>",
key="user_rating",
score=1.0, # 0.0 = bad, 1.0 = good
comment="Perfect summary, exactly right length",
)
# Query runs with negative feedback
bad_runs = ls.list_runs(
project_name="my-project",
filter='and(eq(feedback_key, "user_rating"), lt(feedback_score, 0.5))',
)
for run in bad_runs:
print(run.id, run.inputs, run.outputs)
CI integration — fail on score regression#
from langsmith.evaluation import evaluate
def test_summarisation_quality():
results = evaluate(
predict,
data="summarisation-v1",
evaluators=[helpfulness_evaluator],
experiment_prefix="ci",
)
agg = results.get_aggregate_feedback()
assert agg["helpfulness"] >= 0.75, (
f"Helpfulness score {agg['helpfulness']:.2f} below threshold 0.75"
)
pytest test_eval.py # fails if mean helpfulness drops below threshold
Output: (none — exits 0 on success)
Run types — semantic categories in the trace tree#
A run type labels a span in the trace tree so the UI can render the right icon, surface token usage, and filter activity. Pick the type that matches what the function does — not the framework that produces it.
| Type | Meaning | Typical inputs/outputs |
|---|---|---|
llm | A model call (any provider) | {prompt} → {completion, tokens, cost} |
chain | A multi-step orchestration | composite inputs → composite outputs |
tool | A function/tool call (search, calc, HTTP) | tool args → tool return value |
retriever | A vector store or BM25 retriever | {query} → list of documents |
embedding | An embedding call | {text} → vector |
parser | Output parsing / JSON extraction | raw text → structured data |
prompt | A prompt template render | template + vars → final string |
from langsmith import traceable
@traceable(run_type="retriever", name="pgvector_retrieve")
def retrieve(query: str, k: int = 5) -> list[dict]:
# The UI renders this as a retriever node with a document count badge
return [{"page_content": "...", "metadata": {"source": "doc-1"}} for _ in range(k)]
@traceable(run_type="tool", name="weather_api")
def get_weather(city: str) -> dict:
return {"city": city, "temp_c": 18, "conditions": "cloudy"}
Output: (none — exits 0 on success)
Tracing context — overriding project, tags, metadata#
tracing_context is a context manager that mutates the active trace settings for the lifetime of a block. Use it to route A/B variants to separate projects or attach experiment metadata without changing env vars.
from langsmith import traceable
from langsmith.run_helpers import tracing_context
@traceable
def summarise(text: str) -> str:
return call_claude(f"Summarise: {text}")
# Route this run to a different project with extra tags + metadata
with tracing_context(
project_name="experiment-v2",
tags=["ab-test", "treatment"],
metadata={"variant": "B", "user_segment": "power"},
):
summarise("LangSmith batches and flushes traces asynchronously.")
Output: (none — exits 0 on success)
Multi-turn chats — threads and sessions#
Group related runs into a thread so the LangSmith UI shows them as a conversation. Set the session_id (LangChain) or pass metadata={"thread_id": ...} when calling @traceable functions.
import uuid
from langsmith import traceable
from langsmith.run_helpers import tracing_context
thread_id = str(uuid.uuid4())
@traceable(run_type="chain")
def chat_turn(user_message: str) -> str:
return call_claude(user_message)
with tracing_context(metadata={"session_id": thread_id, "user_id": "alice-dev"}):
chat_turn("Hi, what is RAG?")
chat_turn("How does it differ from fine-tuning?")
chat_turn("Show me a Python example.")
Output: (none — exits 0 on success)
Programmatic trace inspection#
The Client API lets you query, filter, and download runs without the UI — useful for nightly reports, dataset curation, and custom dashboards.
from langsmith import Client
from datetime import datetime, timedelta, timezone
ls = Client()
# Last 24h of failing runs in this project
runs = ls.list_runs(
project_name="my-project",
start_time=datetime.now(timezone.utc) - timedelta(days=1),
filter='eq(status, "error")',
limit=100,
)
for r in runs:
print(f"{r.start_time:%H:%M:%S} {r.name:30s} err={r.error[:60] if r.error else '-'}")
Output:
14:02:11 retrieve err=ConnectionError: pgvector timed out
14:09:47 call_claude err=anthropic.APIStatusError: 529 overloaded
14:31:05 parse_json err=ValueError: Expecting value: line 1 col
Filtering runs — the LangSmith query DSL#
filter= accepts a small expression language for selecting runs. Combine predicates with and(...), or(...), and not(...). Operators: eq, ne, gt, gte, lt, lte, has, search.
# All runs where the question contained "rag", token cost > $0.01, and feedback score < 0.5
runs = ls.list_runs(
project_name="my-project",
filter=(
'and('
' search("rag"),'
' gt(total_cost, 0.01),'
' and(eq(feedback_key, "user_rating"), lt(feedback_score, 0.5))'
')'
),
)
print(f"{sum(1 for _ in runs)} runs matched")
Output:
17 runs matched
Cost and token tracking#
Every traced LLM call records prompt/completion token counts and a computed dollar cost (LangSmith maintains a per-model price list). Aggregate across a project with get_project_stats or by iterating list_runs.
from langsmith import Client
from collections import defaultdict
ls = Client()
costs = defaultdict(float)
for r in ls.list_runs(project_name="my-project", run_type="llm", limit=1000):
model = (r.extra or {}).get("invocation_params", {}).get("model", "unknown")
costs[model] += r.total_cost or 0.0
for model, total in sorted(costs.items(), key=lambda kv: -kv[1]):
print(f" {model:30s} ${total:7.2f}")
Output:
claude-sonnet-4-6 $ 42.18
gpt-4o $ 18.74
text-embedding-3-small $ 0.62
Streaming and partial outputs#
For streamed token output, LangSmith records the full assembled text as the final output once the stream closes. Use streaming=True in LangChain clients; with @traceable, return an iterable or yield from a generator — LangSmith collects the full sequence automatically.
from langsmith import traceable
@traceable(run_type="llm", name="claude_stream")
def stream_completion(prompt: str):
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[{"role": "user", "content": prompt}],
) as stream:
for chunk in stream.text_stream:
yield chunk
text = "".join(stream_completion("List three uses of LangSmith."))
print(text[:120])
Output:
1. Debugging LangChain failures by inspecting the exact prompt/response per step.
2. Building evaluation
Datasets from production — promoting traces to ground truth#
A common workflow: a user thumbs-down a response → you fix the prompt → you want to make sure the fix didn’t regress other queries. The fastest loop is to clone interesting production runs into a dataset, then re-run a candidate chain against that dataset.
from langsmith import Client
ls = Client()
# Find production runs with negative feedback and clone them into a dataset
bad_runs = list(ls.list_runs(
project_name="prod",
filter='and(eq(feedback_key, "user_rating"), lt(feedback_score, 0.5))',
limit=50,
))
dataset = ls.create_dataset("regressions-2026-05", description="Negative feedback from prod")
for r in bad_runs:
ls.create_example(
inputs=r.inputs,
outputs=r.outputs, # current production output, treat as "what we had"
dataset_id=dataset.id,
metadata={"source_run": str(r.id)},
)
print(f"Promoted {len(bad_runs)} runs into '{dataset.name}'")
Output:
Promoted 24 runs into 'regressions-2026-05'
Pairwise (preference) evaluation#
A pairwise evaluator chooses which of two candidate outputs is better — useful for A/B tests where no single ground-truth answer exists.
from langsmith.evaluation import evaluate_comparative
from langchain_anthropic import ChatAnthropic
import os
judge = ChatAnthropic(model="claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"])
def preference_judge(runs, example):
"""Pick the run with the more concise, accurate summary."""
a, b = runs
prompt = (
"You are judging two summaries. Reply with 'A' or 'B' only.\n"
f"Question: {example.inputs['text']}\n"
f"Reference: {example.outputs['summary']}\n"
f"A: {a.outputs['summary']}\n"
f"B: {b.outputs['summary']}"
)
choice = judge.invoke(prompt).content.strip().upper()
winner = a if choice == "A" else b
return {"key": "preferred", "scores": {str(a.id): int(winner is a), str(b.id): int(winner is b)}}
evaluate_comparative(
experiments=["claude-sonnet-4-6", "gpt-4o"], # two prior experiment names
evaluators=[preference_judge],
)
Self-hosted LangSmith#
Set a different LANGCHAIN_ENDPOINT to send traces to a self-hosted LangSmith instance (Helm chart, Docker Compose). The client is identical otherwise.
export LANGCHAIN_ENDPOINT="https://langsmith.internal.example.com"
export LANGCHAIN_API_KEY="ls__..."
export LANGCHAIN_TRACING_V2="true"
python my_app.py
Output: (none — exits 0 on success)
Sampling traces in production#
For high-traffic services, capturing every trace is expensive. Sample 1–10% with LANGCHAIN_TRACING_SAMPLING_RATE or wrap your entry point with a manual sampler.
import os, random
from langsmith.run_helpers import tracing_context
@traceable
def handle_request(payload: dict) -> dict:
return process(payload)
def maybe_trace(payload: dict, sample_rate: float = 0.05) -> dict:
if random.random() < sample_rate:
return handle_request(payload)
# Disable tracing for this call entirely
with tracing_context(enabled=False):
return handle_request(payload)
Output: (none — exits 0 on success)
Real-world recipes#
These recipes string together the building blocks above into common production patterns.
Recipe: nightly evaluation report#
Run a held-out dataset against the current production chain every night and post the deltas to Slack.
import os
from datetime import datetime
from langsmith import Client
from langsmith.evaluation import evaluate
ls = Client()
def predict(inputs: dict) -> dict:
return {"summary": production_chain.invoke(inputs)}
results = evaluate(
predict,
data="regression-suite-v3",
evaluators=[helpfulness_evaluator, word_count_evaluator],
experiment_prefix=f"nightly-{datetime.utcnow():%Y-%m-%d}",
)
agg = results.get_aggregate_feedback()
prev = ls.read_experiment("nightly-previous") # convention: alias previous green run
prev_agg = prev.aggregate_feedback if prev else {}
delta = {k: agg[k] - prev_agg.get(k, 0) for k in agg}
print({k: round(v, 3) for k, v in delta.items()})
Output:
{'helpfulness': 0.04, 'conciseness': -0.02}
Recipe: prompt promotion gate#
Block any merge that pushes a prompt change unless evaluation scores hold or improve on the canonical dataset.
from langsmith.evaluation import evaluate
from langsmith import Client
def gate_prompt_change(prompt_path: str, baseline_score: float = 0.78) -> None:
new_prompt = open(prompt_path).read()
def predict(inputs: dict) -> dict:
return {"answer": call_claude(new_prompt.format(**inputs))}
results = evaluate(predict, data="prompt-gate-v1", evaluators=[helpfulness_evaluator])
score = results.get_aggregate_feedback()["helpfulness"]
if score < baseline_score:
raise SystemExit(f"FAIL: helpfulness {score:.3f} < baseline {baseline_score:.3f}")
print(f"PASS: helpfulness {score:.3f} >= {baseline_score:.3f}")
python -m scripts.gate_prompt_change ./prompts/summarise.txt
Output:
PASS: helpfulness 0.812 >= 0.780
Recipe: user feedback → fine-tune dataset#
Collect runs that earned a thumbs-up and export them as a Hugging Face dataset for supervised fine-tuning.
from langsmith import Client
from datasets import Dataset
ls = Client()
runs = list(ls.list_runs(
project_name="prod",
filter='and(eq(feedback_key, "user_rating"), eq(feedback_score, 1.0))',
run_type="llm",
limit=5000,
))
records = [
{
"prompt": (r.inputs.get("messages") or r.inputs.get("prompt") or [""])[0]
if isinstance(r.inputs.get("messages"), list) else str(r.inputs),
"completion": r.outputs.get("output") or r.outputs.get("content") or "",
}
for r in runs if r.outputs
]
ds = Dataset.from_list(records)
ds.save_to_disk("./sft_thumbs_up")
print(f"Exported {len(ds)} thumbs-up examples")
Output:
Exported 1438 thumbs-up examples
Recipe: cost alarm on a per-user basis#
Aggregate trace cost by user_id metadata and warn on top spenders.
from collections import defaultdict
from langsmith import Client
from datetime import datetime, timezone, timedelta
ls = Client()
spend = defaultdict(float)
for r in ls.list_runs(
project_name="prod",
start_time=datetime.now(timezone.utc) - timedelta(days=1),
run_type="llm",
limit=10_000,
):
user = (r.extra or {}).get("metadata", {}).get("user_id", "anon")
spend[user] += r.total_cost or 0.0
for user, total in sorted(spend.items(), key=lambda kv: -kv[1])[:10]:
if total > 5.0:
print(f"ALERT {user} ${total:.2f}/day")
Output:
ALERT user_4711 $12.93/day
ALERT alice-dev $ 8.40/day
Performance and reliability tips#
- Always call
Client().flush()at the end of short scripts; otherwise the background sender may drop traces on exit. - For high-throughput services, use
LANGCHAIN_TRACING_SAMPLING_RATE=0.05and tag the kept runs withmetadata={"sampled": True}so dashboards know the sampling factor. - Avoid putting large payloads (>1 MB) directly in
inputs/outputs— link to S3/R2 in metadata instead. LangSmith truncates oversized fields. - Set
LANGCHAIN_HIDE_INPUTS=trueto redact inputs on PII-sensitive projects; combine with a custom hash so you can still group identical queries. - Pin a prompt version (
hub.pull("owner/name:abc123")) in production code — the floating tag can drift under you.
Quick reference#
| Task | Code |
|---|---|
| Enable tracing | os.environ["LANGCHAIN_TRACING_V2"] = "true" + LANGCHAIN_API_KEY |
| Set project | os.environ["LANGCHAIN_PROJECT"] = "name" |
| Trace any function | @traceable(name="step", run_type="chain") |
| Override project | with tracing_context(project_name="exp"): |
| Attach metadata | with tracing_context(metadata={"user_id": "..."}): |
| Group as thread | metadata={"session_id": uuid} on each turn |
| Disable a block | with tracing_context(enabled=False): |
| Create dataset | ls.create_dataset("name") |
| Add examples | ls.create_examples(inputs=[...], outputs=[...], dataset_id=...) |
| Run evaluation | evaluate(predict_fn, data="dataset-name", evaluators=[...]) |
| Built-in evaluator | LangChainStringEvaluator("criteria", config={"criteria": "helpfulness"}) |
| Custom evaluator | @run_evaluator def fn(run, example) -> dict: |
| Pairwise eval | evaluate_comparative(experiments=["a","b"], evaluators=[...]) |
| Query runs | ls.list_runs(project_name=..., filter='and(...)') |
| Pull prompt | hub.pull("owner/name") |
| Pin prompt | hub.pull("owner/name:abc123") |
| Push prompt | hub.push("owner/name", prompt) |
| Tag run | ls.create_feedback(run_id, key="rating", score=1.0) |
| Flush traces | Client().flush() |
| Self-hosted | export LANGCHAIN_ENDPOINT=https://... |
| Sample 5% | export LANGCHAIN_TRACING_SAMPLING_RATE=0.05 |