TruLens — LLM App Evaluation#
What it is#
TruLens is a Python library for evaluating and monitoring LLM-powered applications, with a particular focus on RAG pipelines. It defines the RAG Triad — three feedback functions (Answer Relevance, Context Relevance, and Groundedness) that together diagnose whether a RAG system retrieves the right information and generates faithful, on-topic answers. TruLens records every LLM call, computes feedback scores automatically, and surfaces results in a local web dashboard so you can compare runs and catch regressions.
Install#
pip install trulens-eval
pip install trulens-eval[langchain] # LangChain integration
pip install trulens-eval[llama-index] # LlamaIndex integration
Output: (none — exits 0 on success)
Quick example#
from trulens_eval import Tru, TruBasicApp, Feedback
from trulens_eval.feedback.provider import OpenAI as TruOpenAI
tru = Tru() # starts local SQLite database
provider = TruOpenAI(model_engine="gpt-4o-mini")
# Simple RAG stub
def rag(question: str) -> str:
return "Attention allows the model to weigh input tokens by relevance."
f_answer_relevance = Feedback(provider.relevance, name="Answer Relevance").on_input_output()
tru_rag = TruBasicApp(rag, app_id="my-rag-v1", feedbacks=[f_answer_relevance])
with tru_rag as recording:
answer = rag("What is attention in transformers?")
print(answer)
tru.get_leaderboard()
Output:
Attention allows the model to weigh input tokens by relevance.
app_id Answer Relevance total_cost total_tokens
0 my-rag-v1 0.92 0.0003 142
When / why to use it#
- Systematically evaluating RAG pipelines across the three quality dimensions: retrieval relevance, context use, and answer faithfulness.
- Comparing pipeline versions side-by-side in the built-in dashboard — swap the embedding model or chunk size, run both, compare scores.
- Detecting regressions in CI: fail the build if groundedness drops below a threshold.
- Debugging: low Context Relevance means the retriever is fetching off-topic chunks; low Groundedness means the LLM is ignoring retrieved context.
- Building a leaderboard across LLM providers, prompt templates, or retrieval strategies.
Common pitfalls#
[!WARNING] TruLens uses LLMs as judges — each feedback function makes one or more LLM calls per evaluated record. Evaluating 1 000 records × 3 feedback functions can cost significant tokens. Use a cheaper judge model (e.g.
gpt-4o-mini) or cache results withtru.reset_database()between runs only when you want a clean slate.
[!WARNING]
Tru()writes to a local SQLite file — by defaultdefault.sqlitein the working directory. Setdatabase_urlto a persistent path:Tru(database_url="sqlite:///evals/trulens.sqlite"). Deleting or moving this file loses all recorded runs.
[!WARNING] App ID must be unique per pipeline version — if you reuse the same
app_idacross different code versions, runs are merged in the dashboard. Use versioned IDs like"rag-v2-chroma-k5"to keep experiments separate.
[!TIP] Use
tru.start_dashboard()to open the local Streamlit dashboard in the browser. It shows per-run scores, a leaderboard, and a record-level trace viewer — no external service required.
[!TIP] The three RAG Triad metrics are complementary diagnostics, not a single score. Always evaluate all three together: a pipeline with high Answer Relevance but low Groundedness is hallucinating, while one with high Groundedness but low Context Relevance retrieved the wrong chunks.
The RAG Triad#
The RAG Triad is TruLens’s core evaluation framework. Each dimension measures a different failure mode in the retrieve-then-generate pipeline.
Question → Retriever → [context chunks] → LLM → Answer
↑ ↑ ↑
Context Context Answer
Relevance Relevance Relevance
↕
Groundedness
(is the answer supported by context?)
| Metric | Question it answers | Low score means… |
|---|---|---|
| Context Relevance | Are retrieved chunks relevant to the question? | Retriever is fetching noise |
| Groundedness | Is the answer supported by retrieved chunks? | LLM is hallucinating |
| Answer Relevance | Does the answer address the question? | LLM is off-topic or verbose |
Feedback functions#
Feedback functions are the building blocks of evaluation. TruLens provides a library of pre-built feedback functions for common tasks (relevance, coherence, sentiment) and lets you define custom ones.
from trulens_eval import Feedback
from trulens_eval.feedback.provider import OpenAI as TruOpenAI
import numpy as np
provider = TruOpenAI(model_engine="gpt-4o-mini")
# Answer Relevance — does the answer address the question?
f_answer_relevance = (
Feedback(provider.relevance, name="Answer Relevance")
.on_input_output()
)
# Context Relevance — are retrieved chunks relevant to the question?
f_context_relevance = (
Feedback(provider.context_relevance, name="Context Relevance")
.on_input()
.on(TruChain.select_context())
.aggregate(np.mean)
)
# Groundedness — is the answer supported by the retrieved chunks?
f_groundedness = (
Feedback(provider.groundedness_measure_with_cot_reasons, name="Groundedness")
.on(TruChain.select_context())
.on_output()
.aggregate(np.mean)
)
TruChain — evaluating LangChain RAG pipelines#
TruChain wraps a LangChain Runnable or chain, records every invocation, and runs the configured feedback functions after each call.
from trulens_eval import Tru, TruChain, Feedback
from trulens_eval.feedback.provider import OpenAI as TruOpenAI
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import numpy as np, os
tru = Tru(database_url="sqlite:///evals/trulens.sqlite")
provider = TruOpenAI(model_engine="gpt-4o-mini")
# --- build a minimal LangChain RAG chain ---
vectorstore = Chroma.from_texts(
texts=[
"Transformers use self-attention to process all tokens simultaneously.",
"BERT uses masked language modelling to learn bidirectional representations.",
"GPT trains as a left-to-right language model predicting the next token.",
],
embedding=OpenAIEmbeddings(api_key=os.environ["OPENAI_API_KEY"]),
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
prompt = ChatPromptTemplate.from_template(
"Answer using only the context below.\n\nContext: {context}\n\nQuestion: {question}"
)
llm = ChatOpenAI(model="gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# --- feedback functions ---
f_answer_relevance = Feedback(provider.relevance, name="Answer Relevance").on_input_output()
f_context_relevance = (
Feedback(provider.context_relevance, name="Context Relevance")
.on_input()
.on(TruChain.select_context())
.aggregate(np.mean)
)
f_groundedness = (
Feedback(provider.groundedness_measure_with_cot_reasons, name="Groundedness")
.on(TruChain.select_context())
.on_output()
.aggregate(np.mean)
)
# --- wrap with TruChain ---
tru_chain = TruChain(
chain,
app_id="langchain-rag-v1",
feedbacks=[f_answer_relevance, f_context_relevance, f_groundedness],
)
questions = [
"What is self-attention?",
"How does BERT differ from GPT?",
"What is positional encoding?",
]
with tru_chain as recording:
for q in questions:
chain.invoke(q)
leaderboard = tru.get_leaderboard(app_ids=["langchain-rag-v1"])
print(leaderboard)
Output:
app_id Answer Relevance Context Relevance Groundedness total_cost
0 langchain-rag-v1 0.91 0.87 0.94 0.0041
TruLlama — evaluating LlamaIndex RAG pipelines#
TruLlama is the LlamaIndex equivalent of TruChain — it wraps any LlamaIndex query engine or chat engine.
from trulens_eval import Tru, TruLlama, Feedback
from trulens_eval.feedback.provider import OpenAI as TruOpenAI
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI as LlamaOpenAI
import numpy as np, os
tru = Tru()
provider = TruOpenAI(model_engine="gpt-4o-mini")
# --- build a LlamaIndex query engine ---
Settings.llm = LlamaOpenAI(model="gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=3)
# --- feedback functions ---
f_answer_relevance = Feedback(provider.relevance, name="Answer Relevance").on_input_output()
f_context_relevance = (
Feedback(provider.context_relevance, name="Context Relevance")
.on_input()
.on(TruLlama.select_source_nodes().node.text)
.aggregate(np.mean)
)
f_groundedness = (
Feedback(provider.groundedness_measure_with_cot_reasons, name="Groundedness")
.on(TruLlama.select_source_nodes().node.text)
.on_output()
.aggregate(np.mean)
)
# --- wrap with TruLlama ---
tru_query_engine = TruLlama(
query_engine,
app_id="llamaindex-rag-v1",
feedbacks=[f_answer_relevance, f_context_relevance, f_groundedness],
)
with tru_query_engine as recording:
response = query_engine.query("What are the main topics covered?")
print(response)
tru.get_leaderboard()
Output:
A summary of the main topics in the documents.
app_id Answer Relevance Context Relevance Groundedness
0 llamaindex-rag-v1 0.93 0.89 0.96
Custom feedback functions#
Any Python function that returns a float between 0.0 and 1.0 can be a feedback function.
from trulens_eval import Feedback
from trulens_eval.feedback.provider import OpenAI as TruOpenAI
provider = TruOpenAI(model_engine="gpt-4o-mini")
# Built-in: conciseness check via LLM judge
f_conciseness = Feedback(provider.conciseness, name="Conciseness").on_output()
# Custom: Python-only word-count ratio (no LLM call)
def word_count_score(answer: str) -> float:
"""Score 1.0 for answers under 100 words, decaying for longer answers."""
words = len(answer.split())
if words <= 100:
return 1.0
return max(0.0, 1.0 - (words - 100) / 200)
f_brevity = Feedback(word_count_score, name="Brevity").on_output()
# Custom: check for citation presence
def has_citation(answer: str) -> float:
"""Returns 1.0 if the answer contains a citation pattern like [1] or (source:)."""
import re
return 1.0 if re.search(r"\[\d+\]|\(source:", answer, re.IGNORECASE) else 0.0
f_citation = Feedback(has_citation, name="Has Citation").on_output()
Alternate LLM providers as judges#
TruLens supports Anthropic, Hugging Face, Bedrock, and local models as judge LLMs via provider wrappers.
from trulens_eval.feedback.provider import Bedrock as TruBedrock
from trulens_eval.feedback.provider import Huggingface as TruHuggingface
import os
# Anthropic Claude as judge (via LangChain wrapper)
from trulens_eval.feedback.provider import LangChainProvider
from langchain_anthropic import ChatAnthropic
provider = LangChainProvider(
chain=ChatAnthropic(model="claude-haiku-4-5-20251001", api_key=os.environ["ANTHROPIC_API_KEY"])
)
# Hugging Face local pipeline (free, no API key needed)
hf_provider = TruHuggingface() # uses sentence-transformers by default
f_relevance_hf = (
Feedback(hf_provider.not_toxic, name="Not Toxic")
.on_output()
)
The TruLens dashboard#
The dashboard is a local Streamlit web app that shows all recorded runs, leaderboard scores, and per-record trace details.
from trulens_eval import Tru
tru = Tru(database_url="sqlite:///evals/trulens.sqlite")
# Open dashboard in browser (blocks until Ctrl-C)
tru.start_dashboard(port=8501, force=True)
# Or just print the leaderboard to stdout
leaderboard = tru.get_leaderboard()
print(leaderboard.to_string(index=False))
# Export all records as a DataFrame for custom analysis
records, feedback_col = tru.get_records_and_feedback(app_ids=["langchain-rag-v1"])
print(records[["input", "output", "Answer Relevance", "Groundedness"]].head())
Output:
input output Answer Relevance Groundedness
0 What is self-attention? Self-attention computes ... 0.94 0.97
1 How does BERT differ from … BERT is bidirectional w… 0.90 0.93
CI integration — fail on score regression#
from trulens_eval import Tru
def test_rag_quality():
tru = Tru(database_url="sqlite:///evals/trulens.sqlite")
leaderboard = tru.get_leaderboard(app_ids=["langchain-rag-v1"])
row = leaderboard[leaderboard["app_id"] == "langchain-rag-v1"].iloc[0]
assert row["Answer Relevance"] >= 0.85, (
f"Answer Relevance {row['Answer Relevance']:.2f} below 0.85"
)
assert row["Groundedness"] >= 0.80, (
f"Groundedness {row['Groundedness']:.2f} below 0.80"
)
assert row["Context Relevance"] >= 0.75, (
f"Context Relevance {row['Context Relevance']:.2f} below 0.75"
)
pytest test_eval.py # fails if any metric regresses below threshold
Output: (none — exits 0 on success)
Comparing pipeline versions#
from trulens_eval import Tru, TruChain
tru = Tru()
# Version A — k=2 retriever
tru_v1 = TruChain(chain_k2, app_id="rag-k2", feedbacks=[f_answer_relevance, f_groundedness])
# Version B — k=5 retriever
tru_v2 = TruChain(chain_k5, app_id="rag-k5", feedbacks=[f_answer_relevance, f_groundedness])
for question in eval_questions:
with tru_v1 as rec:
chain_k2.invoke(question)
with tru_v2 as rec:
chain_k5.invoke(question)
# Both appear side-by-side in the leaderboard
print(tru.get_leaderboard(app_ids=["rag-k2", "rag-k5"]))
Output:
app_id Answer Relevance Groundedness total_cost
rag-k2 0.86 0.88 0.0021
rag-k5 0.91 0.94 0.0034
Quick reference#
| Metric | What it measures | Low score means… |
|---|---|---|
Answer Relevance | Does the answer address the question? | Off-topic or verbose response |
Context Relevance | Are retrieved chunks relevant? | Retriever fetching noise |
Groundedness | Is the answer supported by context? | LLM hallucinating |
| Task | Code |
|---|---|
| Init TruLens | tru = Tru(database_url="sqlite:///eval.sqlite") |
| Create provider | provider = TruOpenAI(model_engine="gpt-4o-mini") |
| Answer relevance | Feedback(provider.relevance).on_input_output() |
| Context relevance | Feedback(provider.context_relevance).on_input().on(TruChain.select_context()).aggregate(np.mean) |
| Groundedness | Feedback(provider.groundedness_measure_with_cot_reasons).on(TruChain.select_context()).on_output() |
| Wrap LangChain | TruChain(chain, app_id="v1", feedbacks=[...]) |
| Wrap LlamaIndex | TruLlama(query_engine, app_id="v1", feedbacks=[...]) |
| Record run | with tru_chain as rec: chain.invoke(q) |
| Leaderboard | tru.get_leaderboard(app_ids=["v1"]) |
| Dashboard | tru.start_dashboard(port=8501) |
| Export records | tru.get_records_and_feedback(app_ids=["v1"]) |
| Reset DB | tru.reset_database() |