TruLens — LLM App Evaluation

Evaluate and monitor LLM applications with TruLens. Covers the RAG triad, feedback functions, TruChain, TruLlama, custom evaluators, the dashboard, and CI integration.

8 min read 17 snippets 1mo ago deep dive

TruLens — LLM App Evaluation#

What it is#

TruLens is a Python library for evaluating and monitoring LLM-powered applications, with a particular focus on RAG pipelines. It defines the RAG Triad — three feedback functions (Answer Relevance, Context Relevance, and Groundedness) that together diagnose whether a RAG system retrieves the right information and generates faithful, on-topic answers. TruLens records every LLM call, computes feedback scores automatically, and surfaces results in a local web dashboard so you can compare runs and catch regressions.

Install#

pip install trulens-eval
pip install trulens-eval[langchain]      # LangChain integration
pip install trulens-eval[llama-index]    # LlamaIndex integration

Output: (none — exits 0 on success)

Quick example#

from trulens_eval import Tru, TruBasicApp, Feedback
from trulens_eval.feedback.provider import OpenAI as TruOpenAI

tru = Tru()   # starts local SQLite database

provider = TruOpenAI(model_engine="gpt-4o-mini")

# Simple RAG stub
def rag(question: str) -> str:
    return "Attention allows the model to weigh input tokens by relevance."

f_answer_relevance = Feedback(provider.relevance, name="Answer Relevance").on_input_output()

tru_rag = TruBasicApp(rag, app_id="my-rag-v1", feedbacks=[f_answer_relevance])

with tru_rag as recording:
    answer = rag("What is attention in transformers?")

print(answer)
tru.get_leaderboard()

Output:

Attention allows the model to weigh input tokens by relevance.
           app_id  Answer Relevance  total_cost  total_tokens
0  my-rag-v1             0.92        0.0003           142

When / why to use it#

Systematically evaluating RAG pipelines across the three quality dimensions: retrieval relevance, context use, and answer faithfulness.
Comparing pipeline versions side-by-side in the built-in dashboard — swap the embedding model or chunk size, run both, compare scores.
Detecting regressions in CI: fail the build if groundedness drops below a threshold.
Debugging: low Context Relevance means the retriever is fetching off-topic chunks; low Groundedness means the LLM is ignoring retrieved context.
Building a leaderboard across LLM providers, prompt templates, or retrieval strategies.

Common pitfalls#

[!WARNING] TruLens uses LLMs as judges — each feedback function makes one or more LLM calls per evaluated record. Evaluating 1 000 records × 3 feedback functions can cost significant tokens. Use a cheaper judge model (e.g. gpt-4o-mini) or cache results with tru.reset_database() between runs only when you want a clean slate.

[!WARNING] Tru() writes to a local SQLite file — by default default.sqlite in the working directory. Set database_url to a persistent path: Tru(database_url="sqlite:///evals/trulens.sqlite"). Deleting or moving this file loses all recorded runs.

[!WARNING] App ID must be unique per pipeline version — if you reuse the same app_id across different code versions, runs are merged in the dashboard. Use versioned IDs like "rag-v2-chroma-k5" to keep experiments separate.

[!TIP] Use tru.start_dashboard() to open the local Streamlit dashboard in the browser. It shows per-run scores, a leaderboard, and a record-level trace viewer — no external service required.

[!TIP] The three RAG Triad metrics are complementary diagnostics, not a single score. Always evaluate all three together: a pipeline with high Answer Relevance but low Groundedness is hallucinating, while one with high Groundedness but low Context Relevance retrieved the wrong chunks.

The RAG Triad#

The RAG Triad is TruLens’s core evaluation framework. Each dimension measures a different failure mode in the retrieve-then-generate pipeline.

Question → Retriever → [context chunks] → LLM → Answer
              ↑                ↑                 ↑
        Context         Context            Answer
        Relevance       Relevance         Relevance
                              ↕
                        Groundedness
                  (is the answer supported by context?)

Metric	Question it answers	Low score means…
Context Relevance	Are retrieved chunks relevant to the question?	Retriever is fetching noise
Groundedness	Is the answer supported by retrieved chunks?	LLM is hallucinating
Answer Relevance	Does the answer address the question?	LLM is off-topic or verbose

Feedback functions#

Feedback functions are the building blocks of evaluation. TruLens provides a library of pre-built feedback functions for common tasks (relevance, coherence, sentiment) and lets you define custom ones.

from trulens_eval import Feedback
from trulens_eval.feedback.provider import OpenAI as TruOpenAI
import numpy as np

provider = TruOpenAI(model_engine="gpt-4o-mini")

# Answer Relevance — does the answer address the question?
f_answer_relevance = (
    Feedback(provider.relevance, name="Answer Relevance")
    .on_input_output()
)

# Context Relevance — are retrieved chunks relevant to the question?
f_context_relevance = (
    Feedback(provider.context_relevance, name="Context Relevance")
    .on_input()
    .on(TruChain.select_context())
    .aggregate(np.mean)
)

# Groundedness — is the answer supported by the retrieved chunks?
f_groundedness = (
    Feedback(provider.groundedness_measure_with_cot_reasons, name="Groundedness")
    .on(TruChain.select_context())
    .on_output()
    .aggregate(np.mean)
)

TruChain — evaluating LangChain RAG pipelines#

TruChain wraps a LangChain Runnable or chain, records every invocation, and runs the configured feedback functions after each call.

from trulens_eval import Tru, TruChain, Feedback
from trulens_eval.feedback.provider import OpenAI as TruOpenAI
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import numpy as np, os

tru = Tru(database_url="sqlite:///evals/trulens.sqlite")
provider = TruOpenAI(model_engine="gpt-4o-mini")

# --- build a minimal LangChain RAG chain ---
vectorstore = Chroma.from_texts(
    texts=[
        "Transformers use self-attention to process all tokens simultaneously.",
        "BERT uses masked language modelling to learn bidirectional representations.",
        "GPT trains as a left-to-right language model predicting the next token.",
    ],
    embedding=OpenAIEmbeddings(api_key=os.environ["OPENAI_API_KEY"]),
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

prompt = ChatPromptTemplate.from_template(
    "Answer using only the context below.\n\nContext: {context}\n\nQuestion: {question}"
)
llm = ChatOpenAI(model="gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# --- feedback functions ---
f_answer_relevance = Feedback(provider.relevance, name="Answer Relevance").on_input_output()
f_context_relevance = (
    Feedback(provider.context_relevance, name="Context Relevance")
    .on_input()
    .on(TruChain.select_context())
    .aggregate(np.mean)
)
f_groundedness = (
    Feedback(provider.groundedness_measure_with_cot_reasons, name="Groundedness")
    .on(TruChain.select_context())
    .on_output()
    .aggregate(np.mean)
)

# --- wrap with TruChain ---
tru_chain = TruChain(
    chain,
    app_id="langchain-rag-v1",
    feedbacks=[f_answer_relevance, f_context_relevance, f_groundedness],
)

questions = [
    "What is self-attention?",
    "How does BERT differ from GPT?",
    "What is positional encoding?",
]

with tru_chain as recording:
    for q in questions:
        chain.invoke(q)

leaderboard = tru.get_leaderboard(app_ids=["langchain-rag-v1"])
print(leaderboard)

Output:

             app_id  Answer Relevance  Context Relevance  Groundedness  total_cost
0  langchain-rag-v1              0.91               0.87          0.94      0.0041

TruLlama — evaluating LlamaIndex RAG pipelines#

TruLlama is the LlamaIndex equivalent of TruChain — it wraps any LlamaIndex query engine or chat engine.

from trulens_eval import Tru, TruLlama, Feedback
from trulens_eval.feedback.provider import OpenAI as TruOpenAI
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI as LlamaOpenAI
import numpy as np, os

tru = Tru()
provider = TruOpenAI(model_engine="gpt-4o-mini")

# --- build a LlamaIndex query engine ---
Settings.llm = LlamaOpenAI(model="gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])

documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=3)

# --- feedback functions ---
f_answer_relevance  = Feedback(provider.relevance, name="Answer Relevance").on_input_output()
f_context_relevance = (
    Feedback(provider.context_relevance, name="Context Relevance")
    .on_input()
    .on(TruLlama.select_source_nodes().node.text)
    .aggregate(np.mean)
)
f_groundedness = (
    Feedback(provider.groundedness_measure_with_cot_reasons, name="Groundedness")
    .on(TruLlama.select_source_nodes().node.text)
    .on_output()
    .aggregate(np.mean)
)

# --- wrap with TruLlama ---
tru_query_engine = TruLlama(
    query_engine,
    app_id="llamaindex-rag-v1",
    feedbacks=[f_answer_relevance, f_context_relevance, f_groundedness],
)

with tru_query_engine as recording:
    response = query_engine.query("What are the main topics covered?")
    print(response)

tru.get_leaderboard()

Output:

A summary of the main topics in the documents.
             app_id  Answer Relevance  Context Relevance  Groundedness
0  llamaindex-rag-v1              0.93               0.89          0.96

Custom feedback functions#

Any Python function that returns a float between 0.0 and 1.0 can be a feedback function.

from trulens_eval import Feedback
from trulens_eval.feedback.provider import OpenAI as TruOpenAI

provider = TruOpenAI(model_engine="gpt-4o-mini")

# Built-in: conciseness check via LLM judge
f_conciseness = Feedback(provider.conciseness, name="Conciseness").on_output()

# Custom: Python-only word-count ratio (no LLM call)
def word_count_score(answer: str) -> float:
    """Score 1.0 for answers under 100 words, decaying for longer answers."""
    words = len(answer.split())
    if words <= 100:
        return 1.0
    return max(0.0, 1.0 - (words - 100) / 200)

f_brevity = Feedback(word_count_score, name="Brevity").on_output()

# Custom: check for citation presence
def has_citation(answer: str) -> float:
    """Returns 1.0 if the answer contains a citation pattern like [1] or (source:)."""
    import re
    return 1.0 if re.search(r"\[\d+\]|\(source:", answer, re.IGNORECASE) else 0.0

f_citation = Feedback(has_citation, name="Has Citation").on_output()

Alternate LLM providers as judges#

TruLens supports Anthropic, Hugging Face, Bedrock, and local models as judge LLMs via provider wrappers.

from trulens_eval.feedback.provider import Bedrock as TruBedrock
from trulens_eval.feedback.provider import Huggingface as TruHuggingface
import os

# Anthropic Claude as judge (via LangChain wrapper)
from trulens_eval.feedback.provider import LangChainProvider
from langchain_anthropic import ChatAnthropic

provider = LangChainProvider(
    chain=ChatAnthropic(model="claude-haiku-4-5-20251001", api_key=os.environ["ANTHROPIC_API_KEY"])
)

# Hugging Face local pipeline (free, no API key needed)
hf_provider = TruHuggingface()  # uses sentence-transformers by default

f_relevance_hf = (
    Feedback(hf_provider.not_toxic, name="Not Toxic")
    .on_output()
)

The TruLens dashboard#

The dashboard is a local Streamlit web app that shows all recorded runs, leaderboard scores, and per-record trace details.

from trulens_eval import Tru

tru = Tru(database_url="sqlite:///evals/trulens.sqlite")

# Open dashboard in browser (blocks until Ctrl-C)
tru.start_dashboard(port=8501, force=True)

# Or just print the leaderboard to stdout
leaderboard = tru.get_leaderboard()
print(leaderboard.to_string(index=False))

# Export all records as a DataFrame for custom analysis
records, feedback_col = tru.get_records_and_feedback(app_ids=["langchain-rag-v1"])
print(records[["input", "output", "Answer Relevance", "Groundedness"]].head())

Output:

                        input                        output  Answer Relevance  Groundedness
0     What is self-attention?  Self-attention computes ...              0.94          0.97
1  How does BERT differ from …  BERT is bidirectional w…              0.90          0.93

CI integration — fail on score regression#

from trulens_eval import Tru

def test_rag_quality():
    tru = Tru(database_url="sqlite:///evals/trulens.sqlite")
    leaderboard = tru.get_leaderboard(app_ids=["langchain-rag-v1"])

    row = leaderboard[leaderboard["app_id"] == "langchain-rag-v1"].iloc[0]

    assert row["Answer Relevance"] >= 0.85, (
        f"Answer Relevance {row['Answer Relevance']:.2f} below 0.85"
    )
    assert row["Groundedness"] >= 0.80, (
        f"Groundedness {row['Groundedness']:.2f} below 0.80"
    )
    assert row["Context Relevance"] >= 0.75, (
        f"Context Relevance {row['Context Relevance']:.2f} below 0.75"
    )

pytest test_eval.py   # fails if any metric regresses below threshold

Output: (none — exits 0 on success)

Comparing pipeline versions#

from trulens_eval import Tru, TruChain

tru = Tru()

# Version A — k=2 retriever
tru_v1 = TruChain(chain_k2, app_id="rag-k2", feedbacks=[f_answer_relevance, f_groundedness])
# Version B — k=5 retriever
tru_v2 = TruChain(chain_k5, app_id="rag-k5", feedbacks=[f_answer_relevance, f_groundedness])

for question in eval_questions:
    with tru_v1 as rec:
        chain_k2.invoke(question)
    with tru_v2 as rec:
        chain_k5.invoke(question)

# Both appear side-by-side in the leaderboard
print(tru.get_leaderboard(app_ids=["rag-k2", "rag-k5"]))

Output:

  app_id  Answer Relevance  Groundedness  total_cost
  rag-k2              0.86          0.88      0.0021
  rag-k5              0.91          0.94      0.0034

Quick reference#

Metric	What it measures	Low score means…
`Answer Relevance`	Does the answer address the question?	Off-topic or verbose response
`Context Relevance`	Are retrieved chunks relevant?	Retriever fetching noise
`Groundedness`	Is the answer supported by context?	LLM hallucinating

Task	Code
Init TruLens	`tru = Tru(database_url="sqlite:///eval.sqlite")`
Create provider	`provider = TruOpenAI(model_engine="gpt-4o-mini")`
Answer relevance	`Feedback(provider.relevance).on_input_output()`
Context relevance	`Feedback(provider.context_relevance).on_input().on(TruChain.select_context()).aggregate(np.mean)`
Groundedness	`Feedback(provider.groundedness_measure_with_cot_reasons).on(TruChain.select_context()).on_output()`
Wrap LangChain	`TruChain(chain, app_id="v1", feedbacks=[...])`
Wrap LlamaIndex	`TruLlama(query_engine, app_id="v1", feedbacks=[...])`
Record run	`with tru_chain as rec: chain.invoke(q)`
Leaderboard	`tru.get_leaderboard(app_ids=["v1"])`
Dashboard	`tru.start_dashboard(port=8501)`
Export records	`tru.get_records_and_feedback(app_ids=["v1"])`
Reset DB	`tru.reset_database()`

g h	home
g p	Programming section
g p	Python section
g j	JavaScript section
g t	TypeScript section
g o	OS section
g l	Linux section
g w	Windows section
g z	z/OS section
g o	macOS section
g a	AI section
g c	Claude Code section
g c	Codex CLI section
g c	Claude API section
g p	Prompting section
g f	Frameworks section
g p	Packages section
g p	Pip (Python) section
g p	npm (Node) section
g p	Cargo (Rust) section
g p	Go modules section
g g	graph view
g t	tags index

⌘K / /	open search palette
t	cycle theme (dark → light → system)
?	toggle this panel

[ / ]	previous / next sheet in section
j / k	scroll down / up