Faithfulness Evaluator¶

Overview¶

The FaithfulnessEvaluator evaluates whether agent responses are grounded in the conversation history. It assesses if the agent's statements are faithful to the information available in the preceding context, helping detect hallucinations and unsupported claims. A complete example can be found here.

Key Features¶

Trace-Level Evaluation: Evaluates the most recent turn in the conversation
Context Grounding: Checks if responses are based on conversation history
Categorical Scoring: Five-level scale from "Not At All" to "Completely Yes"
Structured Reasoning: Provides step-by-step reasoning for each evaluation
Async Support: Supports both synchronous and asynchronous evaluation
Hallucination Detection: Identifies fabricated or unsupported information

When to Use¶

Use the FaithfulnessEvaluator when you need to:

Detect hallucinations in agent responses
Verify that responses are grounded in available context
Ensure agents don't fabricate information
Validate that claims are supported by conversation history
Assess information accuracy in multi-turn conversations
Debug issues with context adherence

Evaluation Level¶

This evaluator operates at the TRACE_LEVEL, meaning it evaluates the most recent turn in the conversation (the last agent response and its context).

Parameters¶

`model` (optional)¶

Type: Union[Model, str, None]
Default: None (uses default Bedrock model)
Description: The model to use as the judge. Can be a model ID string or a Model instance.

`system_prompt` (optional)¶

Type: str | None
Default: None (uses built-in template)
Description: Custom system prompt to guide the judge model's behavior.

Scoring System¶

The evaluator uses a five-level categorical scoring system:

Not At All (0.0): Response contains significant fabrications or unsupported claims
Not Generally (0.25): Response is mostly unfaithful with some grounded elements
Neutral/Mixed (0.5): Response has both faithful and unfaithful elements
Generally Yes (0.75): Response is mostly faithful with minor issues
Completely Yes (1.0): Response is completely grounded in conversation history

A response passes the evaluation if the score is >= 0.5.

Basic Usage¶

from strands import Agent
from strands_evals import Case, Experiment
from strands_evals.evaluators import FaithfulnessEvaluator
from strands_evals.mappers import StrandsInMemorySessionMapper
from strands_evals.telemetry import StrandsEvalsTelemetry

# Setup telemetry
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()
memory_exporter = telemetry.in_memory_exporter

# Define task function
def user_task_function(case: Case) -> dict:
    memory_exporter.clear()

    agent = Agent(
        trace_attributes={
            "gen_ai.conversation.id": case.session_id,
            "session.id": case.session_id
        },
        callback_handler=None
    )
    agent_response = agent(case.input)

    # Map spans to session
    finished_spans = memory_exporter.get_finished_spans()
    mapper = StrandsInMemorySessionMapper()
    session = mapper.map_to_session(finished_spans, session_id=case.session_id)

    return {"output": str(agent_response), "trajectory": session}

# Create test cases
test_cases = [
    Case[str, str](
        name="knowledge-1",
        input="What is the capital of France?",
        metadata={"category": "knowledge"}
    ),
    Case[str, str](
        name="knowledge-2",
        input="What color is the ocean?",
        metadata={"category": "knowledge"}
    ),
]

# Create evaluator
evaluator = FaithfulnessEvaluator()

# Run evaluation
experiment = Experiment[str, str](cases=test_cases, evaluators=[evaluator])
reports = experiment.run_evaluations(user_task_function)
reports[0].run_display()

Evaluation Output¶

The FaithfulnessEvaluator returns EvaluationOutput objects with:

score: Float between 0.0 and 1.0 (0.0, 0.25, 0.5, 0.75, or 1.0)
test_pass: True if score >= 0.5, False otherwise
reason: Step-by-step reasoning explaining the evaluation
label: One of the categorical labels (e.g., "Completely Yes", "Neutral/Mixed")

What Gets Evaluated¶

The evaluator examines:

Conversation History: All prior messages and tool executions
Assistant's Response: The most recent agent response
Context Grounding: Whether claims in the response are supported by the history

The judge determines if the agent's statements are faithful to the available information or if they contain fabrications, assumptions, or unsupported claims.

Best Practices¶

Use with Proper Telemetry Setup: The evaluator requires trajectory information captured via OpenTelemetry
Provide Complete Context: Ensure full conversation history is captured in traces
Test with Known Facts: Include test cases with verifiable information
Monitor Hallucination Patterns: Track which types of queries lead to unfaithful responses
Combine with Other Evaluators: Use alongside output quality evaluators for comprehensive assessment

Common Patterns¶

Pattern 1: Detecting Fabrications¶

Identify when agents make up information not present in the context.

Pattern 2: Validating Tool Results¶

Ensure agents accurately represent information from tool calls.

Pattern 3: Multi-Turn Consistency¶

Check that agents maintain consistency across conversation turns.

Example Scenarios¶

Scenario 1: Faithful Response¶

User: "What did the search results say about Python?"
Agent: "The search results indicated that Python is a high-level programming language."
Evaluation: Completely Yes (1.0) - Response accurately reflects search results

Scenario 2: Unfaithful Response¶

User: "What did the search results say about Python?"
Agent: "Python was created in 1991 by Guido van Rossum and is the most popular language."
Evaluation: Not Generally (0.25) - Response adds information not in search results

Scenario 3: Mixed Response¶

User: "What did the search results say about Python?"
Agent: "The search results showed Python is a programming language. It's also the fastest language."
Evaluation: Neutral/Mixed (0.5) - First part faithful, second part unsupported

Common Issues and Solutions¶

Issue 1: No Evaluation Returned¶

Problem: Evaluator returns empty results. Solution: Ensure trajectory contains at least one agent invocation span.

Issue 2: Overly Strict Evaluation¶

Problem: Evaluator marks reasonable inferences as unfaithful. Solution: Review system prompt and consider if agent is expected to make reasonable inferences.

Issue 3: Context Not Captured¶

Problem: Evaluation doesn't consider full conversation history. Solution: Verify telemetry setup captures all messages and tool executions.

HelpfulnessEvaluator: Evaluates helpfulness from user perspective
OutputEvaluator: Evaluates overall output quality
ToolParameterAccuracyEvaluator: Evaluates if tool parameters are grounded in context
GoalSuccessRateEvaluator: Evaluates if overall goals were achieved