Output Evaluator¶
Overview¶
The OutputEvaluator is an LLM-based evaluator that assesses the quality of agent outputs against custom criteria. It uses a judge LLM to evaluate responses based on a user-defined rubric, making it ideal for evaluating subjective qualities like safety, relevance, accuracy, and completeness. A complete example can be found here.
Key Features¶
- Flexible Rubric System: Define custom evaluation criteria tailored to your use case
- LLM-as-a-Judge: Leverages a language model to perform nuanced evaluations
- Structured Output: Returns standardized evaluation results with scores and reasoning
- Async Support: Supports both synchronous and asynchronous evaluation
- Input Context: Optionally includes input prompts in the evaluation for context-aware scoring
When to Use¶
Use the OutputEvaluator when you need to:
- Evaluate subjective qualities of agent responses (e.g., helpfulness, safety, tone)
- Assess whether outputs meet specific business requirements
- Check for policy compliance or content guidelines
- Compare different agent configurations or prompts
- Evaluate responses where ground truth is not available or difficult to define
Parameters¶
rubric (required)¶
- Type:
str - Description: The evaluation criteria that defines what constitutes a good response. Should include scoring guidelines (e.g., "Score 1 if..., 0.5 if..., 0 if...").
model (optional)¶
- Type:
Union[Model, str, None] - Default:
None(uses default Bedrock model) - Description: The model to use as the judge. Can be a model ID string or a Model instance.
system_prompt (optional)¶
- Type:
str - Default: Built-in template
- Description: Custom system prompt to guide the judge model's behavior. If not provided, uses a default template optimized for evaluation.
include_inputs (optional)¶
- Type:
bool - Default:
True - Description: Whether to include the input prompt in the evaluation context. Set to
Falseif you only want to evaluate the output in isolation.
Basic Usage¶
from strands import Agent
from strands_evals import Case, Experiment
from strands_evals.evaluators import OutputEvaluator
# Define your task function
def get_response(case: Case) -> str:
agent = Agent(
system_prompt="You are a helpful assistant.",
callback_handler=None
)
response = agent(case.input)
return str(response)
# Create test cases
test_cases = [
Case[str, str](
name="greeting",
input="Hello, how are you?",
expected_output="A friendly greeting response",
metadata={"category": "conversation"}
),
]
# Create evaluator with custom rubric
evaluator = OutputEvaluator(
rubric="""
Evaluate the response based on:
1. Accuracy - Is the information correct?
2. Completeness - Does it fully answer the question?
3. Clarity - Is it easy to understand?
Score 1.0 if all criteria are met excellently.
Score 0.5 if some criteria are partially met.
Score 0.0 if the response is inadequate.
""",
include_inputs=True
)
# Create and run experiment
experiment = Experiment[str, str](cases=test_cases, evaluators=[evaluator])
reports = experiment.run_evaluations(get_response)
reports[0].run_display()
Evaluation Output¶
The OutputEvaluator returns EvaluationOutput objects with:
- score: Float between 0.0 and 1.0 representing the evaluation score
- test_pass: Boolean indicating if the test passed (based on score threshold)
- reason: String containing the judge's reasoning for the score
- label: Optional label categorizing the result
Best Practices¶
- Write Clear, Specific Rubrics: Include explicit scoring criteria and examples
- Use Appropriate Judge Models: Consider using stronger models for complex evaluations
- Include Input Context When Relevant: Set
include_inputs=Truefor context-dependent evaluation - Validate Your Rubric: Test with known good and bad examples to ensure expected scores
- Combine with Other Evaluators: Use alongside trajectory and tool evaluators for comprehensive assessment
Related Evaluators¶
- TrajectoryEvaluator: Evaluates the sequence of actions/tools used
- FaithfulnessEvaluator: Checks if responses are grounded in conversation history
- HelpfulnessEvaluator: Specifically evaluates helpfulness from user perspective
- GoalSuccessRateEvaluator: Evaluates if user goals were achieved