Interactions Evaluator¶
Overview¶
The InteractionsEvaluator is designed for evaluating interactions between agents or components in multi-agent systems or complex workflows. It assesses each interaction step-by-step, considering dependencies, message flow, and the overall sequence of interactions.
Key Features¶
- Interaction-Level Evaluation: Evaluates each interaction in a sequence
- Multi-Agent Support: Designed for evaluating multi-agent systems and workflows
- Node-Specific Rubrics: Supports different evaluation criteria for different nodes/agents
- Sequential Context: Maintains context across interactions using sliding window
- Dependency Tracking: Considers dependencies between interactions
- Async Support: Supports both synchronous and asynchronous evaluation
When to Use¶
Use the InteractionsEvaluator when you need to:
- Evaluate multi-agent system interactions
- Assess workflow execution across multiple components
- Validate message passing between agents
- Ensure proper dependency handling in complex systems
- Track interaction quality in agent orchestration
- Debug multi-agent coordination issues
Parameters¶
rubric (required)¶
- Type:
str | dict[str, str] - Description: Evaluation criteria. Can be a single string for all nodes or a dictionary mapping node names to specific rubrics.
interaction_description (optional)¶
- Type:
dict | None - Default:
None - Description: A dictionary describing available interactions. Can be updated dynamically using
update_interaction_description().
model (optional)¶
- Type:
Union[Model, str, None] - Default:
None(uses default Bedrock model) - Description: The model to use as the judge. Can be a model ID string or a Model instance.
system_prompt (optional)¶
- Type:
str - Default: Built-in template
- Description: Custom system prompt to guide the judge model's behavior.
include_inputs (optional)¶
- Type:
bool - Default:
True - Description: Whether to include inputs in the evaluation context.
Interaction Structure¶
Each interaction should contain:
- node_name: Name of the agent/component involved
- dependencies: List of nodes this interaction depends on
- messages: Messages exchanged in this interaction
Basic Usage¶
from strands_evals import Case, Experiment
from strands_evals.evaluators import InteractionsEvaluator
# Define task function that returns interactions
def multi_agent_task(case: Case) -> dict:
# Execute multi-agent workflow
# ...
# Return interactions
interactions = [
{
"node_name": "planner",
"dependencies": [],
"messages": "Created execution plan"
},
{
"node_name": "executor",
"dependencies": ["planner"],
"messages": "Executed plan steps"
},
{
"node_name": "validator",
"dependencies": ["executor"],
"messages": "Validated results"
}
]
return {
"output": "Task completed",
"interactions": interactions
}
# Create test cases
test_cases = [
Case[str, str](
name="workflow-1",
input="Process data pipeline",
expected_interactions=[
{"node_name": "planner", "dependencies": [], "messages": "Plan created"},
{"node_name": "executor", "dependencies": ["planner"], "messages": "Executed"},
{"node_name": "validator", "dependencies": ["executor"], "messages": "Validated"}
],
metadata={"category": "workflow"}
),
]
# Create evaluator with single rubric for all nodes
evaluator = InteractionsEvaluator(
rubric="""
Evaluate the interaction based on:
1. Correct node execution order
2. Proper dependency handling
3. Clear message communication
Score 1.0 if all criteria are met.
Score 0.5 if some issues exist.
Score 0.0 if interaction is incorrect.
"""
)
# Or use node-specific rubrics
evaluator = InteractionsEvaluator(
rubric={
"planner": "Evaluate if planning is thorough and logical",
"executor": "Evaluate if execution follows the plan correctly",
"validator": "Evaluate if validation is comprehensive"
}
)
# Run evaluation
experiment = Experiment[str, str](cases=test_cases, evaluators=[evaluator])
reports = experiment.run_evaluations(multi_agent_task)
reports[0].run_display()
Evaluation Output¶
The InteractionsEvaluator returns a list of EvaluationOutput objects (one per interaction) with:
- score: Float between 0.0 and 1.0 for each interaction
- test_pass: Boolean indicating if the interaction passed
- reason: Step-by-step reasoning for the evaluation
- label: Optional label categorizing the result
The final interaction's evaluation includes context from all previous interactions.
What Gets Evaluated¶
For each interaction, the evaluator examines:
- Current Interaction: Node name, dependencies, and messages
- Expected Sequence: Overview of the expected interaction sequence
- Relevant Expected Interactions: Window of expected interactions around current position
- Previous Evaluations: Context from earlier interactions (for later interactions)
- Final Output: Overall output (only for the last interaction)
Best Practices¶
- Define Clear Interaction Structure: Ensure interactions have consistent node_name, dependencies, and messages
- Use Node-Specific Rubrics: Provide tailored evaluation criteria for different agent types
- Track Dependencies: Clearly specify which nodes depend on others
- Update Descriptions: Use
update_interaction_description()to provide context about available interactions - Test Sequences: Include test cases with various interaction patterns
Common Patterns¶
Pattern 1: Linear Workflow¶
interactions = [
{"node_name": "input_validator", "dependencies": [], "messages": "Input validated"},
{"node_name": "processor", "dependencies": ["input_validator"], "messages": "Data processed"},
{"node_name": "output_formatter", "dependencies": ["processor"], "messages": "Output formatted"}
]
Pattern 2: Parallel Execution¶
interactions = [
{"node_name": "coordinator", "dependencies": [], "messages": "Tasks distributed"},
{"node_name": "worker_1", "dependencies": ["coordinator"], "messages": "Task 1 completed"},
{"node_name": "worker_2", "dependencies": ["coordinator"], "messages": "Task 2 completed"},
{"node_name": "aggregator", "dependencies": ["worker_1", "worker_2"], "messages": "Results aggregated"}
]
Pattern 3: Conditional Flow¶
interactions = [
{"node_name": "analyzer", "dependencies": [], "messages": "Analysis complete"},
{"node_name": "decision_maker", "dependencies": ["analyzer"], "messages": "Decision: proceed"},
{"node_name": "executor", "dependencies": ["decision_maker"], "messages": "Action executed"}
]
Example Scenarios¶
Scenario 1: Successful Multi-Agent Workflow¶
# Task: Research and summarize a topic
interactions = [
{
"node_name": "researcher",
"dependencies": [],
"messages": "Found 5 relevant sources"
},
{
"node_name": "analyzer",
"dependencies": ["researcher"],
"messages": "Extracted key points from sources"
},
{
"node_name": "writer",
"dependencies": ["analyzer"],
"messages": "Created comprehensive summary"
}
]
# Evaluation: Each interaction scored based on quality and dependency adherence
Scenario 2: Failed Dependency¶
# Task: Process data pipeline
interactions = [
{
"node_name": "validator",
"dependencies": [],
"messages": "Validation skipped" # Should depend on data_loader
},
{
"node_name": "processor",
"dependencies": ["validator"],
"messages": "Processing failed"
}
]
# Evaluation: Low scores due to incorrect dependency handling
Common Issues and Solutions¶
Issue 1: Missing Interaction Keys¶
Problem: Interactions missing required keys (node_name, dependencies, messages). Solution: Ensure all interactions include all three required fields.
Issue 2: Incorrect Dependency Specification¶
Problem: Dependencies don't match actual execution order. Solution: Verify dependency lists accurately reflect the workflow.
Issue 3: Rubric Key Mismatch¶
Problem: Node-specific rubric dictionary missing keys for some nodes. Solution: Ensure rubric dictionary contains entries for all node names, or use a single string rubric.
Use Cases¶
Use Case 1: Multi-Agent Orchestration¶
Evaluate coordination between multiple specialized agents.
Use Case 2: Workflow Validation¶
Assess execution of complex, multi-step workflows.
Use Case 3: Agent Handoff Quality¶
Measure quality of information transfer between agents.
Use Case 4: Dependency Compliance¶
Verify that agents respect declared dependencies.
Related Evaluators¶
- TrajectoryEvaluator: Evaluates tool call sequences (single agent)
- GoalSuccessRateEvaluator: Evaluates overall goal achievement
- OutputEvaluator: Evaluates final output quality
- HelpfulnessEvaluator: Evaluates individual response helpfulness