Tool Parameter Accuracy Evaluator¶
Overview¶
The ToolParameterAccuracyEvaluator is a specialized evaluator that assesses whether tool call parameters faithfully use information from the preceding conversation context. It evaluates each tool call individually to ensure parameters are grounded in available information rather than hallucinated or incorrectly inferred. A complete example can be found here.
Key Features¶
- Tool-Level Evaluation: Evaluates each tool call independently
- Context Faithfulness: Checks if parameters are derived from conversation history
- Binary Scoring: Simple Yes/No evaluation for clear pass/fail criteria
- Structured Reasoning: Provides step-by-step reasoning for each evaluation
- Async Support: Supports both synchronous and asynchronous evaluation
- Multiple Evaluations: Returns one evaluation result per tool call
When to Use¶
Use the ToolParameterAccuracyEvaluator when you need to:
- Verify that tool parameters are based on actual conversation context
- Detect hallucinated or fabricated parameter values
- Ensure agents don't make assumptions beyond available information
- Validate that agents correctly extract information for tool calls
- Debug issues with incorrect tool parameter usage
- Ensure data integrity in tool-based workflows
Evaluation Level¶
This evaluator operates at the TOOL_LEVEL, meaning it evaluates each individual tool call in the trajectory separately. If an agent makes 3 tool calls, you'll receive 3 evaluation results.
Parameters¶
model (optional)¶
- Type:
Union[Model, str, None] - Default:
None(uses default Bedrock model) - Description: The model to use as the judge. Can be a model ID string or a Model instance.
system_prompt (optional)¶
- Type:
str | None - Default:
None(uses built-in template) - Description: Custom system prompt to guide the judge model's behavior.
Scoring System¶
The evaluator uses a binary scoring system:
- Yes (1.0): Parameters faithfully use information from the context
- No (0.0): Parameters contain hallucinated, fabricated, or incorrectly inferred values
Basic Usage¶
from strands import Agent
from strands_tools import calculator
from strands_evals import Case, Experiment
from strands_evals.evaluators import ToolParameterAccuracyEvaluator
from strands_evals.mappers import StrandsInMemorySessionMapper
from strands_evals.telemetry import StrandsEvalsTelemetry
# Setup telemetry
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()
memory_exporter = telemetry.in_memory_exporter
# Define task function
def user_task_function(case: Case) -> dict:
memory_exporter.clear()
agent = Agent(
trace_attributes={
"gen_ai.conversation.id": case.session_id,
"session.id": case.session_id
},
tools=[calculator],
callback_handler=None
)
agent_response = agent(case.input)
# Map spans to session
finished_spans = memory_exporter.get_finished_spans()
mapper = StrandsInMemorySessionMapper()
session = mapper.map_to_session(finished_spans, session_id=case.session_id)
return {"output": str(agent_response), "trajectory": session}
# Create test cases
test_cases = [
Case[str, str](
name="simple-calculation",
input="Calculate the square root of 144",
metadata={"category": "math", "difficulty": "easy"}
),
]
# Create evaluator
evaluator = ToolParameterAccuracyEvaluator()
# Run evaluation
experiment = Experiment[str, str](cases=test_cases, evaluators=[evaluator])
reports = experiment.run_evaluations(user_task_function)
reports[0].run_display()
Evaluation Output¶
The ToolParameterAccuracyEvaluator returns a list of EvaluationOutput objects (one per tool call) with:
- score:
1.0(Yes) or0.0(No) - test_pass:
Trueif score is 1.0,Falseotherwise - reason: Step-by-step reasoning explaining the evaluation
- label: "Yes" or "No"
What Gets Evaluated¶
The evaluator examines:
- Available Tools: The tools that were available to the agent
- Previous Conversation History: All prior messages and tool executions
- Target Tool Call: The specific tool call being evaluated, including:
- Tool name
- All parameter values
The judge determines if each parameter value can be traced back to information in the conversation history.
Best Practices¶
- Use with Proper Telemetry Setup: The evaluator requires trajectory information captured via OpenTelemetry
- Test Edge Cases: Include test cases that challenge parameter accuracy (missing info, ambiguous info, etc.)
- Combine with Other Evaluators: Use alongside tool selection and output evaluators for comprehensive assessment
- Review Reasoning: Always review the reasoning provided in evaluation results
- Use Appropriate Models: Consider using stronger models for evaluation
Common Issues and Solutions¶
Issue 1: No Evaluations Returned¶
Problem: Evaluator returns empty list or no results. Solution: Ensure trajectory is properly captured and includes tool calls.
Issue 2: False Negatives¶
Problem: Evaluator marks valid parameters as inaccurate. Solution: Ensure conversation history is complete and context is clear.
Issue 3: Inconsistent Results¶
Problem: Same test case produces different evaluation results. Solution: This is expected due to LLM non-determinism. Run multiple times and aggregate.
Related Evaluators¶
- ToolSelectionAccuracyEvaluator: Evaluates if correct tools were selected
- TrajectoryEvaluator: Evaluates the overall sequence of tool calls
- FaithfulnessEvaluator: Evaluates if responses are grounded in context
- OutputEvaluator: Evaluates the quality of final outputs