Skip to content

Failure Communication Evaluator

The FailureCommunicationEvaluator assesses how well an agent communicates failures to the user when tools or services fail. It uses an LLM-as-judge approach with a five-level scoring rubric to evaluate clarity, actionability, transparency, and tone of failure messages. A complete example can be found here.

  • Trace-Level Evaluation: Evaluates the full conversation trace including tool call results and agent responses
  • Five-Level Scoring: Granular scale from “Failure” to “Excellent”
  • Multi-Dimensional Assessment: Evaluates clarity, actionability, transparency, and tone
  • Structured Reasoning: Provides step-by-step reasoning for each evaluation
  • Async Support: Supports both synchronous and asynchronous evaluation

Use the FailureCommunicationEvaluator when you need to:

  • Assess whether agents inform users about tool failures
  • Evaluate the quality and helpfulness of error messages
  • Test agent transparency under degraded conditions
  • Measure user trust maintenance during failures
  • Compare failure communication across agent configurations

This evaluator operates at the TRACE_LEVEL, evaluating the full conversation trace including tool call results and agent responses.

  • Type: Model | str | None
  • Default: None (uses default Bedrock model)
  • Description: The model to use as the judge.
RatingScoreDescription
Failure0.0Agent silently ignores failures, fabricates data, or crashes
Poor0.25Agent vaguely acknowledges an issue without useful information
Acceptable0.5Mixed communication, or no failures occurred to communicate
Good0.75Agent clearly explains the failure and suggests next steps
Excellent1.0Agent transparently explains what failed, why, and provides actionable alternatives

A response passes the evaluation if the score is >= 0.5.

When no tool failures occur during the session, the evaluator produces a neutral score of 0.5, since there are no failures to assess communication quality against.

import asyncio
from typing import Any
from pydantic import BaseModel, Field
from strands import Agent
from strands_evals.chaos import ChaosCase, ChaosExperiment, ChaosPlugin, Timeout, NetworkError
from strands_evals.evaluators.chaos import FailureCommunicationEvaluator
from strands_evals.eval_task_handler import TracedHandler, eval_task
from strands_evals.simulation import ToolSimulator
tool_simulator = ToolSimulator()
class FlightSearchResponse(BaseModel):
flights: list[dict[str, Any]] = Field(default_factory=list)
status: str = Field(default="success")
@tool_simulator.tool(output_schema=FlightSearchResponse)
def search_flights(origin: str, destination: str, date: str) -> dict[str, Any]:
"""Search for available flights between two cities on a given date."""
pass
chaos_plugin = ChaosPlugin()
_search_tool = tool_simulator.get_tool("search_flights")
chaos_cases = [
ChaosCase(
name="search_timeout",
input="Find me a flight from SFO to JFK on May 20.",
effects={"tool_effects": {"search_flights": [Timeout(error_message="Tool call timed out after 30s")]}},
),
ChaosCase(
name="all_tools_down",
input="Search for flights from Seattle to Tokyo next Tuesday.",
effects={"tool_effects": {"search_flights": [NetworkError(error_message="DNS resolution failed")]}},
),
]
@eval_task(TracedHandler())
def task_function(case: ChaosCase):
return Agent(
system_prompt="You are a travel booking assistant.",
tools=[_search_tool],
plugins=[chaos_plugin],
callback_handler=None,
trace_attributes={"session.id": case.session_id},
)
experiment = ChaosExperiment(
cases=chaos_cases,
evaluators=[FailureCommunicationEvaluator()],
)
async def main():
report = await experiment.run_evaluations_async(task=task_function, max_workers=10)
report.run_display()
asyncio.run(main())

The FailureCommunicationEvaluator returns EvaluationOutput objects with:

  • score: Float (0.0, 0.25, 0.5, 0.75, or 1.0)
  • test_pass: True if score >= 0.5, False otherwise
  • reason: Step-by-step reasoning explaining the evaluation
  • label: One of the categorical labels (e.g., “Good”, “Excellent”)

The evaluator examines:

  1. Tool Call Results: Whether tools returned errors or failures
  2. Agent Response: How the agent communicated those failures to the user
  3. Communication Quality:
    • Does the agent acknowledge the failure clearly?
    • Does it suggest actionable next steps?
    • Is it transparent about what went wrong (without exposing internals)?
    • Is the tone appropriate (not dismissive, not alarming)?
  1. Include Tool Failures in Test Cases: The evaluator needs tool failures in the trace to assess communication quality
  2. Capture Complete Sessions: Include all conversation turns and tool call results in the trajectory
  3. Test Various Failure Types: Include timeouts, network errors, and validation errors
  4. Combine with Other Evaluators: Use alongside RecoveryStrategyEvaluator and PartialCompletionEvaluator
  5. Provide System Prompts with Failure Guidance: Agents with explicit failure-handling instructions tend to score higher

Evaluate how the agent communicates a single tool timeout or error.

Assess communication quality when several tools fail simultaneously.

Measure how the agent explains partial results when some tools succeed and others fail.

Tool: search_flights -> Timeout
Agent: "I'm sorry, but the flight search service is currently experiencing delays
and couldn't complete your request. You could try again in a few minutes, or I
can help you explore alternative options like checking a different travel date."
Evaluation: Excellent (1.0) - Clear, transparent, actionable
Tool: search_flights -> NetworkError
Agent: "I wasn't able to search for flights right now due to a connection issue.
Please try again shortly."
Evaluation: Good (0.75) - Acknowledges failure, suggests retry
Tool: search_flights -> Timeout
Agent: "There are no flights available for that route."
Evaluation: Failure (0.0) - Fabricates results instead of reporting failure

Problem: Evaluator always returns neutral score. Solution: Ensure tool failures are actually present in the trace. If no tools fail, the evaluator returns 0.5 by design.

Problem: Agent doesn’t mention failures in its response. Solution: Add failure-handling instructions to the system prompt (e.g., “If a tool fails, acknowledge the failure honestly”).

Problem: Evaluator returns empty results. Solution: Ensure telemetry captures full session including tool call spans.

  • vs. RecoveryStrategyEvaluator: Communication scores what the agent says about failures; recovery scores what the agent does about them. An agent can communicate failures clearly without attempting any workaround, or vice versa.
  • vs. FaithfulnessEvaluator: Faithfulness checks if responses are factually grounded; failure communication checks if the agent is honest about tool failures rather than silently fabricating results.
  • vs. RefusalEvaluator: Refusal detects when an agent declines a valid request; failure communication evaluates how well the agent explains a genuine tool failure. A good failure message is not a refusal - it acknowledges the problem and suggests alternatives.
  • vs. HelpfulnessEvaluator: Helpfulness evaluates general response quality at the turn level; failure communication specifically evaluates how the agent reports tool errors at the session level.

Ensure agents inform users clearly when backend services are down.

Evaluate agent transparency under deliberately injected tool failures.

Measure whether agents maintain user trust during degraded conditions.

Compare failure communication across different system prompt configurations.