Skip to content

Detectors

Detectors answer “why did my agent fail?” by analyzing execution traces for failures, performing root cause analysis, and producing actionable fix recommendations. While evaluators tell you whether an agent performed well, detectors tell you what went wrong and how to fix it.

Detectors operate on Session objects (the same trace format used by trace-based evaluators) and use LLM-based analysis to identify semantic failures that go beyond simple error codes — hallucinations, tool misuse, reasoning breakdowns, policy violations, and more.

Evaluators give you a score. Detectors give you a diagnosis.

Evaluators alone:

  • Tell you a case scored 0.3 / 1.0
  • Don’t explain which step failed or why
  • Require manual trace inspection to debug
  • Don’t suggest what to change

Evaluators + Detectors:

  • Score the case and identify the specific failing spans
  • Classify failures (hallucination, tool error, policy violation, etc.)
  • Trace causal chains between failures (primary vs. secondary)
  • Recommend concrete fixes (system prompt changes, tool description updates)

Use detectors when you need to:

  • Debug failing evaluations: Understand why specific test cases fail
  • Identify failure patterns: Detect recurring issues across sessions (hallucinations, tool misuse)
  • Get fix recommendations: Receive actionable suggestions for system prompt or tool description changes
  • Automate root cause analysis: Replace manual trace inspection with LLM-based diagnosis
  • Monitor production agents: Analyze traces from remote providers for systematic issues

detect_failures

  • Purpose: Identify semantic failures in agent execution traces
  • Output: List of failures with span location, category, confidence, and evidence
  • Categories: ~20 failure types including hallucinations, execution errors, tool misuse, repetitive behavior, and orchestration errors

analyze_root_cause

  • Purpose: Determine the fundamental cause of detected failures
  • Output: Causal analysis with propagation impact, fix type, and recommendation
  • Strategy: 3-tier fallback (direct, pruned, chunked) for handling large sessions

diagnose_session

  • Purpose: Run failure detection and root cause analysis as a single pipeline
  • Output: Combined result with failures, root causes, and deduplicated recommendations
  • Integration: Can be wired into Experiment via DiagnosisConfig for automatic diagnosis of failing cases
from strands_evals.detectors import diagnose_session
# session is a Session object from a trace provider or in-memory mapper
result = diagnose_session(session)
# What failed?
for failure in result.failures:
print(f"Span {failure.span_id}: {failure.category}")
print(f" Evidence: {failure.evidence}")
# Why did it fail?
for rc in result.root_causes:
print(f"Root cause at {rc.location}: {rc.root_cause_explanation}")
print(f" Fix ({rc.fix_type}): {rc.fix_recommendation}")
# Deduplicated recommendations
for rec in result.recommendations:
print(f" - {rec}")
AspectEvaluatorsDetectors
Question”How well did the agent do?""Why did it fail?”
OutputScore + pass/failFailures + root causes + fix recommendations
GranularityPer-case or per-sessionPer-span
PurposeMeasure qualityDiagnose problems
Use CaseBenchmarking, regression testingDebugging, improvement

Use Together: Run evaluators to score your agent, then use detectors on failing cases to understand what went wrong and how to fix it. The Experiment class supports this workflow natively via DiagnosisConfig.

Detectors integrate directly into the evaluation pipeline. Pass a DiagnosisConfig to Experiment to automatically diagnose failing cases:

from strands_evals import Experiment, Case, DiagnosisConfig
from strands_evals.evaluators import GoalSuccessRateEvaluator
from strands_evals.types.detector import DiagnosisTrigger
experiment = Experiment(
cases=test_cases,
evaluators=[GoalSuccessRateEvaluator()],
diagnosis_config=DiagnosisConfig(trigger=DiagnosisTrigger.ON_FAILURE),
)
reports = experiment.run_evaluations(my_task)
# View recommendations for failing cases
reports[0].display(include_recommendations=True)

See the Session Diagnosis guide for the full integration walkthrough.

Detectors use a two-phase analysis pipeline:

flowchart TD
A[Session traces] --> B[1. Failure Detection]
B --> C[2. Root Cause Analysis]
C --> D[DiagnosisResult]
B -. "span_id, category,\nconfidence, evidence" .-> B
C -. "causality, propagation impact,\nfix recommendation" .-> C
D -. "failures[]\nroot_causes[]\nrecommendations[]" .-> D

Both phases handle large sessions that exceed LLM context limits through automatic chunking with overlap and merge strategies.