Detectors
Overview
Section titled “Overview”Detectors answer “why did my agent fail?” by analyzing execution traces for failures, performing root cause analysis, and producing actionable fix recommendations. While evaluators tell you whether an agent performed well, detectors tell you what went wrong and how to fix it.
Detectors operate on Session objects (the same trace format used by trace-based evaluators) and use LLM-based analysis to identify semantic failures that go beyond simple error codes — hallucinations, tool misuse, reasoning breakdowns, policy violations, and more.
Why Detectors?
Section titled “Why Detectors?”Evaluators give you a score. Detectors give you a diagnosis.
Evaluators alone:
- Tell you a case scored 0.3 / 1.0
- Don’t explain which step failed or why
- Require manual trace inspection to debug
- Don’t suggest what to change
Evaluators + Detectors:
- Score the case and identify the specific failing spans
- Classify failures (hallucination, tool error, policy violation, etc.)
- Trace causal chains between failures (primary vs. secondary)
- Recommend concrete fixes (system prompt changes, tool description updates)
When to Use Detectors
Section titled “When to Use Detectors”Use detectors when you need to:
- Debug failing evaluations: Understand why specific test cases fail
- Identify failure patterns: Detect recurring issues across sessions (hallucinations, tool misuse)
- Get fix recommendations: Receive actionable suggestions for system prompt or tool description changes
- Automate root cause analysis: Replace manual trace inspection with LLM-based diagnosis
- Monitor production agents: Analyze traces from remote providers for systematic issues
Available Detectors
Section titled “Available Detectors”Failure Detection
Section titled “Failure Detection”- Purpose: Identify semantic failures in agent execution traces
- Output: List of failures with span location, category, confidence, and evidence
- Categories: ~20 failure types including hallucinations, execution errors, tool misuse, repetitive behavior, and orchestration errors
Root Cause Analysis
Section titled “Root Cause Analysis”- Purpose: Determine the fundamental cause of detected failures
- Output: Causal analysis with propagation impact, fix type, and recommendation
- Strategy: 3-tier fallback (direct, pruned, chunked) for handling large sessions
Session Diagnosis
Section titled “Session Diagnosis”- Purpose: Run failure detection and root cause analysis as a single pipeline
- Output: Combined result with failures, root causes, and deduplicated recommendations
- Integration: Can be wired into
ExperimentviaDiagnosisConfigfor automatic diagnosis of failing cases
Quick Example
Section titled “Quick Example”from strands_evals.detectors import diagnose_session
# session is a Session object from a trace provider or in-memory mapperresult = diagnose_session(session)
# What failed?for failure in result.failures: print(f"Span {failure.span_id}: {failure.category}") print(f" Evidence: {failure.evidence}")
# Why did it fail?for rc in result.root_causes: print(f"Root cause at {rc.location}: {rc.root_cause_explanation}") print(f" Fix ({rc.fix_type}): {rc.fix_recommendation}")
# Deduplicated recommendationsfor rec in result.recommendations: print(f" - {rec}")Detectors vs Evaluators
Section titled “Detectors vs Evaluators”| Aspect | Evaluators | Detectors |
|---|---|---|
| Question | ”How well did the agent do?" | "Why did it fail?” |
| Output | Score + pass/fail | Failures + root causes + fix recommendations |
| Granularity | Per-case or per-session | Per-span |
| Purpose | Measure quality | Diagnose problems |
| Use Case | Benchmarking, regression testing | Debugging, improvement |
Use Together: Run evaluators to score your agent, then use detectors on failing cases to understand what went wrong and how to fix it. The Experiment class supports this workflow natively via DiagnosisConfig.
Integration with Experiments
Section titled “Integration with Experiments”Detectors integrate directly into the evaluation pipeline. Pass a DiagnosisConfig to Experiment to automatically diagnose failing cases:
from strands_evals import Experiment, Case, DiagnosisConfigfrom strands_evals.evaluators import GoalSuccessRateEvaluatorfrom strands_evals.types.detector import DiagnosisTrigger
experiment = Experiment( cases=test_cases, evaluators=[GoalSuccessRateEvaluator()], diagnosis_config=DiagnosisConfig(trigger=DiagnosisTrigger.ON_FAILURE),)
reports = experiment.run_evaluations(my_task)
# View recommendations for failing casesreports[0].display(include_recommendations=True)See the Session Diagnosis guide for the full integration walkthrough.
How It Works
Section titled “How It Works”Detectors use a two-phase analysis pipeline:
flowchart TD A[Session traces] --> B[1. Failure Detection] B --> C[2. Root Cause Analysis] C --> D[DiagnosisResult]
B -. "span_id, category,\nconfidence, evidence" .-> B C -. "causality, propagation impact,\nfix recommendation" .-> C D -. "failures[]\nroot_causes[]\nrecommendations[]" .-> DBoth phases handle large sessions that exceed LLM context limits through automatic chunking with overlap and merge strategies.
Next Steps
Section titled “Next Steps”- Failure Detection: Identify what went wrong in agent traces
- Root Cause Analysis: Understand why failures happened
- Session Diagnosis: Run the full pipeline and integrate with experiments
Related Documentation
Section titled “Related Documentation”- Getting Started: Set up your first evaluation experiment
- Evaluators Overview: Score agent performance
- Remote Trace Providers: Fetch traces from production backends