Session Diagnosis
Overview
Section titled “Overview”diagnose_session runs the full detection-and-analysis pipeline in a single call: it detects failures, performs root cause analysis on any failures found, and returns a combined DiagnosisResult with deduplicated fix recommendations. It also integrates directly into the Experiment class via DiagnosisConfig for automatic diagnosis of failing evaluation cases.
Key Features
Section titled “Key Features”- Single-call pipeline: Runs
detect_failuresthenanalyze_root_causein sequence - Deduplicated recommendations: The
.recommendationsproperty returns unique fix suggestions across all root causes - Experiment integration: Wire into
ExperimentwithDiagnosisConfigfor automatic diagnosis - Configurable triggers: Run diagnosis on every case (
DiagnosisTrigger.ALWAYS) or only on failing cases (DiagnosisTrigger.ON_FAILURE)
When to Use
Section titled “When to Use”Use diagnose_session when you need to:
- Run the full pipeline without managing individual detector calls
- Integrate diagnosis into experiments for automatic debugging of failing cases
- Get a single result object with failures, root causes, and recommendations
Use the individual detectors (detect_failures, analyze_root_cause) when you need finer control — for example, running failure detection with different confidence thresholds before deciding whether to proceed with RCA.
Parameters
Section titled “Parameters”session (required)
Section titled “session (required)”- Type:
Session - Description: The Session object containing traces and spans to diagnose.
model (optional)
Section titled “model (optional)”- Type:
Model | str | None - Default:
None(uses Claude Sonnet via Bedrock) - Description: The model for both failure detection and root cause analysis.
confidence_threshold (optional)
Section titled “confidence_threshold (optional)”- Type:
ConfidenceLevel - Default:
ConfidenceLevel.LOW - Description: Minimum confidence level for failure detection.
Basic Usage
Section titled “Basic Usage”from strands_evals.detectors import diagnose_session
result = diagnose_session(session)
# Failures foundprint(f"Failures: {len(result.failures)}")for f in result.failures: print(f" [{f.category[0]}] at span {f.span_id}")
# Root causesprint(f"\nRoot causes: {len(result.root_causes)}")for rc in result.root_causes: print(f" {rc.causality} at {rc.location}: {rc.root_cause_explanation}")
# Deduplicated recommendationsprint("\nRecommendations:")for rec in result.recommendations: print(f" - {rec}")Output Structure
Section titled “Output Structure”diagnose_session returns a DiagnosisResult:
class DiagnosisResult(BaseModel): session_id: str failures: list[FailureItem] root_causes: list[RCAItem]
@property def recommendations(self) -> list[str]: """Deduplicated fix recommendations from all root causes."""If no failures are detected, root_causes will be empty and recommendations will return an empty list.
Integration with Experiments
Section titled “Integration with Experiments”The most powerful way to use diagnosis is through the Experiment class. Pass a DiagnosisConfig to automatically diagnose cases during evaluation.
DiagnosisConfig
Section titled “DiagnosisConfig”from strands_evals import DiagnosisConfigfrom strands_evals.detectors import ConfidenceLevelfrom strands_evals.types.detector import DiagnosisTrigger
class DiagnosisConfig(BaseModel): trigger: DiagnosisTrigger = DiagnosisTrigger.ON_FAILURE model: Model | str | None = None confidence_threshold: ConfidenceLevel = ConfidenceLevel.MEDIUM| Parameter | Default | Description |
|---|---|---|
trigger | DiagnosisTrigger.ON_FAILURE | When to run diagnosis. ON_FAILURE runs only when at least one evaluator fails. ALWAYS runs on every case. |
model | None | Model for the detectors. None uses the default. |
confidence_threshold | ConfidenceLevel.MEDIUM | Minimum confidence for failure detection. |
Example: Diagnose Failing Cases
Section titled “Example: Diagnose Failing Cases”from strands import Agentfrom strands_evals import Case, Experiment, DiagnosisConfig, eval_task, TracedHandlerfrom strands_evals.detectors import ConfidenceLevelfrom strands_evals.evaluators import GoalSuccessRateEvaluatorfrom strands_evals.types.detector import DiagnosisTrigger
@eval_task(TracedHandler())def my_agent_task(): return Agent( system_prompt="You are a helpful travel booking assistant.", callback_handler=None, )
cases = [ Case( name="booking-1", input="Book me a flight from NYC to London for next Friday.", metadata={"task_description": "Flight booked with confirmation number"}, ), Case( name="booking-2", input="I need to cancel my reservation ABC123.", metadata={"task_description": "Reservation cancelled successfully"}, ),]
experiment = Experiment( cases=cases, evaluators=[GoalSuccessRateEvaluator()], diagnosis_config=DiagnosisConfig( trigger=DiagnosisTrigger.ON_FAILURE, confidence_threshold=ConfidenceLevel.MEDIUM, ),)
reports = experiment.run_evaluations(my_agent_task)Viewing Recommendations
Section titled “Viewing Recommendations”Display recommendations in the evaluation report:
# Display with recommendations columnreports[0].display(include_recommendations=True)Or access them programmatically:
report = reports[0]for i, rec in enumerate(report.recommendations): if rec is not None: case_name = report.cases[i].get("name", f"case_{i}") passed = report.test_passes[i] print(f"[{'PASS' if passed else 'FAIL'}] {case_name}") print(f" Recommendation: {rec}")Accessing Full Diagnosis Data
Section titled “Accessing Full Diagnosis Data”The full diagnosis dict (failures + root causes) is available per case:
report = reports[0]for i, diagnosis in enumerate(report.diagnoses): if diagnosis is not None: case_name = report.cases[i].get("name", f"case_{i}") n_failures = len(diagnosis.get("failures", [])) n_rca = len(diagnosis.get("root_causes", [])) print(f"{case_name}: {n_failures} failures, {n_rca} root causes")
for rc in diagnosis.get("root_causes", []): print(f" [{rc['fix_type']}] {rc['fix_recommendation']}")Trigger Modes
Section titled “Trigger Modes”DiagnosisTrigger.ON_FAILURE (default): Diagnosis runs only when at least one evaluator returns test_pass=False for the case. This is the most efficient option — no LLM calls are spent diagnosing passing cases.
DiagnosisConfig(trigger=DiagnosisTrigger.ON_FAILURE)DiagnosisTrigger.ALWAYS: Diagnosis runs on every case regardless of evaluator results. Useful for deep analysis or when you want to detect latent issues in passing cases (e.g., the agent succeeded but through a suboptimal path).
DiagnosisConfig(trigger=DiagnosisTrigger.ALWAYS)Requirements
Section titled “Requirements”Diagnosis requires the task function to return a Session object as the trajectory. This means using either:
- The
@eval_task(TracedHandler())decorator (recommended) - A manual task function that collects spans and maps them to a
SessionviaStrandsInMemorySessionMapper - A trace provider that returns
Sessionobjects
If the trajectory is not a Session (e.g., it’s a plain list of tool names), diagnosis is silently skipped for that case.
Example: Full Workflow
Section titled “Example: Full Workflow”This example shows the complete workflow: run evaluations with diagnosis, then analyze the results.
from strands import Agentfrom strands_evals import Case, Experiment, DiagnosisConfig, eval_task, TracedHandlerfrom strands_evals.detectors import ConfidenceLevelfrom strands_evals.evaluators import OutputEvaluator, GoalSuccessRateEvaluatorfrom strands_evals.types.detector import DiagnosisTriggerfrom strands_tools import calculator
@eval_task(TracedHandler())def math_agent(): return Agent( tools=[calculator], system_prompt="You are a math assistant. Use the calculator tool for computations.", callback_handler=None, )
cases = [ Case( name="basic-math", input="What is 15% of 230?", expected_output="34.5", metadata={"task_description": "Correct calculation provided"}, ),]
experiment = Experiment( cases=cases, evaluators=[ OutputEvaluator(rubric="Score 1.0 if the answer is numerically correct. Score 0.0 otherwise."), GoalSuccessRateEvaluator(), ], diagnosis_config=DiagnosisConfig( trigger=DiagnosisTrigger.ON_FAILURE, confidence_threshold=ConfidenceLevel.MEDIUM, ),)
reports = experiment.run_evaluations(math_agent)
# Flatten all evaluator reports into one and displayfrom strands_evals.types.evaluation_report import EvaluationReport
combined = EvaluationReport.flatten(reports)combined.display(include_recommendations=True)Best Practices
Section titled “Best Practices”- Start with
DiagnosisTrigger.ON_FAILUREto minimize LLM costs — only diagnose cases that actually fail - Use
ConfidenceLevel.MEDIUMfor diagnosis to balance signal and noise - Use
TracedHandlerwith the@eval_taskdecorator to automatically collectSessionobjects - Group recommendations across cases to identify systematic issues vs. one-off problems
- Use
DiagnosisTrigger.ALWAYSsparingly — it’s useful for deep analysis but doubles the LLM cost per case
Related Documentation
Section titled “Related Documentation”- Failure Detection: Standalone failure detection
- Root Cause Analysis: Standalone root cause analysis
- Detectors Overview: High-level detectors guide
- Task Decorator: Simplify task functions with
@eval_task - Remote Trace Providers: Fetch traces from production backends