Session Diagnosis

Overview

diagnose_session runs the full detection-and-analysis pipeline in a single call: it detects failures, performs root cause analysis on any failures found, and returns a combined DiagnosisResult with deduplicated fix recommendations. It also integrates directly into the Experiment class via DiagnosisConfig for automatic diagnosis of failing evaluation cases.

Key Features

Single-call pipeline: Runs detect_failures then analyze_root_cause in sequence
Deduplicated recommendations: The .recommendations property returns unique fix suggestions across all root causes
Experiment integration: Wire into Experiment with DiagnosisConfig for automatic diagnosis
Configurable triggers: Run diagnosis on every case (DiagnosisTrigger.ALWAYS) or only on failing cases (DiagnosisTrigger.ON_FAILURE)

When to Use

Use diagnose_session when you need to:

Run the full pipeline without managing individual detector calls
Integrate diagnosis into experiments for automatic debugging of failing cases
Get a single result object with failures, root causes, and recommendations

Use the individual detectors (detect_failures, analyze_root_cause) when you need finer control — for example, running failure detection with different confidence thresholds before deciding whether to proceed with RCA.

Parameters

`session` (required)

Type: Session
Description: The Session object containing traces and spans to diagnose.

`model` (optional)

Type: Model | str | None
Default: None (uses Claude Sonnet via Bedrock)
Description: The model for both failure detection and root cause analysis.

`confidence_threshold` (optional)

Type: ConfidenceLevel
Default: ConfidenceLevel.LOW
Description: Minimum confidence level for failure detection.

Basic Usage

from strands_evals.detectors import diagnose_session

result = diagnose_session(session)

# Failures found
print(f"Failures: {len(result.failures)}")
for f in result.failures:
    print(f"  [{f.category[0]}] at span {f.span_id}")

# Root causes
print(f"\nRoot causes: {len(result.root_causes)}")
for rc in result.root_causes:
    print(f"  {rc.causality} at {rc.location}: {rc.root_cause_explanation}")

# Deduplicated recommendations
print("\nRecommendations:")
for rec in result.recommendations:
    print(f"  - {rec}")

Output Structure

diagnose_session returns a DiagnosisResult:

class DiagnosisResult(BaseModel):
    session_id: str
    failures: list[FailureItem]
    root_causes: list[RCAItem]

    @property
    def recommendations(self) -> list[str]:
        """Deduplicated fix recommendations from all root causes."""

If no failures are detected, root_causes will be empty and recommendations will return an empty list.

Integration with Experiments

The most powerful way to use diagnosis is through the Experiment class. Pass a DiagnosisConfig to automatically diagnose cases during evaluation.

DiagnosisConfig

from strands_evals import DiagnosisConfig
from strands_evals.detectors import ConfidenceLevel
from strands_evals.types.detector import DiagnosisTrigger

class DiagnosisConfig(BaseModel):
    trigger: DiagnosisTrigger = DiagnosisTrigger.ON_FAILURE
    model: Model | str | None = None
    confidence_threshold: ConfidenceLevel = ConfidenceLevel.MEDIUM

Parameter	Default	Description
`trigger`	`DiagnosisTrigger.ON_FAILURE`	When to run diagnosis. `ON_FAILURE` runs only when at least one evaluator fails. `ALWAYS` runs on every case.
`model`	`None`	Model for the detectors. `None` uses the default.
`confidence_threshold`	`ConfidenceLevel.MEDIUM`	Minimum confidence for failure detection.

Example: Diagnose Failing Cases

from strands import Agent
from strands_evals import Case, Experiment, DiagnosisConfig, eval_task, TracedHandler
from strands_evals.detectors import ConfidenceLevel
from strands_evals.evaluators import GoalSuccessRateEvaluator
from strands_evals.types.detector import DiagnosisTrigger

@eval_task(TracedHandler())
def my_agent_task():
    return Agent(
        system_prompt="You are a helpful travel booking assistant.",
        callback_handler=None,
    )

cases = [
    Case(
        name="booking-1",
        input="Book me a flight from NYC to London for next Friday.",
        metadata={"task_description": "Flight booked with confirmation number"},
    ),
    Case(
        name="booking-2",
        input="I need to cancel my reservation ABC123.",
        metadata={"task_description": "Reservation cancelled successfully"},
    ),
]

experiment = Experiment(
    cases=cases,
    evaluators=[GoalSuccessRateEvaluator()],
    diagnosis_config=DiagnosisConfig(
        trigger=DiagnosisTrigger.ON_FAILURE,
        confidence_threshold=ConfidenceLevel.MEDIUM,
    ),
)

reports = experiment.run_evaluations(my_agent_task)

Viewing Recommendations

Display recommendations in the evaluation report:

# Display with recommendations column
reports[0].display(include_recommendations=True)

Or access them programmatically:

report = reports[0]
for i, rec in enumerate(report.recommendations):
    if rec is not None:
        case_name = report.cases[i].get("name", f"case_{i}")
        passed = report.test_passes[i]
        print(f"[{'PASS' if passed else 'FAIL'}] {case_name}")
        print(f"  Recommendation: {rec}")

Accessing Full Diagnosis Data

The full diagnosis dict (failures + root causes) is available per case:

report = reports[0]
for i, diagnosis in enumerate(report.diagnoses):
    if diagnosis is not None:
        case_name = report.cases[i].get("name", f"case_{i}")
        n_failures = len(diagnosis.get("failures", []))
        n_rca = len(diagnosis.get("root_causes", []))
        print(f"{case_name}: {n_failures} failures, {n_rca} root causes")

        for rc in diagnosis.get("root_causes", []):
            print(f"  [{rc['fix_type']}] {rc['fix_recommendation']}")

Trigger Modes

DiagnosisTrigger.ON_FAILURE (default): Diagnosis runs only when at least one evaluator returns test_pass=False for the case. This is the most efficient option — no LLM calls are spent diagnosing passing cases.

DiagnosisConfig(trigger=DiagnosisTrigger.ON_FAILURE)

DiagnosisTrigger.ALWAYS: Diagnosis runs on every case regardless of evaluator results. Useful for deep analysis or when you want to detect latent issues in passing cases (e.g., the agent succeeded but through a suboptimal path).

DiagnosisConfig(trigger=DiagnosisTrigger.ALWAYS)

Requirements

Diagnosis requires the task function to return a Session object as the trajectory. This means using either:

The @eval_task(TracedHandler()) decorator (recommended)
A manual task function that collects spans and maps them to a Session via StrandsInMemorySessionMapper
A trace provider that returns Session objects

If the trajectory is not a Session (e.g., it’s a plain list of tool names), diagnosis is silently skipped for that case.

Example: Full Workflow

This example shows the complete workflow: run evaluations with diagnosis, then analyze the results.

from strands import Agent
from strands_evals import Case, Experiment, DiagnosisConfig, eval_task, TracedHandler
from strands_evals.detectors import ConfidenceLevel
from strands_evals.evaluators import OutputEvaluator, GoalSuccessRateEvaluator
from strands_evals.types.detector import DiagnosisTrigger
from strands_tools import calculator

@eval_task(TracedHandler())
def math_agent():
    return Agent(
        tools=[calculator],
        system_prompt="You are a math assistant. Use the calculator tool for computations.",
        callback_handler=None,
    )

cases = [
    Case(
        name="basic-math",
        input="What is 15% of 230?",
        expected_output="34.5",
        metadata={"task_description": "Correct calculation provided"},
    ),
]

experiment = Experiment(
    cases=cases,
    evaluators=[
        OutputEvaluator(rubric="Score 1.0 if the answer is numerically correct. Score 0.0 otherwise."),
        GoalSuccessRateEvaluator(),
    ],
    diagnosis_config=DiagnosisConfig(
        trigger=DiagnosisTrigger.ON_FAILURE,
        confidence_threshold=ConfidenceLevel.MEDIUM,
    ),
)

reports = experiment.run_evaluations(math_agent)

# Flatten all evaluator reports into one and display
from strands_evals.types.evaluation_report import EvaluationReport

combined = EvaluationReport.flatten(reports)
combined.display(include_recommendations=True)

Best Practices

Start with DiagnosisTrigger.ON_FAILURE to minimize LLM costs — only diagnose cases that actually fail
Use ConfidenceLevel.MEDIUM for diagnosis to balance signal and noise
Use TracedHandler with the @eval_task decorator to automatically collect Session objects
Group recommendations across cases to identify systematic issues vs. one-off problems
Use DiagnosisTrigger.ALWAYS sparingly — it’s useful for deep analysis but doubles the LLM cost per case

Failure Detection: Standalone failure detection
Root Cause Analysis: Standalone root cause analysis
Detectors Overview: High-level detectors guide
Task Decorator: Simplify task functions with @eval_task
Remote Trace Providers: Fetch traces from production backends