Command-Line Interface

Overview

Installing strands-agents-evals also installs the strands-evals console script — a thin wrapper over the public Python API for CI gates and one-off use. It exposes five subcommands that map directly to library calls so behavior in CI matches what you get from a Python script.

Command	Purpose
`strands-evals run`	Execute an `Experiment` against an `--agent` factory or `--task` callable, or run a single ad-hoc case via `--input` + `--evaluator`/`--expected-output`/`--rubric`.
`strands-evals validate`	Schema-check a serialized `Experiment` JSON file. Useful as a CI gate before `run`.
`strands-evals report`	Render an existing `EvaluationReport` JSON via Rich, or dump it as JSON.
`strands-evals diagnose`	Run `detect_failures`, `analyze_root_cause`, or the full `diagnose_session` pipeline on a `Session` JSON file.
`strands-evals generate`	Synthesize an `Experiment` via `ExperimentGenerator` from a free-form `--context` or an existing `--experiment` file.

Run any subcommand with --help for the full flag set.

Installation

pip install strands-agents-evals

The strands-evals script is registered as a console entry point and is on your PATH after installation.

Entry Point Convention

--agent, --task, --evaluator, and --custom-evaluator all accept a MODULE:ATTR reference. The same convention is used by pytest --pyargs, gunicorn, and inspect-ai eval.

Two forms are accepted:

Dotted module: pkg.module:attr — resolved via importlib.import_module. The current working directory is added to sys.path so a sibling file like agent.py works as agent:build_agent without PYTHONPATH=..
Path-like: ./agent.py:build_agent, ../sibling/agent:build_agent, or /abs/path/agent.py:build_agent. Anything that contains a path separator, starts with .//..//~, or ends in .py.

`run` — Execute an Experiment

Two modes:

Experiment file mode: pass an EXPERIMENT_FILE (a JSON document produced by Experiment.to_file).
Ad-hoc mode: omit the file and provide --input + at least one of --evaluator, --expected-output, or --rubric for a single-case run without authoring an experiment.

The two modes are mutually exclusive — argparse rejects mixing them.

Choosing `--agent` vs `--task`

--agent is the standard path. It expects a factory callable that returns a fresh strands.Agent per invocation:

from strands import Agent
from strands_tools import calculator

def build_agent():
    return Agent(tools=[calculator], callback_handler=None)

The CLI synthesizes the standard task wrapper around it: telemetry setup → per-case OTel context (session.id, gen_ai.conversation.id) → factory call → invoke with case.input → map spans to a Session → return {"output", "trajectory"}. Trace-based evaluators (HelpfulnessEvaluator, FaithfulnessEvaluator, GoalSuccessRateEvaluator, etc.) read the trajectory directly.

The factory may also take a single Case argument for per-case customization:

def build_agent(case):
    tools = [calculator] if (case.metadata or {}).get("use_calc") else []
    return Agent(tools=tools, callback_handler=None)

A prebuilt strands.Agent instance or an Agent subclass is rejected — the conversation state would leak across cases.

--task is the escape hatch for non-standard task shapes — multi-turn loops, custom session mapping, etc. It expects a Callable[[Case], dict|str]. When --task is used, the user owns agent instantiation; --trace-attributes is a no-op and is logged as a warning.

Experiment file run

# Schema-check first, then run against a factory
strands-evals validate experiments/customer_service.json
strands-evals run experiments/customer_service.json \
  --agent my_pkg.agents:build_agent \
  --display

--display renders a Rich table on stdout with input, expected output, actual output, and per-evaluator scores.

run is the one subcommand whose primary stdout output does not follow the global --rich/--json TTY auto-detection — building the Rich table eagerly walks every case row, which is wasteful on large experiments where output is typically piped to strands-evals report or written via -o. Concretely, with no --display and no -o:

--json (or stdout is a pipe) → flattened report JSON on stdout.
TTY with no --json/--rich → silent on stdout (pass --display to see results).
-o PATH → JSON written to the file; nothing on stdout.

Ad-hoc run

For a one-off check with no experiment file:

# Substring match against the agent's response
strands-evals run \
  --input "What is the capital of France?" \
  --expected-output "Paris" \
  --agent my_pkg.agents:build_agent

# LLM-as-judge with a rubric
strands-evals run \
  --input "Explain recursion in one paragraph." \
  --rubric "Score 1.0 if accurate and one paragraph. Score 0.0 otherwise." \
  --agent my_pkg.agents:build_agent

# Built-in shortname evaluator
strands-evals run \
  --input "Is 17 prime?" \
  --evaluator helpfulness \
  --agent my_pkg.agents:build_agent

Auto-wiring rules in ad-hoc mode:

--expected-output TEXT (without --evaluator) → Contains(value=TEXT).
--rubric TEXT (without --evaluator) → OutputEvaluator(rubric=TEXT).
--expected-output and --rubric compose — both auto-evaluators are appended.
An explicit --evaluator disables the auto-wiring; pass it again to add more (--evaluator is repeatable).

--evaluator accepts either a built-in shortname or MODULE:CLASS for a custom Evaluator subclass. Built-in shortnames instantiate with no arguments; richer config (custom rubrics, judge models, target tool names) belongs in an experiment file.

Built-in shortnames: coherence, conciseness, correctness, equals, faithfulness, goal-success-rate, harmfulness, helpfulness, instruction-following, refusal, response-relevance, stereotyping, tool-parameter-accuracy, tool-selection-accuracy.

Concurrency, caching, and exit codes

strands-evals run experiments/regression.json \
  --agent my_pkg.agents:build_agent \
  --max-workers 8 \
  --data-store ./.cache/regression \
  --fail-on threshold:0.8 \
  -o reports/regression.json

--max-workers controls parallelism for run_evaluations_async (default 1).
--data-store DIR enables LocalFileTaskResultStore so cached task outputs short-circuit reruns. See Result Caching for details.
--fail-on chooses the exit-code rule: any (default — exit non-zero on any case failure), none (always exit 0 on completion), or threshold:0.X (exit non-zero when the report’s overall score falls below the threshold).
--exit-zero overrides --fail-on and always returns 0. Useful when you want to record the report without breaking the build.
-o PATH writes the flattened report JSON to a file. Without -o, the JSON goes to stdout.

Exit codes:

Code	Meaning
`0`	Success (all cases passed, or `--fail-on=none` / `--exit-zero`).
`1`	Evaluation failures triggered by `--fail-on`.
`2`	Bad input (invalid flags, missing entry point, schema error).
`3`	Unexpected runtime error.

Diagnosis during a run

Combine run with on-failure diagnosis to capture root causes alongside scores:

strands-evals run experiments/regression.json \
  --agent my_pkg.agents:build_agent \
  --diagnose on_failure \
  --confidence medium \
  --display

--diagnose accepts on_failure or always. Diagnosis requires Session trajectories, which only --agent produces. With --display, recommendations render in the Rich table.

Trace attributes and custom evaluators

strands-evals run experiments/regression.json \
  --agent my_pkg.agents:build_agent \
  --trace-attributes service.name=billing \
  --trace-attributes deployment.env=staging \
  --custom-evaluator my_pkg.evaluators:DomainSafetyEvaluator

--trace-attributes KEY=VALUE is repeatable. The pairs are set as W3C Baggage on the per-case context and stamped on every span the agent emits. session.id and gen_ai.conversation.id are always set from the case — --trace-attributes is for additional keys. No-op when --task is used.
--custom-evaluator MODULE:CLASS registers a custom Evaluator subclass before Experiment.from_file so the deserializer can rehydrate it. Repeatable. Ignored in ad-hoc mode (pass MODULE:CLASS directly to --evaluator instead).

`validate` — Schema-check an Experiment

strands-evals validate experiments/customer_service.json
# valid: 12 case(s), 3 evaluator(s) [OutputEvaluator, TrajectoryEvaluator, HelpfulnessEvaluator]

validate loads the file via Experiment.from_file and reports case + evaluator counts. It exits non-zero on schema or I/O errors, making it a fast CI gate before run. Use --custom-evaluator MODULE:CLASS (repeatable) when the experiment references custom evaluators.

`report` — Render an existing report

# Static Rich rendering on stdout (Rich on a TTY, JSON when piped — pass --rich to force)
strands-evals report reports/regression.json --rich

# Interactive Rich table (expand/collapse rows)
strands-evals report reports/regression.json --interactive

# Include diagnosis recommendations
strands-evals report reports/regression.json --recommendations

# Re-emit as JSON
strands-evals report reports/regression.json --json

report accepts - to read from stdin, so it composes with run:

strands-evals run experiments/regression.json --agent my_pkg.agents:build_agent \
  | strands-evals report - --recommendations

-o PATH always writes JSON regardless of --interactive/--rich, so you can pipe through report to persist a stable on-disk format.

`diagnose` — Detect failures and analyze root causes

diagnose operates on a serialized Session (the same Session object trace-based evaluators consume). Three modes:

# Full pipeline: detect failures and analyze root causes
strands-evals diagnose session.json --confidence medium

# Detection only
strands-evals diagnose session.json --detect-only --confidence high

# Root cause analysis only
strands-evals diagnose session.json --rca-only

# Read from stdin, write JSON to a file
cat session.json | strands-evals diagnose - --output diagnosis.json

--confidence is the minimum confidence threshold for failure detection (low | medium | high, default low).
--model MODEL_ID overrides the judge model used for detection and RCA.
--detect-only and --rca-only are mutually exclusive; omit both for the full pipeline.
A one-line summary is always written to stderr (diagnosis: N failure(s), M root cause(s)), so the command is scriptable even when the rich output goes to a TTY.

See Detectors for the underlying API.

`generate` — Synthesize an Experiment

generate wraps ExperimentGenerator to produce a starter experiment from either a free-form context or an existing experiment file. The two source flags are mutually exclusive.

From a context description

strands-evals generate \
  --context "$(cat tools.txt)" \
  --num-cases 10 \
  --evaluator TrajectoryEvaluator \
  --task-description "Calculation and time-aware assistant" \
  --num-topics 3 \
  -o experiments/generated.json

--context accepts free-form text. Use shell substitution for file contents.
--num-cases (default 5) is the number of test cases to generate.
--evaluator (context mode only) attaches a default evaluator with a generated rubric. Choices: OutputEvaluator, TrajectoryEvaluator, InteractionsEvaluator. Omit to produce an experiment with a placeholder Evaluator.
--num-topics (context mode only) splits generation across N topic-specific prompts for diverse coverage.

From an existing experiment

strands-evals generate \
  --experiment experiments/baseline.json \
  --num-cases 20 \
  --extra-information "Focus on edge cases involving timezone handling." \
  -o experiments/expanded.json

New cases are inspired by the source; evaluators are inherited from the source’s defaults (so --evaluator and --num-topics are rejected).
--custom-evaluator MODULE:CLASS (experiment mode only, repeatable) registers custom evaluators before loading the source.
--extra-information (experiment mode only) is extra context for the new cases and rubric.

--model MODEL_ID overrides the judge model used by the generator. With -o, the experiment is written via Experiment.to_file (a .json extension is enforced). Without -o, the JSON document is written to stdout. A one-line summary on stderr reports the case and evaluator counts.

See Experiment Generator for the underlying API.

Global flags

Every subcommand accepts the same global flags from the parent parser:

Flag	Purpose
`--json`	Emit machine-readable JSON to stdout.
`--rich`	Emit Rich-rendered output to stdout. Default when stdout is a TTY.
`-v`, `--verbose`	Increase log verbosity. Repeat (`-vv`) for `DEBUG`.
`--debug`	`DEBUG` logging plus full tracebacks on errors.

--json and --rich are mutually exclusive; without either, the format is auto-detected from whether stdout is a TTY.

CI Integration

A typical CI flow combines validate (fast schema gate) with run (the actual evaluation):

# .github/workflows/evals.yml (excerpt)
- name: Validate experiments
  run: strands-evals validate experiments/regression.json

- name: Run evaluations
  run: |
    strands-evals run experiments/regression.json \
      --agent my_pkg.agents:build_agent \
      --max-workers 8 \
      --data-store ./.cache/regression \
      --fail-on threshold:0.85 \
      -o regression-report.json

- name: Upload report
  if: always()
  uses: actions/upload-artifact@v4
  with:
    name: eval-report
    path: regression-report.json

validate exits non-zero on schema errors before any agent calls, and run exits non-zero on evaluation failures via --fail-on. The cached results from --data-store make reruns cheap when only the evaluators or the agent change.

Next Steps

Task Decorator — the Python equivalent of --agent’s synthesized task wrapper, for use in scripts.
Result Caching — what --data-store writes and how cache hits work.
Serialization — the on-disk shapes consumed by validate, report, and generate --experiment.
Experiment Generator — the API behind strands-evals generate.
Detectors — the API behind strands-evals diagnose.