Skip to content

Command-Line Interface

Installing strands-agents-evals also installs the strands-evals console script — a thin wrapper over the public Python API for CI gates and one-off use. It exposes five subcommands that map directly to library calls so behavior in CI matches what you get from a Python script.

CommandPurpose
strands-evals runExecute an Experiment against an --agent factory or --task callable, or run a single ad-hoc case via --input + --evaluator/--expected-output/--rubric.
strands-evals validateSchema-check a serialized Experiment JSON file. Useful as a CI gate before run.
strands-evals reportRender an existing EvaluationReport JSON via Rich, or dump it as JSON.
strands-evals diagnoseRun detect_failures, analyze_root_cause, or the full diagnose_session pipeline on a Session JSON file.
strands-evals generateSynthesize an Experiment via ExperimentGenerator from a free-form --context or an existing --experiment file.

Run any subcommand with --help for the full flag set.

Terminal window
pip install strands-agents-evals

The strands-evals script is registered as a console entry point and is on your PATH after installation.

--agent, --task, --evaluator, and --custom-evaluator all accept a MODULE:ATTR reference. The same convention is used by pytest --pyargs, gunicorn, and inspect-ai eval.

Two forms are accepted:

  • Dotted module: pkg.module:attr — resolved via importlib.import_module. The current working directory is added to sys.path so a sibling file like agent.py works as agent:build_agent without PYTHONPATH=..
  • Path-like: ./agent.py:build_agent, ../sibling/agent:build_agent, or /abs/path/agent.py:build_agent. Anything that contains a path separator, starts with .//..//~, or ends in .py.

Two modes:

  • Experiment file mode: pass an EXPERIMENT_FILE (a JSON document produced by Experiment.to_file).
  • Ad-hoc mode: omit the file and provide --input + at least one of --evaluator, --expected-output, or --rubric for a single-case run without authoring an experiment.

The two modes are mutually exclusive — argparse rejects mixing them.

--agent is the standard path. It expects a factory callable that returns a fresh strands.Agent per invocation:

my_pkg/agents.py
from strands import Agent
from strands_tools import calculator
def build_agent():
return Agent(tools=[calculator], callback_handler=None)

The CLI synthesizes the standard task wrapper around it: telemetry setup → per-case OTel context (session.id, gen_ai.conversation.id) → factory call → invoke with case.input → map spans to a Session → return {"output", "trajectory"}. Trace-based evaluators (HelpfulnessEvaluator, FaithfulnessEvaluator, GoalSuccessRateEvaluator, etc.) read the trajectory directly.

The factory may also take a single Case argument for per-case customization:

def build_agent(case):
tools = [calculator] if (case.metadata or {}).get("use_calc") else []
return Agent(tools=tools, callback_handler=None)

A prebuilt strands.Agent instance or an Agent subclass is rejected — the conversation state would leak across cases.

--task is the escape hatch for non-standard task shapes — multi-turn loops, custom session mapping, etc. It expects a Callable[[Case], dict|str]. When --task is used, the user owns agent instantiation; --trace-attributes is a no-op and is logged as a warning.

Terminal window
# Schema-check first, then run against a factory
strands-evals validate experiments/customer_service.json
strands-evals run experiments/customer_service.json \
--agent my_pkg.agents:build_agent \
--display

--display renders a Rich table on stdout with input, expected output, actual output, and per-evaluator scores.

run is the one subcommand whose primary stdout output does not follow the global --rich/--json TTY auto-detection — building the Rich table eagerly walks every case row, which is wasteful on large experiments where output is typically piped to strands-evals report or written via -o. Concretely, with no --display and no -o:

  • --json (or stdout is a pipe) → flattened report JSON on stdout.
  • TTY with no --json/--rich → silent on stdout (pass --display to see results).
  • -o PATH → JSON written to the file; nothing on stdout.

For a one-off check with no experiment file:

Terminal window
# Substring match against the agent's response
strands-evals run \
--input "What is the capital of France?" \
--expected-output "Paris" \
--agent my_pkg.agents:build_agent
# LLM-as-judge with a rubric
strands-evals run \
--input "Explain recursion in one paragraph." \
--rubric "Score 1.0 if accurate and one paragraph. Score 0.0 otherwise." \
--agent my_pkg.agents:build_agent
# Built-in shortname evaluator
strands-evals run \
--input "Is 17 prime?" \
--evaluator helpfulness \
--agent my_pkg.agents:build_agent

Auto-wiring rules in ad-hoc mode:

  • --expected-output TEXT (without --evaluator) → Contains(value=TEXT).
  • --rubric TEXT (without --evaluator) → OutputEvaluator(rubric=TEXT).
  • --expected-output and --rubric compose — both auto-evaluators are appended.
  • An explicit --evaluator disables the auto-wiring; pass it again to add more (--evaluator is repeatable).

--evaluator accepts either a built-in shortname or MODULE:CLASS for a custom Evaluator subclass. Built-in shortnames instantiate with no arguments; richer config (custom rubrics, judge models, target tool names) belongs in an experiment file.

Built-in shortnames: coherence, conciseness, correctness, equals, faithfulness, goal-success-rate, harmfulness, helpfulness, instruction-following, refusal, response-relevance, stereotyping, tool-parameter-accuracy, tool-selection-accuracy.

Terminal window
strands-evals run experiments/regression.json \
--agent my_pkg.agents:build_agent \
--max-workers 8 \
--data-store ./.cache/regression \
--fail-on threshold:0.8 \
-o reports/regression.json
  • --max-workers controls parallelism for run_evaluations_async (default 1).
  • --data-store DIR enables LocalFileTaskResultStore so cached task outputs short-circuit reruns. See Result Caching for details.
  • --fail-on chooses the exit-code rule: any (default — exit non-zero on any case failure), none (always exit 0 on completion), or threshold:0.X (exit non-zero when the report’s overall score falls below the threshold).
  • --exit-zero overrides --fail-on and always returns 0. Useful when you want to record the report without breaking the build.
  • -o PATH writes the flattened report JSON to a file. Without -o, the JSON goes to stdout.

Exit codes:

CodeMeaning
0Success (all cases passed, or --fail-on=none / --exit-zero).
1Evaluation failures triggered by --fail-on.
2Bad input (invalid flags, missing entry point, schema error).
3Unexpected runtime error.

Combine run with on-failure diagnosis to capture root causes alongside scores:

Terminal window
strands-evals run experiments/regression.json \
--agent my_pkg.agents:build_agent \
--diagnose on_failure \
--confidence medium \
--display

--diagnose accepts on_failure or always. Diagnosis requires Session trajectories, which only --agent produces. With --display, recommendations render in the Rich table.

Terminal window
strands-evals run experiments/regression.json \
--agent my_pkg.agents:build_agent \
--trace-attributes service.name=billing \
--trace-attributes deployment.env=staging \
--custom-evaluator my_pkg.evaluators:DomainSafetyEvaluator
  • --trace-attributes KEY=VALUE is repeatable. The pairs are set as W3C Baggage on the per-case context and stamped on every span the agent emits. session.id and gen_ai.conversation.id are always set from the case — --trace-attributes is for additional keys. No-op when --task is used.
  • --custom-evaluator MODULE:CLASS registers a custom Evaluator subclass before Experiment.from_file so the deserializer can rehydrate it. Repeatable. Ignored in ad-hoc mode (pass MODULE:CLASS directly to --evaluator instead).
Terminal window
strands-evals validate experiments/customer_service.json
# valid: 12 case(s), 3 evaluator(s) [OutputEvaluator, TrajectoryEvaluator, HelpfulnessEvaluator]

validate loads the file via Experiment.from_file and reports case + evaluator counts. It exits non-zero on schema or I/O errors, making it a fast CI gate before run. Use --custom-evaluator MODULE:CLASS (repeatable) when the experiment references custom evaluators.

Terminal window
# Static Rich rendering on stdout (Rich on a TTY, JSON when piped — pass --rich to force)
strands-evals report reports/regression.json --rich
# Interactive Rich table (expand/collapse rows)
strands-evals report reports/regression.json --interactive
# Include diagnosis recommendations
strands-evals report reports/regression.json --recommendations
# Re-emit as JSON
strands-evals report reports/regression.json --json

report accepts - to read from stdin, so it composes with run:

Terminal window
strands-evals run experiments/regression.json --agent my_pkg.agents:build_agent \
| strands-evals report - --recommendations

-o PATH always writes JSON regardless of --interactive/--rich, so you can pipe through report to persist a stable on-disk format.

diagnose — Detect failures and analyze root causes

Section titled “diagnose — Detect failures and analyze root causes”

diagnose operates on a serialized Session (the same Session object trace-based evaluators consume). Three modes:

Terminal window
# Full pipeline: detect failures and analyze root causes
strands-evals diagnose session.json --confidence medium
# Detection only
strands-evals diagnose session.json --detect-only --confidence high
# Root cause analysis only
strands-evals diagnose session.json --rca-only
# Read from stdin, write JSON to a file
cat session.json | strands-evals diagnose - --output diagnosis.json
  • --confidence is the minimum confidence threshold for failure detection (low | medium | high, default low).
  • --model MODEL_ID overrides the judge model used for detection and RCA.
  • --detect-only and --rca-only are mutually exclusive; omit both for the full pipeline.
  • A one-line summary is always written to stderr (diagnosis: N failure(s), M root cause(s)), so the command is scriptable even when the rich output goes to a TTY.

See Detectors for the underlying API.

generate wraps ExperimentGenerator to produce a starter experiment from either a free-form context or an existing experiment file. The two source flags are mutually exclusive.

Terminal window
strands-evals generate \
--context "$(cat tools.txt)" \
--num-cases 10 \
--evaluator TrajectoryEvaluator \
--task-description "Calculation and time-aware assistant" \
--num-topics 3 \
-o experiments/generated.json
  • --context accepts free-form text. Use shell substitution for file contents.
  • --num-cases (default 5) is the number of test cases to generate.
  • --evaluator (context mode only) attaches a default evaluator with a generated rubric. Choices: OutputEvaluator, TrajectoryEvaluator, InteractionsEvaluator. Omit to produce an experiment with a placeholder Evaluator.
  • --num-topics (context mode only) splits generation across N topic-specific prompts for diverse coverage.
Terminal window
strands-evals generate \
--experiment experiments/baseline.json \
--num-cases 20 \
--extra-information "Focus on edge cases involving timezone handling." \
-o experiments/expanded.json
  • New cases are inspired by the source; evaluators are inherited from the source’s defaults (so --evaluator and --num-topics are rejected).
  • --custom-evaluator MODULE:CLASS (experiment mode only, repeatable) registers custom evaluators before loading the source.
  • --extra-information (experiment mode only) is extra context for the new cases and rubric.

--model MODEL_ID overrides the judge model used by the generator. With -o, the experiment is written via Experiment.to_file (a .json extension is enforced). Without -o, the JSON document is written to stdout. A one-line summary on stderr reports the case and evaluator counts.

See Experiment Generator for the underlying API.

Every subcommand accepts the same global flags from the parent parser:

FlagPurpose
--jsonEmit machine-readable JSON to stdout.
--richEmit Rich-rendered output to stdout. Default when stdout is a TTY.
-v, --verboseIncrease log verbosity. Repeat (-vv) for DEBUG.
--debugDEBUG logging plus full tracebacks on errors.

--json and --rich are mutually exclusive; without either, the format is auto-detected from whether stdout is a TTY.

A typical CI flow combines validate (fast schema gate) with run (the actual evaluation):

# .github/workflows/evals.yml (excerpt)
- name: Validate experiments
run: strands-evals validate experiments/regression.json
- name: Run evaluations
run: |
strands-evals run experiments/regression.json \
--agent my_pkg.agents:build_agent \
--max-workers 8 \
--data-store ./.cache/regression \
--fail-on threshold:0.85 \
-o regression-report.json
- name: Upload report
if: always()
uses: actions/upload-artifact@v4
with:
name: eval-report
path: regression-report.json

validate exits non-zero on schema errors before any agent calls, and run exits non-zero on evaluation failures via --fail-on. The cached results from --data-store make reruns cheap when only the evaluators or the agent change.

  • Task Decorator — the Python equivalent of --agent’s synthesized task wrapper, for use in scripts.
  • Result Caching — what --data-store writes and how cache hits work.
  • Serialization — the on-disk shapes consumed by validate, report, and generate --experiment.
  • Experiment Generator — the API behind strands-evals generate.
  • Detectors — the API behind strands-evals diagnose.