Skip to content
← All posts

Tag: Evaluation

6 posts

Reduced cost, better isolation, and more resilience: Strands Agents evolves next-gen capabilities

Strands Agents ships context management that cuts costs in half, Strands Shell for sandboxed agent execution, and chaos testing and red teaming in Strands Evals 1.0.

Multimodal evaluators: MLLM-as-a-judge for image-to-text tasks in Strands Evals

Announcing four new MLLM-as-a-Judge evaluators for image-to-text tasks in Strands Evals: Overall Quality, Correctness, Faithfulness, and Instruction Following — automated, image-grounded scoring with reasoning.

ToolSimulator: scalable tool testing for AI agents

ToolSimulator is an LLM-powered framework within Strands Evals that enables safe, scalable agent testing by using simulated tool responses instead of risky live API calls.

Simulate realistic users to evaluate multi-turn AI agents in Strands Evals

ActorSimulator in the Strands Evals SDK enables teams to test conversational agents through realistic, goal-driven simulated users rather than relying on static test cases or manual testing.

Evaluating AI agents for production: A practical guide to Strands Evals

Learn how to systematically evaluate AI agents using Strands Evals, covering core concepts like cases, experiments, evaluators, multi-turn simulation capabilities, and practical integration patterns for production deployment.

How Steering Hooks Achieved 100% Agent Accuracy Where Prompts and Workflows Failed

Steering hooks achieved a 100% accuracy pass rate across 600 evaluation runs, compared to 82.5% for simple prompt-based instructions and 80.8% for graph-based workflows.