Tag: Evaluation

Name: Strands Agents SDK
Author: Strands Agents

6 posts

June 18, 2026

Reduced cost, better isolation, and more resilience: Strands Agents evolves next-gen capabilities

Strands Agents ships context management that cuts costs in half, Strands Shell for sandboxed agent execution, and chaos testing and red teaming in Strands Evals 1.0.

Ryan ColemanStrands Agents Team

May 20, 2026

Multimodal evaluators: MLLM-as-a-judge for image-to-text tasks in Strands Evals

Announcing four new MLLM-as-a-Judge evaluators for image-to-text tasks in Strands Evals: Overall Quality, Correctness, Faithfulness, and Instruction Following — automated, image-grounded scoring with reasoning.

Sangmin WooHaibo DingSungyeon KimVinayak Arannil

April 20, 2026

ToolSimulator: scalable tool testing for AI agents

ToolSimulator is an LLM-powered framework within Strands Evals that enables safe, scalable agent testing by using simulated tool responses instead of risky live API calls.

Darren WangSmeet DhakechaVinayak ArannilXuan Qi

April 2, 2026

Simulate realistic users to evaluate multi-turn AI agents in Strands Evals

ActorSimulator in the Strands Evals SDK enables teams to test conversational agents through realistic, goal-driven simulated users rather than relying on static test cases or manual testing.

Ishan SinghAbhishek KumarJonathan BuckVinayak Arannil

March 18, 2026

Evaluating AI agents for production: A practical guide to Strands Evals

Learn how to systematically evaluate AI agents using Strands Evals, covering core concepts like cases, experiments, evaluators, multi-turn simulation capabilities, and practical integration patterns for production deployment.

Ishan SinghAkarsha SehwagJonathan BuckPo-Shin ChenSmeet Dhakecha

March 18, 2026

How Steering Hooks Achieved 100% Agent Accuracy Where Prompts and Workflows Failed

Steering hooks achieved a 100% accuracy pass rate across 600 evaluation runs, compared to 82.5% for simple prompt-based instructions and 80.8% for graph-based workflows.

Clare Liguori