Tag: Evaluation
6 posts
Reduced cost, better isolation, and more resilience: Strands Agents evolves next-gen capabilities
Strands Agents ships context management that cuts costs in half, Strands Shell for sandboxed agent execution, and chaos testing and red teaming in Strands Evals 1.0.
Multimodal evaluators: MLLM-as-a-judge for image-to-text tasks in Strands Evals
Announcing four new MLLM-as-a-Judge evaluators for image-to-text tasks in Strands Evals: Overall Quality, Correctness, Faithfulness, and Instruction Following — automated, image-grounded scoring with reasoning.
ToolSimulator: scalable tool testing for AI agents
ToolSimulator is an LLM-powered framework within Strands Evals that enables safe, scalable agent testing by using simulated tool responses instead of risky live API calls.
Simulate realistic users to evaluate multi-turn AI agents in Strands Evals
ActorSimulator in the Strands Evals SDK enables teams to test conversational agents through realistic, goal-driven simulated users rather than relying on static test cases or manual testing.
Evaluating AI agents for production: A practical guide to Strands Evals
Learn how to systematically evaluate AI agents using Strands Evals, covering core concepts like cases, experiments, evaluators, multi-turn simulation capabilities, and practical integration patterns for production deployment.
How Steering Hooks Achieved 100% Agent Accuracy Where Prompts and Workflows Failed
Steering hooks achieved a 100% accuracy pass rate across 600 evaluation runs, compared to 82.5% for simple prompt-based instructions and 80.8% for graph-based workflows.