Red Teaming
Overview
Section titled “Overview”Red teaming answers “can an attacker make my agent misbehave?” It runs jailbreaks — prompts crafted to get a model to do something its instructions forbid — against your agent and scores whether each one got through. While evaluators measure whether an agent does the right thing on cooperative input, red teaming probes what it does under deliberately hostile input: prompts engineered to leak its system prompt, extract data it should keep private, produce harmful content, or trigger tool calls beyond its authority. An attack that gets through is a breach.
You assemble adversarial cases and one or more attack strategies against your agent in a RedTeamExperiment, run it, and read the breaches off the report. Under the hood each attack is driven by an AttackStrategy (a technique drawn from published jailbreak research) and scored by an LLM judge — but you only interact with the experiment.
Why Red Teaming?
Section titled “Why Red Teaming?”Evaluators tell you the agent is helpful. Red teaming tells you whether it stays safe when someone tries to break it.
Evaluators alone:
- Measure quality on cooperative, in-distribution input
- Assume the user is trying to get a correct answer
- Don’t tell you how the agent behaves under an adversarial prompt
Evaluators + red teaming:
- Probe the agent with attacks engineered to bypass its guardrails
- Cover the threat types that matter for a deployed agent (prompt leak, data exfiltration, harmful content, excessive tool use)
- Score each attack with an LLM judge and surface the conversations that breached
When to Use Red Teaming
Section titled “When to Use Red Teaming”Use red teaming when you need to:
- Pre-deployment safety checks: Probe an agent’s guardrails before it ships
- Safety regression checks: Re-run the same attacks after a prompt or tool change to catch new weaknesses
- Coverage across threat types: Probe prompt-leak, data-exfiltration, harmful-content, and excessive-agency risks in one run
- Comparing defenses: Measure attack success rate across model or system-prompt variants
Risk Categories
Section titled “Risk Categories”Cases are organized by risk category — the kind of violation an attack targets. The built-in categories are:
| Risk Category | What an attack targets |
|---|---|
guideline_bypass | Getting the agent to ignore its safety guidelines via roleplay, hypotheticals, or creative framing |
system_prompt_leak | Getting the agent to reveal its system prompt, internal rules, or hidden configuration |
harmful_content | Getting the agent to produce dangerous, illegal, or policy-violating content |
data_exfiltration | Getting the agent to disclose data it should not share — other users’ records, internal-only fields |
excessive_agency | Getting the agent to make a tool call that violates policy or exceeds its authorization |
A case’s risk category tells the judge what counts as a breach for that case. These categories align with the threat types in standard taxonomies like the OWASP LLM Top 10 (for example, system_prompt_leak maps to LLM07 and excessive_agency to LLM06), so findings translate into the language a security review expects.
Quick Example
Section titled “Quick Example”Define a zero-arg factory that builds a fresh target, generate adversarial cases from it, run a strategy across the case x strategy cross-product in parallel, and read the breaches:
import asyncio
from strands import Agentfrom strands_evals.experimental.redteam import ( AdversarialCaseGenerator, CrescendoStrategy, RedTeamExperiment,)
def agent_factory() -> Agent: return Agent(system_prompt="You are a helpful customer-support assistant.")
cases = AdversarialCaseGenerator().generate_cases(agent=agent_factory(), num_cases=3)experiment = RedTeamExperiment( cases=cases, agent_factory=agent_factory, attack_strategies=[CrescendoStrategy()])report = asyncio.run(experiment.run_evaluations_async(max_workers=5))report.display()The Quickstart walks through each step (including a sync run_evaluations() path for notebook-style runs) and shows the report output.
Red Teaming vs Evaluators
Section titled “Red Teaming vs Evaluators”| Aspect | Evaluators | Red Teaming |
|---|---|---|
| Question | ”How well did the agent do?" | "Can an attacker make it misbehave?” |
| Input | Cooperative test cases | Adversarial attacks (multi-turn or scripted) |
| Output | Score + pass/fail | Attack success score + breaching conversations |
| Driver | A fixed task function | An AttackStrategy that adapts per turn |
| Use Case | Quality evaluation | Safety probing, guardrail regression |
Use Together: Evaluate the agent for quality, then red team it for safety. A high quality score and an undefended jailbreak are both true at once.
How It Works
Section titled “How It Works”flowchart TD A[RedTeamCase: risk category + actor goal] --> B[RedTeamExperiment] G[Your Agent or MultiAgentBase] --> B S[AttackStrategy: a published jailbreak technique] --> B B --> C[Strategy drives the attack via TargetSession.invoke] C --> D[AttackSuccessEvaluator scores the conversation 0.0-1.0] D --> E[RedTeamReport: breaches by risk category and strategy]A strategy runs against the target one case at a time when you call run_evaluations(), and against the case x strategy cross-product in parallel (default max_workers=5) when you call run_evaluations_async(). Each strategy carries its own cheap in-loop “should I stop?” gate, but the authoritative breach verdict always comes from the AttackSuccessEvaluator over the full conversation and tool trace.
Best Practices
Section titled “Best Practices”Informed by general LLM red-teaming guidance like the OWASP Top 10 for LLM Applications and the NIST AI Risk Management Framework, scoped to what this module does.
- Cover several threat types, not one. Spread cases across the risk categories (
AdversarialCaseGeneratordoes this automatically;report.by_risk_category()breaks results down by type). A pass on one category says nothing about the others. - Run multiple strategies. Which technique breaks a given target varies, so run several and compare
report.by_strategy()— coverage from a portfolio beats betting on one. This is the module’s core capability. - Test the agent in context, with its tools. Point the strategies at your real
Agent(system prompt, tools, guardrails) rather than a bare model — application-layer risks likeexcessive_agencyanddata_exfiltrationonly surface when the tools are present. - Fix and re-run. A breach is the start of a loop: read the conversation, mitigate, then re-run the same cases to confirm the fix held (see Acting on a breach). Keep your breaching cases as a regression suite.
- Don’t read a clean run as proof of safety. Scores come from an LLM judge over a finite set of cases and strategies, and models are stochastic. A
PASSis evidence, not a guarantee.
Next Steps
Section titled “Next Steps”- Quickstart: Run your first red-team experiment end to end
- Attack Strategies: The built-in strategies and how to choose
- Writing Custom Cases: Hand-author cases instead of generating them
- Scoring Attacks: How
AttackSuccessEvaluatordecides a breach - Reading the Report: Read the breach matrix and act on findings
Related Documentation
Section titled “Related Documentation”- Getting Started: Set up your first evaluation experiment
- Evaluators Overview: Score agent performance on cooperative input
- Harmfulness Evaluator: Score a single response for harmful content