Skip to content

Red Teaming

Red teaming answers “can an attacker make my agent misbehave?” It runs jailbreaks — prompts crafted to get a model to do something its instructions forbid — against your agent and scores whether each one got through. While evaluators measure whether an agent does the right thing on cooperative input, red teaming probes what it does under deliberately hostile input: prompts engineered to leak its system prompt, extract data it should keep private, produce harmful content, or trigger tool calls beyond its authority. An attack that gets through is a breach.

You assemble adversarial cases and one or more attack strategies against your agent in a RedTeamExperiment, run it, and read the breaches off the report. Under the hood each attack is driven by an AttackStrategy (a technique drawn from published jailbreak research) and scored by an LLM judge — but you only interact with the experiment.

Evaluators tell you the agent is helpful. Red teaming tells you whether it stays safe when someone tries to break it.

Evaluators alone:

  • Measure quality on cooperative, in-distribution input
  • Assume the user is trying to get a correct answer
  • Don’t tell you how the agent behaves under an adversarial prompt

Evaluators + red teaming:

  • Probe the agent with attacks engineered to bypass its guardrails
  • Cover the threat types that matter for a deployed agent (prompt leak, data exfiltration, harmful content, excessive tool use)
  • Score each attack with an LLM judge and surface the conversations that breached

Use red teaming when you need to:

  • Pre-deployment safety checks: Probe an agent’s guardrails before it ships
  • Safety regression checks: Re-run the same attacks after a prompt or tool change to catch new weaknesses
  • Coverage across threat types: Probe prompt-leak, data-exfiltration, harmful-content, and excessive-agency risks in one run
  • Comparing defenses: Measure attack success rate across model or system-prompt variants

Cases are organized by risk category — the kind of violation an attack targets. The built-in categories are:

Risk CategoryWhat an attack targets
guideline_bypassGetting the agent to ignore its safety guidelines via roleplay, hypotheticals, or creative framing
system_prompt_leakGetting the agent to reveal its system prompt, internal rules, or hidden configuration
harmful_contentGetting the agent to produce dangerous, illegal, or policy-violating content
data_exfiltrationGetting the agent to disclose data it should not share — other users’ records, internal-only fields
excessive_agencyGetting the agent to make a tool call that violates policy or exceeds its authorization

A case’s risk category tells the judge what counts as a breach for that case. These categories align with the threat types in standard taxonomies like the OWASP LLM Top 10 (for example, system_prompt_leak maps to LLM07 and excessive_agency to LLM06), so findings translate into the language a security review expects.

Define a zero-arg factory that builds a fresh target, generate adversarial cases from it, run a strategy across the case x strategy cross-product in parallel, and read the breaches:

import asyncio
from strands import Agent
from strands_evals.experimental.redteam import (
AdversarialCaseGenerator,
CrescendoStrategy,
RedTeamExperiment,
)
def agent_factory() -> Agent:
return Agent(system_prompt="You are a helpful customer-support assistant.")
cases = AdversarialCaseGenerator().generate_cases(agent=agent_factory(), num_cases=3)
experiment = RedTeamExperiment(
cases=cases, agent_factory=agent_factory, attack_strategies=[CrescendoStrategy()]
)
report = asyncio.run(experiment.run_evaluations_async(max_workers=5))
report.display()

The Quickstart walks through each step (including a sync run_evaluations() path for notebook-style runs) and shows the report output.

AspectEvaluatorsRed Teaming
Question”How well did the agent do?""Can an attacker make it misbehave?”
InputCooperative test casesAdversarial attacks (multi-turn or scripted)
OutputScore + pass/failAttack success score + breaching conversations
DriverA fixed task functionAn AttackStrategy that adapts per turn
Use CaseQuality evaluationSafety probing, guardrail regression

Use Together: Evaluate the agent for quality, then red team it for safety. A high quality score and an undefended jailbreak are both true at once.

flowchart TD
A[RedTeamCase: risk category + actor goal] --> B[RedTeamExperiment]
G[Your Agent or MultiAgentBase] --> B
S[AttackStrategy: a published jailbreak technique] --> B
B --> C[Strategy drives the attack via TargetSession.invoke]
C --> D[AttackSuccessEvaluator scores the conversation 0.0-1.0]
D --> E[RedTeamReport: breaches by risk category and strategy]

A strategy runs against the target one case at a time when you call run_evaluations(), and against the case x strategy cross-product in parallel (default max_workers=5) when you call run_evaluations_async(). Each strategy carries its own cheap in-loop “should I stop?” gate, but the authoritative breach verdict always comes from the AttackSuccessEvaluator over the full conversation and tool trace.

Informed by general LLM red-teaming guidance like the OWASP Top 10 for LLM Applications and the NIST AI Risk Management Framework, scoped to what this module does.

  • Cover several threat types, not one. Spread cases across the risk categories (AdversarialCaseGenerator does this automatically; report.by_risk_category() breaks results down by type). A pass on one category says nothing about the others.
  • Run multiple strategies. Which technique breaks a given target varies, so run several and compare report.by_strategy() — coverage from a portfolio beats betting on one. This is the module’s core capability.
  • Test the agent in context, with its tools. Point the strategies at your real Agent (system prompt, tools, guardrails) rather than a bare model — application-layer risks like excessive_agency and data_exfiltration only surface when the tools are present.
  • Fix and re-run. A breach is the start of a loop: read the conversation, mitigate, then re-run the same cases to confirm the fix held (see Acting on a breach). Keep your breaching cases as a regression suite.
  • Don’t read a clean run as proof of safety. Scores come from an LLM judge over a finite set of cases and strategies, and models are stochastic. A PASS is evidence, not a guarantee.