Evaluating Remote Traces
Trace providers fetch agent execution data from observability backends and convert it into the format the evaluation pipeline expects. This lets you run evaluators against traces from production or staging agents without re-running them.
Available Providers
Section titled “Available Providers”| Provider | Backend | Auth |
|---|---|---|
CloudWatchProvider | AWS CloudWatch Logs (Bedrock AgentCore runtime logs) | AWS credentials (boto3) |
LangfuseProvider | Langfuse | API keys |
Installation
Section titled “Installation”The CloudWatch provider works out of the box since boto3 is a core dependency:
pip install strands-agents-evalsFor the Langfuse provider, install the optional langfuse extra:
pip install strands-agents-evals[langfuse]CloudWatch Provider
Section titled “CloudWatch Provider”The CloudWatchProvider queries CloudWatch Logs Insights to retrieve OpenTelemetry log records from Bedrock AgentCore runtime log groups.
from strands_evals.providers import CloudWatchProvider
# Option 1: Provide the log group directlyprovider = CloudWatchProvider( log_group="/aws/bedrock-agentcore/runtimes/my-agent-abc123-DEFAULT", region="us-east-1",)
# Option 2: Discover the log group from the agent nameprovider = CloudWatchProvider(agent_name="my-agent", region="us-east-1")You must provide either log_group or agent_name. When using agent_name, the provider calls describe_log_groups to find the runtime log group automatically.
The region parameter falls back to the AWS_REGION environment variable, then AWS_DEFAULT_REGION, then us-east-1.
Configuration
Section titled “Configuration”| Parameter | Default | Description |
|---|---|---|
region | AWS_REGION env var | AWS region for the CloudWatch client |
log_group | — | Full CloudWatch log group path |
agent_name | — | Agent name used to discover the log group |
lookback_days | 30 | How many days back to search for traces |
query_timeout_seconds | 60.0 | Maximum seconds to wait for a Logs Insights query |
Langfuse Provider
Section titled “Langfuse Provider”The LangfuseProvider fetches traces and observations via the Langfuse Python SDK, converting them to typed spans for evaluation.
from strands_evals.providers import LangfuseProvider
# Reads LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY from env by defaultprovider = LangfuseProvider()
# Or pass credentials explicitlyprovider = LangfuseProvider( public_key="pk-...", secret_key="sk-...", host="https://us.cloud.langfuse.com",)Configuration
Section titled “Configuration”| Parameter | Default | Description |
|---|---|---|
public_key | LANGFUSE_PUBLIC_KEY env var | Langfuse public API key |
secret_key | LANGFUSE_SECRET_KEY env var | Langfuse secret API key |
host | LANGFUSE_HOST env var or https://us.cloud.langfuse.com | Langfuse API host URL |
timeout | 120 | Request timeout in seconds |
Running Evaluations on Remote Traces
Section titled “Running Evaluations on Remote Traces”All providers implement the same TraceProvider interface with a single method:
data = provider.get_evaluation_data(session_id="my-session-id")# data.output -> str (final agent response)# data.trajectory -> Session (traces and spans)Pass the provider’s data into the standard Experiment pipeline by wrapping it in a task function:
from strands_evals import Case, Experimentfrom strands_evals.evaluators import CoherenceEvaluator, OutputEvaluatorfrom strands_evals.providers import CloudWatchProvider
provider = CloudWatchProvider(log_group="/aws/...", region="us-east-1")
def task(case: Case) -> dict: return provider.get_evaluation_data(case.input)
cases = [ Case( name="session_1", input="my-session-id", expected_output="any", ),]
evaluators = [ OutputEvaluator( rubric="Score 1.0 if the output is coherent. Score 0.0 otherwise." ), CoherenceEvaluator(),]
experiment = Experiment(cases=cases, evaluators=evaluators)reports = experiment.run_evaluations(task)
for report in reports: print(f"{report.overall_score:.2f} - {report.reasons}")The same pattern works with LangfuseProvider — just swap the provider initialization.
Error Handling
Section titled “Error Handling”Providers raise specific exceptions when traces cannot be retrieved:
from strands_evals.providers import SessionNotFoundError, ProviderError
try: data = provider.get_evaluation_data("unknown-session")except SessionNotFoundError: print("No traces found for that session")except ProviderError: print("Provider unreachable or query failed")Both exceptions inherit from TraceProviderError, so you can catch that for a single handler:
from strands_evals.providers import TraceProviderError
try: data = provider.get_evaluation_data(session_id)except TraceProviderError as e: print(f"Failed to retrieve traces: {e}")Implementing a Custom Provider
Section titled “Implementing a Custom Provider”Subclass TraceProvider and implement get_evaluation_data to integrate with any observability backend:
from strands_evals.providers import TraceProviderfrom strands_evals.types.evaluation import TaskOutput
class MyProvider(TraceProvider): def get_evaluation_data(self, session_id: str) -> TaskOutput: # 1. Fetch traces from your backend # 2. Convert to a Session object with Trace and Span types # 3. Extract the final agent response return TaskOutput(output="final response", trajectory=session)The returned TaskOutput must contain:
output: The final agent response texttrajectory: ASessionobject containingTraceobjects with typed spans (AgentInvocationSpan,InferenceSpan,ToolExecutionSpan)
Related Documentation
Section titled “Related Documentation”- Getting Started: Set up your first evaluation experiment
- Output Evaluator: Evaluate agent response quality
- Trajectory Evaluator: Evaluate tool usage and execution paths
- Helpfulness Evaluator: Assess agent helpfulness