SGLang¶
Community Contribution
This is a community-maintained package that is not owned or supported by the Strands team. Validate and review the package before using it in your project.
Have your own integration? We'd love to add it here too!
Language Support
This provider is only supported in Python.
strands-sglang is an SGLang model provider for Strands Agents SDK with Token-In/Token-Out (TITO) support for agentic RL training. It provides direct integration with SGLang servers using the native /generate endpoint, optimized for reinforcement learning workflows.
Features:
- SGLang Native API: Uses SGLang's native
/generateendpoint with non-streaming POST for optimal parallelism - TITO Support: Tracks complete token trajectories with logprobs for RL training - no retokenization drift
- Tool Call Parsing: Customizable tool parsing aligned with model chat templates (Hermes/Qwen format)
- Iteration Limiting: Built-in hook to limit tool iterations with clean trajectory truncation
- RL Training Optimized: Connection pooling, aggressive retry (60 attempts), and non-streaming design aligned with Slime's http_utils.py
Installation¶
Install strands-sglang along with the Strands Agents SDK:
pip install strands-sglang strands-agents-tools
Requirements¶
- SGLang server running with your model
- HuggingFace tokenizer for the model
Usage¶
1. Start SGLang Server¶
First, start an SGLang server with your model:
python -m sglang.launch_server \
--model-path Qwen/Qwen3-4B-Instruct-2507 \
--port 30000 \
--host 0.0.0.0
2. Basic Agent¶
import asyncio
from transformers import AutoTokenizer
from strands import Agent
from strands_tools import calculator
from strands_sglang import SGLangModel
async def main():
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")
model = SGLangModel(tokenizer=tokenizer, base_url="http://localhost:30000")
agent = Agent(model=model, tools=[calculator])
model.reset() # Reset TITO state for new episode
result = await agent.invoke_async("What is 25 * 17?")
print(result)
# Access TITO data for RL training
print(f"Tokens: {model.token_manager.token_ids}")
print(f"Loss mask: {model.token_manager.loss_mask}")
print(f"Logprobs: {model.token_manager.logprobs}")
asyncio.run(main())
3. Slime RL Training¶
For RL training with Slime, SGLangModel with TITO eliminates the retokenization step:
from strands import Agent, tool
from strands_sglang import SGLangClient, SGLangModel, ToolIterationLimiter
from slime.utils.types import Sample
SYSTEM_PROMPT = "..."
MAX_TOOL_ITERATIONS = 5
_client_cache: dict[str, SGLangClient] = {}
def get_client(args) -> SGLangClient:
"""Get shared client for connection pooling (like Slime)."""
base_url = f"http://{args.sglang_router_ip}:{args.sglang_router_port}"
if base_url not in _client_cache:
_client_cache[base_url] = SGLangClient.from_slime_args(args)
return _client_cache[base_url]
@tool
def execute_python_code(code: str):
"""Execute Python code and return the output."""
...
async def generate(args, sample: Sample, sampling_params) -> Sample:
"""Generate with TITO: tokens captured during generation, no retokenization."""
assert not args.partial_rollout, "Partial rollout not supported."
state = GenerateState(args)
# Set up Agent with SGLangModel and ToolIterationLimiter hook
model = SGLangModel(
tokenizer=state.tokenizer,
client=get_client(args),
model_id=args.hf_checkpoint.split("/")[-1],
params={k: sampling_params[k] for k in ["max_new_tokens", "temperature", "top_p"]},
)
limiter = ToolIterationLimiter(max_iterations=MAX_TOOL_ITERATIONS)
agent = Agent(
model=model,
tools=[execute_python_code],
hooks=[limiter],
callback_handler=None,
system_prompt=SYSTEM_PROMPT,
)
# Run Agent Loop
prompt = sample.prompt if isinstance(sample.prompt, str) else sample.prompt[0]["content"]
try:
await agent.invoke_async(prompt)
sample.status = Sample.Status.COMPLETED
except Exception as e:
# Always use TRUNCATED instead of ABORTED because Slime doesn't properly
# handle ABORTED samples in reward processing. See: https://github.com/THUDM/slime/issues/200
sample.status = Sample.Status.TRUNCATED
logger.warning(f"TRUNCATED: {type(e).__name__}: {e}")
# TITO: extract trajectory from token_manager
tm = model.token_manager
prompt_len = len(tm.segments[0]) # system + user are first segment
sample.tokens = tm.token_ids
sample.loss_mask = tm.loss_mask[prompt_len:]
sample.rollout_log_probs = tm.logprobs[prompt_len:]
sample.response_length = len(sample.tokens) - prompt_len
sample.response = model.tokenizer.decode(sample.tokens[prompt_len:], skip_special_tokens=False)
# Cleanup and return
model.reset()
agent.cleanup()
return sample
Configuration¶
Model Configuration¶
The SGLangModel accepts the following parameters:
| Parameter | Description | Example | Required |
|---|---|---|---|
tokenizer |
HuggingFace tokenizer instance | AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507") |
Yes |
base_url |
SGLang server URL | "http://localhost:30000" |
Yes (or client) |
client |
Pre-configured SGLangClient |
SGLangClient.from_slime_args(args) |
Yes (or base_url) |
model_id |
Model identifier for logging | "Qwen3-4B-Instruct-2507" |
No |
params |
Generation parameters | {"max_new_tokens": 2048, "temperature": 0.7} |
No |
enable_thinking |
Enable thinking mode for Qwen3 hybrid models | True or False |
No |
Client Configuration¶
For RL training, use a centralized SGLangClient with connection pooling:
from strands_sglang import SGLangClient, SGLangModel
# Option 1: Direct configuration
client = SGLangClient(
base_url="http://localhost:30000",
max_connections=1000, # Default: 1000
timeout=None, # Default: None (infinite, like Slime)
max_retries=60, # Default: 60 (aggressive retry for RL stability)
retry_delay=1.0, # Default: 1.0 seconds
)
# Option 2: Adaptive to Slime's training args
client = SGLangClient.from_slime_args(args)
model = SGLangModel(tokenizer=tokenizer, client=client)
| Parameter | Description | Default |
|---|---|---|
base_url |
SGLang server URL | Required |
max_connections |
Maximum concurrent connections | 1000 |
timeout |
Request timeout (None = infinite) | None |
max_retries |
Retry attempts on transient errors | 60 |
retry_delay |
Delay between retries (seconds) | 1.0 |
Troubleshooting¶
Connection errors to SGLang server¶
Ensure your SGLang server is running and accessible:
# Check if server is responding
curl http://localhost:30000/health
Token trajectory mismatch¶
If TITO data doesn't match expected output, ensure you call model.reset() before each new episode to clear the token manager state.