Skip to content

SGLang

Community Contribution

This is a community-maintained package that is not owned or supported by the Strands team. Validate and review the package before using it in your project.

Have your own integration? We'd love to add it here too!

Language Support

This provider is only supported in Python.

strands-sglang is an SGLang model provider for Strands Agents SDK with Token-In/Token-Out (TITO) support for agentic RL training. It provides direct integration with SGLang servers using the native /generate endpoint, optimized for reinforcement learning workflows.

Features:

  • SGLang Native API: Uses SGLang's native /generate endpoint with non-streaming POST for optimal parallelism
  • TITO Support: Tracks complete token trajectories with logprobs for RL training - no retokenization drift
  • Tool Call Parsing: Customizable tool parsing aligned with model chat templates (Hermes/Qwen format)
  • Iteration Limiting: Built-in hook to limit tool iterations with clean trajectory truncation
  • RL Training Optimized: Connection pooling, aggressive retry (60 attempts), and non-streaming design aligned with Slime's http_utils.py

Installation

Install strands-sglang along with the Strands Agents SDK:

pip install strands-sglang strands-agents-tools

Requirements

  • SGLang server running with your model
  • HuggingFace tokenizer for the model

Usage

1. Start SGLang Server

First, start an SGLang server with your model:

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-4B-Instruct-2507 \
    --port 30000 \
    --host 0.0.0.0

2. Basic Agent

import asyncio
from transformers import AutoTokenizer
from strands import Agent
from strands_tools import calculator
from strands_sglang import SGLangModel

async def main():
    tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")
    model = SGLangModel(tokenizer=tokenizer, base_url="http://localhost:30000")
    agent = Agent(model=model, tools=[calculator])

    model.reset()  # Reset TITO state for new episode
    result = await agent.invoke_async("What is 25 * 17?")
    print(result)

    # Access TITO data for RL training
    print(f"Tokens: {model.token_manager.token_ids}")
    print(f"Loss mask: {model.token_manager.loss_mask}")
    print(f"Logprobs: {model.token_manager.logprobs}")

asyncio.run(main())

3. Slime RL Training

For RL training with Slime, SGLangModel with TITO eliminates the retokenization step:

from strands import Agent, tool
from strands_sglang import SGLangClient, SGLangModel, ToolIterationLimiter
from slime.utils.types import Sample

SYSTEM_PROMPT = "..."
MAX_TOOL_ITERATIONS = 5
_client_cache: dict[str, SGLangClient] = {}

def get_client(args) -> SGLangClient:
    """Get shared client for connection pooling (like Slime)."""
    base_url = f"http://{args.sglang_router_ip}:{args.sglang_router_port}"
    if base_url not in _client_cache:
        _client_cache[base_url] = SGLangClient.from_slime_args(args)
    return _client_cache[base_url]

@tool
def execute_python_code(code: str):
    """Execute Python code and return the output."""
    ...

async def generate(args, sample: Sample, sampling_params) -> Sample:
    """Generate with TITO: tokens captured during generation, no retokenization."""
    assert not args.partial_rollout, "Partial rollout not supported."

    state = GenerateState(args)

    # Set up Agent with SGLangModel and ToolIterationLimiter hook
    model = SGLangModel(
        tokenizer=state.tokenizer,
        client=get_client(args),
        model_id=args.hf_checkpoint.split("/")[-1],
        params={k: sampling_params[k] for k in ["max_new_tokens", "temperature", "top_p"]},
    )
    limiter = ToolIterationLimiter(max_iterations=MAX_TOOL_ITERATIONS)
    agent = Agent(
        model=model,
        tools=[execute_python_code],
        hooks=[limiter],
        callback_handler=None,
        system_prompt=SYSTEM_PROMPT,
    )

    # Run Agent Loop
    prompt = sample.prompt if isinstance(sample.prompt, str) else sample.prompt[0]["content"]
    try:
        await agent.invoke_async(prompt)
        sample.status = Sample.Status.COMPLETED
    except Exception as e:
        # Always use TRUNCATED instead of ABORTED because Slime doesn't properly
        # handle ABORTED samples in reward processing. See: https://github.com/THUDM/slime/issues/200
        sample.status = Sample.Status.TRUNCATED
        logger.warning(f"TRUNCATED: {type(e).__name__}: {e}")

    # TITO: extract trajectory from token_manager
    tm = model.token_manager
    prompt_len = len(tm.segments[0])  # system + user are first segment
    sample.tokens = tm.token_ids
    sample.loss_mask = tm.loss_mask[prompt_len:]
    sample.rollout_log_probs = tm.logprobs[prompt_len:]
    sample.response_length = len(sample.tokens) - prompt_len
    sample.response = model.tokenizer.decode(sample.tokens[prompt_len:], skip_special_tokens=False)

    # Cleanup and return
    model.reset()
    agent.cleanup()
    return sample

Configuration

Model Configuration

The SGLangModel accepts the following parameters:

Parameter Description Example Required
tokenizer HuggingFace tokenizer instance AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507") Yes
base_url SGLang server URL "http://localhost:30000" Yes (or client)
client Pre-configured SGLangClient SGLangClient.from_slime_args(args) Yes (or base_url)
model_id Model identifier for logging "Qwen3-4B-Instruct-2507" No
params Generation parameters {"max_new_tokens": 2048, "temperature": 0.7} No
enable_thinking Enable thinking mode for Qwen3 hybrid models True or False No

Client Configuration

For RL training, use a centralized SGLangClient with connection pooling:

from strands_sglang import SGLangClient, SGLangModel

# Option 1: Direct configuration
client = SGLangClient(
    base_url="http://localhost:30000",
    max_connections=1000,  # Default: 1000
    timeout=None,          # Default: None (infinite, like Slime)
    max_retries=60,        # Default: 60 (aggressive retry for RL stability)
    retry_delay=1.0,       # Default: 1.0 seconds
)

# Option 2: Adaptive to Slime's training args
client = SGLangClient.from_slime_args(args)

model = SGLangModel(tokenizer=tokenizer, client=client)
Parameter Description Default
base_url SGLang server URL Required
max_connections Maximum concurrent connections 1000
timeout Request timeout (None = infinite) None
max_retries Retry attempts on transient errors 60
retry_delay Delay between retries (seconds) 1.0

Troubleshooting

Connection errors to SGLang server

Ensure your SGLang server is running and accessible:

# Check if server is responding
curl http://localhost:30000/health

Token trajectory mismatch

If TITO data doesn't match expected output, ensure you call model.reset() before each new episode to clear the token manager state.

References