vLLM¶
Community Contribution
This is a community-maintained package that is not owned or supported by the Strands team. Validate and review the package before using it in your project.
Have your own integration? We'd love to add it here too!
Language Support
This provider is only supported in Python.
strands-vllm is a vLLM model provider for Strands Agents SDK with Token-In/Token-Out (TITO) support for agentic RL training. It provides integration with vLLM's OpenAI-compatible API, optimized for reinforcement learning workflows with Agent Lightning.
Features:
- OpenAI-Compatible API: Uses vLLM's OpenAI-compatible
/v1/chat/completionsendpoint with streaming - TITO Support: Captures
prompt_token_idsandtoken_idsdirectly from vLLM - no retokenization drift - Tool Call Validation: Optional hooks for RL-friendly error messages (allowed tools list, schema validation)
- Agent Lightning Integration: Automatically adds token IDs to OpenTelemetry spans for RL training data extraction
- Streaming: Full streaming support with token ID capture via
VLLMTokenRecorder
Why TITO?
Traditional retokenization can cause drift in RL training—the same text may tokenize differently during inference vs. training (e.g., "HAVING" → H+AVING vs. HAV+ING). TITO captures exact tokens from vLLM, eliminating this issue. See No More Retokenization Drift for details.
Installation¶
Install strands-vllm along with the Strands Agents SDK:
pip install strands-vllm strands-agents-tools
For retokenization drift demos (requires HuggingFace tokenizer):
pip install "strands-vllm[drift]" strands-agents-tools
Requirements¶
- vLLM server running with your model (v0.10.2+ for
return_token_idssupport) - For tool calling: vLLM must be started with tool-calling enabled and appropriate chat template
Usage¶
1. Start vLLM Server¶
First, start a vLLM server with your model:
vllm serve <MODEL_ID> \
--host 0.0.0.0 \
--port 8000
For tool calling support, add the appropriate flags for your model:
vllm serve <MODEL_ID> \
--host 0.0.0.0 \
--port 8000 \
--enable-auto-tool-choice \
--tool-call-parser <PARSER> # e.g., llama3_json, hermes, etc.
See vLLM tool calling documentation for supported parsers and chat templates.
2. Basic Agent¶
import os
from strands import Agent
from strands_vllm import VLLMModel, VLLMTokenRecorder
# Configure via environment variables or directly
base_url = os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1")
model_id = os.getenv("VLLM_MODEL_ID", "<YOUR_MODEL_ID>")
model = VLLMModel(
base_url=base_url,
model_id=model_id,
return_token_ids=True,
)
recorder = VLLMTokenRecorder()
agent = Agent(model=model, callback_handler=recorder)
result = agent("What is the capital of France?")
print(result)
# Access TITO data for RL training
print(f"Prompt tokens: {len(recorder.prompt_token_ids or [])}")
print(f"Response tokens: {len(recorder.token_ids or [])}")
3. Tool Call Validation (Optional, Recommended for RL)¶
Strands SDK already handles unknown tools and malformed JSON gracefully. VLLMToolValidationHooks adds RL-friendly enhancements:
import os
from strands import Agent
from strands_tools.calculator import calculator
from strands_vllm import VLLMModel, VLLMToolValidationHooks
model = VLLMModel(
base_url=os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1"),
model_id=os.getenv("VLLM_MODEL_ID", "<YOUR_MODEL_ID>"),
return_token_ids=True,
)
agent = Agent(
model=model,
tools=[calculator],
hooks=[VLLMToolValidationHooks()],
)
result = agent("Compute 17 * 19 using the calculator tool.")
print(result)
What it adds beyond Strands defaults:
- Unknown tool errors include allowed tools list — helps RL training learn valid tool names
- Schema validation — catches missing required args and unknown args before tool execution
Invalid tool calls receive deterministic error messages, providing cleaner RL training signals.
4. Agent Lightning Integration¶
VLLMTokenRecorder automatically adds token IDs to OpenTelemetry spans for Agent Lightning compatibility:
import os
from strands import Agent
from strands_vllm import VLLMModel, VLLMTokenRecorder
model = VLLMModel(
base_url=os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1"),
model_id=os.getenv("VLLM_MODEL_ID", "<YOUR_MODEL_ID>"),
return_token_ids=True,
)
# add_to_span=True (default) adds token IDs to OpenTelemetry spans
recorder = VLLMTokenRecorder(add_to_span=True)
agent = Agent(model=model, callback_handler=recorder)
result = agent("Hello!")
The following span attributes are set:
| Attribute | Description |
|---|---|
llm.token_count.prompt |
Token count for the prompt (OpenTelemetry semantic convention) |
llm.token_count.completion |
Token count for the completion (OpenTelemetry semantic convention) |
llm.hosted_vllm.prompt_token_ids |
Token ID array for the prompt |
llm.hosted_vllm.response_token_ids |
Token ID array for the response |
5. RL Training with TokenManager¶
For building RL-ready trajectories with loss masks:
import asyncio
import os
from strands import Agent, tool
from strands_tools.calculator import calculator as _calculator_impl
from strands_vllm import TokenManager, VLLMModel, VLLMTokenRecorder, VLLMToolValidationHooks
@tool
def calculator(expression: str) -> dict:
return _calculator_impl(expression=expression)
async def main():
model = VLLMModel(
base_url=os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1"),
model_id=os.getenv("VLLM_MODEL_ID", "<YOUR_MODEL_ID>"),
return_token_ids=True,
)
recorder = VLLMTokenRecorder()
agent = Agent(
model=model,
tools=[calculator],
hooks=[VLLMToolValidationHooks()],
callback_handler=recorder,
)
await agent.invoke_async("What is 25 * 17?")
# Build RL trajectory with loss mask
tm = TokenManager()
for entry in recorder.history:
if entry.get("prompt_token_ids"):
tm.add_prompt(entry["prompt_token_ids"]) # loss_mask=0
if entry.get("token_ids"):
tm.add_response(entry["token_ids"]) # loss_mask=1
print(f"Total tokens: {len(tm)}")
print(f"Prompt tokens: {sum(1 for m in tm.loss_mask if m == 0)}")
print(f"Response tokens: {sum(1 for m in tm.loss_mask if m == 1)}")
print(f"Token IDs: {tm.token_ids[:20]}...") # First 20 tokens
print(f"Loss mask: {tm.loss_mask[:20]}...")
asyncio.run(main())
Configuration¶
Model Configuration¶
The VLLMModel accepts the following parameters:
| Parameter | Description | Example | Required |
|---|---|---|---|
base_url |
vLLM server URL | "http://localhost:8000/v1" |
Yes |
model_id |
Model identifier | "<YOUR_MODEL_ID>" |
Yes |
api_key |
API key (usually "EMPTY" for local vLLM) | "EMPTY" |
No (default: "EMPTY") |
return_token_ids |
Request token IDs from vLLM | True |
No (default: False) |
disable_tools |
Remove tools/tool_choice from requests | True |
No (default: False) |
params |
Additional generation parameters | {"temperature": 0, "max_tokens": 256} |
No |
VLLMTokenRecorder Configuration¶
| Parameter | Description | Default |
|---|---|---|
inner |
Inner callback handler to chain | None |
add_to_span |
Add token IDs to OpenTelemetry spans | True |
VLLMToolValidationHooks Configuration¶
| Parameter | Description | Default |
|---|---|---|
include_allowed_tools_in_errors |
Include list of allowed tools in error messages | True |
max_allowed_tools_in_error |
Maximum tool names to show in error messages | 25 |
validate_input_shape |
Validate required/unknown args against schema | True |
Example error messages (more informative than Strands defaults):
- Unknown tool:
Error: unknown tool: fake_tool | allowed_tools=[calculator, search, ...] - Missing argument:
Error: tool_name=<calculator> | missing required argument(s): expression - Unknown argument:
Error: tool_name=<calculator> | unknown argument(s): invalid_param
Troubleshooting¶
Connection errors to vLLM server¶
Ensure your vLLM server is running and accessible:
# Check if server is responding
curl http://localhost:8000/health
No token IDs captured¶
Ensure:
- vLLM version is 0.10.2 or later
return_token_ids=Trueis set onVLLMModel- Your vLLM server supports
return_token_idsin streaming mode
RL training needs cleaner error signals¶
Strands handles unknown tools gracefully, but for RL training you may want more informative errors. Add VLLMToolValidationHooks to get errors that include the list of allowed tools and validate argument schemas.
Model only supports single tool calls¶
Some models/chat templates only support one tool call per message. If you see "This model only supports single tool-calls at once!", adjust your prompts to request one tool at a time.
References¶
- strands-vllm Repository
- vLLM Documentation
- Agent Lightning GitHub - The absolute trainer to light up AI agents
- Agent Lightning Blog Post - No More Retokenization Drift
- Strands Agents API