Skip to content

vLLM

Community Contribution

This is a community-maintained package that is not owned or supported by the Strands team. Validate and review the package before using it in your project.

Have your own integration? We'd love to add it here too!

Language Support

This provider is only supported in Python.

strands-vllm is a vLLM model provider for Strands Agents SDK with Token-In/Token-Out (TITO) support for agentic RL training. It provides integration with vLLM's OpenAI-compatible API, optimized for reinforcement learning workflows with Agent Lightning.

Features:

  • OpenAI-Compatible API: Uses vLLM's OpenAI-compatible /v1/chat/completions endpoint with streaming
  • TITO Support: Captures prompt_token_ids and token_ids directly from vLLM - no retokenization drift
  • Tool Call Validation: Optional hooks for RL-friendly error messages (allowed tools list, schema validation)
  • Agent Lightning Integration: Automatically adds token IDs to OpenTelemetry spans for RL training data extraction
  • Streaming: Full streaming support with token ID capture via VLLMTokenRecorder

Why TITO?

Traditional retokenization can cause drift in RL training—the same text may tokenize differently during inference vs. training (e.g., "HAVING" → H+AVING vs. HAV+ING). TITO captures exact tokens from vLLM, eliminating this issue. See No More Retokenization Drift for details.

Installation

Install strands-vllm along with the Strands Agents SDK:

pip install strands-vllm strands-agents-tools

For retokenization drift demos (requires HuggingFace tokenizer):

pip install "strands-vllm[drift]" strands-agents-tools

Requirements

  • vLLM server running with your model (v0.10.2+ for return_token_ids support)
  • For tool calling: vLLM must be started with tool-calling enabled and appropriate chat template

Usage

1. Start vLLM Server

First, start a vLLM server with your model:

vllm serve <MODEL_ID> \
    --host 0.0.0.0 \
    --port 8000

For tool calling support, add the appropriate flags for your model:

vllm serve <MODEL_ID> \
    --host 0.0.0.0 \
    --port 8000 \
    --enable-auto-tool-choice \
    --tool-call-parser <PARSER>  # e.g., llama3_json, hermes, etc.

See vLLM tool calling documentation for supported parsers and chat templates.

2. Basic Agent

import os
from strands import Agent
from strands_vllm import VLLMModel, VLLMTokenRecorder

# Configure via environment variables or directly
base_url = os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1")
model_id = os.getenv("VLLM_MODEL_ID", "<YOUR_MODEL_ID>")

model = VLLMModel(
    base_url=base_url,
    model_id=model_id,
    return_token_ids=True,
)

recorder = VLLMTokenRecorder()
agent = Agent(model=model, callback_handler=recorder)

result = agent("What is the capital of France?")
print(result)

# Access TITO data for RL training
print(f"Prompt tokens: {len(recorder.prompt_token_ids or [])}")
print(f"Response tokens: {len(recorder.token_ids or [])}")

Strands SDK already handles unknown tools and malformed JSON gracefully. VLLMToolValidationHooks adds RL-friendly enhancements:

import os
from strands import Agent
from strands_tools.calculator import calculator
from strands_vllm import VLLMModel, VLLMToolValidationHooks

model = VLLMModel(
    base_url=os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1"),
    model_id=os.getenv("VLLM_MODEL_ID", "<YOUR_MODEL_ID>"),
    return_token_ids=True,
)

agent = Agent(
    model=model,
    tools=[calculator],
    hooks=[VLLMToolValidationHooks()],
)

result = agent("Compute 17 * 19 using the calculator tool.")
print(result)

What it adds beyond Strands defaults:

  • Unknown tool errors include allowed tools list — helps RL training learn valid tool names
  • Schema validation — catches missing required args and unknown args before tool execution

Invalid tool calls receive deterministic error messages, providing cleaner RL training signals.

4. Agent Lightning Integration

VLLMTokenRecorder automatically adds token IDs to OpenTelemetry spans for Agent Lightning compatibility:

import os
from strands import Agent
from strands_vllm import VLLMModel, VLLMTokenRecorder

model = VLLMModel(
    base_url=os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1"),
    model_id=os.getenv("VLLM_MODEL_ID", "<YOUR_MODEL_ID>"),
    return_token_ids=True,
)

# add_to_span=True (default) adds token IDs to OpenTelemetry spans
recorder = VLLMTokenRecorder(add_to_span=True)
agent = Agent(model=model, callback_handler=recorder)

result = agent("Hello!")

The following span attributes are set:

Attribute Description
llm.token_count.prompt Token count for the prompt (OpenTelemetry semantic convention)
llm.token_count.completion Token count for the completion (OpenTelemetry semantic convention)
llm.hosted_vllm.prompt_token_ids Token ID array for the prompt
llm.hosted_vllm.response_token_ids Token ID array for the response

5. RL Training with TokenManager

For building RL-ready trajectories with loss masks:

import asyncio
import os
from strands import Agent, tool
from strands_tools.calculator import calculator as _calculator_impl
from strands_vllm import TokenManager, VLLMModel, VLLMTokenRecorder, VLLMToolValidationHooks

@tool
def calculator(expression: str) -> dict:
    return _calculator_impl(expression=expression)

async def main():
    model = VLLMModel(
        base_url=os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1"),
        model_id=os.getenv("VLLM_MODEL_ID", "<YOUR_MODEL_ID>"),
        return_token_ids=True,
    )

    recorder = VLLMTokenRecorder()
    agent = Agent(
        model=model,
        tools=[calculator],
        hooks=[VLLMToolValidationHooks()],
        callback_handler=recorder,
    )

    await agent.invoke_async("What is 25 * 17?")

    # Build RL trajectory with loss mask
    tm = TokenManager()
    for entry in recorder.history:
        if entry.get("prompt_token_ids"):
            tm.add_prompt(entry["prompt_token_ids"])  # loss_mask=0
        if entry.get("token_ids"):
            tm.add_response(entry["token_ids"])       # loss_mask=1

    print(f"Total tokens: {len(tm)}")
    print(f"Prompt tokens: {sum(1 for m in tm.loss_mask if m == 0)}")
    print(f"Response tokens: {sum(1 for m in tm.loss_mask if m == 1)}")
    print(f"Token IDs: {tm.token_ids[:20]}...")  # First 20 tokens
    print(f"Loss mask: {tm.loss_mask[:20]}...")

asyncio.run(main())

Configuration

Model Configuration

The VLLMModel accepts the following parameters:

Parameter Description Example Required
base_url vLLM server URL "http://localhost:8000/v1" Yes
model_id Model identifier "<YOUR_MODEL_ID>" Yes
api_key API key (usually "EMPTY" for local vLLM) "EMPTY" No (default: "EMPTY")
return_token_ids Request token IDs from vLLM True No (default: False)
disable_tools Remove tools/tool_choice from requests True No (default: False)
params Additional generation parameters {"temperature": 0, "max_tokens": 256} No

VLLMTokenRecorder Configuration

Parameter Description Default
inner Inner callback handler to chain None
add_to_span Add token IDs to OpenTelemetry spans True

VLLMToolValidationHooks Configuration

Parameter Description Default
include_allowed_tools_in_errors Include list of allowed tools in error messages True
max_allowed_tools_in_error Maximum tool names to show in error messages 25
validate_input_shape Validate required/unknown args against schema True

Example error messages (more informative than Strands defaults):

  • Unknown tool: Error: unknown tool: fake_tool | allowed_tools=[calculator, search, ...]
  • Missing argument: Error: tool_name=<calculator> | missing required argument(s): expression
  • Unknown argument: Error: tool_name=<calculator> | unknown argument(s): invalid_param

Troubleshooting

Connection errors to vLLM server

Ensure your vLLM server is running and accessible:

# Check if server is responding
curl http://localhost:8000/health

No token IDs captured

Ensure:

  1. vLLM version is 0.10.2 or later
  2. return_token_ids=True is set on VLLMModel
  3. Your vLLM server supports return_token_ids in streaming mode

RL training needs cleaner error signals

Strands handles unknown tools gracefully, but for RL training you may want more informative errors. Add VLLMToolValidationHooks to get errors that include the list of allowed tools and validate argument schemas.

Model only supports single tool calls

Some models/chat templates only support one tool call per message. If you see "This model only supports single tool-calls at once!", adjust your prompts to request one tool at a time.

References