llama.cpp¶
llama.cpp is a high-performance C++ inference engine for running large language models locally. The Strands Agents SDK implements a llama.cpp provider, allowing you to run agents against any llama.cpp server with quantized models.
Installation¶
llama.cpp support is included in the base Strands Agents package. To install, run:
Usage¶
After setting up a llama.cpp server, you can import and initialize the Strands Agents' llama.cpp provider as follows:
from strands import Agent
from strands.models.llamacpp import LlamaCppModel
from strands_tools import calculator
model = LlamaCppModel(
base_url="http://localhost:8080",
# **model_config
model_id="default",
params={
"max_tokens": 1000,
"temperature": 0.7,
"repeat_penalty": 1.1,
}
)
agent = Agent(model=model, tools=[calculator])
response = agent("What is 2+2")
print(response)
To connect to a remote llama.cpp server, you can specify a different base URL:
model = LlamaCppModel(
base_url="http://your-server:8080",
model_id="default",
params={
"temperature": 0.7,
"cache_prompt": True
}
)
Configuration¶
Server Setup¶
Before using LlamaCppModel, you need a running llama.cpp server with a GGUF model:
# Download a model (e.g., using Hugging Face CLI)
huggingface-cli download ggml-org/Qwen2.5-7B-GGUF \
Qwen2.5-7B-Q4_K_M.gguf --local-dir ./models
# Start the server
llama-server -m models/Qwen2.5-7B-Q4_K_M.gguf \
--host 0.0.0.0 --port 8080 -c 8192 --jinja
Model Configuration¶
The model_config
configures the underlying model selected for inference. The supported configurations are:
Parameter | Description | Example | Default |
---|---|---|---|
base_url |
llama.cpp server URL | http://localhost:8080 |
http://localhost:8080 |
model_id |
Model identifier | default |
default |
params |
Model parameters | {"temperature": 0.7, "max_tokens": 1000} |
None |
Supported Parameters¶
Standard parameters:
temperature
,max_tokens
,top_p
,frequency_penalty
,presence_penalty
,stop
,seed
llama.cpp-specific parameters:
repeat_penalty
,top_k
,min_p
,typical_p
,tfs_z
,mirostat
,grammar
,json_schema
,cache_prompt
Troubleshooting¶
Connection Refused¶
If you encounter connection errors, ensure:
- The llama.cpp server is running (
llama-server
command) - The server URL and port are correct
- No firewall is blocking the connection
Context Window Overflow¶
If you get context overflow errors:
- Increase context size with
-c
flag when starting server - Reduce input size
- Enable prompt caching with
cache_prompt: True
Advanced Features¶
Structured Output¶
llama.cpp models support structured output through native JSON schema validation. When you use Agent.structured_output()
, the SDK uses llama.cpp's json_schema parameter to constrain output:
from pydantic import BaseModel, Field
from strands import Agent
from strands.models.llamacpp import LlamaCppModel
class PersonInfo(BaseModel):
"""Extract person information from text."""
name: str = Field(description="Full name of the person")
age: int = Field(description="Age in years")
occupation: str = Field(description="Job or profession")
model = LlamaCppModel(
base_url="http://localhost:8080",
model_id="default",
)
agent = Agent(model=model)
result = agent.structured_output(
PersonInfo,
"John Smith is a 30-year-old software engineer working at a tech startup."
)
print(f"Name: {result.name}") # "John Smith"
print(f"Age: {result.age}") # 30
print(f"Job: {result.occupation}") # "software engineer"
Grammar Constraints¶
llama.cpp supports GBNF grammar constraints to ensure output follows specific patterns:
model = LlamaCppModel(
base_url="http://localhost:8080",
params={
"grammar": '''
root ::= answer
answer ::= "yes" | "no" | "maybe"
'''
}
)
agent = Agent(model=model)
response = agent("Is the Earth flat?") # Will only output "yes", "no", or "maybe"
Advanced Sampling¶
llama.cpp offers sophisticated sampling parameters for fine-tuning output:
# High-quality output (slower)
model = LlamaCppModel(
base_url="http://localhost:8080",
params={
"temperature": 0.3,
"top_k": 10,
"repeat_penalty": 1.2,
}
)
# Creative writing
model = LlamaCppModel(
base_url="http://localhost:8080",
params={
"temperature": 0.9,
"top_p": 0.95,
"mirostat": 2,
"mirostat_ent": 5.0,
}
)
Multimodal Support¶
For multimodal models like Qwen2.5-Omni, llama.cpp can process images and audio:
# Requires multimodal model and --mmproj flag when starting server
from PIL import Image
import base64
import io
# Image analysis
img = Image.open("example.png")
img_bytes = io.BytesIO()
img.save(img_bytes, format='PNG')
img_base64 = base64.b64encode(img_bytes.getvalue()).decode()
image_message = {
"role": "user",
"content": [
{"type": "image", "image": {"data": img_base64, "format": "png"}},
{"type": "text", "text": "Describe this image"}
]
}
response = agent([image_message])