Serve LLMs with SGLang on DGX Station for prefix-cached multi-turn and structured output inference
If you haven't already, add your user to the docker group to run Docker without sudo:
sudo usermod -aG docker $USER
newgrp docker
# HuggingFace token (required)
# Get a token from https://huggingface.co/settings/tokens
export HF_TOKEN="your_huggingface_token"
# Model to serve
export MODEL_HANDLE="Qwen/Qwen3-8B"
# Maximum context length
export MAX_MODEL_LEN=8192
Pull the SGLang container image with CUDA 13.0 support (required for Blackwell SM103):
docker pull lmsysorg/sglang:latest-cu130
On DGX Station with multiple GPUs, identify the GB300's device index:
nvidia-smi --query-gpu=index,name --format=csv,noheader
Look for the row showing NVIDIA GB300. Note its index (e.g., 1).
Launch the SGLang server:
# Replace device=1 with your GB300's index from Step 4
docker run -d \
--name sglang-server \
--gpus '"device=1"' \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 30000:30000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
lmsysorg/sglang:latest-cu130 \
sglang serve --model-path "$MODEL_HANDLE" \
--host 0.0.0.0 \
--port 30000 \
--context-length $MAX_MODEL_LEN \
--mem-fraction-static 0.85
Check the server logs:
docker logs -f sglang-server
Wait for the server to show it is ready:
INFO: Uvicorn running on http://0.0.0.0:30000
Press Ctrl+C to exit the log view.
NOTE
First launch downloads the model and compiles kernels. Subsequent starts are faster thanks to cached weights and compiled artifacts.
Send a chat completion request using the OpenAI-compatible API:
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "'"$MODEL_HANDLE"'",
"messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}],
"max_tokens": 256
}'
The response follows the standard OpenAI format with a choices array containing the model's answer.
SGLang's RadixAttention automatically caches the KV cache for processed tokens. When follow-up messages share the same conversation prefix, the cached entries are reused — skipping prefill for all previously seen tokens.
Send a multi-turn conversation. The system prompt is deliberately long so the shared prefix exceeds SGLang's page size (64 tokens), which is the minimum unit for cache reuse:
# Turn 1
curl -s http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "'"$MODEL_HANDLE"'",
"messages": [
{"role": "system", "content": "You are an expert physics tutor who explains concepts clearly and concisely. You use real-world analogies and everyday examples to make abstract ideas concrete. When answering, first state the key concept in one sentence, then give a short explanation with an example."},
{"role": "user", "content": "What is the difference between speed and velocity?"}
],
"max_tokens": 256
}' | python3 -m json.tool
# Turn 2 — extends the same conversation
curl -s http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "'"$MODEL_HANDLE"'",
"messages": [
{"role": "system", "content": "You are an expert physics tutor who explains concepts clearly and concisely. You use real-world analogies and everyday examples to make abstract ideas concrete. When answering, first state the key concept in one sentence, then give a short explanation with an example."},
{"role": "user", "content": "What is the difference between speed and velocity?"},
{"role": "assistant", "content": "Speed is a scalar quantity that measures how fast an object moves, while velocity is a vector quantity that includes both speed and direction. For example, a car driving at 60 km/h has a speed of 60 km/h regardless of where it is headed. But if that car is driving 60 km/h north, that is its velocity — change direction to south and the velocity changes even though the speed stays the same."},
{"role": "user", "content": "Can you give me another example that shows why the distinction matters in real physics problems?"}
],
"max_tokens": 256
}' | python3 -m json.tool
The second request reuses the KV cache from the shared prefix (system message + first user turn + assistant response), only computing attention for the new user message. This reduces first-token latency for follow-up turns.
Check the cache hit rate in the server logs. SGLang logs each prefill batch with the number of cached tokens reused:
docker logs sglang-server 2>&1 | grep "cached-token" | tail -10
Look for #cached-token values greater than 0 on later turns — this confirms RadixAttention is reusing the KV cache from the shared prefix.
SGLang's constrained decoding guarantees valid JSON output matching a schema. This uses the xGrammar backend to overlap grammar mask generation with the model's forward pass, adding minimal latency.
Generate a structured response:
curl -s http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "'"$MODEL_HANDLE"'",
"messages": [
{"role": "user", "content": "List three programming languages with their primary use case and year created."}
],
"max_tokens": 512,
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "languages",
"schema": {
"type": "object",
"properties": {
"languages": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"primary_use": {"type": "string"},
"year_created": {"type": "integer"}
},
"required": ["name", "primary_use", "year_created"]
}
}
},
"required": ["languages"]
}
}
}
}' | python3 -m json.tool
The response content is guaranteed to be valid JSON matching the provided schema. Parse the choices[0].message.content field — it will contain a well-formed JSON object.
Run the included benchmark script to measure how prefix caching improves multi-turn latency. The script is in the assets/ directory of this playbook.
python3 -m venv .venv && source .venv/bin/activate
pip install requests
python3 assets/benchmark_multiturn.py \
--base-url http://localhost:30000 \
--model "$MODEL_HANDLE" \
--num-conversations 20 \
--turns-per-conversation 5
The script sends parallel multi-turn conversations and measures:
You should see latency decrease for later turns in each conversation as the shared prefix grows, demonstrating RadixAttention's cache reuse.
TIP
If you downloaded this playbook as a zip, the assets/ directory is already present. If you cloned the full repository, navigate to nvidia/station-sglang-inference/ first.
Stop and remove the container:
docker stop sglang-server
docker rm sglang-server
Optionally remove the image:
docker rmi lmsysorg/sglang:latest-cu130