LLM Inference with SGLang

Step 1
Set up Docker permissions

If you haven't already, add your user to the docker group to run Docker without sudo:

sudo usermod -aG docker $USER
newgrp docker

Step 2
Set up environment variables

# HuggingFace token (only required for gated models such as Llama 3.3).
# Leave empty for public models like Qwen3-8B; for gated models get a token at
# https://huggingface.co/settings/tokens.
export HF_TOKEN=""

# Model to serve (see **Example model IDs** below).
# Default uses Qwen3-8B for fast first-run validation (~10–15 min boot on Station).
# Switch to Qwen3.6-35B-A3B once the workflow is working end-to-end.
export MODEL_HANDLE="Qwen/Qwen3-8B"

# Maximum context length
export MAX_MODEL_LEN=8192

Example model IDs (`MODEL_HANDLE`)

Use any Hugging Face text-generation or chat checkpoint that your SGLang build supports. The table below lists common starting points on DGX Station; always check the model card for license / gated access, VRAM, and context length.

Model ID	Notes
`Qwen/Qwen3-8B`	Default in this playbook. Dense Qwen3 8B; ~16 GB download, fast warmup, ideal for validating the workflow end-to-end.
`Qwen/Qwen3.6-35B-A3B`	Qwen3.6 MoE (~3B active experts); strong quality per GPU hour on Blackwell. ~70 GB download; allow ~30–45 min to first request. Hybrid mamba/SSM architecture — the Step 7 prefix-cache check does not apply (see the note there).
`Qwen/Qwen3.6-27B`	Dense Qwen3.6; higher VRAM than the MoE row above at equal batch settings.
`google/gemma-3-12b-it`	Popular Gemma 3 instruct (text + vision in full stack; chat API usage is typically text-only).
`google/gemma-3-27b-it`	Larger Gemma 3 instruct variant.
`meta-llama/Llama-3.3-70B-Instruct`	Llama 3.3 70B instruct (gated on Hugging Face; accept the license in the model card before download).

Heavyweight MoE (very large weights; confirm SGLang version + GPU memory before serving):

Model ID	Notes
`deepseek-ai/DeepSeek-V4-Flash`	DeepSeek-V4 family (MoE). Intended to showcase large local models on Station; expect long downloads, strict VRAM headroom, and possible extra flags per SGLang docs.
`deepseek-ai/DeepSeek-V4-Pro`	Larger V4 variant; only if you have sufficient GPU memory and a supported SGLang build.

Choosing an inference backend (DGX Station)

Several OpenAI-compatible servers run well on NVIDIA hardware. None is universally “best”—pick by workload shape and operational constraints.

Backend	Strengths	Typical “use this when…”
SGLang	RadixAttention for shared-prefix workloads; strong structured / grammar decoding; active Blackwell + CUDA 13 paths.	Highly multi-turn, RAG (repeated system + documents), agents, or schema-constrained JSON at scale.
vLLM	MaturePagedAttention, broad model coverage, common default in examples.	You want a well-trodden OSS server with maximum community recipes and straightforward PagedAttention behavior.
TensorRT-LLM	NVIDIA-optimized kernels and quantization workflows for throughput-focused deployment.	You are productionizing on NVIDIA GPUs and can invest in TensorRT-LLM export / engines for peak throughput.

This playbook focuses on SGLang; consult each project’s documentation for model support matrices and quantization modes.

Step 3
Pull the SGLang container

Pull the SGLang container image with CUDA 13.0 support (required for Blackwell SM103):

docker pull lmsysorg/sglang:latest-cu130

Step 4
Identify the GB300 GPU

Identify the GB300's device index:

nvidia-smi --query-gpu=index,name --format=csv,noheader

Look for the row showing NVIDIA GB300. Note its index — on DGX Station the GB300 may be at index 0 or 1 depending on configuration. If nvidia-smi shows only a single GB300, you can simply use --gpus all in the next step.

Step 5
Start SGLang server

Launch the SGLang server. The flags below are tuned for GB300 (Blackwell SM103) — see notes after the command:

# Use --gpus all on a single-GPU Station, or --gpus '"device=N"' with the
# index from Step 4 if multiple GPUs are present.
docker run -d \
  --name sglang-server \
  --gpus all \
  --ipc host \
  --cap-add SYS_NICE \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 30000:30000 \
  -e HF_TOKEN="$HF_TOKEN" \
  -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
  lmsysorg/sglang:latest-cu130 \
  sglang serve --model-path "$MODEL_HANDLE" \
    --host 0.0.0.0 \
    --port 30000 \
    --context-length $MAX_MODEL_LEN \
    --mem-fraction-static 0.85 \
    --attention-backend flashinfer \
    --enable-cache-report

IMPORTANT

Why these flags on GB300:

--attention-backend flashinfer — the auto-selected trtllm_mha backend currently fails CUDA-graph capture on Blackwell SM103 with buildNdTmaDescriptor errors; fa3 is also rejected (it requires SM ≤ 90). FlashInfer is the safe default.
--cap-add SYS_NICE — lets SGLang set NUMA affinity; otherwise the server logs a warning on every launch.
--enable-cache-report — populates usage.prompt_tokens_details.cached_tokens in OpenAI-style responses so the benchmark in Step 9 can report cached prefill tokens.

Check the server logs:

docker logs -f sglang-server

Wait for the server to show it is ready:

INFO:     Uvicorn running on http://0.0.0.0:30000

Press Ctrl+C to exit the log view.

NOTE

First launch downloads the model and captures CUDA graphs. Plan for ~10–15 min for Qwen/Qwen3-8B and ~30–45 min for Qwen/Qwen3.6-35B-A3B before the server is ready. Subsequent starts are faster thanks to cached weights and compiled artifacts.

Step 6
Test basic inference

Send a chat completion request using the OpenAI-compatible API:

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$MODEL_HANDLE"'",
    "messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}],
    "max_tokens": 256
  }'

The response follows the standard OpenAI format with a choices array containing the model's answer.

Step 7
Multi-turn conversation with prefix caching

SGLang's RadixAttention automatically caches the KV cache for processed tokens. When follow-up messages share the same conversation prefix, the cached entries are reused — skipping prefill for all previously seen tokens.

Send a multi-turn conversation. The system prompt is deliberately long so the shared prefix exceeds SGLang's page size (64 tokens), which is the minimum unit for cache reuse:

# Turn 1
curl -s http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$MODEL_HANDLE"'",
    "messages": [
      {"role": "system", "content": "You are an expert physics tutor who explains concepts clearly and concisely. You use real-world analogies and everyday examples to make abstract ideas concrete. When answering, first state the key concept in one sentence, then give a short explanation with an example."},
      {"role": "user", "content": "What is the difference between speed and velocity?"}
    ],
    "max_tokens": 256
  }' | python3 -m json.tool

# Turn 2 — extends the same conversation
curl -s http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$MODEL_HANDLE"'",
    "messages": [
      {"role": "system", "content": "You are an expert physics tutor who explains concepts clearly and concisely. You use real-world analogies and everyday examples to make abstract ideas concrete. When answering, first state the key concept in one sentence, then give a short explanation with an example."},
      {"role": "user", "content": "What is the difference between speed and velocity?"},
      {"role": "assistant", "content": "Speed is a scalar quantity that measures how fast an object moves, while velocity is a vector quantity that includes both speed and direction. For example, a car driving at 60 km/h has a speed of 60 km/h regardless of where it is headed. But if that car is driving 60 km/h north, that is its velocity — change direction to south and the velocity changes even though the speed stays the same."},
      {"role": "user", "content": "Can you give me another example that shows why the distinction matters in real physics problems?"}
    ],
    "max_tokens": 256
  }' | python3 -m json.tool

The second request reuses the KV cache for the shared prefix (system message + first user turn + assistant response) via RadixAttention, so repeated prefill work on that prefix is avoided. End-to-end HTTP latency can still go up on later turns: the transcript is longer (more tokens to attend to even with cache hits on the prefix), each assistant reply adds decode work, and the client measures full request time—not prefill alone.

Check cache reuse in the server logs. SGLang logs each prefill batch with the number of cached tokens reused:

docker logs sglang-server 2>&1 | grep "cached-token" | tail -10

Look for #cached-token values greater than 0 on later turns — this confirms RadixAttention is reusing the KV cache from the shared prefix. Treat that as the primary signal of prefix caching; wall-clock curl latency alone can be misleading.

NOTE

This prefix-cache check does not apply to hybrid mamba/SSM models such as Qwen/Qwen3.6-35B-A3B (and any other Qwen3.6 mamba variant). SGLang serves mamba-bearing models with mamba_scheduler_strategy: no_buffer, which does not carry SSM state across requests — so cross-request prefix reuse is skipped and #cached-token / cached_tokens stay 0 on every turn even though the server still reports disable_radix_cache: false. This is expected for these architectures, not a misconfiguration, and --disable-radix-cache will not change it. To validate prefix caching, use a standard-attention model such as the default Qwen/Qwen3-8B. On a mamba model you may optionally pass --disable-radix-cache to sglang serve for higher throughput and concurrency.

Step 8
Structured JSON output

SGLang's constrained decoding guarantees valid JSON output matching a schema. This uses the xGrammar backend to overlap grammar mask generation with the model's forward pass, adding minimal latency.

Generate a structured response:

curl -s http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$MODEL_HANDLE"'",
    "messages": [
      {"role": "user", "content": "List three programming languages with their primary use case and year created."}
    ],
    "max_tokens": 512,
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "languages",
        "schema": {
          "type": "object",
          "properties": {
            "languages": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "name": {"type": "string"},
                  "primary_use": {"type": "string"},
                  "year_created": {"type": "integer"}
                },
                "required": ["name", "primary_use", "year_created"]
              }
            }
          },
          "required": ["languages"]
        }
      }
    }
  }' | python3 -m json.tool

The response content is guaranteed to be valid JSON matching the provided schema. Parse the choices[0].message.content field — it will contain a well-formed JSON object.

Step 9
Benchmark multi-turn throughput

This step uses benchmark_multiturn.py from this playbook's assets/ directory. Clone (or download) the playbook repository first so the script is available locally:

git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-sglang-inference

TIP

If git is not available, download the repository as a ZIP from the playbook repository and extract it. All commands below assume your working directory is the playbook root (dgx-station-playbooks/nvidia/station-sglang-inference/), so assets/benchmark_multiturn.py resolves correctly.

The benchmark stress-tests the server with parallel conversations (default: 20) and reports per-turn wall time, token counts, and (when the API exposes it) cached prefill tokens.

Install the requests dependency. The virtualenv approach below is the preferred, default installation path — it keeps the script's dependencies isolated from the system Python interpreter so you cannot accidentally damage Ubuntu's own Python packages. Ubuntu 24.04 on DGX Station does not ship python3-venv by default, so install it once before creating the virtualenv:

sudo apt update && sudo apt install -y python3-venv
python3 -m venv .venv && source .venv/bin/activate
pip install requests

If you cannot run sudo apt install python3-venv (for example, a locked-down host), the next safest option is a per-user install that still respects PEP 668:

python3 -m pip install --user requests

CAUTION

Last-resort only — --break-system-packages can damage your system Python. Ubuntu 24.04 ships an "externally managed" system Python (PEP 668). The --break-system-packages flag tells pip to ignore that guard and install into the system or per-user site-packages anyway. This can shadow or conflict with packages installed by apt and break system tooling that depends on them. Only use this command when both the venv and plain --user paths above are unavailable, and only if you are willing to take that risk on the host you are running on:

python3 -m pip install --user --break-system-packages requests

python3 assets/benchmark_multiturn.py \
  --base-url http://localhost:30000 \
  --model "$MODEL_HANDLE" \
  --num-conversations 20 \
  --turns-per-conversation 5 \
  --cache-detail-file ./sglang_benchmark_cache_details.log

The script prints:

Median / P90 wall time per turn — often increases as prompts grow and under parallel load; that does not contradict RadixAttention.
Median prompt tokens per turn — should climb as history lengthens.
Median cached prefill tokens (when returned in usage) — populated by --enable-cache-report (already set in Step 5); this is the primary cache signal from the OpenAI-style usage payload.
A short summary of cache-related /server_info or /metrics lines; the full responses are written to --cache-detail-file (default ./sglang_benchmark_cache_details.log) so you are not flooded with an unparsed metrics blob in the terminal.

NOTE

The Step 5 launch enables --enable-cache-report (which fills usage.prompt_tokens_details.cached_tokens) but does not enable the Prometheus /metrics endpoint, since cached-prefill data is already exposed through usage and the docker logs #cached-token lines. If /metrics returns 404/empty in the detail log, that is expected — the benchmark's primary cache signals (usage.prompt_tokens_details.cached_tokens and Docker logs) still work. To populate /metrics as well, add --enable-metrics to the sglang serve invocation in Step 5 and restart the container.

To isolate prefix-cache behavior from multi-client contention, rerun with a single conversation:

python3 assets/benchmark_multiturn.py \
  --base-url http://localhost:30000 \
  --model "$MODEL_HANDLE" \
  --num-conversations 1 \
  --turns-per-conversation 5

Always correlate behavior with docker logs (#cached-token lines) as in Step 7.

Next steps: heavier models on Station

To stress GPU memory and throughput after completing the steps above, point MODEL_HANDLE at a larger checkpoint (for example deepseek-ai/DeepSeek-V4-Flash), lower --mem-fraction-static if you hit OOM, and reduce --context-length until the server starts cleanly. Confirm your SGLang image version supports the architecture (see SGLang documentation) and accept any gated model licenses on Hugging Face before pulling weights.

Step 10
Cleanup

Stop and remove the container:

docker stop sglang-server
docker rm sglang-server

Optionally remove the image:

docker rmi lmsysorg/sglang:latest-cu130

LLM Inference with SGLang

Step 1Set up Docker permissions

Step 2Set up environment variables

Example model IDs (MODEL_HANDLE)

Choosing an inference backend (DGX Station)

Step 3Pull the SGLang container

Step 4Identify the GB300 GPU

Step 5Start SGLang server

Step 6Test basic inference

Step 7Multi-turn conversation with prefix caching

Step 8Structured JSON output

Step 9Benchmark multi-turn throughput

Next steps: heavier models on Station

Step 10Cleanup

Resources

Step 1
Set up Docker permissions

Step 2
Set up environment variables

Example model IDs (`MODEL_HANDLE`)

Step 3
Pull the SGLang container

Step 4
Identify the GB300 GPU

Step 5
Start SGLang server

Step 6
Test basic inference

Step 7
Multi-turn conversation with prefix caching

Step 8
Structured JSON output

Step 9
Benchmark multi-turn throughput

Step 10
Cleanup