LLM Inference with SGLang | DGX Station

LLM Inference with SGLang

30 MIN

Serve LLMs with SGLang on DGX Station (Qwen3-8B default; Qwen3.6 MoE optional)—prefix-cached multi-turn, structured output, benchmarks, and inference-server guidance

Overview Instructions Troubleshooting

Common issues

Symptom	Cause	Fix
"permission denied" when running docker	User not in docker group	Run `sudo usermod -aG docker $USER && newgrp docker`
Container fails to start with GPU error	NVIDIA Container Toolkit not configured	Run `nvidia-ctk runtime configure --runtime=docker` and restart Docker
`device >= 0 && device < num_gpus INTERNAL ASSERT FAILED`	`--gpus '"device=N"'` index does not exist on this Station	Re-run `nvidia-smi --query-gpu=index,name --format=csv,noheader` and use the actual GB300 index, or `--gpus all` if there is only one GPU
`RuntimeError: ... buildNdTmaDescriptor ... Check failed: false` during CUDA-graph capture	Default `trtllm_mha` attention backend is incompatible with Blackwell SM103	Pass `--attention-backend flashinfer` to `sglang serve`
`AssertionError: FlashAttention v3 Backend requires SM>=80 and SM<=90`	`--attention-backend fa3` selected on Blackwell (SM103)	Use `--attention-backend flashinfer` instead
`User lacks permission to set NUMA affinity ... try adding --cap-add SYS_NICE` warning	Docker dropped the `SYS_NICE` capability	Add `--cap-add SYS_NICE` to the `docker run` command
`python3 -m venv .venv` fails with `apt install python3.12-venv` hint	Ubuntu 24.04 ships without `python3-venv`	Run `sudo apt update && sudo apt install -y python3-venv` (or use `python3 -m pip install --user --break-system-packages requests`)
"Token is required" or 401 error	Missing HuggingFace token for a gated model	Export `HF_TOKEN` before running the docker command and accept the model license on huggingface.co
Server exits with OOM error	Model too large for available GPU memory	Lower `--mem-fraction-static` (e.g., 0.7) or reduce `--context-length`. Check GPU memory with `nvidia-smi`
`json_schema` response_format returns error	SGLang version too old	Ensure you are using `lmsysorg/sglang:latest-cu130`. Older versions may not support `json_schema` format
Server starts but CUDA errors on inference	Wrong CUDA version for Blackwell	Use the `latest-cu130` image tag. SM103 requires CUDA 13.0+
Slow first request after server start	Kernel JIT + CUDA-graph capture	First launch can take 10–15 min for `Qwen/Qwen3-8B` and 30–45 min for `Qwen/Qwen3.6-35B-A3B` before the server prints "fired up and ready to roll!". Subsequent requests are fast.
Connection refused on port 30000	Server still loading model or capturing CUDA graphs	Check `docker logs sglang-server` — wait for the Uvicorn startup message and "The server is fired up and ready to roll!"
`Med cached prefill` column is `n/a` in the benchmark	OpenAI-style `cached_tokens` not enabled on the server	Add `--enable-cache-report` to `sglang serve` so `usage.prompt_tokens_details.cached_tokens` is populated
`/server_info` body floods the benchmark "cache highlights" output	Older `benchmark_multiturn.py` matched any line containing "cache" — including the single-line `/server_info` JSON	Use the version of `benchmark_multiturn.py` shipped with this playbook (it skips JSON blobs and lines longer than 200 chars); the full body is still saved to `--cache-detail-file`
Benchmark shows higher median latency on later turns	Expected under parallel load + longer transcripts	RadixAttention reduces repeated prefill on shared prefixes—use `docker logs` (`#cached-token`) and optionally `--num-conversations 1`. See Step 9 and `sglang_benchmark_cache_details.log`
`deepseek-ai/DeepSeek-V4-*` fails to load	Unsupported in this SGLang build or insufficient VRAM	Check SGLang docs for model support; try `DeepSeek-V4-Flash` before Pro; lower `--mem-fraction-static` and `--context-length`

NOTE

On DGX Station the GB300 may be at device 0 or 1 depending on configuration (some Stations also expose a workstation GPU at 0). Always verify with nvidia-smi --query-gpu=index,name --format=csv,noheader before launching the container.

Resources