Serve LLMs with SGLang on DGX Station (Qwen3-8B default; Qwen3.6 MoE optional)—prefix-cached multi-turn, structured output, benchmarks, and inference-server guidance
| Symptom | Cause | Fix |
|---|---|---|
| "permission denied" when running docker | User not in docker group | Run sudo usermod -aG docker $USER && newgrp docker |
| Container fails to start with GPU error | NVIDIA Container Toolkit not configured | Run nvidia-ctk runtime configure --runtime=docker and restart Docker |
device >= 0 && device < num_gpus INTERNAL ASSERT FAILED | --gpus '"device=N"' index does not exist on this Station | Re-run nvidia-smi --query-gpu=index,name --format=csv,noheader and use the actual GB300 index, or --gpus all if there is only one GPU |
RuntimeError: ... buildNdTmaDescriptor ... Check failed: false during CUDA-graph capture | Default trtllm_mha attention backend is incompatible with Blackwell SM103 | Pass --attention-backend flashinfer to sglang serve |
AssertionError: FlashAttention v3 Backend requires SM>=80 and SM<=90 | --attention-backend fa3 selected on Blackwell (SM103) | Use --attention-backend flashinfer instead |
User lacks permission to set NUMA affinity ... try adding --cap-add SYS_NICE warning | Docker dropped the SYS_NICE capability | Add --cap-add SYS_NICE to the docker run command |
python3 -m venv .venv fails with apt install python3.12-venv hint | Ubuntu 24.04 ships without python3-venv | Run sudo apt update && sudo apt install -y python3-venv (or use python3 -m pip install --user --break-system-packages requests) |
| "Token is required" or 401 error | Missing HuggingFace token for a gated model | Export HF_TOKEN before running the docker command and accept the model license on huggingface.co |
| Server exits with OOM error | Model too large for available GPU memory | Lower --mem-fraction-static (e.g., 0.7) or reduce --context-length. Check GPU memory with nvidia-smi |
json_schema response_format returns error | SGLang version too old | Ensure you are using lmsysorg/sglang:latest-cu130. Older versions may not support json_schema format |
| Server starts but CUDA errors on inference | Wrong CUDA version for Blackwell | Use the latest-cu130 image tag. SM103 requires CUDA 13.0+ |
| Slow first request after server start | Kernel JIT + CUDA-graph capture | First launch can take 10–15 min for Qwen/Qwen3-8B and 30–45 min for Qwen/Qwen3.6-35B-A3B before the server prints "fired up and ready to roll!". Subsequent requests are fast. |
| Connection refused on port 30000 | Server still loading model or capturing CUDA graphs | Check docker logs sglang-server — wait for the Uvicorn startup message and "The server is fired up and ready to roll!" |
Med cached prefill column is n/a in the benchmark | OpenAI-style cached_tokens not enabled on the server | Add --enable-cache-report to sglang serve so usage.prompt_tokens_details.cached_tokens is populated |
/server_info body floods the benchmark "cache highlights" output | Older benchmark_multiturn.py matched any line containing "cache" — including the single-line /server_info JSON | Use the version of benchmark_multiturn.py shipped with this playbook (it skips JSON blobs and lines longer than 200 chars); the full body is still saved to --cache-detail-file |
| Benchmark shows higher median latency on later turns | Expected under parallel load + longer transcripts | RadixAttention reduces repeated prefill on shared prefixes—use docker logs (#cached-token) and optionally --num-conversations 1. See Step 9 and sglang_benchmark_cache_details.log |
deepseek-ai/DeepSeek-V4-* fails to load | Unsupported in this SGLang build or insufficient VRAM | Check SGLang docs for model support; try DeepSeek-V4-Flash before Pro; lower --mem-fraction-static and --context-length |
NOTE
On DGX Station the GB300 may be at device 0 or 1 depending on configuration (some Stations also expose a workstation GPU at 0). Always verify with nvidia-smi --query-gpu=index,name --format=csv,noheader before launching the container.