Serve LLMs with SGLang on DGX Station for prefix-cached multi-turn and structured output inference
| Symptom | Cause | Fix |
|---|---|---|
| "permission denied" when running docker | User not in docker group | Run sudo usermod -aG docker $USER && newgrp docker |
| Container fails to start with GPU error | NVIDIA Container Toolkit not configured | Run nvidia-ctk runtime configure --runtime=docker and restart Docker |
device >= 0 && device < num_gpus INTERNAL ASSERT FAILED | Using --gpus all on a multi-GPU system | Use --gpus '"device=N"' to target the GB300 specifically (check index with nvidia-smi) |
| "Token is required" or 401 error | Missing HuggingFace token | Ensure HF_TOKEN is exported before running the docker command |
| Server exits with OOM error | Model too large for available GPU memory | Lower --mem-fraction-static (e.g., 0.7) or reduce --context-length. Check GPU memory with nvidia-smi |
json_schema response_format returns error | SGLang version too old | Ensure you are using lmsysorg/sglang:latest-cu130. Older versions may not support json_schema format |
| Server starts but CUDA errors on inference | Wrong CUDA version for Blackwell | Use the latest-cu130 image tag. SM103 requires CUDA 13.0+ |
| Model runs on wrong GPU | Default GPU selection | Use --gpus '"device=N"' to select the GB300 specifically |
| Slow first request after server start | Kernel JIT compilation | First request triggers kernel compilation. Subsequent requests are fast. Wait ~30 seconds |
| Connection refused on port 30000 | Server still loading model | Check docker logs sglang-server — wait for the Uvicorn startup message |
/server_info shows no cache stats | Endpoint may differ across versions | Try curl http://localhost:30000/v1/models to verify the server is responsive. Cache metrics may be under /metrics (requires --enable-metrics server flag) |
NOTE
On DGX Station, the GB300 is typically device 1 (device 0 is the RTX Pro 6000 workstation GPU). Always verify with nvidia-smi --query-gpu=index,name --format=csv,noheader.