Basic Idea
SGLang is a fast serving framework for large language models and vision language models that makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language. This setup uses the optimized SGLang CUDA container on a single NVIDIA Spark device with Blackwell architecture, providing GPU-accelerated inference with all dependencies pre-installed.
What you'll accomplish
You'll deploy SGLang in both server and offline inference modes on your NVIDIA Spark device, enabling high-performance LLM serving with support for text generation, chat completion, and vision-language tasks using models like DeepSeek-V2-Lite.
What to know before starting
- Working in a terminal environment on Linux systems
- Basic understanding of Docker containers and container management
- Familiarity with NVIDIA GPU drivers and CUDA toolkit concepts
- Experience with HTTP API endpoints and JSON request/response handling
Prerequisites
- NVIDIA Spark device with Blackwell architecture
- Docker Engine installed and running:
docker --version - NVIDIA GPU drivers installed:
nvidia-smi - NVIDIA Container Toolkit configured:
docker run --rm --gpus all lmsysorg/sglang@sha256:ceaf8b16e02d165143633ac228bbb994a05fe77d7e0526cf035ae4bbf4eacc36 nvidia-smi - Sufficient disk space (>20GB available):
df -h - Network connectivity for pulling containers:
docker pull lmsysorg/sglang@sha256:ceaf8b16e02d165143633ac228bbb994a05fe77d7e0526cf035ae4bbf4eacc36
Ancillary files
- An offline inference python script found here on GitHub
Model Support Matrix
The following models are supported with SGLang on Spark. All listed models are available and ready to use:
| Model | Quantization | Support Status | HF Handle |
|---|---|---|---|
| Nemotron-3-Nano-Omni-30B-A3B-Reasoning | BF16 | ✅ | nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 |
| GPT-OSS-20B | MXFP4 | ✅ | openai/gpt-oss-20b |
| GPT-OSS-120B | MXFP4 | ✅ | openai/gpt-oss-120b |
| Llama-3.1-8B-Instruct | FP8 | ✅ | nvidia/Llama-3.1-8B-Instruct-FP8 |
| Llama-3.1-8B-Instruct | NVFP4 | ✅ | nvidia/Llama-3.1-8B-Instruct-FP4 |
| Llama-3.3-70B-Instruct | NVFP4 | ✅ | nvidia/Llama-3.3-70B-Instruct-FP4 |
| Qwen3-8B | FP8 | ✅ | nvidia/Qwen3-8B-FP8 |
| Qwen3-8B | NVFP4 | ✅ | nvidia/Qwen3-8B-FP4 |
| Qwen3-14B | FP8 | ✅ | nvidia/Qwen3-14B-FP8 |
| Qwen3-14B | NVFP4 | ✅ | nvidia/Qwen3-14B-FP4 |
| Qwen3-32B | NVFP4 | ✅ | nvidia/Qwen3-32B-FP4 |
| Phi-4-multimodal-instruct | FP8 | ✅ | nvidia/Phi-4-multimodal-instruct-FP8 |
| Phi-4-multimodal-instruct | NVFP4 | ✅ | nvidia/Phi-4-multimodal-instruct-FP4 |
| Phi-4-reasoning-plus | FP8 | ✅ | nvidia/Phi-4-reasoning-plus-FP8 |
| Phi-4-reasoning-plus | NVFP4 | ✅ | nvidia/Phi-4-reasoning-plus-FP4 |
Note: for NVFP4 models, add the --quantization modelopt_fp4 flag.
Time & risk
- Estimated time: 30 minutes for initial setup and validation
- Risk level: Low - Uses pre-built, validated SGLang container with minimal configuration
- Rollback: Stop and remove containers with
docker stopanddocker rmcommands - Last Updated: 04/28/2026
- Introduce Nemotron-3-Nano-Omni reasoning FP8 support