SGLang for Inference

Basic Idea

SGLang is a fast serving framework for large language models and vision language models that makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language. This setup uses the optimized NVIDIA SGLang NGC Container on a single NVIDIA Spark device with Blackwell architecture, providing GPU-accelerated inference with all dependencies pre-installed.

What you'll accomplish

You'll deploy SGLang in both server and offline inference modes on your NVIDIA Spark device, enabling high-performance LLM serving with support for text generation, chat completion, and vision-language tasks using models like DeepSeek-V2-Lite.

What to know before starting

Working in a terminal environment on Linux systems
Basic understanding of Docker containers and container management
Familiarity with NVIDIA GPU drivers and CUDA toolkit concepts
Experience with HTTP API endpoints and JSON request/response handling

Prerequisites

NVIDIA Spark device with Blackwell architecture
Docker Engine installed and running: docker --version
NVIDIA GPU drivers installed: nvidia-smi
NVIDIA Container Toolkit configured: docker run --rm --gpus all lmsysorg/sglang:spark nvidia-smi
Sufficient disk space (>20GB available): df -h
Network connectivity for pulling NGC containers: ping nvcr.io

Ancillary files

An offline inference python script found here on GitHub

Model Support Matrix

The following models are supported with SGLang on Spark. All listed models are available and ready to use:

Model	Quantization	Support Status	HF Handle
GPT-OSS-20B	MXFP4	✅	`openai/gpt-oss-20b`
GPT-OSS-120B	MXFP4	✅	`openai/gpt-oss-120b`
Llama-3.1-8B-Instruct	FP8	✅	`nvidia/Llama-3.1-8B-Instruct-FP8`
Llama-3.1-8B-Instruct	NVFP4	✅	`nvidia/Llama-3.1-8B-Instruct-FP4`
Llama-3.3-70B-Instruct	NVFP4	✅	`nvidia/Llama-3.3-70B-Instruct-FP4`
Qwen3-8B	FP8	✅	`nvidia/Qwen3-8B-FP8`
Qwen3-8B	NVFP4	✅	`nvidia/Qwen3-8B-FP4`
Qwen3-14B	FP8	✅	`nvidia/Qwen3-14B-FP8`
Qwen3-14B	NVFP4	✅	`nvidia/Qwen3-14B-FP4`
Qwen3-32B	NVFP4	✅	`nvidia/Qwen3-32B-FP4`
Phi-4-multimodal-instruct	FP8	✅	`nvidia/Phi-4-multimodal-instruct-FP8`
Phi-4-multimodal-instruct	NVFP4	✅	`nvidia/Phi-4-multimodal-instruct-FP4`
Phi-4-reasoning-plus	FP8	✅	`nvidia/Phi-4-reasoning-plus-FP8`
Phi-4-reasoning-plus	NVFP4	✅	`nvidia/Phi-4-reasoning-plus-FP4`

Note: for NVFP4 models, add the --quantization modelopt_fp4 flag.

Time & risk

Estimated time: 30 minutes for initial setup and validation
Risk level: Low - Uses pre-built, validated SGLang container with minimal configuration
Rollback: Stop and remove containers with docker stop and docker rm commands
Last Updated: 01/02/2026
- Add Model Support Matrix