SGLang Inference Server

30 MIN

Install and use SGLang on DGX Spark

Basic Idea

SGLang is a fast serving framework for large language models and vision language models that makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language. This setup uses the optimized NVIDIA SGLang NGC Container on a single NVIDIA Spark device with Blackwell architecture, providing GPU-accelerated inference with all dependencies pre-installed.

What you'll accomplish

You'll deploy SGLang in both server and offline inference modes on your NVIDIA Spark device, enabling high-performance LLM serving with support for text generation, chat completion, and vision-language tasks using models like DeepSeek-V2-Lite.

What to know before starting

  • Working in a terminal environment on Linux systems
  • Basic understanding of Docker containers and container management
  • Familiarity with NVIDIA GPU drivers and CUDA toolkit concepts
  • Experience with HTTP API endpoints and JSON request/response handling

Prerequisites

  • NVIDIA Spark device with Blackwell architecture
  • Docker Engine installed and running: docker --version
  • NVIDIA GPU drivers installed: nvidia-smi
  • NVIDIA Container Toolkit configured: docker run --rm --gpus all lmsysorg/sglang:spark nvidia-smi
  • Sufficient disk space (>20GB available): df -h
  • Network connectivity for pulling NGC containers: ping nvcr.io

Ancillary files

Model Support Matrix

The following models are supported with SGLang on Spark. All listed models are available and ready to use:

ModelQuantizationSupport StatusHF Handle
GPT-OSS-20BMXFP4openai/gpt-oss-20b
GPT-OSS-120BMXFP4openai/gpt-oss-120b
Llama-3.1-8B-InstructFP8nvidia/Llama-3.1-8B-Instruct-FP8
Llama-3.1-8B-InstructNVFP4nvidia/Llama-3.1-8B-Instruct-FP4
Llama-3.3-70B-InstructNVFP4nvidia/Llama-3.3-70B-Instruct-FP4
Qwen3-8BFP8nvidia/Qwen3-8B-FP8
Qwen3-8BNVFP4nvidia/Qwen3-8B-FP4
Qwen3-14BFP8nvidia/Qwen3-14B-FP8
Qwen3-14BNVFP4nvidia/Qwen3-14B-FP4
Qwen3-32BNVFP4nvidia/Qwen3-32B-FP4
Phi-4-multimodal-instructFP8nvidia/Phi-4-multimodal-instruct-FP8
Phi-4-multimodal-instructNVFP4nvidia/Phi-4-multimodal-instruct-FP4
Phi-4-reasoning-plusFP8nvidia/Phi-4-reasoning-plus-FP8
Phi-4-reasoning-plusNVFP4nvidia/Phi-4-reasoning-plus-FP4

Time & risk

  • Estimated time: 30 minutes for initial setup and validation
  • Risk level: Low - Uses pre-built, validated SGLang container with minimal configuration
  • Rollback: Stop and remove containers with docker stop and docker rm commands
  • Last Updated: 01/02/2026
    • Add Model Support Matrix