NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Optimized JAX
  • Text to Knowledge Graph

tools

  • VS Code
  • DGX Dashboard
  • Comfy UI
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark

fine tuning

  • Fine-tune with Pytorch
  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Unsloth on DGX Spark

use case

  • Install and Use Isaac Sim and Isaac Lab
  • Live VLM WebUI
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • NCCL for Two Sparks
  • Connect Two Sparks
  • Build a Video Search and Summarization (VSS) Agent

inference

  • Nemotron-3-Nano with llama.cpp
  • Speculative Decoding
  • vLLM for Inference
  • SGLang Inference Server
  • TRT LLM for Inference
  • Multi-modal Inference
  • NIM on Spark
  • NVFP4 Quantization
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation

SGLang Inference Server

30 MIN

Install and use SGLang on DGX Spark

View on GitHub
OverviewInstructionsTroubleshooting

Basic Idea

SGLang is a fast serving framework for large language models and vision language models that makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language. This setup uses the optimized NVIDIA SGLang NGC Container on a single NVIDIA Spark device with Blackwell architecture, providing GPU-accelerated inference with all dependencies pre-installed.

What you'll accomplish

You'll deploy SGLang in both server and offline inference modes on your NVIDIA Spark device, enabling high-performance LLM serving with support for text generation, chat completion, and vision-language tasks using models like DeepSeek-V2-Lite.

What to know before starting

  • Working in a terminal environment on Linux systems
  • Basic understanding of Docker containers and container management
  • Familiarity with NVIDIA GPU drivers and CUDA toolkit concepts
  • Experience with HTTP API endpoints and JSON request/response handling

Prerequisites

  • NVIDIA Spark device with Blackwell architecture
  • Docker Engine installed and running: docker --version
  • NVIDIA GPU drivers installed: nvidia-smi
  • NVIDIA Container Toolkit configured: docker run --rm --gpus all lmsysorg/sglang:spark nvidia-smi
  • Sufficient disk space (>20GB available): df -h
  • Network connectivity for pulling NGC containers: ping nvcr.io

Ancillary files

  • An offline inference python script found here on GitHub

Model Support Matrix

The following models are supported with SGLang on Spark. All listed models are available and ready to use:

ModelQuantizationSupport StatusHF Handle
GPT-OSS-20BMXFP4✅openai/gpt-oss-20b
GPT-OSS-120BMXFP4✅openai/gpt-oss-120b
Llama-3.1-8B-InstructFP8✅nvidia/Llama-3.1-8B-Instruct-FP8
Llama-3.1-8B-InstructNVFP4✅nvidia/Llama-3.1-8B-Instruct-FP4
Llama-3.3-70B-InstructNVFP4✅nvidia/Llama-3.3-70B-Instruct-FP4
Qwen3-8BFP8✅nvidia/Qwen3-8B-FP8
Qwen3-8BNVFP4✅nvidia/Qwen3-8B-FP4
Qwen3-14BFP8✅nvidia/Qwen3-14B-FP8
Qwen3-14BNVFP4✅nvidia/Qwen3-14B-FP4
Qwen3-32BNVFP4✅nvidia/Qwen3-32B-FP4
Phi-4-multimodal-instructFP8✅nvidia/Phi-4-multimodal-instruct-FP8
Phi-4-multimodal-instructNVFP4✅nvidia/Phi-4-multimodal-instruct-FP4
Phi-4-reasoning-plusFP8✅nvidia/Phi-4-reasoning-plus-FP8
Phi-4-reasoning-plusNVFP4✅nvidia/Phi-4-reasoning-plus-FP4

Time & risk

  • Estimated time: 30 minutes for initial setup and validation
  • Risk level: Low - Uses pre-built, validated SGLang container with minimal configuration
  • Rollback: Stop and remove containers with docker stop and docker rm commands
  • Last Updated: 01/02/2026
    • Add Model Support Matrix

Resources

  • SGLang Documentation
  • DGX Spark Documentation
  • DGX Spark Forum