NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Text to Knowledge Graph
  • Optimized JAX

tools

  • DGX Dashboard
  • Comfy UI
  • Connect Three DGX Spark in a Ring Topology
  • Connect Multiple DGX Spark through a Switch
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark
  • VS Code

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • NemoClaw with Nemotron 3 Super and Telegram on DGX Spark
  • Secure Long Running AI Agents with OpenShell on DGX Spark
  • OpenClaw 🦞
  • Live VLM WebUI
  • Install and Use Isaac Sim and Isaac Lab
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • Connect Two Sparks
  • NCCL for Two Sparks
  • Build a Video Search and Summarization (VSS) Agent
  • Spark & Reachy Photo Booth

inference

  • Speculative Decoding
  • Run models with llama.cpp on DGX Spark
  • vLLM for Inference
  • Nemotron-3-Nano with llama.cpp
  • SGLang for Inference
  • TRT LLM for Inference
  • NVFP4 Quantization
  • Multi-modal Inference
  • NIM on Spark
  • LM Studio on DGX Spark

SGLang for Inference

30 MIN

Install and use SGLang on DGX Spark

DGXSpark
View on GitHub
OverviewOverviewInstructionsInstructionsTroubleshootingTroubleshooting

Step 1
Verify system prerequisites

Check that your NVIDIA Spark device meets all requirements before proceeding. This step runs on your host system and ensures Docker, GPU drivers, and container toolkit are properly configured.

Note: If you experience timeouts or "connection refused" errors while pulling the container image, you may need to use a VPN or a proxy, as some registries may be restricted by your local network or ISP.

# Verify Docker installation
docker --version

# Check NVIDIA GPU drivers
nvidia-smi

# Verify Docker GPU support
docker run --rm --gpus all lmsysorg/sglang:spark nvidia-smi

# Check available disk space
df -h /

If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .

sudo usermod -aG docker $USER
newgrp docker

Step 2
Pull the SGLang Container

Download the latest SGLang container. This step runs on the host and may take several minutes depending on your network connection.

# Pull the SGLang container
docker pull lmsysorg/sglang:spark

# Verify the image was downloaded
docker images | grep sglang

Step 3
Launch SGLang container for server mode

Start the SGLang container in server mode to enable HTTP API access. This runs the inference server inside the container, exposing it on port 30000 for client connections.

# Launch container with GPU support and port mapping
docker run --gpus all -it --rm \
  -p 30000:30000 \
  -v /tmp:/tmp \
  lmsysorg/sglang:spark \
  bash

Step 4
Start the SGLang inference server

Inside the container, launch the HTTP inference server with a supported model. This step runs inside the Docker container and starts the SGLang server daemon.

# Start the inference server with DeepSeek-V2-Lite model
python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V2-Lite \
  --host 0.0.0.0 \
  --port 30000 \
  --trust-remote-code \
  --tp 1 \
  --attention-backend flashinfer \
  --mem-fraction-static 0.75 &

# Wait for server to initialize
sleep 30

# Check server status
curl http://localhost:30000/health

Step 5
Test client-server inference

From a new terminal on your host system, test the SGLang server API to ensure it's working correctly. This validates that the server is accepting requests and generating responses.

# Test with curl
curl -X POST http://localhost:30000/generate \
  -H "Content-Type: application/json" \
  -d '{
      "text": "What does NVIDIA love?",
      "sampling_params": {
          "temperature": 0.7,
          "max_new_tokens": 100
      }
  }'

Step 6
Test Python client API

Create a simple Python script to test programmatic access to the SGLang server. This runs on the host system and demonstrates how to integrate SGLang into applications.

import requests

# Send prompt to server
response = requests.post('http://localhost:30000/generate', json={
  'text': 'What does NVIDIA love?',
  'sampling_params': {
      'temperature': 0.7,
      'max_new_tokens': 100,
  },
})

print(f"Response: {response.json()['text']}")

Step 7
Validate installation

Confirm that both server and offline modes are working correctly. This step verifies the complete SGLang setup and ensures reliable operation.

# Check server mode (from host)
curl http://localhost:30000/health
curl -X POST http://localhost:30000/generate -H "Content-Type: application/json" \
  -d '{"text": "Hello", "sampling_params": {"max_new_tokens": 10}}'

# Check container logs
docker ps
docker logs <CONTAINER_ID>

Step 8
Cleanup and rollback

Stop and remove containers to clean up resources. This step returns your system to its original state.

WARNING

This will stop all SGLang containers and remove temporary data.

# Stop all SGLang containers
docker ps | grep sglang | awk '{print $1}' | xargs docker stop

# Remove stopped containers
docker container prune -f

# Remove SGLang images (optional)
docker rmi lmsysorg/sglang:spark

Step 9
Next steps

With SGLang successfully deployed, you can now:

  • Integrate the HTTP API into your applications using the /generate endpoint
  • Experiment with different models by changing the --model-path parameter
  • Scale up using multiple GPUs by adjusting the --tp (tensor parallel) setting
  • Deploy production workloads using the container orchestration platform of your choice

Resources

  • SGLang Documentation
  • DGX Spark Documentation
  • DGX Spark Forum
  • DGX Spark User Performance Guide
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation