NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • MIG on DGX Station

data science

  • Topic Modeling
  • Text to Knowledge Graph on DGX Station

tools

  • NVFP4 Quantization

fine tuning

  • Nanochat Training

use case

  • NemoClaw with Nemotron-3-Super and vLLM on DGX Station
  • Local Coding Agent
  • Secure Long Running AI Agents with OpenShell on DGX Station

inference

  • Serve Qwen3-235B with vLLM

LLM Inference with SGLang

25 MIN

Serve LLMs with SGLang on DGX Station for prefix-cached multi-turn and structured output inference

BlackwellDGX StationGB300InferenceRadixAttentionSGLangStructured Output
View on GitHub
OverviewOverviewInstructionsInstructionsTroubleshootingTroubleshooting

Basic idea

SGLang is a high-performance serving framework for large language models, optimized for workloads where requests share common prefixes — multi-turn conversations, RAG pipelines, and agentic workflows. Its core innovation, RadixAttention, automatically caches and reuses KV cache entries across requests using a radix tree, eliminating redundant prefill computation. SGLang also provides best-in-class structured output generation (JSON, regex, grammar-constrained decoding) through its xGrammar backend, running up to 3x faster than standard guided decoding.

  • RadixAttention — Automatically reuses KV cache across requests sharing common prefixes. Multi-turn conversations and repeated system prompts skip re-computation entirely, reducing first-token latency and increasing throughput.
  • Structured output — Compressed finite-state machine decoding with grammar mask generation overlapped with the LLM forward pass. Produces valid JSON, regex-matched, or grammar-constrained output with minimal overhead.
  • OpenAI-compatible API — Drop-in replacement for OpenAI and vLLM endpoints. Supports /v1/chat/completions, /v1/completions, and /v1/embeddings.
  • Blackwell optimized — SGLang includes optimizations for SM100+ GPUs and CUDA 13, with NVFP4 GEMM support and accelerated softmax kernels.

What you'll accomplish

Launch SGLang on DGX Station to serve an LLM, then exercise its two key differentiators: prefix-cached multi-turn chat and structured JSON output generation. You will also benchmark multi-turn throughput to see RadixAttention's effect.

  • Serve Qwen3-8B with SGLang's Blackwell-optimized backend
  • Send multi-turn conversations and observe prefix cache hits in server metrics
  • Generate structured JSON output using schema-constrained decoding
  • Benchmark multi-turn throughput with and without prefix caching

What to know before starting

  • Basic Docker container usage
  • Familiarity with REST APIs (curl or Python requests)

Prerequisites

  • NVIDIA DGX Station with GB300 GPU (Blackwell SM103)
  • Docker installed: docker --version
  • NVIDIA Container Toolkit configured: nvidia-smi should show the GB300
  • HuggingFace account with access token
  • Network access to HuggingFace and Docker Hub

Ancillary files

  • assets/benchmark_multiturn.py — Python script that benchmarks multi-turn conversation throughput and demonstrates structured output generation

Time & risk

  • Duration: 20–25 minutes (including model download)
  • Risks: Model download requires HuggingFace authentication
  • Rollback: Stop and remove the container to restore state
  • Last Updated: 04/06/2026
    • First Publication

Resources

  • SGLang (GitHub)
  • SGLang Documentation
  • SGLang OpenAI API Reference
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation