Basic idea
SGLang is a high-performance serving framework for large language models, optimized for workloads where requests share common prefixes — multi-turn conversations, RAG pipelines, and agentic workflows. Its core innovation, RadixAttention, automatically caches and reuses KV cache entries across requests using a radix tree, eliminating redundant prefill computation. SGLang also provides best-in-class structured output generation (JSON, regex, grammar-constrained decoding) through its xGrammar backend, running up to 3x faster than standard guided decoding.
- RadixAttention — Automatically reuses KV cache across requests sharing common prefixes. Multi-turn conversations and repeated system prompts skip re-computation entirely, reducing first-token latency and increasing throughput.
- Structured output — Compressed finite-state machine decoding with grammar mask generation overlapped with the LLM forward pass. Produces valid JSON, regex-matched, or grammar-constrained output with minimal overhead.
- OpenAI-compatible API — Drop-in replacement for OpenAI and vLLM endpoints. Supports
/v1/chat/completions, /v1/completions, and /v1/embeddings.
- Blackwell optimized — SGLang includes optimizations for SM100+ GPUs and CUDA 13, with NVFP4 GEMM support and accelerated softmax kernels.
What you'll accomplish
Launch SGLang on DGX Station to serve an LLM, then exercise its two key differentiators: prefix-cached multi-turn chat and structured JSON output generation. You will also benchmark multi-turn throughput to see RadixAttention's effect.
- Serve Qwen3-8B with SGLang's Blackwell-optimized backend
- Send multi-turn conversations and observe prefix cache hits in server metrics
- Generate structured JSON output using schema-constrained decoding
- Benchmark multi-turn throughput with and without prefix caching
What to know before starting
- Basic Docker container usage
- Familiarity with REST APIs (curl or Python requests)
Prerequisites
- NVIDIA DGX Station with GB300 GPU (Blackwell SM103)
- Docker installed:
docker --version
- NVIDIA Container Toolkit configured:
nvidia-smi should show the GB300
- HuggingFace account with access token
- Network access to HuggingFace and Docker Hub
Ancillary files
assets/benchmark_multiturn.py — Python script that benchmarks multi-turn conversation throughput and demonstrates structured output generation
Time & risk
- Duration: 20–25 minutes (including model download)
- Risks: Model download requires HuggingFace authentication
- Rollback: Stop and remove the container to restore state
- Last Updated: 04/06/2026