LLM Inference with SGLang

Basic idea

SGLang is a high-performance serving framework for large language models, optimized for workloads where requests share common prefixes — multi-turn conversations, RAG pipelines, and agentic workflows. Its core innovation, RadixAttention, automatically caches and reuses KV cache entries across requests using a radix tree, eliminating redundant prefill computation. SGLang also provides best-in-class structured output generation (JSON, regex, grammar-constrained decoding) through its xGrammar backend, running up to 3x faster than standard guided decoding.

RadixAttention — Automatically reuses KV cache across requests sharing common prefixes. Multi-turn conversations and repeated system prompts skip re-computation entirely, reducing first-token latency and increasing throughput.
Structured output — Compressed finite-state machine decoding with grammar mask generation overlapped with the LLM forward pass. Produces valid JSON, regex-matched, or grammar-constrained output with minimal overhead.
OpenAI-compatible API — Drop-in replacement for OpenAI and vLLM endpoints. Supports /v1/chat/completions, /v1/completions, and /v1/embeddings.
Blackwell optimized — SGLang includes optimizations for SM100+ GPUs and CUDA 13, with NVFP4 GEMM support and accelerated softmax kernels.

What you'll accomplish

Launch SGLang on DGX Station to serve an LLM, then exercise its two key differentiators: prefix-cached multi-turn chat and structured JSON output generation. You will also benchmark multi-turn throughput and interpret results together with server logs (wall time alone is not a reliable cache signal under parallel load).

Serve Qwen3-8B (Qwen/Qwen3-8B by default for fast first-run validation) or another checkpoint from the in-playbook model table — including the larger Qwen3.6 MoE (Qwen/Qwen3.6-35B-A3B) once the workflow is verified
Send multi-turn conversations and observe prefix cache hits in Docker logs (#cached-token)
Generate structured JSON output using schema-constrained decoding
Benchmark multi-turn throughput; optional single-conversation run to reduce contention; full cache/metrics scrape written to a log file for review
Optional next step: large MoE such as DeepSeek-V4 on Station when your SGLang build and VRAM allow

What to know before starting

Basic Docker container usage
Familiarity with REST APIs (curl or Python requests)

Prerequisites

NVIDIA DGX Station with GB300 GPU (Blackwell SM103)
Docker installed: docker --version
NVIDIA Container Toolkit configured: nvidia-smi should show the GB300
HuggingFace account with access token
Network access to HuggingFace and Docker Hub

Ancillary files

assets/benchmark_multiturn.py — Benchmarks multi-turn chat under parallel load, structured JSON output, and writes full /server_info + /metrics bodies to a detail log (terminal shows a short summary only)

Time & risk

Duration: 20–30 minutes for the default Qwen/Qwen3-8B; 45–60 minutes if you switch to Qwen/Qwen3.6-35B-A3B (download + Blackwell CUDA-graph capture)
Risks: Gated models (e.g., Llama 3.3) require HuggingFace authentication and license acceptance
Rollback: Stop and remove the container to restore state
Last Updated: 05/26/2026
- First Publication