Serve LLMs with SGLang on DGX Station (Qwen3-8B default; Qwen3.6 MoE optional)—prefix-cached multi-turn, structured output, benchmarks, and inference-server guidance
SGLang is a high-performance serving framework for large language models, optimized for workloads where requests share common prefixes — multi-turn conversations, RAG pipelines, and agentic workflows. Its core innovation, RadixAttention, automatically caches and reuses KV cache entries across requests using a radix tree, eliminating redundant prefill computation. SGLang also provides best-in-class structured output generation (JSON, regex, grammar-constrained decoding) through its xGrammar backend, running up to 3x faster than standard guided decoding.
/v1/chat/completions, /v1/completions, and /v1/embeddings.Launch SGLang on DGX Station to serve an LLM, then exercise its two key differentiators: prefix-cached multi-turn chat and structured JSON output generation. You will also benchmark multi-turn throughput and interpret results together with server logs (wall time alone is not a reliable cache signal under parallel load).
Qwen/Qwen3-8B by default for fast first-run validation) or another checkpoint from the in-playbook model table — including the larger Qwen3.6 MoE (Qwen/Qwen3.6-35B-A3B) once the workflow is verified#cached-token)docker --versionnvidia-smi should show the GB300assets/benchmark_multiturn.py — Benchmarks multi-turn chat under parallel load, structured JSON output, and writes full /server_info + /metrics bodies to a detail log (terminal shows a short summary only)Qwen/Qwen3-8B; 45–60 minutes if you switch to Qwen/Qwen3.6-35B-A3B (download + Blackwell CUDA-graph capture)