NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • MIG on DGX Station

data science

  • Topic Modeling
  • Text to Knowledge Graph on DGX Station

tools

  • NVFP4 Quantization

fine tuning

  • Nanochat Training

use case

  • Secure Long Running AI Agents with OpenShell on DGX Station
  • Local Coding Agent

inference

  • Serve Qwen3-235B with vLLM

Serve Qwen3-235B with vLLM

20 MIN

Set up vLLM server with Qwen3-235B on DGX Station

InferencevLLM
OverviewOverviewServe Qwen3-235BServe Qwen3-235BTroubleshootingTroubleshooting

Basic idea

vLLM is an inference engine designed to run large language models efficiently. The key idea is maximizing throughput and minimizing memory waste when serving LLMs.

  • PagedAttention handles long sequences without running out of GPU memory.
  • Continuous batching keeps GPUs fully utilized by adding new requests to batches in progress.
  • OpenAI-compatible API allows applications built for OpenAI to switch to vLLM with minimal changes.

What you'll accomplish

Serve the Qwen3-235B-A22B-NVFP4 model using vLLM on NVIDIA DGX Station. This 235B parameter model uses NVFP4 quantization and fits entirely in VRAM on the GB300 GPU.

What to know before starting

  • Basic Docker container usage
  • Familiarity with REST APIs

Prerequisites

  • NVIDIA DGX Station with GB300 and RTX 6000 Pro GPUs
  • Docker installed: docker --version
  • NVIDIA Container Toolkit configured
  • HuggingFace account with access token
  • Network access to NGC and HuggingFace

Time & risk

  • Duration: 15-20 minutes (longer on first run due to model download)
  • Risks: Model download requires HuggingFace authentication
  • Rollback: Stop and remove the container to restore state
  • Last Updated: 03/02/2026
    • First Publication

Resources

  • vLLM Documentation
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation