Skip to main content
NVIDIA
Explore
Models
Skills
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • MIG on DGX Station

data science

  • Topic Modeling
  • Text to Knowledge Graph on DGX Station

tools

  • NVFP4 Quantization

fine tuning

  • NVFP4 Pretraining with Megatron Bridge
  • Nanochat Training

use case

  • Run NemoClaw with a Local LLM
  • DGX Station AI Skills for Coding Agents
  • Profiler-Driven Kernel Optimization for Fine-Tuning
  • Local Healthcare Agent on DGX Station
  • Secure Long Running AI Agents with OpenShell on DGX Station
  • Local Coding Agent

inference

  • vLLM for Inference
  • Image & Video Generation with ComfyUI
  • Isaac GR00T N1.6 Fine-Tuning
  • LLM Inference with SGLang

vLLM for Inference

30 MIN

Install and use vLLM on DGX Station

InferencevLLM
OverviewOverviewInstructionsInstructionsTroubleshootingTroubleshooting

Basic idea

vLLM is an inference engine designed to run large language models efficiently. The key idea is maximizing throughput and minimizing memory waste when serving LLMs.

  • PagedAttention handles long sequences without running out of GPU memory.
  • Continuous batching keeps GPUs fully utilized by adding new requests to batches in progress.
  • OpenAI-compatible API allows applications built for OpenAI to switch to vLLM with minimal changes.

What you'll accomplish

Serve a supported model using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models.

You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.

What to know before starting

  • Basic Docker container usage
  • Familiarity with REST APIs

Prerequisites

  • NVIDIA DGX Station with GB300 and RTX 6000 Pro GPUs
  • Docker installed: docker --version
  • NVIDIA Container Toolkit configured
  • HuggingFace account with access token
  • Network access to NGC and HuggingFace

Model Support Matrix

The following models are supported with vLLM on DGX Station. All listed models are available and ready to use:

ModelQuantizationSupport StatusHF Handle
Step-3.7-Flash-FP8FP8✅stepfun-ai/Step-3.7-Flash-FP8
Step-3.7-Flash-NVFP4NVFP4✅stepfun-ai/Step-3.7-Flash-NVFP4
Qwen3-235B-A22B-NVFP4NVFP4✅nvidia/Qwen3-235B-A22B-NVFP4

Time & risk

  • Duration: 30 minutes (longer on first run due to model download)
  • Risks: Model download requires HuggingFace authentication
  • Rollback: Stop and remove the container to restore state
  • Last Updated: 05/28/2026
    • Update models

Resources

  • vLLM Documentation
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation