Copyright © 2026 NVIDIA Corporation
Generate images and videos with FLUX, Wan 2.1, HunyuanVideo, and Cosmos on DGX Station
Confirm Docker, GPU access, and available disk space.
docker --version
nvidia-smi
df -h /
/ for model weights and the Docker image. On DGX Station /home is on the root filesystem, so checking / covers both. You can download fewer models by choosing a tier (see Step 4).If you haven't already, add your user to the docker group:
sudo usermod -aG docker $USER
newgrp docker
Set your HuggingFace token so the download script and container can access gated models.
# HuggingFace token (required). Run this in the SAME shell that will
# launch `bash assets/scripts/download-models.sh` in Step 4 — the script
# reads $HF_TOKEN from the environment and exits early if it is unset.
# Get a token from https://huggingface.co/settings/tokens
export HF_TOKEN="your_huggingface_token"
Some models (FLUX.1, HiDream-I1) require accepting the model license on HuggingFace before downloading. Visit each model page and click "Agree and access" if prompted:
Clone the playbook repository and build the ComfyUI Docker image. The image is built on top of the NGC PyTorch container, which is already optimized for the GB300's ARM64 architecture.
git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-comfyui
Build the container image:
docker build -t comfyui-gb300 -f assets/Dockerfile .
The build clones ComfyUI, installs dependencies (preserving the NGC-optimized PyTorch), and pre-installs custom nodes for video generation, ControlNet, and IP-Adapter. This takes approximately 5–10 minutes.
This playbook uses models organized into three tiers. Download only what you need, or download everything.
| Tier | Models | Disk space | Peak VRAM (approx.) | Workflows enabled |
|---|---|---|---|---|
| 1 — Getting Started | FLUX.1 dev, Wan 2.1 T2V 14B | ~70 GB | ~80 GB (Wan 720p clip) | Text-to-image, text-to-video |
| 2 — Intermediate | + HiDream-I1, Wan 2.1 I2V, Cosmos-Predict2 | ~180 GB | ~100 GB (FLUX→Wan two-model graph) | + HiDream image gen, image-to-video, FLUX→Wan pipeline, Cosmos Video2World |
| 3 — Advanced | + HunyuanVideo, FLUX ControlNet (Canny) | ~230 GB | ~120 GB (Hunyuan 1080p / long clips) | + 1080p video, ControlNet-guided generation |
Peak VRAM depends on resolution, frame count, and precision; values above are order-of-magnitude for the default graphs in this playbook on a GB300 (252 GB HBM3e).
Install the Hugging Face Hub CLI (provides the hf command) if you do not already have it. The CLI installs to ~/.local/bin/, which is not on the default non-interactive PATH, so add it before continuing:
pip3 install --break-system-packages huggingface-hub
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
export PATH="$HOME/.local/bin:$PATH"
hf --version # confirms PATH is correct
Run the download script with your chosen tier (default downloads all):
# Download Tier 1 only (Getting Started):
bash assets/scripts/download-models.sh 1
# Download all tiers:
bash assets/scripts/download-models.sh
Model downloads can take 30–60 minutes depending on network speed. The script uses the Hugging Face Hub hf download command (from the huggingface_hub package). If a download fails, the script exits with an error and prints which file was expected — check your token, network, and that you have accepted gated model licenses on Hugging Face.
After Tier 1 completes, verify weights landed under models/:
ls -la ./models/diffusion_models/
ls -la ./models/text_encoders/ | head
Start the ComfyUI container with all model and output directories mounted as volumes. On DGX Station, identify the GB300 GPU index with nvidia-smi and use --gpus '"device=N"' to target it. If the GB300 is your only GPU, --gpus all also works.
# Find the GB300 device index (look for "GB300" in the Name column)
nvidia-smi --query-gpu=index,name --format=csv,noheader
The default --gpus '"device=0"' works on single-GPU stations where the GB300 is index 0. If nvidia-smi reports the GB300 at a different index (for example index 1 on dual-GPU stations with an RTX PRO 6000 + GB300), substitute that index in the command below.
# device=0 by default; replace with the GB300 index from the command above
docker run -d \
--name comfyui \
--gpus '"device=0"' \
--ipc host \
--ulimit memlock=-1 \
-p 8188:8188 \
-v "$(pwd)/models:/opt/ComfyUI/models" \
-v "$(pwd)/output:/opt/ComfyUI/output" \
-v "$(pwd)/input:/opt/ComfyUI/input" \
-v "$(pwd)/assets/workflows:/opt/ComfyUI/user/default/workflows" \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
comfyui-gb300
Check startup logs:
docker logs -f comfyui
Expected output includes custom node loading messages and:
To see the GUI go to: http://0.0.0.0:8188
Press Ctrl+C to exit the log view. Open a web browser and navigate to http://<STATION_IP>:8188 where <STATION_IP> is your DGX Station's IP address.
NOTE
The startup logs include several benign warnings you can ignore: aimdo: ... funchook_prepare(cuMemFree_v2) failed (NGC PyTorch's CUDA hooks tool falling back to no-op), urllib3 / charset_normalizer doesn't match a supported version, torchaudio missing (covered by the import-only stub — no playbook workflow uses audio VAE), DWPose: Onnxruntime not found ... switch to OpenCV with CPU device (aarch64 has no onnxruntime-gpu wheel; CPU preprocessing still works), and accelerate / GPTQModel / optimum / bitsandbytes not installed from the HiDream sampler. The real "ready" signal is the To see the GUI go to: ... line above; treat anything else as suspect.
ComfyUI uses two different JSON shapes:
| Location | Format | Use |
|---|---|---|
assets/workflows/*.json mounted at user/default/workflows/ | UI workflow (has "nodes" and "links") | Load in the web UI, edit in the canvas, then Queue Prompt |
assets/workflow_api/*.api.json (on the host repo, not mounted into the default workflow folder) | API prompt graph (flat node ids → class_type / inputs) | POST /prompt, curl, automation |
If you open an .api.json file with Load, the UI shows "Error: the workflow does not contain any nodes" — that is expected; those files are not UI workflows.
Optional — run the same graph via HTTP API (from the playbook root, with ComfyUI listening on port 8188). Strip any non-node keys (for example _comment in some API files), minify to one line, and POST:
PROMPT=$(python3 -c "import json; d=json.load(open('assets/workflow_api/flux-text-to-image.api.json')); print(json.dumps({k:v for k,v in d.items() if str(k).isdigit()}, separators=(',',':')))")
curl -sS http://127.0.0.1:8188/prompt \
-X POST \
-H "Content-Type: application/json" \
-d "{\"prompt\":${PROMPT}}" | python3 -m json.tool
The response includes a prompt_id you can correlate with server logs and the output/ folder.
ComfyUI interface orientation:
flux-text-to-image.json, wan-text-to-video.json, etc. (listed in the workflow sidebar under the mounted folder).Requires: Tier 1 models
Load the pre-built FLUX text-to-image workflow. In ComfyUI, click Load and select flux-text-to-image.json (UI format). Do not use the *.api.json files in assets/workflow_api/ with Load — they are for the HTTP API only.
What this workflow does:
The workflow connects these nodes in sequence:
weight_dtype=default.euler/simple).output/ directory.Try it:
A majestic snow leopard resting on a cliff at golden hour, photorealistic, 8k detailoutput/ directory and in the preview node.Experiment with different prompts, resolutions (512x512 up to 2048x2048), and step counts. FLUX.1 dev produces high-quality results even at 20 steps.
Requires: Tier 1 models
Load wan-text-to-video.json from the workflow browser.
What this workflow does:
Try it:
A drone shot flying over a misty mountain forest at sunrise, cinematicnvidia-smi in another terminal — the 14B model at 720p uses approximately 65–80 GB of the GB300's 252 GB HBM3e..webp (animated WEBP from SaveAnimatedWEBP) appears in the output/ directory. To convert to MP4, use ffmpeg -i output/wan_t2v_output_00001_.webp output/wan_t2v_output.mp4.Tips:
Requires: Tier 2 models
This step introduces four additional workflows. Each builds on the basics from Steps 6–7.
Load hidream-text-to-image.json.
HiDream-I1 Full is a 17B parameter image model that uses four text encoders — CLIP-L, CLIP-G, T5-XXL, and Llama-3.1-8B-Instruct. The Llama encoder gives it exceptional prompt understanding, especially for complex or nuanced descriptions.
The full pipeline uses approximately 60–65 GB in bf16 — well within the GB300's capacity but impossible on most GPUs.
Try it: Use a detailed, complex prompt to see the difference from FLUX — for example: An astronaut riding a horse on Mars, with Earth visible in the sky, oil painting style by Rembrandt, dramatic chiaroscuro lighting
Load wan-image-to-video.json.
This workflow takes an input image and animates it into a video clip. Place your source image in the input/ directory before running.
input/.Try it: Generate an image with FLUX first (Step 6), copy it from output/ to input/, then animate it.
Load flux-to-wan-pipeline.json.
This workflow chains two models in a single graph:
This avoids manually moving files between workflows. Both models load into GPU memory simultaneously (~95 GB total in bf16).
Load cosmos-text-to-video.json.
NVIDIA Cosmos-Predict2 14B is NVIDIA's world foundation model for Video2World generation. It takes an input image and generates a physically plausible video extending from that scene. Place your source image in the input/ directory before running.
The Cosmos VAE is extremely efficient — it can encode/decode 1280x704 at 121 frames without tiling.
Try it: Use an image from a previous FLUX generation as the start frame, with a prompt describing the motion: A red ball rolling down a wooden ramp and bouncing off a wall, physics simulation, realistic lighting
Requires: Tier 3 models
Load hunyuan-1080p-video.json.
This is the true GB300 showcase. HunyuanVideo's 13B model generating at 1080p resolution uses approximately 100–120 GB of VRAM — impossible on any consumer or professional GPU, but well within the GB300's 252 GB.
nvidia-smi — you should see 100+ GB GPU memory usage.Try it: A time-lapse of cherry blossoms falling in a Japanese garden with a koi pond, 4K cinematic
Load flux-controlnet.json.
ControlNet lets you guide image generation with structural conditioning — edges, depth maps, or pose skeletons extracted from a reference image.
input/.Use cases: Architectural visualization, consistent character poses, style transfer while preserving composition.
Stop and remove the ComfyUI container:
docker stop comfyui
docker rm comfyui
NOTE
Files in output/ and models/ are written by the container as root, so removing them from the host shell needs sudo (e.g. sudo rm -rf models/). To avoid this in future runs, add --user "$(id -u):$(id -g)" to the docker run command in Step 5 — note that this requires the host UID to have write access to all mounted directories.
Optionally remove the Docker image:
docker rmi comfyui-gb300
Optionally remove downloaded models to reclaim disk space:
rm -rf models/
Generated images and videos in output/ are preserved on the host regardless of container state.