Image & Video Generation with ComfyUI

Step 1
Verify your environment

Confirm Docker, GPU access, and available disk space.

docker --version
nvidia-smi
df -h /

Docker: Must be running (version 24+ recommended).
nvidia-smi: Should list the GB300 GPU with 252 GB HBM3e.
Disk space: At least 200 GB free on / for model weights and the Docker image. On DGX Station /home is on the root filesystem, so checking / covers both. You can download fewer models by choosing a tier (see Step 4).

If you haven't already, add your user to the docker group:

sudo usermod -aG docker $USER
newgrp docker

Step 2
Set up environment variables

Set your HuggingFace token so the download script and container can access gated models.

# HuggingFace token (required). Run this in the SAME shell that will
# launch `bash assets/scripts/download-models.sh` in Step 4 — the script
# reads $HF_TOKEN from the environment and exits early if it is unset.
# Get a token from https://huggingface.co/settings/tokens
export HF_TOKEN="your_huggingface_token"

Some models (FLUX.1, HiDream-I1) require accepting the model license on HuggingFace before downloading. Visit each model page and click "Agree and access" if prompted:

Step 3
Clone the playbook and build the container

Clone the playbook repository and build the ComfyUI Docker image. The image is built on top of the NGC PyTorch container, which is already optimized for the GB300's ARM64 architecture.

git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-comfyui

Build the container image:

docker build -t comfyui-gb300 -f assets/Dockerfile .

The build clones ComfyUI, installs dependencies (preserving the NGC-optimized PyTorch), and pre-installs custom nodes for video generation, ControlNet, and IP-Adapter. This takes approximately 5–10 minutes.

Step 4
Download models

This playbook uses models organized into three tiers. Download only what you need, or download everything.

Tier	Models	Disk space	Peak VRAM (approx.)	Workflows enabled
1 — Getting Started	FLUX.1 dev, Wan 2.1 T2V 14B	~70 GB	~80 GB (Wan 720p clip)	Text-to-image, text-to-video
2 — Intermediate	+ HiDream-I1, Wan 2.1 I2V, Cosmos-Predict2	~180 GB	~100 GB (FLUX→Wan two-model graph)	+ HiDream image gen, image-to-video, FLUX→Wan pipeline, Cosmos Video2World
3 — Advanced	+ HunyuanVideo, FLUX ControlNet (Canny)	~230 GB	~120 GB (Hunyuan 1080p / long clips)	+ 1080p video, ControlNet-guided generation

Peak VRAM depends on resolution, frame count, and precision; values above are order-of-magnitude for the default graphs in this playbook on a GB300 (252 GB HBM3e).

Install the Hugging Face Hub CLI (provides the hf command) if you do not already have it. The CLI installs to ~/.local/bin/, which is not on the default non-interactive PATH, so add it before continuing:

pip3 install --break-system-packages huggingface-hub
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
export PATH="$HOME/.local/bin:$PATH"
hf --version   # confirms PATH is correct

Run the download script with your chosen tier (default downloads all):

# Download Tier 1 only (Getting Started):
bash assets/scripts/download-models.sh 1

# Download all tiers:
bash assets/scripts/download-models.sh

Model downloads can take 30–60 minutes depending on network speed. The script uses the Hugging Face Hub hf download command (from the huggingface_hub package). If a download fails, the script exits with an error and prints which file was expected — check your token, network, and that you have accepted gated model licenses on Hugging Face.

After Tier 1 completes, verify weights landed under models/:

ls -la ./models/diffusion_models/
ls -la ./models/text_encoders/ | head

Step 5
Launch ComfyUI

Start the ComfyUI container with all model and output directories mounted as volumes. On DGX Station, identify the GB300 GPU index with nvidia-smi and use --gpus '"device=N"' to target it. If the GB300 is your only GPU, --gpus all also works.

# Find the GB300 device index (look for "GB300" in the Name column)
nvidia-smi --query-gpu=index,name --format=csv,noheader

The default --gpus '"device=0"' works on single-GPU stations where the GB300 is index 0. If nvidia-smi reports the GB300 at a different index (for example index 1 on dual-GPU stations with an RTX PRO 6000 + GB300), substitute that index in the command below.

# device=0 by default; replace with the GB300 index from the command above
docker run -d \
  --name comfyui \
  --gpus '"device=0"' \
  --ipc host \
  --ulimit memlock=-1 \
  -p 8188:8188 \
  -v "$(pwd)/models:/opt/ComfyUI/models" \
  -v "$(pwd)/output:/opt/ComfyUI/output" \
  -v "$(pwd)/input:/opt/ComfyUI/input" \
  -v "$(pwd)/assets/workflows:/opt/ComfyUI/user/default/workflows" \
  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
  comfyui-gb300

Check startup logs:

docker logs -f comfyui

Expected output includes custom node loading messages and:

To see the GUI go to: http://0.0.0.0:8188

Press Ctrl+C to exit the log view. Open a web browser and navigate to http://<STATION_IP>:8188 where <STATION_IP> is your DGX Station's IP address.

NOTE

The startup logs include several benign warnings you can ignore: aimdo: ... funchook_prepare(cuMemFree_v2) failed (NGC PyTorch's CUDA hooks tool falling back to no-op), urllib3 / charset_normalizer doesn't match a supported version, torchaudio missing (covered by the import-only stub — no playbook workflow uses audio VAE), DWPose: Onnxruntime not found ... switch to OpenCV with CPU device (aarch64 has no onnxruntime-gpu wheel; CPU preprocessing still works), and accelerate / GPTQModel / optimum / bitsandbytes not installed from the HiDream sampler. The real "ready" signal is the To see the GUI go to: ... line above; treat anything else as suspect.

UI workflows vs API graphs (important)

ComfyUI uses two different JSON shapes:

Location	Format	Use
`assets/workflows/*.json` mounted at `user/default/workflows/`	UI workflow (has `"nodes"` and `"links"`)	Load in the web UI, edit in the canvas, then Queue Prompt
`assets/workflow_api/*.api.json` (on the host repo, not mounted into the default workflow folder)	API prompt graph (flat node ids → `class_type` / `inputs`)	`POST /prompt`, `curl`, automation

If you open an .api.json file with Load, the UI shows "Error: the workflow does not contain any nodes" — that is expected; those files are not UI workflows.

Optional — run the same graph via HTTP API (from the playbook root, with ComfyUI listening on port 8188). Strip any non-node keys (for example _comment in some API files), minify to one line, and POST:

PROMPT=$(python3 -c "import json; d=json.load(open('assets/workflow_api/flux-text-to-image.api.json')); print(json.dumps({k:v for k,v in d.items() if str(k).isdigit()}, separators=(',',':')))")
curl -sS http://127.0.0.1:8188/prompt \
  -X POST \
  -H "Content-Type: application/json" \
  -d "{\"prompt\":${PROMPT}}" | python3 -m json.tool

The response includes a prompt_id you can correlate with server logs and the output/ folder.

ComfyUI interface orientation:

Canvas — The central area where you build and view node workflows.
Queue Prompt — The button (top right) that runs the current workflow.
Load — Load a UI workflow from flux-text-to-image.json, wan-text-to-video.json, etc. (listed in the workflow sidebar under the mounted folder).
Manager — Access ComfyUI-Manager to install additional custom nodes.

Step 6
Image generation with FLUX.1 dev

Requires: Tier 1 models

Load the pre-built FLUX text-to-image workflow. In ComfyUI, click Load and select flux-text-to-image.json (UI format). Do not use the *.api.json files in assets/workflow_api/ with Load — they are for the HTTP API only.

What this workflow does:

The workflow connects these nodes in sequence:

UNETLoader — Loads the FLUX.1 dev 12B transformer (~24 GB in bf16) with weight_dtype=default.
DualCLIPLoader — Loads CLIP-L and T5-XXL text encoders that convert your prompt into conditioning vectors.
CLIP Text Encode — Takes your text prompt and produces positive conditioning.
FluxGuidance — Applies FLUX's guidance value (default 3.5) to the conditioning.
EmptySD3LatentImage — Creates a blank latent at your chosen resolution (default: 1024x1024).
ModelSamplingFlux + BasicScheduler + KSamplerSelect + BasicGuider + RandomNoise — Configure FLUX's flow-matching schedule (20 steps, euler/simple).
SamplerCustomAdvanced — The diffusion sampling loop that denoises the latent.
VAE Decode — Converts the latent back into a pixel image.
Save Image — Writes the result to the output/ directory.

Try it:

Find the CLIP Text Encode node and enter a prompt, for example: A majestic snow leopard resting on a cliff at golden hour, photorealistic, 8k detail
Click Queue Prompt.
The image generates in approximately 15–30 seconds. Results appear in the output/ directory and in the preview node.

Experiment with different prompts, resolutions (512x512 up to 2048x2048), and step counts. FLUX.1 dev produces high-quality results even at 20 steps.

Step 7
Video generation with Wan 2.1

Requires: Tier 1 models

Load wan-text-to-video.json from the workflow browser.

What this workflow does:

Load Diffusion Model — Loads the Wan 2.1 T2V 14B model (~28 GB in bf16).
CLIPLoader — Loads the UMT5-XXL text encoder for Wan.
CLIP Text Encode — Encodes your video description prompt.
EmptyHunyuanLatentVideo — Creates a blank video latent (default: 720p, 81 frames at ~16 fps ≈ 5 seconds). Wan reuses this latent format.
KSampler — Diffusion sampling over the video latent. This is the slowest step — expect 3–5 minutes for a 5-second clip on the GB300.
VAE Decode — Converts latents to video frames.
SaveAnimatedWEBP — Encodes frames into an animated WEBP file.

Try it:

Enter a prompt: A drone shot flying over a misty mountain forest at sunrise, cinematic
Click Queue Prompt.
Generation takes 3–10 minutes at 720p with 81 frames. Monitor GPU memory with nvidia-smi in another terminal — the 14B model at 720p uses approximately 65–80 GB of the GB300's 252 GB HBM3e.
The output .webp (animated WEBP from SaveAnimatedWEBP) appears in the output/ directory. To convert to MP4, use ffmpeg -i output/wan_t2v_output_00001_.webp output/wan_t2v_output.mp4.

Tips:

Reduce frame count (e.g., 49 frames ≈ 3 seconds) for faster iteration.
Wan 2.1 responds well to cinematic, descriptive prompts with camera movement descriptions.

Step 8
Intermediate workflows

Requires: Tier 2 models

This step introduces four additional workflows. Each builds on the basics from Steps 6–7.

HiDream-I1 image generation

Load hidream-text-to-image.json.

HiDream-I1 Full is a 17B parameter image model that uses four text encoders — CLIP-L, CLIP-G, T5-XXL, and Llama-3.1-8B-Instruct. The Llama encoder gives it exceptional prompt understanding, especially for complex or nuanced descriptions.

The full pipeline uses approximately 60–65 GB in bf16 — well within the GB300's capacity but impossible on most GPUs.

Try it: Use a detailed, complex prompt to see the difference from FLUX — for example: An astronaut riding a horse on Mars, with Earth visible in the sky, oil painting style by Rembrandt, dramatic chiaroscuro lighting

Wan 2.1 image-to-video

Load wan-image-to-video.json.

This workflow takes an input image and animates it into a video clip. Place your source image in the input/ directory before running.

The LoadImage node reads from input/.
The Wan 2.1 I2V 14B model generates motion that is consistent with the source image.

Try it: Generate an image with FLUX first (Step 6), copy it from output/ to input/, then animate it.

FLUX → Wan combined pipeline

Load flux-to-wan-pipeline.json.

This workflow chains two models in a single graph:

FLUX.1 dev generates a high-quality still image from your text prompt.
The image is passed directly to Wan 2.1 I2V 14B, which animates it into a video.

This avoids manually moving files between workflows. Both models load into GPU memory simultaneously (~95 GB total in bf16).

Cosmos-Predict2 Video2World

Load cosmos-video2world.json.

NVIDIA Cosmos-Predict2 14B is NVIDIA's world foundation model for Video2World generation. It takes an input image and generates a physically plausible video extending from that scene. Place your source image in the input/ directory before running.

The Cosmos VAE is extremely efficient — it can encode/decode 1280x704 at 121 frames without tiling.

Try it: Use an image from a previous FLUX generation as the start frame, with a prompt describing the motion: A red ball rolling down a wooden ramp and bouncing off a wall, physics simulation, realistic lighting

Step 9
Advanced workflows

Requires: Tier 3 models

HunyuanVideo 1080p generation

Load hunyuan-1080p-video.json.

This is the true GB300 showcase. HunyuanVideo's 13B model generating at 1080p resolution uses approximately 100–120 GB of VRAM — impossible on any consumer or professional GPU, but well within the GB300's 252 GB.

Default: 1920x1056, 49 frames (~3 seconds). Note: height must be divisible by 16 for HunyuanVideo's latent space, so 1056 is used instead of 1080.
Generation time: 2–5 minutes for 49 frames, longer for more.
Monitor with nvidia-smi — you should see 100+ GB GPU memory usage.

Try it: A time-lapse of cherry blossoms falling in a Japanese garden with a koi pond, 4K cinematic

ControlNet with FLUX

Load flux-controlnet.json.

ControlNet lets you guide image generation with structural conditioning — edges, depth maps, or pose skeletons extracted from a reference image.

Place a reference image in input/.
The Canny Edge Preprocessor extracts edge structure from the reference.
The FLUX.1 Canny Dev model (a full FLUX variant fine-tuned for canny conditioning) generates an image following that structure while applying the text prompt's style and content.
Both the preprocessed canny image and the final output are saved for comparison.

Use cases: Architectural visualization, consistent character poses, style transfer while preserving composition.

Step 10
Cleanup

Stop and remove the ComfyUI container:

docker stop comfyui
docker rm comfyui

NOTE

Files in output/ and models/ are written by the container as root, so removing them from the host shell needs sudo (e.g. sudo rm -rf models/). To avoid this in future runs, add --user "$(id -u):$(id -g)" to the docker run command in Step 5 — note that this requires the host UID to have write access to all mounted directories.

Optionally remove the Docker image:

docker rmi comfyui-gb300

Optionally remove downloaded models to reclaim disk space:

rm -rf models/

Generated images and videos in output/ are preserved on the host regardless of container state.

Image & Video Generation with ComfyUI

Step 1Verify your environment

Step 2Set up environment variables

Step 3Clone the playbook and build the container

Step 4Download models

Step 5Launch ComfyUI