Fine-tune and benchmark NVIDIA's GR00T N1.6 robotics foundation model on DGX Station
NVIDIA Isaac GR00T N1.6 is a 3-billion-parameter open vision-language-action (VLA) foundation model for generalist humanoid robot skills. It combines a Cosmos-family vision-language backbone with a 32-layer Diffusion Transformer (DiT) action head that denoises continuous robot actions from multimodal input — language instructions and camera images. The model is pre-trained on a large mixture of robot demonstration data, then adapted to specific embodiments and tasks through fine-tuning.
High-level architecture (VLM + DiT action head), as in the upstream Isaac GR00T repo:
Source: NVIDIA Isaac GR00T — media/GR00T-reference-arch-diagram.png. If the local image above is missing, the upstream copy is at https://raw.githubusercontent.com/NVIDIA/Isaac-GR00T/n1.6-release/media/GR00T-reference-arch-diagram.png.
In this playbook you will fine-tune GR00T N1.6 on the LIBERO Spatial benchmark on a DGX Station with GB300 (large unified memory). That setup supports a high global batch size (128) on a single GPU, which improves training throughput compared to typical 24–80 GB consumer or datacenter GPUs.
LIBERO Spatial is part of the LIBERO suite of simulated tabletop manipulation benchmarks. The spatial split emphasizes where objects need to be placed: tasks such as putting a bowl on a stove burner vs a plate, placing utensils in a mug vs next to it, or moving objects to left/right/front targets on the table. Episodes include third-person RGB video, proprioceptive state, language instructions, and continuous end-effector actions in a consistent LeRobot v2 layout. Understanding these constraints helps when you read training logs or open-loop evaluation plots.
This playbook runs the default Isaac GR00T fine-tuning recipe from launch_finetune.py: not full-model weight updates of the entire 3B VLM. In the stock configuration, training focuses on the action head (DiT) and projector / adapter paths that map observations into the action model, with strong state dropout and color jitter so the policy leans on vision. Optional flags such as --tune-llm or --tune-visual (mentioned under Next steps) trade compute and memory for updating more of the backbone. LoRA is not the default here; if your team uses LoRA or other PEFT variants, treat that as a separate configuration branch from this playbook.
DGX Station is a deskside AI system built for large-memory GPU training and inference (this playbook targets GB300 with 284 GB HBM3e). Beyond robotics, the same class of machine supports large-model fine-tuning, RAG serving, multi-modal training, and CUDA research where single-GPU memory and bandwidth dominate. For GR00T, the headline benefit is fitting much larger batch sizes per GPU than on smaller cards, which stabilizes gradients and improves samples per second when the data pipeline keeps up.
n1.6-release branch of Isaac GR00T so commands, embodiment tags, and demo_data/ match GR00T N1.6uv (project-local .venv) and understand what the optional install_deps.sh script changes on the systemget_frames_by_indices patch when torchcodec is unavailable so LIBERO AV1 video decoding does not stall on an ffmpeg subprocess fallbacksource .venv/bin/activate)sudo for system packages (or use the documented user-space alternative)nvcc --version should show CUDA 12.8+ (often already under /usr/local/cuda on DGX images)git lfs version) — LFS is required for some demo assets and submodules; install with sudo apt-get install -y git-lfs then git lfs install if missing.venv, checkpoints, and the LIBERO downloadscripts/deployment/dgpu/install_deps.sh performs system-level apt operations and may install the CUDA 12.8 toolkit if /usr/local/cuda is absent (see Instructions). Model download requires Hugging Face authentication.Isaac-GR00T directory and optionally rm -rf ~/.local/share/uv if you want to reclaim uv caches. Reverting apt-installed packages is a separate admin task; the playbook does not uninstall them automatically.