Copyright © 2026 NVIDIA Corporation
Fine-tune and benchmark NVIDIA's GR00T N1.6 robotics foundation model on DGX Station
NVIDIA Isaac GR00T N1.6 is a 3-billion-parameter open vision-language-action (VLA) foundation model for generalist humanoid robot skills. It combines a Cosmos-Reason-2B vision-language backbone with a 32-layer Diffusion Transformer (DiT) action head that denoises continuous robot actions from multimodal input — language instructions and camera images. The model is pre-trained on over 10,000 hours of robot demonstration data spanning bimanual arms, semi-humanoid platforms, and full humanoids, then adapted to specific embodiments and tasks through fine-tuning.
In this playbook you will fine-tune GR00T N1.6 on the LIBERO Spatial benchmark — a manipulation task suite that tests spatial reasoning with a Panda robot arm. DGX Station's GB300 GPU with 284 GB of HBM3e memory enables a per-device batch size of 128, far exceeding the typical 32–64 used on smaller GPUs, which accelerates convergence and improves training throughput.
uv (fast, reproducible Python packaging)nvcc --version should show CUDA 12.8+git --versionuv sync installs packages into a project-local .venvIsaac-GR00T directory to restore state. No system-level changes are made.