Fine-tune and benchmark NVIDIA's GR00T N1.6 robotics foundation model on DGX Station
If git clone fails with errors about Git LFS or missing pointer files, install and initialize LFS, then remove any partial Isaac-GR00T directory and clone again:
sudo apt-get update
sudo apt-get install -y git-lfs
git lfs install
n1.6-releaseThe main branch tracks ongoing development (for example newer GR00T milestones) and does not always match this N1.6 playbook. Embodiment tags such as GR1, paths like demo_data/gr1.PickNPlace, and tutorial scripts are aligned with the n1.6-release branch.
git clone --recurse-submodules https://github.com/NVIDIA/Isaac-GR00T
cd Isaac-GR00T
git fetch origin
git checkout n1.6-release
git submodule update --init --recursive
install_deps.sh (matches upstream docs; uses sudo)This script is the supported path. It may make system-level changes:
apt-get update and installs ffmpeg and libaio-dev/usr/local/cuda is missing, adds the NVIDIA CUDA apt repository and installs cuda-toolkit-12-8uv into your user account if needed, then runs uv sync and uv pip install -e . into the project .venvtorchcodec from source into .venvI_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh
install_deps.sh)Use this only when CUDA 12.8+ is already installed, system ffmpeg / libaio-dev are already present, and your policy forbids the script's apt or CUDA steps. From the Isaac-GR00T repo root, install uv if needed, then:
command -v uv >/dev/null || curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="/usr/local/cuda/bin:$HOME/.local/bin:$PATH"
export CUDA_HOME=/usr/local/cuda
uv sync
uv pip install -e .
You still need a working video backend for LIBERO (see Step 2). On aarch64, building torchcodec inside .venv without the script is possible but manual; see Troubleshooting.
IMPORTANT
PATH and CUDA_HOME matter on multi-toolkit hosts. If the system has both an old Ubuntu nvidia-cuda-toolkit package (/usr/bin/nvcc ≈ 12.0) and a current NVIDIA CUDA repo install (/usr/local/cuda-13.x/bin/nvcc), uv will pick whichever appears first on PATH. Putting /usr/local/cuda/bin first (and exporting CUDA_HOME) is required for flash-attn's source build to find the matching toolkit. Verify with nvcc --version after the export.
WARNING
flash-attn build on aarch64 takes ~2 hours from source. The upstream pyproject.toml only lists pre-built flash-attn==2.7.4.post1 wheels for x86_64; on aarch64 (Grace + GB300), uv sync falls back to compiling ~72 CUDA kernels from source. A faster route is to pin flash-attn==2.8.1 and reuse the GitHub release's prebuilt aarch64 wheel:
# In pyproject.toml under [project] dependencies:
"flash-attn==2.8.1",
# In [tool.uv.sources]:
flash-attn = [
{ url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.10cxx11abiTRUE-cp312-cp312-linux_aarch64.whl",
marker = "sys_platform == 'linux' and platform_machine == 'aarch64' and python_version == '3.12'" },
]
With this pin, uv sync finishes in ~1 minute on aarch64 instead of ~2 hours. The wheel works against torch 2.10. Verified on GB300 + CUDA 13.1 in this playbook's validation run.
Activate the virtual environment:
source .venv/bin/activate
Verify GPU access:
CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_name(0))"
Expected output: NVIDIA GB300
NOTE
Examples in this playbook use CUDA_VISIBLE_DEVICES=0 because the GB300 is at index 0 on a single-GPU Station. On a multi-GPU Station (for example RTX PRO 6000 + GB300), the GB300 may be at a different index — run nvidia-smi --query-gpu=index,name --format=csv,noheader, find the GB300 row, and substitute that index everywhere CUDA_VISIBLE_DEVICES=0 appears below.
On many stacks torchcodec fails to import or build, the resolver falls back to pyav, and stock n1.6-release can raise NotImplementedError from get_frames_by_indices for the pyav backend (fallback order is already torchcodec → decord → pyav → ffmpeg). Without this patch, training may appear hung: GPU idle, no traceback, while ffmpeg spawns per-frame decode work on the CPU.
From the Isaac-GR00T repo root with n1.6-release checked out and .venv activated:
git apply /path/to/dgx-station-playbooks/nvidia/station-gr00t/assets/patches/001-pyav-get-frames-by-indices.patch
uv pip install av
If you copied nvidia/station-gr00t/assets/patches/ into the Isaac-GR00T root instead, use git apply assets/patches/001-pyav-get-frames-by-indices.patch.
Details and re-apply rules: nvidia/station-gr00t/assets/patches/README.md.
After patching, repeated log lines such as Video backend 'torchcodec' is not available, falling back to 'pyav' are expected and noisy but not fatal.
export HF_TOKEN="your_huggingface_token"
Get a token from https://huggingface.co/settings/tokens if you don't have one.
Download the LIBERO Spatial dataset and the GR00T N1.6 base model:
# Download LIBERO Spatial dataset (~2-3 GB)
huggingface-cli download \
--repo-type dataset IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot \
--local-dir examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/
# Copy the LIBERO modality config into the dataset's meta/ directory
cp examples/LIBERO/modality.json \
examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/
# Download GR00T N1.6 base model (~6 GB)
huggingface-cli download nvidia/GR00T-N1.6-3B
NOTE
HF cache permission errors: If huggingface-cli download fails with Permission denied: '/home/.../.cache/huggingface/hub/...', the cache directory was previously created by a Docker container running as root (common on shared dev boxes). Point HF at a user-owned cache for this run:
export HF_HOME=$HOME/hf_cache_gr00t
Transient xet-read-token 500 errors: Hugging Face's xet backend occasionally returns 500 Internal Server Error for dataset downloads. Disable it:
export HF_HUB_DISABLE_XET=1
Verify the dataset is ready:
ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json
Expected result: the command prints the full path to modality.json (and ls exits 0). That confirms the merged modality file exists next to the downloaded LeRobot dataset metadata.
Confirm the GR00T N1.6 base model loads and produces actions using the GR1 demo shipped on n1.6-release:
TORCHDYNAMO_DISABLE=1 CUDA_VISIBLE_DEVICES=0 python scripts/deployment/standalone_inference_script.py \
--model-path nvidia/GR00T-N1.6-3B \
--dataset-path demo_data/gr1.PickNPlace \
--embodiment-tag GR1 \
--traj-ids 0 \
--inference-mode pytorch \
--action-horizon 8 \
--steps 32
TORCHDYNAMO_DISABLE=1 avoids torch.compile / Triton paths that can fail on GB300 with ptxas-blackwell fatal: Value 'sm_103a' is not defined for option 'gpu-name'. Keep it on all standalone_inference_script.py invocations in this playbook unless you have a Triton build that supports SM103.
You should see per-step timing output and no errors. This confirms the model, CUDA, and data pipeline work before a long fine-tuning run.
NOTE
The base model's pretrained processor does not include the LIBERO_PANDA embodiment configuration, so you cannot run this standalone script on the LIBERO dataset with the base checkpoint alone. The LIBERO modality config is registered during fine-tuning. That is expected — LIBERO is a post-training benchmark.
Fine-tune the base model on LIBERO Spatial. DGX Station's GB300 GPU with 284 GB HBM3e allows a global batch size of 128 — roughly several times what fits on a typical 80 GB GPU. Larger batches stabilize gradients and improve wall-clock throughput when the dataloader keeps the GPU fed.
CUDA_VISIBLE_DEVICES=0 python \
gr00t/experiment/launch_finetune.py \
--base-model-path nvidia/GR00T-N1.6-3B \
--dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
--embodiment-tag LIBERO_PANDA \
--num-gpus 1 \
--output-dir output/libero_spatial_ft \
--save-steps 500 \
--save-total-limit 5 \
--max-steps 2000 \
--global-batch-size 128 \
--learning-rate 1e-4 \
--warmup-ratio 0.05 \
--weight-decay 1e-5 \
--state-dropout-prob 0.8 \
--color-jitter-params brightness 0.3 contrast 0.4 saturation 0.5 hue 0.08 \
--dataloader-num-workers 4
If GPU utilization stays near zero for many minutes while the process is alive, suspect video decoding (see Step 2 patch and Troubleshooting). You can try --dataloader-num-workers 8 if CPU cores are available.
Training runs for 2000 steps at batch size 128 and takes approximately 20–25 minutes on GB300 when torchcodec is the active video backend.
IMPORTANT
With the PyAV fallback (Step 2 patch + no torchcodec), expect ~5–6 s per step instead of <1 s — so 2000 steps is closer to 2.5–3 hours, and GPU utilization sits in the 3–30 % range while CPU-side video decoding starves the GPU. To validate the workflow without the long wait, lower --max-steps (e.g. 100) and --save-steps (e.g. 50); loss should still drop visibly (validated drop 1.07 → 0.63 in 100 steps in this playbook's GB300 run). If you need full-throughput training, build torchcodec from source (Troubleshooting → "Video decoding errors") or run Option A which builds it for you.
NOTE
This playbook uses 2000 steps to keep execution time under an hour. For production-quality results closer to the published 97.65% success rate on LIBERO Spatial, increase to 20,000 steps (--max-steps 20000). Published settings used batch size 640 across 8 GPUs — 128 on one GB300 exceeds the per-GPU batch in that reference.
What the training flags mean:
| Flag | Value | Purpose |
|---|---|---|
--global-batch-size | 128 | Total samples per training step; enabled by GB300 memory. |
--state-dropout-prob | 0.8 | Drops proprioceptive state 80% of the time so the model relies on vision. |
--color-jitter-params | brightness/contrast/saturation/hue | Photometric augmentation for lighting robustness. |
--warmup-ratio | 0.05 | Linear LR warmup over the first 5% of steps. |
--save-steps | 500 | Checkpoint cadence under output/libero_spatial_ft/. |
Monitor the Hugging Face Trainer loss in the terminal. Checkpoints land under output/libero_spatial_ft/.
Open-loop evaluation compares predicted actions to dataset ground truth and writes plots to /tmp/open_loop_eval/:
CUDA_VISIBLE_DEVICES=0 python gr00t/eval/open_loop_eval.py \
--dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
--embodiment-tag LIBERO_PANDA \
--model-path output/libero_spatial_ft/checkpoint-2000/ \
--traj-ids 0 1 2 \
--action-horizon 16
How to read the run: the terminal prints per-trajectory MSE/MAE and averages. The JPEGs under /tmp/open_loop_eval/ overlay predicted vs ground-truth trajectories per action dimension (translation, rotation, gripper). Use them to confirm the policy tracks pick-and-place phases and gripper open/close timing on spatial tasks.
TIP
At 2000 steps you should see clear improvement over a random policy; at 20,000 steps, published LIBERO Spatial success reaches 97.65% in closed-loop sim.
This step passes LIBERO Spatial observations through the fine-tuned checkpoint (the base model cannot run this embodiment). TORCHDYNAMO_DISABLE=1 is included for GB300:
TORCHDYNAMO_DISABLE=1 CUDA_VISIBLE_DEVICES=0 python scripts/deployment/standalone_inference_script.py \
--model-path output/libero_spatial_ft/checkpoint-2000/ \
--dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
--embodiment-tag LIBERO_PANDA \
--traj-ids 0 \
--inference-mode pytorch \
--action-horizon 8
What to inspect: the script prints a timing breakdown (data processing, backbone, action head, end-to-end). Compare MSE/MAE and latency to Step 5's base-model smoke test. In eager mode (with TORCHDYNAMO_DISABLE=1), per-step latency on GB300 depends heavily on the torch + CUDA stack — expect ~3–4 s/step on torch 2.10 + cu130 in eager mode (validated in this playbook's run on a fine-tuned checkpoint-100); a compiled torch 2.7 + cu128 stack with Triton support for sm_103 can be much faster. Treat the "Backbone vs Action head" split as the more stable signal across stacks.
deactivate
cd ..
rm -rf Isaac-GR00T
Fine-tuned checkpoints under output/libero_spatial_ft/ are removed with the repo. Copy them elsewhere first if you want to keep them.
--max-steps 20000 for stronger LIBERO Spatial alignment (~3.5 hours at the same throughput).libero_10_no_noops, libero_goal_no_noops, libero_object_no_noops from IPEC-COMMUNITY on Hugging Face.--tune-llm / --tune-visual raise memory use; probe batch size if you enable them.