Fine-tune and benchmark NVIDIA's GR00T N1.6 robotics foundation model on DGX Station
Clone the repository and run the dGPU install script. This uses uv for fast, reproducible dependency management and automatically detects the aarch64 architecture:
git clone --recurse-submodules https://github.com/NVIDIA/Isaac-GR00T
cd Isaac-GR00T
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh
The install script:
ffmpeg, libaio-dev)uv if not presentuv sync to create a .venv with all Python dependencies (PyTorch 2.7.1 + CUDA 12.8)torchcodec from source on aarch64 (required for video decoding)Activate the virtual environment:
source .venv/bin/activate
Verify GPU access:
CUDA_VISIBLE_DEVICES=1 python -c "import torch; print(torch.cuda.get_device_name(0))"
Expected output: NVIDIA GB300
NOTE
Replace CUDA_VISIBLE_DEVICES=1 with the index of your GB300 GPU throughout this playbook. Run nvidia-smi --query-gpu=index,name --format=csv,noheader to find it.
export HF_TOKEN="your_huggingface_token"
Get a token from https://huggingface.co/settings/tokens if you don't have one.
Download the LIBERO Spatial dataset and the GR00T N1.6 base model:
# Download LIBERO Spatial dataset (~2-3 GB)
huggingface-cli download \
--repo-type dataset IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot \
--local-dir examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/
# Copy the LIBERO modality config into the dataset's meta/ directory
cp examples/LIBERO/modality.json \
examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/
# Download GR00T N1.6 base model (~6 GB)
huggingface-cli download nvidia/GR00T-N1.6-3B
Verify the dataset is ready:
ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json
Confirm the GR00T N1.6 base model loads correctly and can produce actions. The base model ships with a GR1 demo dataset for quick verification:
CUDA_VISIBLE_DEVICES=1 python scripts/deployment/standalone_inference_script.py \
--model-path nvidia/GR00T-N1.6-3B \
--dataset-path demo_data/gr1.PickNPlace \
--embodiment-tag GR1 \
--traj-ids 0 \
--inference-mode pytorch \
--action-horizon 8 \
--steps 32
You should see per-step timing output and no errors. This confirms the model, CUDA, and data pipeline all work before committing to a longer fine-tuning run.
NOTE
If you see Triton/PTXAS errors about sm_103a, prepend TORCHDYNAMO_DISABLE=1 to the command. See Troubleshooting for details.
NOTE
The base model's pretrained processor does not include the LIBERO_PANDA embodiment configuration, so you cannot run evaluation directly on the LIBERO dataset with the base model. The LIBERO modality config is registered during fine-tuning. This is expected — LIBERO is a post-training benchmark.
Fine-tune the base model on the LIBERO Spatial dataset. DGX Station's GB300 GPU with 284 GB HBM3e lets you use a batch size of 128 — roughly 4x what fits on a typical 80 GB GPU. Larger batch sizes mean more stable gradients and faster convergence per wall-clock hour.
CUDA_VISIBLE_DEVICES=1 python \
gr00t/experiment/launch_finetune.py \
--base-model-path nvidia/GR00T-N1.6-3B \
--dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
--embodiment-tag LIBERO_PANDA \
--num-gpus 1 \
--output-dir output/libero_spatial_ft \
--save-steps 500 \
--save-total-limit 5 \
--max-steps 2000 \
--global-batch-size 128 \
--learning-rate 1e-4 \
--warmup-ratio 0.05 \
--weight-decay 1e-5 \
--state-dropout-prob 0.8 \
--color-jitter-params brightness 0.3 contrast 0.4 saturation 0.5 hue 0.08 \
--dataloader-num-workers 4
Training runs for 2000 steps at batch size 128 and takes approximately 20–25 minutes on the GB300.
NOTE
This playbook uses 2000 steps to keep execution time under an hour. For production-quality results matching the published 97.65% success rate on LIBERO Spatial, increase to 20,000 steps by changing --max-steps 20000. Published results used batch size 640 across 8 GPUs (80 per GPU) — batch size 128 on a single GB300 exceeds the per-GPU batch size used in the published benchmarks.
What the training flags mean:
| Flag | Value | Purpose |
|---|---|---|
--global-batch-size | 128 | Total samples per training step. GB300's 284 GB HBM3e makes this possible on a single GPU. |
--state-dropout-prob | 0.8 | Drops state input 80% of the time during training, forcing the model to rely on vision. Improves generalization. |
--color-jitter-params | brightness/contrast/saturation/hue | Randomly perturbs image colors during training for robustness to lighting variation. |
--warmup-ratio | 0.05 | Linearly ramps learning rate from 0 to 1e-4 over the first 5% of steps (100 steps). |
--save-steps | 500 | Saves a checkpoint every 500 steps. |
Monitor the training loss in the terminal. The HuggingFace Trainer logs progress at each step — look for the loss field decreasing over time. Checkpoints are saved every 500 steps to output/libero_spatial_ft/.
Run open-loop evaluation on the fine-tuned checkpoint. This compares the model's predicted actions against the ground truth from the dataset:
CUDA_VISIBLE_DEVICES=1 python gr00t/eval/open_loop_eval.py \
--dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
--embodiment-tag LIBERO_PANDA \
--model-path output/libero_spatial_ft/checkpoint-2000/ \
--traj-ids 0 1 2 \
--action-horizon 16
The evaluation outputs:
/tmp/open_loop_eval/ showing ground truth vs. predicted actions for each action dimension (x, y, z, roll, pitch, yaw, gripper)Key things to look for in the plots:
TIP
Even at 2000 steps, the fine-tuned model should show clearly improved action prediction compared to random. With 20,000 steps, LIBERO Spatial achieves 97.65% success rate in closed-loop simulation.
Measure the fine-tuned model's per-step inference latency:
CUDA_VISIBLE_DEVICES=1 python scripts/deployment/standalone_inference_script.py \
--model-path output/libero_spatial_ft/checkpoint-2000/ \
--dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
--embodiment-tag LIBERO_PANDA \
--traj-ids 0 \
--inference-mode pytorch \
--action-horizon 8
NOTE
If you see Triton/PTXAS errors about sm_103a, prepend TORCHDYNAMO_DISABLE=1 to the command. This runs inference in eager mode. See Troubleshooting for details.
The timing output breaks down into:
In eager mode (without torch.compile), expect ~240 ms per step. With torch.compile working, expect ~38 ms per step comparable to H100.
To remove the environment:
deactivate
cd ..
rm -rf Isaac-GR00T
Your fine-tuned checkpoints in output/libero_spatial_ft/ are deleted with the repo. Copy them elsewhere first if you want to keep them.
--max-steps to 20000 for results closer to the published 97.65% success rate on LIBERO Spatial. Training time scales linearly (~3.5 hours at 20K steps).libero_10_no_noops, libero_goal_no_noops, or libero_object_no_noops datasets from the IPEC-COMMUNITY HuggingFace organization and repeat the workflow. Published success rates: Object 98.45%, Goal 97.5%, 10-Long 94.35%.--tune-llm or --tune-visual increases memory usage significantly.