Fine-tune and benchmark NVIDIA's GR00T N1.6 robotics foundation model on DGX Station
install_deps.sh fails building torchcodecSolution:
Ensure the license confirmation env var is set:
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh
If the build still fails, ensure FFmpeg dev libraries are installed:
sudo apt-get install -y libavdevice-dev libavfilter-dev libavformat-dev \
libavcodec-dev libavutil-dev libswresample-dev libswscale-dev
huggingface-cli download fails with 401 UnauthorizedSolution:
Verify your HuggingFace token is set and valid:
echo $HF_TOKEN
huggingface-cli whoami
If the token is not set:
export HF_TOKEN="your_token_here"
Make sure you have accepted any required model agreements on the HuggingFace model page.
Solution:
If fine-tuning fails with an OOM error at batch size 128, reduce the batch size:
--global-batch-size 64
Also check that no other processes are using GPU memory:
nvidia-smi
If you are tuning additional model components (--tune-llm or --tune-visual), these significantly increase memory usage. The default configuration (projector + diffusion model only) is the most memory-efficient.
sm_103a during inferenceSolution:
The bundled Triton version may not yet support SM103 (GB300). This causes errors like:
ptxas-blackwell fatal: Value 'sm_103a' is not defined for option 'gpu-name'
Disable torch.compile by prepending:
TORCHDYNAMO_DISABLE=1 python scripts/deployment/standalone_inference_script.py ...
This runs inference in eager mode (~240 ms/step instead of ~38 ms/step with compile). Training and open-loop evaluation are not affected since they use eager mode by default.
ModuleNotFoundError: No module named 'gr00t'Solution:
The virtual environment is not activated. Run:
source .venv/bin/activate
Verify you are in the Isaac-GR00T directory:
pwd
# Should show: .../Isaac-GR00T
Solution:
At 2000 steps, the model may not have converged fully — this is expected for the shortened playbook run. If loss remains flat after 500+ steps:
Verify the dataset was downloaded correctly and the modality config was copied:
ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json
Check that the correct embodiment tag is used (LIBERO_PANDA, not NEW_EMBODIMENT).
Try increasing the learning rate to 5e-4 for faster initial convergence on short runs.
nvidia-smi shows the wrong GPUSolution:
On DGX Station, the GB300 may not be device 0. Find the correct index:
nvidia-smi --query-gpu=index,name --format=csv,noheader
Use the GB300's index with CUDA_VISIBLE_DEVICES:
CUDA_VISIBLE_DEVICES=1 python ...
Solution:
Increase the number of dataloader workers:
--dataloader-num-workers 8
NotImplementedError or torchcodec not found)Solution:
The install_deps.sh script builds torchcodec from source on aarch64. If it wasn't built correctly, reinstall:
sudo apt-get install -y libavdevice-dev libavfilter-dev libavformat-dev \
libavcodec-dev libavutil-dev libswresample-dev libswscale-dev \
pkg-config cmake build-essential pybind11-dev
git clone --depth 1 --branch release/0.4 https://github.com/meta-pytorch/torchcodec.git /tmp/torchcodec
cd /tmp/torchcodec
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 ENABLE_CUDA=1 \
uv pip install --python .venv/bin/python . --no-build-isolation
cd - && rm -rf /tmp/torchcodec