Fine-tune and benchmark NVIDIA's GR00T N1.6 robotics foundation model on DGX Station
git clone fails or demo videos are tiny / missing (Git LFS)Solution:
sudo apt-get install -y git-lfs
git lfs install
Remove any partial Isaac-GR00T directory, then clone again with --recurse-submodules.
GR1, demo_data/gr1.PickNPlace, or scripts do not match the playbookCause: The repository default branch (main) may track a newer GR00T line (for example N1.7) with different embodiment tags and demo layouts.
Solution:
cd Isaac-GR00T
git fetch origin
git checkout n1.6-release
git submodule update --init --recursive
Always run playbook commands from n1.6-release for N1.6 + GR00T-N1.6-3B.
install_deps.sh is not allowed on your machine (policy) or you need to know what it changesFacts: scripts/deployment/dgpu/install_deps.sh runs sudo apt-get to install ffmpeg, libaio-dev, and (on aarch64) FFmpeg development libraries for the torchcodec build. If /usr/local/cuda does not exist, it adds the NVIDIA CUDA apt repo and installs cuda-toolkit-12-8. It also installs uv into the user account if missing, then uv sync + uv pip install -e . into .venv.
Solution (policy-friendly): Pre-install the same system packages and CUDA using your IT process, ensure nvcc works, then from the repo root:
export PATH="$HOME/.local/bin:$PATH"
uv sync
uv pip install -e .
On aarch64, you still need torchcodec in .venv or rely on the PyAV patch (Instructions Step 2) plus uv pip install av.
uv sync (Option B) appears stuck for hours building flash-attn on aarch64Cause: Upstream pyproject.toml lists pre-built flash-attn==2.7.4.post1 wheels only for linux_x86_64. On aarch64 (Grace + GB300), uv falls back to a from-source build that compiles ~72 CUDA kernels — typically ~2 hours end-to-end.
Solution: Pin to flash-attn==2.8.1 and use the GitHub release's prebuilt aarch64 wheel. Edit pyproject.toml in the repo root:
# under [project] dependencies, replace:
# "flash-attn==2.7.4.post1",
"flash-attn==2.8.1",
# under [tool.uv.sources], add:
flash-attn = [
{ url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.10cxx11abiTRUE-cp312-cp312-linux_aarch64.whl",
marker = "sys_platform == 'linux' and platform_machine == 'aarch64' and python_version == '3.12'" },
]
The cu12torch2.10 aarch64 wheel works against torch 2.10 (cu128 or cu130 builds). Validated on GB300 + CUDA 13.1 — uv sync completes in ~1 minute instead of ~2 hours. Track upstream Isaac-GR00T for a future commit that bakes this in.
If you must keep flash-attn==2.7.4.post1 (Option A path), expect the 2-hour build on first sync; subsequent uv sync invocations re-use the cached wheel.
install_deps.sh fails building torchcodecSolution:
Ensure the license confirmation env var is set:
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh
If the build still fails, install FFmpeg development libraries:
sudo apt-get install -y libavdevice-dev libavfilter-dev libavformat-dev \
libavcodec-dev libavutil-dev libswresample-dev libswscale-dev \
pkg-config cmake build-essential pybind11-dev
Then apply Instructions Step 2 (PyAV patch) so training does not depend on a working torchcodec for indexed frame reads.
huggingface-cli download fails with 401 UnauthorizedSolution:
echo $HF_TOKEN
huggingface-cli whoami
If the token is not set:
export HF_TOKEN="your_token_here"
Accept any required license or gated-model agreements on the Hugging Face model page.
huggingface-cli download fails with Permission denied: '/home/.../.cache/huggingface/hub/...'Cause: The shared cache directory was previously created by a Docker container running as root (common on multi-user dev boxes that mount ~/.cache/huggingface into containers without --user). The current user (nvidia) cannot write into it.
Solution: point HF at a user-owned cache location for this run:
export HF_HOME=$HOME/hf_cache_gr00t
mkdir -p "$HF_HOME"
huggingface-cli download nvidia/GR00T-N1.6-3B
Re-export HF_HOME for the rest of the playbook (Step 5 onward) so model loads find the right cache. To permanently un-stick the original cache, ask whoever owns the container session to chown ~/.cache/huggingface back to your user.
huggingface-cli download returns 500 Internal Server Error from the xet-read-token endpointCause: Hugging Face's xet content-addressable backend occasionally returns transient 5xx. This blocks dataset downloads even though the underlying files are reachable via the legacy backend.
Solution: disable xet for the download:
export HF_HUB_DISABLE_XET=1
huggingface-cli download --repo-type dataset \
IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot \
--local-dir examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/
externally-managed-environment or pip installs not going into .venvCause: Debian/Ubuntu PEP 668 blocks pip install onto the system Python. Mixing sudo pip with the project venv breaks the playbook.
Solution:
source .venv/bin/activate — prompt should show (.venv).uv pip install ... (or python -m pip install ...) only with the venv activated — never sudo pip for this project.pip, recreate: rm -rf .venv and run uv sync again from the repo root (after n1.6-release checkout).Solution:
Reduce batch size:
--global-batch-size 64
Check for other GPU processes: nvidia-smi. --tune-llm / --tune-visual increase memory use substantially.
sm_103a (GB300 / Blackwell)Symptom:
ptxas-blackwell fatal: Value 'sm_103a' is not defined for option 'gpu-name'
Solution:
For scripts/deployment/standalone_inference_script.py (which may use torch.compile), prepend:
TORCHDYNAMO_DISABLE=1 python scripts/deployment/standalone_inference_script.py ...
This forces eager inference (higher latency per step but stable on SM103 until Triton catches up). Fine-tuning and open_loop_eval.py typically run without this compile path; use the same prefix there only if you see the same crash.
ModuleNotFoundError: No module named 'gr00t'Solution:
source .venv/bin/activate
pwd # .../Isaac-GR00T
NotImplementedError in get_frames_by_indices when backend is pyavCause: On n1.6-release, resolve_backend can select pyav, but stock get_frames_by_indices did not implement the pyav branch.
Solution: Apply the playbook patch and install PyAV (see Instructions Step 2 and assets/patches/README.md).
Cause: Fallback to per-frame ffmpeg subprocess decoding for AV1 LIBERO clips; dataloaders starve the GPU.
Solution:
uv pip install av.--dataloader-num-workers (for example 8) if CPUs are free.Expected noise after patching: logs may repeat Video backend 'torchcodec' is not available, falling back to 'pyav' — that is normal if torchcodec is absent.
torchcodec not found (general)Solution:
Prefer the PyAV patch + av path above for LIBERO on GB300.
If you must build torchcodec into .venv manually (aarch64), with FFmpeg dev packages installed:
# Run this from inside the Isaac-GR00T repo root (the directory that
# contains .venv). Capture its absolute path BEFORE changing directories
# so we can still reach the virtualenv after cd'ing into /tmp/torchcodec.
GR00T_ROOT="$(pwd)"
# Sanity check — the virtualenv interpreter must already exist.
test -x "$GR00T_ROOT/.venv/bin/python" || { echo "Not in Isaac-GR00T root (missing .venv/bin/python)"; }
# Clone the torchcodec source into /tmp/torchcodec (skip if already cloned).
git clone https://github.com/pytorch/torchcodec.git /tmp/torchcodec
cd /tmp/torchcodec
# Build torchcodec into the Isaac-GR00T virtualenv using the absolute
# path captured above (do NOT use the relative ".venv/bin/python" here —
# the current directory is /tmp/torchcodec, which has no .venv).
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 ENABLE_CUDA=1 \
uv pip install --python "$GR00T_ROOT/.venv/bin/python" . --no-build-isolation
CUDA-enabled builds can fail when system FFmpeg or CUDA does not match torchcodec expectations — in that case use the PyAV patch instead.
Solution:
At 2000 steps the model may still be early. If loss is flat after many steps:
ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json--embodiment-tag LIBERO_PANDA--learning-rate 5e-4 for faster early movement on short runsnvidia-smi shows the wrong GPUSolution:
nvidia-smi --query-gpu=index,name --format=csv,noheader
CUDA_VISIBLE_DEVICES=<gb300_index> python ...
Notes: OpenCV often fails on AV1 in LIBERO assets. decord may lack a compatible wheel for your platform. The PyAV patch path is the supported mitigation in this playbook.