Isaac GR00T N1.6 Fine-Tuning

Common Issues

Issue: `git clone` fails or demo videos are tiny / missing (Git LFS)

Solution:

sudo apt-get install -y git-lfs
git lfs install

Remove any partial Isaac-GR00T directory, then clone again with --recurse-submodules.

Issue: `GR1`, `demo_data/gr1.PickNPlace`, or scripts do not match the playbook

Cause: The repository default branch (main) may track a newer GR00T line (for example N1.7) with different embodiment tags and demo layouts.

Solution:

cd Isaac-GR00T
git fetch origin
git checkout n1.6-release
git submodule update --init --recursive

Always run playbook commands from n1.6-release for N1.6 + GR00T-N1.6-3B.

Issue: `install_deps.sh` is not allowed on your machine (policy) or you need to know what it changes

Facts: scripts/deployment/dgpu/install_deps.sh runs sudo apt-get to install ffmpeg, libaio-dev, and (on aarch64) FFmpeg development libraries for the torchcodec build. If /usr/local/cuda does not exist, it adds the NVIDIA CUDA apt repo and installs cuda-toolkit-12-8. It also installs uv into the user account if missing, then uv sync + uv pip install -e . into .venv.

Solution (policy-friendly): Pre-install the same system packages and CUDA using your IT process, ensure nvcc works, then from the repo root:

export PATH="$HOME/.local/bin:$PATH"
uv sync
uv pip install -e .

On aarch64, you still need torchcodec in .venv or rely on the PyAV patch (Instructions Step 2) plus uv pip install av.

Issue: `uv sync` (Option B) appears stuck for hours building `flash-attn` on aarch64

Cause: Upstream pyproject.toml lists pre-built flash-attn==2.7.4.post1 wheels only for linux_x86_64. On aarch64 (Grace + GB300), uv falls back to a from-source build that compiles ~72 CUDA kernels — typically ~2 hours end-to-end.

Solution: Pin to flash-attn==2.8.1 and use the GitHub release's prebuilt aarch64 wheel. Edit pyproject.toml in the repo root:

# under [project] dependencies, replace:
# "flash-attn==2.7.4.post1",
"flash-attn==2.8.1",

# under [tool.uv.sources], add:
flash-attn = [
    { url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.10cxx11abiTRUE-cp312-cp312-linux_aarch64.whl",
      marker = "sys_platform == 'linux' and platform_machine == 'aarch64' and python_version == '3.12'" },
]

The cu12torch2.10 aarch64 wheel works against torch 2.10 (cu128 or cu130 builds). Validated on GB300 + CUDA 13.1 — uv sync completes in ~1 minute instead of ~2 hours. Track upstream Isaac-GR00T for a future commit that bakes this in.

If you must keep flash-attn==2.7.4.post1 (Option A path), expect the 2-hour build on first sync; subsequent uv sync invocations re-use the cached wheel.

Issue: `install_deps.sh` fails building torchcodec

Solution:

Ensure the license confirmation env var is set:

I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh

If the build still fails, install FFmpeg development libraries:

sudo apt-get install -y libavdevice-dev libavfilter-dev libavformat-dev \
    libavcodec-dev libavutil-dev libswresample-dev libswscale-dev \
    pkg-config cmake build-essential pybind11-dev

Then apply Instructions Step 2 (PyAV patch) so training does not depend on a working torchcodec for indexed frame reads.

Issue: `huggingface-cli download` fails with 401 Unauthorized

Solution:

echo $HF_TOKEN
huggingface-cli whoami

If the token is not set:

export HF_TOKEN="your_token_here"

Accept any required license or gated-model agreements on the Hugging Face model page.

Issue: `huggingface-cli download` fails with `Permission denied: '/home/.../.cache/huggingface/hub/...'`

Cause: The shared cache directory was previously created by a Docker container running as root (common on multi-user dev boxes that mount ~/.cache/huggingface into containers without --user). The current user (nvidia) cannot write into it.

Solution: point HF at a user-owned cache location for this run:

export HF_HOME=$HOME/hf_cache_gr00t
mkdir -p "$HF_HOME"
huggingface-cli download nvidia/GR00T-N1.6-3B

Re-export HF_HOME for the rest of the playbook (Step 5 onward) so model loads find the right cache. To permanently un-stick the original cache, ask whoever owns the container session to chown ~/.cache/huggingface back to your user.

Issue: `huggingface-cli download` returns `500 Internal Server Error` from the `xet-read-token` endpoint

Cause: Hugging Face's xet content-addressable backend occasionally returns transient 5xx. This blocks dataset downloads even though the underlying files are reachable via the legacy backend.

Solution: disable xet for the download:

export HF_HUB_DISABLE_XET=1
huggingface-cli download --repo-type dataset \
    IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot \
    --local-dir examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/

Issue: `externally-managed-environment` or `pip` installs not going into `.venv`

Cause: Debian/Ubuntu PEP 668 blocks pip install onto the system Python. Mixing sudo pip with the project venv breaks the playbook.

Solution:

source .venv/bin/activate — prompt should show (.venv).
Use uv pip install ... (or python -m pip install ...) only with the venv activated — never sudo pip for this project.
If the venv was created with a broken pip, recreate: rm -rf .venv and run uv sync again from the repo root (after n1.6-release checkout).

Issue: CUDA out of memory during fine-tuning

Solution:

Reduce batch size:

--global-batch-size 64

Check for other GPU processes: nvidia-smi. --tune-llm / --tune-visual increase memory use substantially.

Issue: Triton / PTXAS errors about `sm_103a` (GB300 / Blackwell)

Symptom:

ptxas-blackwell fatal: Value 'sm_103a' is not defined for option 'gpu-name'

Solution:

For scripts/deployment/standalone_inference_script.py (which may use torch.compile), prepend:

TORCHDYNAMO_DISABLE=1 python scripts/deployment/standalone_inference_script.py ...

This forces eager inference (higher latency per step but stable on SM103 until Triton catches up). Fine-tuning and open_loop_eval.py typically run without this compile path; use the same prefix there only if you see the same crash.

Issue: `ModuleNotFoundError: No module named 'gr00t'`

Solution:

source .venv/bin/activate
pwd   # .../Isaac-GR00T

Issue: `NotImplementedError` in `get_frames_by_indices` when backend is `pyav`

Cause: On n1.6-release, resolve_backend can select pyav, but stock get_frames_by_indices did not implement the pyav branch.

Solution: Apply the playbook patch and install PyAV (see Instructions Step 2 and assets/patches/README.md).

Issue: Training “hangs” — low GPU utilization, no traceback, very slow steps

Cause: Fallback to per-frame ffmpeg subprocess decoding for AV1 LIBERO clips; dataloaders starve the GPU.

Solution:

Apply the PyAV patch (Step 2) and uv pip install av.
Optionally increase --dataloader-num-workers (for example 8) if CPUs are free.

Expected noise after patching: logs may repeat Video backend 'torchcodec' is not available, falling back to 'pyav' — that is normal if torchcodec is absent.

Issue: Video decoding errors / `torchcodec` not found (general)

Solution:

Prefer the PyAV patch + av path above for LIBERO on GB300.

If you must build torchcodec into .venv manually (aarch64), with FFmpeg dev packages installed:

# Run this from inside the Isaac-GR00T repo root (the directory that
# contains .venv). Capture its absolute path BEFORE changing directories
# so we can still reach the virtualenv after cd'ing into /tmp/torchcodec.
GR00T_ROOT="$(pwd)"

# Sanity check — the virtualenv interpreter must already exist.
test -x "$GR00T_ROOT/.venv/bin/python" || { echo "Not in Isaac-GR00T root (missing .venv/bin/python)"; }

# Clone the torchcodec source into /tmp/torchcodec (skip if already cloned).
git clone https://github.com/pytorch/torchcodec.git /tmp/torchcodec
cd /tmp/torchcodec

# Build torchcodec into the Isaac-GR00T virtualenv using the absolute
# path captured above (do NOT use the relative ".venv/bin/python" here —
# the current directory is /tmp/torchcodec, which has no .venv).
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 ENABLE_CUDA=1 \
  uv pip install --python "$GR00T_ROOT/.venv/bin/python" . --no-build-isolation

CUDA-enabled builds can fail when system FFmpeg or CUDA does not match torchcodec expectations — in that case use the PyAV patch instead.

Issue: Training loss is not decreasing

Solution:

At 2000 steps the model may still be early. If loss is flat after many steps:

Verify modality file: ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json
Confirm --embodiment-tag LIBERO_PANDA
Try --learning-rate 5e-4 for faster early movement on short runs

Issue: `nvidia-smi` shows the wrong GPU

Solution:

nvidia-smi --query-gpu=index,name --format=csv,noheader
CUDA_VISIBLE_DEVICES=<gb300_index> python ...

Issue: OpenCV or decord cannot decode LIBERO AV1

Notes: OpenCV often fails on AV1 in LIBERO assets. decord may lack a compatible wheel for your platform. The PyAV patch path is the supported mitigation in this playbook.

Isaac GR00T N1.6 Fine-Tuning

Common Issues

Issue: git clone fails or demo videos are tiny / missing (Git LFS)

Issue: GR1, demo_data/gr1.PickNPlace, or scripts do not match the playbook

Issue: install_deps.sh is not allowed on your machine (policy) or you need to know what it changes

Issue: uv sync (Option B) appears stuck for hours building flash-attn on aarch64

Issue: install_deps.sh fails building torchcodec

Issue: huggingface-cli download fails with 401 Unauthorized

Issue: huggingface-cli download fails with Permission denied: '/home/.../.cache/huggingface/hub/...'

Issue: huggingface-cli download returns 500 Internal Server Error from the xet-read-token endpoint

Issue: externally-managed-environment or pip installs not going into .venv

Issue: CUDA out of memory during fine-tuning

Issue: Triton / PTXAS errors about sm_103a (GB300 / Blackwell)

Issue: ModuleNotFoundError: No module named 'gr00t'

Issue: NotImplementedError in get_frames_by_indices when backend is pyav

Issue: Training “hangs” — low GPU utilization, no traceback, very slow steps

Issue: Video decoding errors / torchcodec not found (general)

Issue: Training loss is not decreasing

Issue: nvidia-smi shows the wrong GPU

Issue: OpenCV or decord cannot decode LIBERO AV1

Resources

Issue: `git clone` fails or demo videos are tiny / missing (Git LFS)

Issue: `GR1`, `demo_data/gr1.PickNPlace`, or scripts do not match the playbook

Issue: `install_deps.sh` is not allowed on your machine (policy) or you need to know what it changes

Issue: `uv sync` (Option B) appears stuck for hours building `flash-attn` on aarch64

Issue: `install_deps.sh` fails building torchcodec

Issue: `huggingface-cli download` fails with 401 Unauthorized

Issue: `huggingface-cli download` fails with `Permission denied: '/home/.../.cache/huggingface/hub/...'`

Issue: `huggingface-cli download` returns `500 Internal Server Error` from the `xet-read-token` endpoint

Issue: `externally-managed-environment` or `pip` installs not going into `.venv`

Issue: Triton / PTXAS errors about `sm_103a` (GB300 / Blackwell)

Issue: `ModuleNotFoundError: No module named 'gr00t'`

Issue: `NotImplementedError` in `get_frames_by_indices` when backend is `pyav`

Issue: Video decoding errors / `torchcodec` not found (general)

Issue: `nvidia-smi` shows the wrong GPU