Isaac GR00T N1.6 Fine-Tuning

Step 1
Clone Isaac GR00T and install dependencies

1a. Git LFS (required for a clean clone)

If git clone fails with errors about Git LFS or missing pointer files, install and initialize LFS, then remove any partial Isaac-GR00T directory and clone again:

sudo apt-get update
sudo apt-get install -y git-lfs
git lfs install

1b. Clone and check out `n1.6-release`

The main branch tracks ongoing development (for example newer GR00T milestones) and does not always match this N1.6 playbook. Embodiment tags such as GR1, paths like demo_data/gr1.PickNPlace, and tutorial scripts are aligned with the n1.6-release branch.

git clone --recurse-submodules https://github.com/NVIDIA/Isaac-GR00T
cd Isaac-GR00T
git fetch origin
git checkout n1.6-release
git submodule update --init --recursive

1c. Install Python dependencies

Option A — `install_deps.sh` (matches upstream docs; uses `sudo`)

This script is the supported path. It may make system-level changes:

Runs apt-get update and installs ffmpeg and libaio-dev
If /usr/local/cuda is missing, adds the NVIDIA CUDA apt repository and installs cuda-toolkit-12-8
Installs uv into your user account if needed, then runs uv sync and uv pip install -e . into the project .venv
On aarch64 only: installs FFmpeg development packages and builds torchcodec from source into .venv

I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh

Option B — User-space only (no `install_deps.sh`)

Use this only when CUDA 12.8+ is already installed, system ffmpeg / libaio-dev are already present, and your policy forbids the script's apt or CUDA steps. From the Isaac-GR00T repo root, install uv if needed, then:

command -v uv >/dev/null || curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="/usr/local/cuda/bin:$HOME/.local/bin:$PATH"
export CUDA_HOME=/usr/local/cuda
uv sync
uv pip install -e .

You still need a working video backend for LIBERO (see Step 2). On aarch64, building torchcodec inside .venv without the script is possible but manual; see Troubleshooting.

IMPORTANT

PATH and CUDA_HOME matter on multi-toolkit hosts. If the system has both an old Ubuntu nvidia-cuda-toolkit package (/usr/bin/nvcc ≈ 12.0) and a current NVIDIA CUDA repo install (/usr/local/cuda-13.x/bin/nvcc), uv will pick whichever appears first on PATH. Putting /usr/local/cuda/bin first (and exporting CUDA_HOME) is required for flash-attn's source build to find the matching toolkit. Verify with nvcc --version after the export.

WARNING

flash-attn build on aarch64 takes ~2 hours from source. The upstream pyproject.toml only lists pre-built flash-attn==2.7.4.post1 wheels for x86_64; on aarch64 (Grace + GB300), uv sync falls back to compiling ~72 CUDA kernels from source. A faster route is to pin flash-attn==2.8.1 and reuse the GitHub release's prebuilt aarch64 wheel:

# In pyproject.toml under [project] dependencies.
# The wheel below is built against torch 2.10 (see "torch2.10" in its filename), but upstream
# n1.6-release pins torch==2.7.1 — you MUST bump torch and its companions to match, otherwise
# `import flash_attn` fails at runtime with an undefined-symbol (C++ ABI) error and the pipeline
# dies at Step 5 (base-model load):
"torch==2.10.0",          # was 2.7.1
"torchvision==0.25.0",    # pairs with torch 2.10
"triton==3.6.0",          # torch 2.10.0 requires triton 3.6.0
"flash-attn==2.8.1",

# In [tool.uv.sources]:
flash-attn = [
    { url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.10cxx11abiTRUE-cp312-cp312-linux_aarch64.whl",
      marker = "sys_platform == 'linux' and platform_machine == 'aarch64' and python_version == '3.12'" },
]

With this pin, uv sync finishes in ~1 minute on aarch64 instead of ~2 hours. The torch/torchvision/triton bump is required, not optional — the wheel will not import against the upstream 2.7.1 pin. This matches the torch 2.10 stack this playbook validates at Step 8.

Activate the virtual environment:

source .venv/bin/activate

Verify GPU access:

CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_name(0))"

Expected output: NVIDIA GB300

NOTE

Examples in this playbook use CUDA_VISIBLE_DEVICES=0 because the GB300 is at index 0 on a single-GPU Station. On a multi-GPU Station (for example RTX PRO 6000 + GB300), the GB300 may be at a different index — run nvidia-smi --query-gpu=index,name --format=csv,noheader, find the GB300 row, and substitute that index everywhere CUDA_VISIBLE_DEVICES=0 appears below.

Step 2
PyAV patch for LIBERO video (strongly recommended)

On many stacks torchcodec fails to import or build, the resolver falls back to pyav, and stock n1.6-release can raise NotImplementedError from get_frames_by_indices for the pyav backend (fallback order is already torchcodec → decord → pyav → ffmpeg). Without this patch, training may appear hung: GPU idle, no traceback, while ffmpeg spawns per-frame decode work on the CPU.

From the Isaac-GR00T repo root with n1.6-release checked out and .venv activated:

The patch ships with this playbook, not with the Isaac-GR00T clone, so fetch it first. Clone the playbook repo (public mirror shown; the "View on GitHub" link on this page), then apply the patch from the Isaac-GR00T repo root:

# Fetch the playbook (contains the patch) somewhere outside your Isaac-GR00T clone:
git clone https://github.com/NVIDIA/dgx-spark-playbooks /tmp/dgx-spark-playbooks

# From the Isaac-GR00T repo root (n1.6-release checked out, .venv active):
git apply /tmp/dgx-spark-playbooks/nvidia/station-gr00t/assets/patches/001-pyav-get-frames-by-indices.patch
uv pip install av

Already have this playbook checked out (or its assets bundle)? Use that path instead of the clone above. See assets/patches/README.md for the copy-into-clone alternative.

If you copied nvidia/station-gr00t/assets/patches/ into the Isaac-GR00T root instead, use git apply assets/patches/001-pyav-get-frames-by-indices.patch.

Details and re-apply rules: nvidia/station-gr00t/assets/patches/README.md.

After patching, repeated log lines such as Video backend 'torchcodec' is not available, falling back to 'pyav' are expected and noisy but not fatal.

Step 3
Set up HuggingFace authentication

export HF_TOKEN="your_huggingface_token"

Get a token from https://huggingface.co/settings/tokens if you don't have one.

Step 4
Download the dataset and model

Download the LIBERO Spatial dataset and the GR00T N1.6 base model:

# Download LIBERO Spatial dataset (~2-3 GB)
huggingface-cli download \
    --repo-type dataset IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot \
    --local-dir examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/

# Copy the LIBERO modality config into the dataset's meta/ directory
cp examples/LIBERO/modality.json \
    examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/

# Download GR00T N1.6 base model (~6 GB)
huggingface-cli download nvidia/GR00T-N1.6-3B

NOTE

HF cache permission errors: If huggingface-cli download fails with Permission denied: '/home/.../.cache/huggingface/hub/...', the cache directory was previously created by a Docker container running as root (common on shared dev boxes). Point HF at a user-owned cache for this run:

export HF_HOME=$HOME/hf_cache_gr00t

Transient xet-read-token 500 errors: Hugging Face's xet backend occasionally returns 500 Internal Server Error for dataset downloads. Disable it:

export HF_HUB_DISABLE_XET=1

Verify the dataset is ready:

ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json

Expected result: the command prints the full path to modality.json (and ls exits 0). That confirms the merged modality file exists next to the downloaded LeRobot dataset metadata.

Step 5
Verify the base model loads and runs

Confirm the GR00T N1.6 base model loads and produces actions using the GR1 demo shipped on n1.6-release:

TORCHDYNAMO_DISABLE=1 CUDA_VISIBLE_DEVICES=0 python scripts/deployment/standalone_inference_script.py \
    --model-path nvidia/GR00T-N1.6-3B \
    --dataset-path demo_data/gr1.PickNPlace \
    --embodiment-tag GR1 \
    --traj-ids 0 \
    --inference-mode pytorch \
    --action-horizon 8 \
    --steps 32

TORCHDYNAMO_DISABLE=1 avoids torch.compile / Triton paths that can fail on GB300 with ptxas-blackwell fatal: Value 'sm_103a' is not defined for option 'gpu-name'. Keep it on all standalone_inference_script.py invocations in this playbook unless you have a Triton build that supports SM103.

You should see per-step timing output and no errors. This confirms the model, CUDA, and data pipeline work before a long fine-tuning run.

NOTE

The base model's pretrained processor does not include the LIBERO_PANDA embodiment configuration, so you cannot run this standalone script on the LIBERO dataset with the base checkpoint alone. The LIBERO modality config is registered during fine-tuning. That is expected — LIBERO is a post-training benchmark.

Step 6
Fine-tune GR00T N1.6 on LIBERO Spatial

Fine-tune the base model on LIBERO Spatial. DGX Station's GB300 GPU with 284 GB HBM3e allows a global batch size of 128 — roughly several times what fits on a typical 80 GB GPU. Larger batches stabilize gradients and improve wall-clock throughput when the dataloader keeps the GPU fed.

CUDA_VISIBLE_DEVICES=0 python \
    gr00t/experiment/launch_finetune.py \
    --base-model-path nvidia/GR00T-N1.6-3B \
    --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
    --embodiment-tag LIBERO_PANDA \
    --num-gpus 1 \
    --output-dir output/libero_spatial_ft \
    --save-steps 500 \
    --save-total-limit 5 \
    --max-steps 2000 \
    --global-batch-size 128 \
    --learning-rate 1e-4 \
    --warmup-ratio 0.05 \
    --weight-decay 1e-5 \
    --state-dropout-prob 0.8 \
    --color-jitter-params brightness 0.3 contrast 0.4 saturation 0.5 hue 0.08 \
    --dataloader-num-workers 4

If GPU utilization stays near zero for many minutes while the process is alive, suspect video decoding (see Step 2 patch and Troubleshooting). You can try --dataloader-num-workers 8 if CPU cores are available.

Training runs for 2000 steps at batch size 128 and takes approximately 20–25 minutes on GB300 when torchcodec is the active video backend.

IMPORTANT

With the PyAV fallback (Step 2 patch + no torchcodec), expect ~5–6 s per step instead of <1 s — so 2000 steps is closer to 2.5–3 hours, and GPU utilization sits in the 3–30 % range while CPU-side video decoding starves the GPU. To validate the workflow without the long wait, lower --max-steps (e.g. 100) and --save-steps (e.g. 50); loss should still drop visibly (validated drop 1.07 → 0.63 in 100 steps in this playbook's GB300 run). If you need full-throughput training, build torchcodec from source (Troubleshooting → "Video decoding errors") or run Option A which builds it for you.

NOTE

This playbook uses 2000 steps to keep execution time under an hour. For production-quality results closer to the published 97.65% success rate on LIBERO Spatial, increase to 20,000 steps (--max-steps 20000). Published settings used batch size 640 across 8 GPUs — 128 on one GB300 exceeds the per-GPU batch in that reference.

What the training flags mean:

Flag	Value	Purpose
`--global-batch-size`	128	Total samples per training step; enabled by GB300 memory.
`--state-dropout-prob`	0.8	Drops proprioceptive state 80% of the time so the model relies on vision.
`--color-jitter-params`	brightness/contrast/saturation/hue	Photometric augmentation for lighting robustness.
`--warmup-ratio`	0.05	Linear LR warmup over the first 5% of steps.
`--save-steps`	500	Checkpoint cadence under `output/libero_spatial_ft/`.

Monitor the Hugging Face Trainer loss in the terminal. Checkpoints land under output/libero_spatial_ft/.

Step 7
Evaluate the fine-tuned model

Open-loop evaluation compares predicted actions to dataset ground truth and writes plots to /tmp/open_loop_eval/:

CUDA_VISIBLE_DEVICES=0 python gr00t/eval/open_loop_eval.py \
    --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
    --embodiment-tag LIBERO_PANDA \
    --model-path output/libero_spatial_ft/checkpoint-2000/ \
    --traj-ids 0 1 2 \
    --action-horizon 16

How to read the run: the terminal prints per-trajectory MSE/MAE and averages. The JPEGs under /tmp/open_loop_eval/ overlay predicted vs ground-truth trajectories per action dimension (translation, rotation, gripper). Use them to confirm the policy tracks pick-and-place phases and gripper open/close timing on spatial tasks.

TIP

At 2000 steps you should see clear improvement over a random policy; at 20,000 steps, published LIBERO Spatial success reaches 97.65% in closed-loop sim.

Step 8
Run inference on a LIBERO sample (timing + actions)

This step passes LIBERO Spatial observations through the fine-tuned checkpoint (the base model cannot run this embodiment). TORCHDYNAMO_DISABLE=1 is included for GB300:

TORCHDYNAMO_DISABLE=1 CUDA_VISIBLE_DEVICES=0 python scripts/deployment/standalone_inference_script.py \
    --model-path output/libero_spatial_ft/checkpoint-2000/ \
    --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
    --embodiment-tag LIBERO_PANDA \
    --traj-ids 0 \
    --inference-mode pytorch \
    --action-horizon 8

What to inspect: the script prints a detailed timing summary — model-load and dataset-load times, then per-trajectory episode-loading, data-preparation, and inference timings, plus per-step inference statistics (avg / min / max / P90) — alongside the MSE/MAE of predicted vs. ground-truth actions. Compare these to Step 5's base-model smoke test. In eager mode (with TORCHDYNAMO_DISABLE=1), per-step latency on GB300 depends heavily on the torch + CUDA stack — expect ~3–4 s/step on torch 2.10 + cu130 in eager mode (validated in this playbook's run on a fine-tuned checkpoint-100); a compiled torch 2.7 + cu128 stack with Triton support for sm_103 can be much faster. Treat average per-step inference latency as the most stable signal across stacks.

Step 9
Clean up

deactivate
cd ..
rm -rf Isaac-GR00T

Fine-tuned checkpoints under output/libero_spatial_ft/ are removed with the repo. Copy them elsewhere first if you want to keep them.

Next steps

Increase training steps — --max-steps 20000 for stronger LIBERO Spatial alignment (~3.5 hours at the same throughput).
Other LIBERO suites — libero_10_no_noops, libero_goal_no_noops, libero_object_no_noops from IPEC-COMMUNITY on Hugging Face.
Closed-loop sim — LIBERO sim server/client: LIBERO evaluation in Isaac GR00T.
Custom embodiments — Fine-tune a new embodiment (LeRobot v2 + modality JSON).
Tune more of the stack — --tune-llm / --tune-visual raise memory use; probe batch size if you enable them.

Isaac GR00T N1.6 Fine-Tuning

Step 1Clone Isaac GR00T and install dependencies

1a. Git LFS (required for a clean clone)

1b. Clone and check out n1.6-release

1c. Install Python dependencies

Option A — install_deps.sh (matches upstream docs; uses sudo)

Option B — User-space only (no install_deps.sh)

Step 2PyAV patch for LIBERO video (strongly recommended)

Step 3Set up HuggingFace authentication

Step 4Download the dataset and model

Step 5Verify the base model loads and runs

Step 6Fine-tune GR00T N1.6 on LIBERO Spatial

Step 7Evaluate the fine-tuned model

Step 8Run inference on a LIBERO sample (timing + actions)

Step 9Clean up

Next steps

Resources

Step 1
Clone Isaac GR00T and install dependencies

1b. Clone and check out `n1.6-release`

Option A — `install_deps.sh` (matches upstream docs; uses `sudo`)

Option B — User-space only (no `install_deps.sh`)

Step 2
PyAV patch for LIBERO video (strongly recommended)

Step 3
Set up HuggingFace authentication

Step 4
Download the dataset and model

Step 5
Verify the base model loads and runs

Step 6
Fine-tune GR00T N1.6 on LIBERO Spatial

Step 7
Evaluate the fine-tuned model

Step 8
Run inference on a LIBERO sample (timing + actions)

Step 9
Clean up