Fine-tune with NeMo
Use NVIDIA NeMo to fine-tune models locally
Verify system requirements
Check your NVIDIA Spark device meets the prerequisites for NeMo AutoModel installation. This step runs on the host system to confirm CUDA toolkit availability and Python version compatibility.
# Verify CUDA installation
nvcc --version
# Check Python version (3.10+ required)
python3 --version
# Verify GPU accessibility
nvidia-smi
# Check available system memory
free -h
Get the container image
docker pull nvcr.io/nvidia/pytorch:25.08-py3
Launch Docker
docker run \
--gpus all \
--ulimit memlock=-1 \
-it --ulimit stack=67108864 \
--entrypoint /usr/bin/bash \
--rm nvcr.io/nvidia/pytorch:25.08-py3
Install package management tools
Install uv for efficient package management and virtual environment isolation. NeMo AutoModel uses uv for dependency management and automatic environment handling.
# Install uv package manager
pip3 install uv
# Verify installation
uv --version
If system installation fails:
# Install for current user only
pip3 install --user uv
# Add to PATH if needed
export PATH="$HOME/.local/bin:$PATH"
Clone NeMo AutoModel repository
Clone the official NeMo AutoModel repository to access recipes and examples. This provides ready-to-use training configurations for various model types and training scenarios.
# Clone the repository
git clone https://github.com/NVIDIA-NeMo/Automodel.git
# Navigate to the repository
cd Automodel
Install NeMo AutoModel
Set up the virtual environment and install NeMo AutoModel. Choose between wheel package installation for stability or source installation for latest features.
Install from wheel package (recommended):
# Initialize virtual environment
uv venv --system-site-packages
# Install packages with uv
uv sync --inexact --frozen --all-extras \
--no-install-package torch \
--no-install-package torchvision \
--no-install-package triton \
--no-install-package nvidia-cublas-cu12 \
--no-install-package nvidia-cuda-cupti-cu12 \
--no-install-package nvidia-cuda-nvrtc-cu12 \
--no-install-package nvidia-cuda-runtime-cu12 \
--no-install-package nvidia-cudnn-cu12 \
--no-install-package nvidia-cufft-cu12 \
--no-install-package nvidia-cufile-cu12 \
--no-install-package nvidia-curand-cu12 \
--no-install-package nvidia-cusolver-cu12 \
--no-install-package nvidia-cusparse-cu12 \
--no-install-package nvidia-cusparselt-cu12 \
--no-install-package nvidia-nccl-cu12 \
--no-install-package transformer-engine \
--no-install-package nvidia-modelopt \
--no-install-package nvidia-modelopt-core \
--no-install-package flash-attn \
--no-install-package transformer-engine-cu12 \
--no-install-package transformer-engine-torch
# Install bitsandbytes
CMAKE_ARGS="-DCOMPUTE_BACKEND=cuda -DCOMPUTE_CAPABILITY=80;86;87;89;90" \
CMAKE_BUILD_PARALLEL_LEVEL=8 \
uv pip install --no-deps git+https://github.com/bitsandbytes-foundation/bitsandbytes.git@50be19c39698e038a1604daf3e1b939c9ac1c342
Verify installation
Confirm NeMo AutoModel is properly installed and accessible. This step validates the installation and checks for any missing dependencies.
# Test NeMo AutoModel import
uv run --frozen --no-sync python -c "import nemo_automodel; print('✅ NeMo AutoModel ready')"
# Check available examples
ls -la examples/
# Below is an example of the expected output (username and domain-users are placeholders).
# $ ls -la examples/
# total 36
# drwxr-xr-x 9 username domain-users 4096 Oct 16 14:52 .
# drwxr-xr-x 16 username domain-users 4096 Oct 16 14:52 ..
# drwxr-xr-x 3 username domain-users 4096 Oct 16 14:52 benchmark
# drwxr-xr-x 3 username domain-users 4096 Oct 16 14:52 diffusion
# drwxr-xr-x 20 username domain-users 4096 Oct 16 14:52 llm_finetune
# drwxr-xr-x 3 username domain-users 4096 Oct 14 09:27 llm_kd
# drwxr-xr-x 2 username domain-users 4096 Oct 16 14:52 llm_pretrain
# drwxr-xr-x 6 username domain-users 4096 Oct 14 09:27 vlm_finetune
# drwxr-xr-x 2 username domain-users 4096 Oct 14 09:27 vlm_generate
Explore available examples
Review the pre-configured training recipes available for different model types and training scenarios. These recipes provide optimized configurations for ARM64 and Blackwell architecture.
# List LLM fine-tuning examples
ls examples/llm_finetune/
# View example recipe configuration
cat examples/llm_finetune/finetune.py | head -20
Run sample fine-tuning
The following commands show how to perform full fine-tuning (SFT), parameter-efficient fine-tuning (PEFT) with LoRA and QLoRA.
First, export your HF_TOKEN so that gated models can be downloaded.
# Run basic LLM fine-tuning example
export HF_TOKEN=<your_huggingface_token>
NOTE
Replace <your_huggingface_token> with your personal Hugging Face access token. A valid token is required to download any gated model.
- Generate a token: Hugging Face tokens, guide available here.
- Request and receive access on each model's page (and accept license/terms) before attempting downloads.
- Llama-3.1-8B: meta-llama/Llama-3.1-8B
- Qwen3-8B: Qwen/Qwen3-8B
- Mixtral-8x7B: mistralai/Mixtral-8x7B
The same steps apply for any other gated model you use: visit its model card on Hugging Face, request access, accept the license, and wait for approval.
LoRA fine-tuning example:
Execute a basic fine-tuning example to validate the complete setup. This demonstrates parameter-efficient fine-tuning using a small model suitable for testing. For the examples below, we are using YAML for configuration, and parameter overrides are passed as command line arguments.
# Run basic LLM fine-tuning example
uv run --frozen --no-sync \
examples/llm_finetune/finetune.py \
-c examples/llm_finetune/llama3_2/llama3_2_1b_squad_peft.yaml \
--model.pretrained_model_name_or_path meta-llama/Llama-3.1-8B \
--packed_sequence.packed_sequence_size 1024 \
--step_scheduler.max_steps 100
These overrides ensure the Llama-3.1-8B LoRA run behaves as expected:
--model.pretrained_model_name_or_path: selects the Llama-3.1-8B model to fine-tune from the Hugging Face model hub (weights fetched via your Hugging Face token).--packed_sequence.packed_sequence_size: sets the packed sequence size to 1024 to enable packed sequence training.--step_scheduler.max_steps: sets the maximum number of training steps. We set it to 100 for demonstation purposes, please adjust this based on your needs.
QLoRA fine-tuning example:
We can use QLoRA to fine-tune large models in a memory-efficient manner.
uv run --frozen --no-sync \
examples/llm_finetune/finetune.py \
-c examples/llm_finetune/llama3_1/llama3_1_8b_squad_qlora.yaml \
--model.pretrained_model_name_or_path meta-llama/Meta-Llama-3-70B \
--loss_fn._target_ nemo_automodel.components.loss.te_parallel_ce.TEParallelCrossEntropy \
--step_scheduler.local_batch_size 1 \
--packed_sequence.packed_sequence_size 1024 \
--step_scheduler.max_steps 100
These overrides ensure the 70B QLoRA run behaves as expected:
--model.pretrained_model_name_or_path: selects the 70B base model to fine-tune (weights fetched via your Hugging Face token).--loss_fn._target_: uses the TransformerEngine-parallel cross-entropy loss variant compatible with tensor-parallel training for large LLMs.--step_scheduler.local_batch_size: sets the per-GPU micro-batch size to 1 to fit 70B in memory; overall effective batch size is still driven by gradient accumulation and data/tensor parallel settings from the recipe.--step_scheduler.max_steps: sets the maximum number of training steps. We set it to 100 for demonstation purposes, please adjust this based on your needs.--packed_sequence.packed_sequence_size: sets the packed sequence size to 1024 to enable packed sequence training.
Full Fine-tuning example:
Once inside the Automodel directory you cloned from GitHub, run:
uv run --frozen --no-sync \
examples/llm_finetune/finetune.py \
-c examples/llm_finetune/qwen/qwen3_8b_squad_spark.yaml \
--model.pretrained_model_name_or_path Qwen/Qwen3-8B \
--step_scheduler.local_batch_size 1 \
--step_scheduler.max_steps 100 \
--packed_sequence.packed_sequence_size 1024
These overrides ensure the Qwen3-8B SFT run behaves as expected:
--model.pretrained_model_name_or_path: selects the Qwen/Qwen3-8B model to fine-tune from the Hugging Face model hub (weights fetched via your Hugging Face token). Adjust this if you want to fine-tune a different model.--step_scheduler.max_steps: sets the maximum number of training steps. We set it to 100 for demonstation purposes, please adjust this based on your needs.--step_scheduler.local_batch_size: sets the per-GPU micro-batch size to 1 to fit in memory; overall effective batch size is still driven by gradient accumulation and data/tensor parallel settings from the recipe.
Validate successful training completion
Validate the fine-tuned model by inspecting artifacts contained in the checkpoint directory.
# Inspect logs and checkpoint output.
# The LATEST is a symlink pointing to the latest checkpoint.
# The checkpoint is the one that was saved during training.
# below is an example of the expected output (username and domain-users are placeholders).
ls -lah checkpoints/LATEST/
# $ ls -lah checkpoints/LATEST/
# total 32K
# drwxr-xr-x 6 username domain-users 4.0K Oct 16 22:33 .
# drwxr-xr-x 4 username domain-users 4.0K Oct 16 22:33 ..
# -rw-r--r-- 1 username domain-users 1.6K Oct 16 22:33 config.yaml
# drwxr-xr-x 2 username domain-users 4.0K Oct 16 22:33 dataloader
# drwxr-xr-x 2 username domain-users 4.0K Oct 16 22:33 model
# drwxr-xr-x 2 username domain-users 4.0K Oct 16 22:33 optim
# drwxr-xr-x 2 username domain-users 4.0K Oct 16 22:33 rng
# -rw-r--r-- 1 username domain-users 1.3K Oct 16 22:33 step_scheduler.pt
Cleanup and rollback (Optional)
Remove the installation and restore the original environment if needed. These commands safely remove all installed components.
WARNING
This will delete all virtual environments and downloaded models. Ensure you have backed up any important training checkpoints.
# Remove virtual environment
rm -rf .venv
# Remove cloned repository
cd ..
rm -rf Automodel
# Remove uv (if installed with --user)
pip3 uninstall uv
# Clear Python cache
rm -rf ~/.cache/pip
Optional: Publish your fine-tuned model checkpoint on Hugging Face Hub
Publish your fine-tuned model checkpoint on Hugging Face Hub.
NOTE
This is an optional step and is not required for using the fine-tuned model.
It is useful if you want to share your fine-tuned model with others or use it in other projects.
You can also use the fine-tuned model in other projects by cloning the repository and using the checkpoint.
To use the fine-tuned model in other projects, you need to have the Hugging Face CLI installed.
You can install the Hugging Face CLI by running pip install huggingface-cli.
For more information, please refer to the Hugging Face CLI documentation.
TIP
You can use the hf command to upload the fine-tuned model checkpoint to Hugging Face Hub.
For more information, please refer to the Hugging Face CLI documentation.
# Publish the fine-tuned model checkpoint to Hugging Face Hub
# will be published under the namespace <your_huggingface_username>/my-cool-model, adjust name as needed.
hf upload my-cool-model checkpoints/LATEST/model
TIP
The above command can fail if you don't have write permissions to the Hugging Face Hub, with the HF_TOKEN you used. Sample error message:
akoumparouli@1604ab7-lcedt:/mnt/4tb/auto/Automodel8$ hf upload my-cool-model checkpoints/LATEST/model
Traceback (most recent call last):
File "/home/akoumparouli/.local/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 409, in hf_raise_for_status
response.raise_for_status()
File "/home/akoumparouli/.local/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/api/repos/create
To fix this, you need to create an access token with write permissions, please see the Hugging Face guide here for instructions.
Next steps
Begin using NeMo AutoModel for your specific fine-tuning tasks. Start with provided recipes and customize based on your model requirements and dataset.
# Copy a recipe for customization
cp recipes/llm_finetune/finetune.py my_custom_training.py
# Edit configuration for your specific model and data
# Then run: uv run my_custom_training.py
Explore the NeMo AutoModel GitHub repository for more recipes, documentation, and community examples. Consider setting up custom datasets, experimenting with different model architectures, and scaling to multi-node distributed training for larger models.