Fine-tune with Pytorch
Use Pytorch to fine-tune models locally
Configure Docker permissions
To easily manage containers without sudo, you must be in the docker group. If you choose to skip this step, you will need to run Docker commands with sudo.
Open a new terminal and test Docker access. In the terminal, run:
docker ps
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
sudo usermod -aG docker $USER
newgrp docker
Pull the latest Pytorch container
docker pull nvcr.io/nvidia/pytorch:25.11-py3
Launch Docker
docker run --gpus all -it --rm --ipc=host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
-v ${PWD}:/workspace -w /workspace \
nvcr.io/nvidia/pytorch:25.11-py3
Install dependencies inside the container
pip install transformers peft datasets trl bitsandbytes
Authenticate with Huggingface
hf auth login
#<input your huggingface token.
#<Enter n for git credential>
Clone the git repo with fine-tuning recipes
git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets
Run the fine-tuning recipes
Available Fine-Tuning Scripts
The following fine-tuning scripts are provided, each optimized for different model sizes and training approaches:
| Script | Model | Fine-Tuning Type | Description |
|---|---|---|---|
Llama3_3B_full_finetuning.py | Llama 3.2 3B | Full SFT | Full supervised fine-tuning (all parameters trainable) |
Llama3_8B_LoRA_finetuning.py | Llama 3.1 8B | LoRA | Low-Rank Adaptation (parameter-efficient) |
Llama3_70B_LoRA_finetuning.py | Llama 3.1 70B | LoRA | Low-Rank Adaptation with FSDP support |
Llama3_70B_qLoRA_finetuning.py | Llama 3.1 70B | QLoRA | Quantized LoRA (4-bit quantization for memory efficiency) |
Basic Usage
Run any script with default settings:
# Full fine-tuning on Llama 3.2 3B
python Llama3_3B_full_finetuning.py
# LoRA fine-tuning on Llama 3.1 8B
python Llama3_8B_LoRA_finetuning.py
# LoRA fine-tuning on Llama 3.1 70B
python Llama3_70B_LoRA_finetuning.py
Common Command-Line Arguments
All scripts support the following command-line arguments for customization:
Model Configuration
--model_name: Model name or path (default: varies by script)--dtype: Model precision -float32,float16, orbfloat16(default:bfloat16)
Training Configuration
--batch_size: Per-device training batch size (default: varies by script)--seq_length: Maximum sequence length (default:2048)--num_epochs: Number of training epochs (default:1)--gradient_accumulation_steps: Gradient accumulation steps (default:1)--learning_rate: Learning rate (default: varies by script)--gradient_checkpointing: Enable gradient checkpointing to save memory (flag)
LoRA Configuration (LoRA and QLoRA scripts only)
--lora_rank: LoRA rank - higher values = more trainable parameters (default:8)
Dataset Configuration
--dataset_size: Number of samples to use from the Alpaca dataset (default:500)
Logging Configuration
--logging_steps: Log metrics every N steps (default:1)--log_dir: Directory for TensorBoard logs (default:logs)
Model Saving
--output_dir: Directory to save the fine-tuned model (default:None- model not saved)
Performance Optimization
--use_torch_compile: Enabletorch.compile()for faster training (flag)
WARNING
Important: The --use_torch_compile flag is not compatible with QLoRA (Llama3_70B_qLoRA_finetuning.py).
Only use this flag with full fine-tuning and standard LoRA scripts.
Usage Examples
python Llama3_8B_LoRA_finetuning.py \
--dataset_size 100 \
--num_epochs 1 \
--batch_size 2