LLaMA Factory

1 HR

Install and fine-tune models with LLaMA Factory

SymptomCauseFix
CUDA out of memory during trainingBatch size too large for GPU VRAMReduce per_device_train_batch_size or increase gradient_accumulation_steps
Cannot access gated repo for URLCertain HuggingFace models have restricted accessRegenerate your HuggingFace token; and request access to the gated model on your web browser
Model download fails or is slowNetwork connectivity or Hugging Face Hub issuesCheck internet connection, try using HF_HUB_OFFLINE=1 for cached models
Training loss not decreasingLearning rate too high/low or insufficient dataAdjust learning_rate parameter or check dataset quality

NOTE

DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'