Fine-tune with Pytorch

Symptom	Cause	Fix
Cannot access gated repo for URL	Certain HuggingFace models have restricted access	Regenerate your HuggingFace token; and request access to the gated model on your web browser
Errors and time-outs in multi-Spark runs	Various reasons	We recommend to set the following variables to enable extra logging and runtime consistency checks `ACCELERATE_DEBUG_MODE=1` `ACCELERATE_LOG_LEVEL=DEBUG` `TORCH_CPP_LOG_LEVEL=INFO` `TORCH_DISTRIBUTED_DEBUG=DETAIL`
task: non-zero exit (255)	Container exit with error code 255	Check container logs with `docker ps -a --filter "name=finetuning-multinode"` to get container ID, then `docker logs <container_id>` to see detailed error messages
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?	Docker daemon crash caused by Docker Swarm attempting to bind to a stale or unreachable link-local IP address	Stop Docker `sudo systemctl stop docker` Remove Swarm state `sudo rm -rf /var/lib/docker/swarm` Restart Docker `sudo systemctl start docker` Re-initialize Swarm with a valid advertise address on an active interface

NOTE

DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

Fine-tune with Pytorch

Resources