Fine-tune with Pytorch
1 HR
Use Pytorch to fine-tune models locally
| Symptom | Cause | Fix |
|---|---|---|
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your HuggingFace token; and request access to the gated model on your web browser |
| Errors and time-outs in multi-Spark runs | Various reasons | We recommend to set the following variables to enable extra logging and runtime consistency checks ACCELERATE_DEBUG_MODE=1ACCELERATE_LOG_LEVEL=DEBUGTORCH_CPP_LOG_LEVEL=INFOTORCH_DISTRIBUTED_DEBUG=DETAIL |
| task: non-zero exit (255) | Container exit with error code 255 | Check container logs with docker ps -a --filter "name=finetuning-multinode" to get container ID, then docker logs <container_id> to see detailed error messages |
| Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? | Docker daemon crash caused by Docker Swarm attempting to bind to a stale or unreachable link-local IP address | Stop Docker sudo systemctl stop dockerRemove Swarm state sudo rm -rf /var/lib/docker/swarmRestart Docker sudo systemctl start dockerRe-initialize Swarm with a valid advertise address on an active interface |
NOTE
DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'