NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Text to Knowledge Graph
  • Optimized JAX

tools

  • DGX Dashboard
  • Comfy UI
  • Connect Three DGX Spark in a Ring Topology
  • Connect Multiple DGX Spark through a Switch
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark
  • VS Code

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • NemoClaw with Nemotron 3 Super and Telegram on DGX Spark
  • Secure Long Running AI Agents with OpenShell on DGX Spark
  • OpenClaw 🦞
  • Live VLM WebUI
  • Install and Use Isaac Sim and Isaac Lab
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • Connect Two Sparks
  • NCCL for Two Sparks
  • Build a Video Search and Summarization (VSS) Agent
  • Spark & Reachy Photo Booth

inference

  • Speculative Decoding
  • Run models with llama.cpp on DGX Spark
  • vLLM for Inference
  • Nemotron-3-Nano with llama.cpp
  • SGLang for Inference
  • TRT LLM for Inference
  • NVFP4 Quantization
  • Multi-modal Inference
  • NIM on Spark
  • LM Studio on DGX Spark

Nanochat on Dual-Spark

5 days

Setup Nanochat on Dual-Spark

DGXSpark
View on GitHub
OverviewOverviewInstructionsInstructionsTroubleshootingTroubleshooting

Troubleshooting

Common Issues

NCCL timeout or connection errors

Symptom:

RuntimeError: NCCL error in: ...

Cause: Network connectivity issues between nodes or firewall blocking distributed training communication.

Solution:

  • Verify network connectivity between nodes: ping <WORKER_IP>
  • Check firewall rules allow traffic on port 29500: sudo ufw allow 29500/tcp
  • Ensure NCCL_SOCKET_IFNAME matches your network interface (default: enp1s0f0np0)
  • Try disabling InfiniBand: export NCCL_IB_DISABLE=1

Out of memory (OOM) errors

Symptom:

RuntimeError: CUDA out of memory

Cause: Training or inference is using more memory than available on the GPU.

Solution: Reduce batch size in the training scripts:

# Edit speedrun_spark.sh and change the device_batch_size parameter
--device_batch_size=16  # or 8, 4, 2, 1

You can also reduce the depth parameter:

--depth=18  # instead of 20 for d20 model

Docker container not starting

Symptom: Container fails to start or immediately exits.

Cause: GPU not available, Docker doesn't have GPU access, or another container is using the GPU.

Solution:

  • Check GPU availability: nvidia-smi
  • Ensure no other containers using GPUs: docker ps
  • Verify Docker has GPU access: docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
  • Check Docker logs: docker logs nanochat

SSH permission denied to worker node

Symptom:

Permission denied (publickey,password)

Cause: SSH key authentication not set up between nodes.

Solution: Set up SSH key authentication:

ssh-keygen -t rsa -b 4096 -N "" -f ~/.ssh/id_rsa
ssh-copy-id $USER@<WORKER_IP>

Test the connection:

ssh $USER@<WORKER_IP> "echo 'Connection successful'"

GPU freezes during training

Symptom: GPU becomes unresponsive, training stops, or system becomes unstable.

Cause: Running out of memory has caused the GPU to crash or become unresponsive.

Solution: Reboot the GPU system:

# Soft reboot
sudo reboot

# Or physically turn the GPU off, wait 30 seconds, then turn it back on

After reboot, reduce memory usage by lowering batch size or depth parameter in speedrun_spark.sh.

Disk space full error

Symptom:

OSError: [Errno 28] No space left on device

Cause: Disk space is full and doesn't have any more space to write files or do basic filesystem operations. This may be due to a large dataset download (~24GB for FineWeb).

Solution: Free up filesystem space:

# Check disk usage
df -h

# Clear Docker cache
docker system prune -a

# Remove old training checkpoints if not needed
rm -rf ~/.cache/nanochat/old_runs

# Remove downloaded datasets if redownload is acceptable
rm -rf ~/.cache/nanochat/data

To prevent this issue, ensure you have at least 50GB of free disk space before starting training.

Training loss not decreasing

Symptom: Training loss remains constant or increases instead of decreasing.

Cause: Learning rate too high, data not loading properly, or distributed training not coordinating correctly.

Solution:

  • Check if data is downloading correctly: ls -lh ~/.cache/nanochat/data/
  • Verify both nodes are participating: Check logs on both host and worker containers
  • Ensure GPUs are being utilized: nvidia-smi
  • Check W&B dashboard (if enabled) for training curves

Web UI not accessible

Symptom: Cannot access the web UI at http://localhost:8000 or http://<HOST_IP>:8000.

Cause: Web server not running, port not forwarded, or firewall blocking access.

Solution:

  • Verify the web server is running: ps aux | grep chat_web
  • Check if port 8000 is open: netstat -tuln | grep 8000
  • If using SSH tunnel, ensure port forwarding is active: ssh -L 8000:localhost:8000 username@<HOST_IP>
  • Check firewall rules: sudo ufw allow 8000/tcp

Getting Additional Help

If you encounter issues not covered here:

  1. Check the nanochat repository issues: https://github.com/karpathy/nanochat/issues
  2. Review Docker container logs: docker logs nanochat
  3. Check system logs: sudo journalctl -xe
  4. Verify system resources: htop and nvidia-smi

Resources

  • DGX Spark Documentation
  • DGX Spark Forum
  • DGX Spark User Performance Guide
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation