Setup Nanochat on Dual-Spark
To easily manage containers without sudo, you must be in the docker group. If you choose to skip this step, you will need to run Docker commands with sudo.
Open a new terminal and test Docker access. In the terminal, run:
docker ps
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo.
sudo usermod -aG docker $USER
newgrp docker
Before starting, ensure you have:
Test network connectivity and SSH access:
# From host node - replace <WORKER_IP> with your worker node IP
ping <WORKER_IP>
# Test SSH access
ssh $USER@<WORKER_IP> "echo 'Connection successful'"
Set your host and worker IP addresses. You can find your IP address using hostname -I or ip addr show.
export HOST_IP=<HOST_IP>
export WORKER_IP=<WORKER_IP>
NOTE
Replace <HOST_IP> and <WORKER_IP> with your actual IP addresses. Use the IP address of the network interface that will be used for distributed training (default: enp1s0f0np0). To find your network interface, use ip addr show and look for the interface with an active connection.
For training visualization and logging, set up your W&B API key:
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export WANDB_RUN=speedrun # Optional, name your run
If you don't have a W&B account, create one at wandb.ai. Without W&B, the training will run but skip logging.
git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/nanochat-dual-spark/assets
Run the setup script to clone nanochat and build the Docker image on both nodes:
chmod +x setup.sh
sh setup.sh $HOST_IP $WORKER_IP
The setup script will:
c6b7ab744055d5915e6ccb61088de80c10cbaff9)speedrun_spark.sh script for dual-node trainingThis step can take 10 to 20 minutes depending on network speed and Docker build performance.
Ensure the Docker image was built successfully on both nodes:
# On host
docker images | grep nanochat
# On worker
ssh $USER@$WORKER_IP "docker images | grep nanochat"
You should see the nanochat image listed on both systems.
Start the distributed training across both DGX Spark nodes:
# Make sure environment variables are set
export HOST_IP=<HOST_IP>
export WORKER_IP=<WORKER_IP>
export WANDB_API_KEY=<YOUR_WANDB_API_KEY> # Optional
# Launch training on both nodes
sh launch.sh $HOST_IP $WORKER_IP
The training script will automatically:
Expected duration: ~4 hours for the complete pipeline
NOTE
Training will run in the foreground. Keep the terminal open or use a terminal multiplexer like tmux or screen. The training containers will automatically coordinate using NCCL on port 29500. You can monitor progress by watching the terminal output.
Watch the training output in the terminal where you launched launch.sh. You should see:
If using W&B, monitor your training at:
https://wandb.ai/<your-username>/<your-project>/runs/<run-id>
Track key metrics:
Training checkpoints are automatically saved in ~/.cache/nanochat/:
model_base.pt: Pretrained base modelmodel_mid.pt: After midtrainingmodel_sft.pt: Final fine-tuned modeltokenizer.model: Trained BPE tokenizerAfter training completes, a comprehensive report is generated. View it with:
cat nanochat/report.md
The report includes:
Launch the ChatGPT-style web interface:
# Navigate to nanochat directory
cd nanochat
# Activate the virtual environment
source ../.venv/bin/activate
# Start the web server
python -m scripts.chat_web
Access the UI at: http://localhost:8000
NOTE
If you are running this on a remote GPU via an SSH connection, in a new terminal window, you need to run the following command to be able to access the UI at localhost:8000:
ssh -L 8000:localhost:8000 username@<HOST_IP>
Try these prompts to test your model:
Creative Writing:
Write a short story about two GPUs falling in love
Reasoning:
Why is distributed training important for large language models?
Math:
If I have 2 DGX Spark systems with 1 GPU each, and training takes 4 hours at $3/GPU/hour, what is the total cost?
Code:
Write a Python function to calculate fibonacci numbers
NOTE
The speedrun d20 model (561M params, ~4e19 FLOPs) performs at a kindergarten level and will make mistakes, hallucinate, and occasionally give silly answers. This is expected for micro-models trained on limited compute!
You can also use the CLI for quick interactions:
# Interactive chat mode
python -m scripts.chat_cli
# Single prompt mode
python -m scripts.chat_cli -p "Why is the sky blue?"
# Specify checkpoint (base, mid, or sft)
python -m scripts.chat_cli -i sft -p "Write me a haiku about distributed training"
Steps to completely remove the containers and free up resources.
To stop training early, interrupt both containers:
# From the terminal running launch.sh
Ctrl+C
# Or manually stop containers
docker stop nanochat
ssh $USER@$WORKER_IP "docker stop nanochat"
To free up disk space after training:
# On both nodes - clear training cache
rm -rf ~/.cache/nanochat
# Remove Docker image
docker rmi nanochat
ssh $USER@$WORKER_IP "docker rmi nanochat"
# Clear Docker system (optional)
docker system prune -a
To train a larger model (e.g., d26 with 1.1B parameters):
speedrun_spark.sh:# Download more data (450 shards for d26)
python -m nanochat.dataset -n 450 &
# Increase depth and reduce batch size to fit in memory
torchrun ... -m scripts.base_train -- --depth=26 --device_batch_size=16
To infuse your model with a custom personality:
{"conversations": [
{"role": "user", "content": "Who are you?"},
{"role": "assistant", "content": "I am YourBot, an AI assistant trained on DGX Spark systems."}
]}
identity_conversations.jsonl before midtrainingscripts/mid_train.pySee the nanochat customization guide for detailed instructions.