Nanochat on Dual-Spark

Step 1
Configure Docker permissions

To easily manage containers without sudo, you must be in the docker group. If you choose to skip this step, you will need to run Docker commands with sudo.

Open a new terminal and test Docker access. In the terminal, run:

docker ps

If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo.

sudo usermod -aG docker $USER
newgrp docker

Step 2
Verify prerequisites

Before starting, ensure you have:

Two DGX Spark systems with network connectivity
SSH access configured between the nodes
Docker installed on both systems
GPU available on both systems

Test network connectivity and SSH access:

# From host node - replace <WORKER_IP> with your worker node IP
ping <WORKER_IP>

# Test SSH access
ssh $USER@<WORKER_IP> "echo 'Connection successful'"

Step 3
Set environment variables

Set your host and worker IP addresses. You can find your IP address using hostname -I or ip addr show.

export HOST_IP=<HOST_IP>
export WORKER_IP=<WORKER_IP>

NOTE

Replace <HOST_IP> and <WORKER_IP> with your actual IP addresses. Use the IP address of the network interface that will be used for distributed training (default: enp1s0f0np0). To find your network interface, use ip addr show and look for the interface with an active connection.

Step 4
(Optional) Configure Weights & Biases

For training visualization and logging, set up your W&B API key:

export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export WANDB_RUN=speedrun  # Optional, name your run

If you don't have a W&B account, create one at wandb.ai. Without W&B, the training will run but skip logging.

Step 5
Clone the repository

git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/nanochat-dual-spark/assets

Step 6
Run the setup script

Run the setup script to clone nanochat and build the Docker image on both nodes:

chmod +x setup.sh
sh setup.sh $HOST_IP $WORKER_IP

The setup script will:

Clone the nanochat repository (specific commit: c6b7ab744055d5915e6ccb61088de80c10cbaff9)
Copy the modified speedrun_spark.sh script for dual-node training
Build the Docker image on both nodes

This step can take 10 to 20 minutes depending on network speed and Docker build performance.

Step 7
Verify Docker image

Ensure the Docker image was built successfully on both nodes:

# On host
docker images | grep nanochat

# On worker
ssh $USER@$WORKER_IP "docker images | grep nanochat"

You should see the nanochat image listed on both systems.

Step 8
Launch distributed training

Start the distributed training across both DGX Spark nodes:

# Make sure environment variables are set
export HOST_IP=<HOST_IP>
export WORKER_IP=<WORKER_IP>
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>  # Optional

# Launch training on both nodes
sh launch.sh $HOST_IP $WORKER_IP

The training script will automatically:

Download ~24GB of FineWeb pretraining data
Train a BPE tokenizer with 65K vocabulary
Pretrain a 561M parameter Transformer model (d20)
Run midtraining to teach conversation format
Fine-tune with supervised learning (SFT)
Generate evaluation reports

Expected duration: ~4 hours for the complete pipeline

NOTE

Training will run in the foreground. Keep the terminal open or use a terminal multiplexer like tmux or screen. The training containers will automatically coordinate using NCCL on port 29500. You can monitor progress by watching the terminal output.

Step 9
Monitor training progress

Watch the training output in the terminal where you launched launch.sh. You should see:

Tokenizer training progress
Data download status
Training loss decreasing from ~3.5 to ~2.5
Checkpoint saving notifications

If using W&B, monitor your training at:

https://wandb.ai/<your-username>/<your-project>/runs/<run-id>

Track key metrics:

Training loss: Should decrease steadily
Validation loss: Monitor for overfitting
Learning rate: Follows cosine decay schedule
Throughput: Tokens processed per second

Training checkpoints are automatically saved in ~/.cache/nanochat/:

model_base.pt: Pretrained base model
model_mid.pt: After midtraining
model_sft.pt: Final fine-tuned model
tokenizer.model: Trained BPE tokenizer

Step 10
View training report

After training completes, a comprehensive report is generated. View it with:

cat nanochat/report.md

The report includes:

System information and training configuration
Training curves and loss plots
Evaluation metrics across all benchmarks (CORE, ARC, GSM8K, HumanEval, MMLU)
Sample generations at each training stage
Total training time and cost breakdown

Step 11
Access the web UI for inference

Launch the ChatGPT-style web interface:

# Navigate to nanochat directory
cd nanochat

# Activate the virtual environment
source ../.venv/bin/activate

# Start the web server
python -m scripts.chat_web

Access the UI at: http://localhost:8000

NOTE

If you are running this on a remote GPU via an SSH connection, in a new terminal window, you need to run the following command to be able to access the UI at localhost:8000:

ssh -L 8000:localhost:8000 username@<HOST_IP>

Step 12
Try out sample prompts

Try these prompts to test your model:

Creative Writing:

Write a short story about two GPUs falling in love

Reasoning:

Why is distributed training important for large language models?

Math:

If I have 2 DGX Spark systems with 1 GPU each, and training takes 4 hours at $3/GPU/hour, what is the total cost?

Code:

Write a Python function to calculate fibonacci numbers

NOTE

The speedrun d20 model (561M params, ~4e19 FLOPs) performs at a kindergarten level and will make mistakes, hallucinate, and occasionally give silly answers. This is expected for micro-models trained on limited compute!

You can also use the CLI for quick interactions:

# Interactive chat mode
python -m scripts.chat_cli

# Single prompt mode
python -m scripts.chat_cli -p "Why is the sky blue?"

# Specify checkpoint (base, mid, or sft)
python -m scripts.chat_cli -i sft -p "Write me a haiku about distributed training"

Step 13
Cleanup and rollback

Steps to completely remove the containers and free up resources.

Stop training early

To stop training early, interrupt both containers:

# From the terminal running launch.sh
Ctrl+C

# Or manually stop containers
docker stop nanochat
ssh $USER@$WORKER_IP "docker stop nanochat"

Clear cache and free disk space

To free up disk space after training:

# On both nodes - clear training cache
rm -rf ~/.cache/nanochat

# Remove Docker image
docker rmi nanochat
ssh $USER@$WORKER_IP "docker rmi nanochat"

# Clear Docker system (optional)
docker system prune -a

Step 14
Next steps

Try different prompts with the trained model
Experiment with training larger models (d26 with 1.1B parameters for ~12 hours)
Customize model personality by modifying identity conversations
Evaluate model on additional benchmarks
Fine-tune on domain-specific datasets

Training Larger Models

To train a larger model (e.g., d26 with 1.1B parameters):

Modify speedrun_spark.sh:

# Download more data (450 shards for d26)
python -m nanochat.dataset -n 450 &

# Increase depth and reduce batch size to fit in memory
torchrun ... -m scripts.base_train -- --depth=26 --device_batch_size=16

Training time and cost:
- d26: ~12 hours, ~$300
- d32: ~33 hours, ~$800

Customizing Personality

To infuse your model with a custom personality:

Create identity conversations in JSONL format:

{"conversations": [
  {"role": "user", "content": "Who are you?"},
  {"role": "assistant", "content": "I am YourBot, an AI assistant trained on DGX Spark systems."}
]}

Replace identity_conversations.jsonl before midtraining
Adjust the mixing ratio in scripts/mid_train.py

See the nanochat customization guide for detailed instructions.

Nanochat on Dual-Spark

Step 1Configure Docker permissions

Step 2Verify prerequisites

Step 3Set environment variables

Step 4(Optional) Configure Weights & Biases

Step 5Clone the repository

Step 6Run the setup script

Step 7Verify Docker image

Step 8Launch distributed training

Step 9Monitor training progress

Step 10View training report

Step 11Access the web UI for inference

Step 12Try out sample prompts

Step 13Cleanup and rollback

Stop training early

Clear cache and free disk space

Step 14Next steps

Training Larger Models

Customizing Personality

Resources

Step 1
Configure Docker permissions

Step 2
Verify prerequisites

Step 3
Set environment variables

Step 4
(Optional) Configure Weights & Biases

Step 5
Clone the repository

Step 6
Run the setup script

Step 7
Verify Docker image

Step 8
Launch distributed training

Step 9
Monitor training progress

Step 10
View training report

Step 11
Access the web UI for inference

Step 12
Try out sample prompts

Step 13
Cleanup and rollback

Step 14
Next steps