NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Text to Knowledge Graph
  • Optimized JAX

tools

  • VS Code
  • DGX Dashboard
  • Comfy UI
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • Spark & Reachy Photo Booth
  • Live VLM WebUI
  • Install and Use Isaac Sim and Isaac Lab
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • Connect Two Sparks
  • NCCL for Two Sparks
  • Build a Video Search and Summarization (VSS) Agent

inference

  • LM Studio on DGX Spark
  • Nemotron-3-Nano with llama.cpp
  • Speculative Decoding
  • SGLang for Inference
  • TRT LLM for Inference
  • vLLM for Inference
  • NVFP4 Quantization
  • Multi-modal Inference
  • NIM on Spark
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation

Nanochat on Dual-Spark

5 days

Setup Nanochat on Dual-Spark

View on GitHub
OverviewOverviewInstructionsInstructionsTroubleshootingTroubleshooting

Step 1
Configure Docker permissions

To easily manage containers without sudo, you must be in the docker group. If you choose to skip this step, you will need to run Docker commands with sudo.

Open a new terminal and test Docker access. In the terminal, run:

docker ps

If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo.

sudo usermod -aG docker $USER
newgrp docker

Step 2
Verify prerequisites

Before starting, ensure you have:

  • Two DGX Spark systems with network connectivity
  • SSH access configured between the nodes
  • Docker installed on both systems
  • GPU available on both systems

Test network connectivity and SSH access:

# From host node - replace <WORKER_IP> with your worker node IP
ping <WORKER_IP>

# Test SSH access
ssh $USER@<WORKER_IP> "echo 'Connection successful'"

Step 3
Set environment variables

Set your host and worker IP addresses. You can find your IP address using hostname -I or ip addr show.

export HOST_IP=<HOST_IP>
export WORKER_IP=<WORKER_IP>

NOTE

Replace <HOST_IP> and <WORKER_IP> with your actual IP addresses. Use the IP address of the network interface that will be used for distributed training (default: enp1s0f0np0). To find your network interface, use ip addr show and look for the interface with an active connection.

Step 4
(Optional) Configure Weights & Biases

For training visualization and logging, set up your W&B API key:

export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export WANDB_RUN=speedrun  # Optional, name your run

If you don't have a W&B account, create one at wandb.ai. Without W&B, the training will run but skip logging.

Step 5
Clone the repository

git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/nanochat-dual-spark/assets

Step 6
Run the setup script

Run the setup script to clone nanochat and build the Docker image on both nodes:

chmod +x setup.sh
sh setup.sh $HOST_IP $WORKER_IP

The setup script will:

  • Clone the nanochat repository (specific commit: c6b7ab744055d5915e6ccb61088de80c10cbaff9)
  • Copy the modified speedrun_spark.sh script for dual-node training
  • Build the Docker image on both nodes

This step can take 10 to 20 minutes depending on network speed and Docker build performance.

Step 7
Verify Docker image

Ensure the Docker image was built successfully on both nodes:

# On host
docker images | grep nanochat

# On worker
ssh $USER@$WORKER_IP "docker images | grep nanochat"

You should see the nanochat image listed on both systems.

Step 8
Launch distributed training

Start the distributed training across both DGX Spark nodes:

# Make sure environment variables are set
export HOST_IP=<HOST_IP>
export WORKER_IP=<WORKER_IP>
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>  # Optional

# Launch training on both nodes
sh launch.sh $HOST_IP $WORKER_IP

The training script will automatically:

  1. Download ~24GB of FineWeb pretraining data
  2. Train a BPE tokenizer with 65K vocabulary
  3. Pretrain a 561M parameter Transformer model (d20)
  4. Run midtraining to teach conversation format
  5. Fine-tune with supervised learning (SFT)
  6. Generate evaluation reports

Expected duration: ~4 hours for the complete pipeline

NOTE

Training will run in the foreground. Keep the terminal open or use a terminal multiplexer like tmux or screen. The training containers will automatically coordinate using NCCL on port 29500. You can monitor progress by watching the terminal output.

Step 9
Monitor training progress

Watch the training output in the terminal where you launched launch.sh. You should see:

  • Tokenizer training progress
  • Data download status
  • Training loss decreasing from ~3.5 to ~2.5
  • Checkpoint saving notifications

If using W&B, monitor your training at:

https://wandb.ai/<your-username>/<your-project>/runs/<run-id>

Track key metrics:

  • Training loss: Should decrease steadily
  • Validation loss: Monitor for overfitting
  • Learning rate: Follows cosine decay schedule
  • Throughput: Tokens processed per second

Training checkpoints are automatically saved in ~/.cache/nanochat/:

  • model_base.pt: Pretrained base model
  • model_mid.pt: After midtraining
  • model_sft.pt: Final fine-tuned model
  • tokenizer.model: Trained BPE tokenizer

Step 10
View training report

After training completes, a comprehensive report is generated. View it with:

cat nanochat/report.md

The report includes:

  • System information and training configuration
  • Training curves and loss plots
  • Evaluation metrics across all benchmarks (CORE, ARC, GSM8K, HumanEval, MMLU)
  • Sample generations at each training stage
  • Total training time and cost breakdown

Step 11
Access the web UI for inference

Launch the ChatGPT-style web interface:

# Navigate to nanochat directory
cd nanochat

# Activate the virtual environment
source ../.venv/bin/activate

# Start the web server
python -m scripts.chat_web

Access the UI at: http://localhost:8000

NOTE

If you are running this on a remote GPU via an SSH connection, in a new terminal window, you need to run the following command to be able to access the UI at localhost:8000:

ssh -L 8000:localhost:8000 username@<HOST_IP>

Step 12
Try out sample prompts

Try these prompts to test your model:

Creative Writing:

Write a short story about two GPUs falling in love

Reasoning:

Why is distributed training important for large language models?

Math:

If I have 2 DGX Spark systems with 1 GPU each, and training takes 4 hours at $3/GPU/hour, what is the total cost?

Code:

Write a Python function to calculate fibonacci numbers

NOTE

The speedrun d20 model (561M params, ~4e19 FLOPs) performs at a kindergarten level and will make mistakes, hallucinate, and occasionally give silly answers. This is expected for micro-models trained on limited compute!

You can also use the CLI for quick interactions:

# Interactive chat mode
python -m scripts.chat_cli

# Single prompt mode
python -m scripts.chat_cli -p "Why is the sky blue?"

# Specify checkpoint (base, mid, or sft)
python -m scripts.chat_cli -i sft -p "Write me a haiku about distributed training"

Step 13
Cleanup and rollback

Steps to completely remove the containers and free up resources.

Stop training early

To stop training early, interrupt both containers:

# From the terminal running launch.sh
Ctrl+C

# Or manually stop containers
docker stop nanochat
ssh $USER@$WORKER_IP "docker stop nanochat"

Clear cache and free disk space

To free up disk space after training:

# On both nodes - clear training cache
rm -rf ~/.cache/nanochat

# Remove Docker image
docker rmi nanochat
ssh $USER@$WORKER_IP "docker rmi nanochat"

# Clear Docker system (optional)
docker system prune -a

Step 14
Next steps

  • Try different prompts with the trained model
  • Experiment with training larger models (d26 with 1.1B parameters for ~12 hours)
  • Customize model personality by modifying identity conversations
  • Evaluate model on additional benchmarks
  • Fine-tune on domain-specific datasets

Training Larger Models

To train a larger model (e.g., d26 with 1.1B parameters):

  1. Modify speedrun_spark.sh:
# Download more data (450 shards for d26)
python -m nanochat.dataset -n 450 &

# Increase depth and reduce batch size to fit in memory
torchrun ... -m scripts.base_train -- --depth=26 --device_batch_size=16
  1. Training time and cost:
    • d26: ~12 hours, ~$300
    • d32: ~33 hours, ~$800

Customizing Personality

To infuse your model with a custom personality:

  1. Create identity conversations in JSONL format:
{"conversations": [
  {"role": "user", "content": "Who are you?"},
  {"role": "assistant", "content": "I am YourBot, an AI assistant trained on DGX Spark systems."}
]}
  1. Replace identity_conversations.jsonl before midtraining
  2. Adjust the mixing ratio in scripts/mid_train.py

See the nanochat customization guide for detailed instructions.

Resources

  • DGX Spark Documentation
  • DGX Spark Forum
  • DGX Spark User Performance Guide