Nanochat on Dual-Spark

Basic idea

This playbook shows you how to run Andrej Karpathy’s Nanochat on Spark. Nanochat is popularized as being the best ChatGPT that $100 can buy. This playbook makes it possible to train and run Nanochat locally on your dual-Spark setup.

What you'll accomplish

You’ll set up a local, end-to-end ChatGPT-like training pipeline, including pre-training, mid-training, post-training, and optional reinforcement learning. You will also be able to chat with your model through a simple web UI.

What to know before starting

Working with Docker containers and GPU passthrough
Command-line tools for GPU workloads
Basic understanding of training foundation LLM models

Prerequisites

Dual-Spark setup with QSFP cable
Docker installed and accessible to current user
NVIDIA Container Runtime configured
Hugging Face token and WandB API key
Verify GPU access: nvidia-smi
Check Docker GPU integration: docker run --rm --gpus all nvcr.io/nvidia/pytorch:25.11-py3 nvidia-smi

Ancillary files

The reference training scripts can be found in the Nanochat repository here on GitHub

Dockerfile - Build custom docker to serup the environment
setup.sh - Setup the docker image on both Spark machines
speedrun_spark.sh - Modified version of speedrun.sh to support distributed training on dual-spark
launch.sh - Launch the nanochat training on both Spark machines

Time & risk

Duration: Upto 5 days depending on model size and number of training stages.
Risks:
- Model instantiation and training are memory-intensive
- Modifying hyperparameters such as batch size, model dimensions, or precision settings can increase memory usage and may result in OOM
- Downloading large datasets and storing the trained checkpoints can take up storage space
Rollback:
- Delete the downloaded dataset and checkpoints from $HOME/.cache/nanochat
- Then exit the container environment