Nanochat on Dual-Spark
5 days
Setup Nanochat on Dual-Spark
Basic idea
This playbook shows you how to run Andrej Karpathy’s Nanochat on Spark. Nanochat is popularized as being the best ChatGPT that $100 can buy. This playbook makes it possible to train and run Nanochat locally on your dual-Spark setup.
What you'll accomplish
You’ll set up a local, end-to-end ChatGPT-like training pipeline, including pre-training, mid-training, post-training, and optional reinforcement learning. You will also be able to chat with your model through a simple web UI.
What to know before starting
- Working with Docker containers and GPU passthrough
- Command-line tools for GPU workloads
- Basic understanding of training foundation LLM models
Prerequisites
- Dual-Spark setup with QSFP cable
- Docker installed and accessible to current user
- NVIDIA Container Runtime configured
- Hugging Face token and WandB API key
- Verify GPU access:
nvidia-smi - Check Docker GPU integration:
docker run --rm --gpus all nvcr.io/nvidia/pytorch:25.11-py3 nvidia-smi
Ancillary files
The reference training scripts can be found in the Nanochat repository here on GitHub
- Dockerfile - Build custom docker to serup the environment
- setup.sh - Setup the docker image on both Spark machines
- speedrun_spark.sh - Modified version of speedrun.sh to support distributed training on dual-spark
- launch.sh - Launch the nanochat training on both Spark machines
Time & risk
-
Duration: Upto 5 days depending on model size and number of training stages.
-
Risks:
- Model instantiation and training are memory-intensive
- Modifying hyperparameters such as batch size, model dimensions, or precision settings can increase memory usage and may result in OOM
- Downloading large datasets and storing the trained checkpoints can take up storage space
-
Rollback:
- Delete the downloaded dataset and checkpoints from
$HOME/.cache/nanochat - Then exit the container environment
- Delete the downloaded dataset and checkpoints from