Nanochat on Dual-Spark

5 days

Setup Nanochat on Dual-Spark

Basic idea

This playbook shows you how to run Andrej Karpathy’s Nanochat on Spark. Nanochat is popularized as being the best ChatGPT that $100 can buy. This playbook makes it possible to train and run Nanochat locally on your dual-Spark setup.

What you'll accomplish

You’ll set up a local, end-to-end ChatGPT-like training pipeline, including pre-training, mid-training, post-training, and optional reinforcement learning. You will also be able to chat with your model through a simple web UI.

What to know before starting

  • Working with Docker containers and GPU passthrough
  • Command-line tools for GPU workloads
  • Basic understanding of training foundation LLM models

Prerequisites

  • Dual-Spark setup with QSFP cable
  • Docker installed and accessible to current user
  • NVIDIA Container Runtime configured
  • Hugging Face token and WandB API key
  • Verify GPU access: nvidia-smi
  • Check Docker GPU integration: docker run --rm --gpus all nvcr.io/nvidia/pytorch:25.11-py3 nvidia-smi

Ancillary files

The reference training scripts can be found in the Nanochat repository here on GitHub

  • Dockerfile - Build custom docker to serup the environment
  • setup.sh - Setup the docker image on both Spark machines
  • speedrun_spark.sh - Modified version of speedrun.sh to support distributed training on dual-spark
  • launch.sh - Launch the nanochat training on both Spark machines

Time & risk

  • Duration: Upto 5 days depending on model size and number of training stages.

  • Risks:

    • Model instantiation and training are memory-intensive
    • Modifying hyperparameters such as batch size, model dimensions, or precision settings can increase memory usage and may result in OOM
    • Downloading large datasets and storing the trained checkpoints can take up storage space
  • Rollback:

    • Delete the downloaded dataset and checkpoints from $HOME/.cache/nanochat
    • Then exit the container environment