Nanochat Training

Basic idea

This playbook demonstrates training of nanochat on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, supervised fine-tuning (SFT), and inference via CLI or web UI.

The project uses the PyTorch NGC container, FineWeb for pretraining data, SmolTalk for SFT, and Weights & Biases for logging. The default configuration trains a ~1B parameter (d24) Transformer model with FP8 precision.

What you'll accomplish

Environment: Docker image with PyTorch NGC and nanochat dependencies on your DGX Station.
Training pipeline: BPE tokenizer (65K vocab), base model pretraining with FP8, SFT, and automated report generation.
Inference: ChatGPT-style web UI and CLI to chat with the base or SFT checkpoints.
Monitoring: W&B dashboards and nanochat_cache/report/report.md with metrics and samples.

What to know before starting

Basic Linux command line and shell usage.
Familiarity with Docker and GPU containers (e.g. docker run --gpus all).
Optional: understanding of LLM training (tokenizer, pretraining, fine-tuning).

Prerequisites

Hardware:

NVIDIA DGX Station with GB300 Ultra Superchip (288GB VRAM).
Adequate storage for cache (~25GB+ for FineWeb data and checkpoints).

Software:

Docker with NVIDIA Container Toolkit: docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi
Network access to download datasets (Hugging Face, FineWeb) and container images (nvcr.io)
Weights & Biases account and API key.
Hugging Face token for evaluation datasets.

Model architecture (d24)

Layers: 24
Attention Heads: 12
Head Dimension: 128
Context Length: 2048 tokens
Vocabulary Size: 65,536 (2^16, trained BPE)
Precision: FP8 (e4m3, tensorwise scaling)

Training stages

Stage	Description
Tokenizer	Trains BPE tokenizer (65K vocab) on ~2B characters from FineWeb
Base pretraining	Pretrains d24 model on FineWeb with FP8, target data ratio of 8
SFT	Fine-tunes on synthetic identity conversations + SmolTalk
Report	Generates `report.md` with metrics, samples, and system info

Ancillary files

All required assets are in nvidia/station-nanochat/assets/:

Dockerfile – PyTorch NGC image with nanochat pip dependencies.
setup.sh – Clones nanochat, checks out the supported commit, copies speedrun_station.sh, and builds the Docker image.
launch.sh – Runs the training container (full pipeline: tokenizer → pretrain → SFT → report).
speedrun_station.sh – Modified speedrun script adapted for single-GPU DGX Station.

Time & risk

Estimated time: ~30 minutes for setup. Full d24 training takes on the order of 12+ hours on a single GB300 Ultra.
Risk level: Medium
- Large downloads (FineWeb) can be slow; ensure stable network and disk space.
- API keys (W&B, HF) must be set or launch.sh will exit immediately.
Rollback: Stop containers with docker stop, remove caches, and run docker system prune -a if needed.

Credits

nanochat by Andrej Karpathy
FineWeb by HuggingFace (pretraining data)
SmolTalk by HuggingFace (SFT data)