Nanochat Training
30 MIN
Train a small ChatGPT-style LLM (nanochat) with tokenizer, pretraining, midtraining, and SFT on DGX Station with GB300 Ultra
Basic idea
This playbook demonstrates training of nanochat on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, supervised fine-tuning (SFT), and inference via CLI or web UI.
The project uses the PyTorch NGC container, FineWeb for pretraining data, SmolTalk for SFT, and Weights & Biases for logging. The default configuration trains a ~1B parameter (d24) Transformer model with FP8 precision.
What you'll accomplish
- Environment: Docker image with PyTorch NGC and nanochat dependencies on your DGX Station.
- Training pipeline: BPE tokenizer (65K vocab), base model pretraining with FP8, SFT, and automated report generation.
- Inference: ChatGPT-style web UI and CLI to chat with the base or SFT checkpoints.
- Monitoring: W&B dashboards and
nanochat_cache/report/report.mdwith metrics and samples.
What to know before starting
- Basic Linux command line and shell usage.
- Familiarity with Docker and GPU containers (e.g.
docker run --gpus all). - Optional: understanding of LLM training (tokenizer, pretraining, fine-tuning).
Prerequisites
Hardware:
- NVIDIA DGX Station with GB300 Ultra Superchip (288GB VRAM).
- Adequate storage for cache (~25GB+ for FineWeb data and checkpoints).
Software:
- Docker with NVIDIA Container Toolkit:
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi - Network access to download datasets (Hugging Face, FineWeb) and container images (nvcr.io)
- Weights & Biases account and API key.
- Hugging Face token for evaluation datasets.
Model architecture (d24)
Layers: 24
Attention Heads: 12
Head Dimension: 128
Context Length: 2048 tokens
Vocabulary Size: 65,536 (2^16, trained BPE)
Precision: FP8 (e4m3, tensorwise scaling)
Training stages
| Stage | Description |
|---|---|
| Tokenizer | Trains BPE tokenizer (65K vocab) on ~2B characters from FineWeb |
| Base pretraining | Pretrains d24 model on FineWeb with FP8, target data ratio of 8 |
| SFT | Fine-tunes on synthetic identity conversations + SmolTalk |
| Report | Generates report.md with metrics, samples, and system info |
Ancillary files
All required assets are in nvidia/station-nanochat/assets/:
Dockerfile– PyTorch NGC image with nanochat pip dependencies.setup.sh– Clones nanochat, checks out the supported commit, copiesspeedrun_station.sh, and builds the Docker image.launch.sh– Runs the training container (full pipeline: tokenizer → pretrain → SFT → report).speedrun_station.sh– Modified speedrun script adapted for single-GPU DGX Station.
Time & risk
- Estimated time: ~30 minutes for setup. Full d24 training takes on the order of 12+ hours on a single GB300 Ultra.
- Risk level: Medium
- Large downloads (FineWeb) can be slow; ensure stable network and disk space.
- API keys (W&B, HF) must be set or
launch.shwill exit immediately.
- Rollback: Stop containers with
docker stop, remove caches, and rundocker system prune -aif needed.