Train a small ChatGPT-style LLM (nanochat) with tokenizer, pretraining, midtraining, and SFT on DGX Station with GB300 Ultra
This playbook demonstrates training of nanochat on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, supervised fine-tuning (SFT), and inference via CLI or web UI.
The project uses the PyTorch NGC container, FineWeb for pretraining data, SmolTalk for SFT, and Weights & Biases for logging. The default configuration trains a ~1B parameter (d24) Transformer model with FP8 precision.
nanochat_cache/report/report.md with metrics and samples.docker run --gpus all).Hardware:
Software:
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smiLayers: 24
Attention Heads: 12
Head Dimension: 128
Context Length: 2048 tokens
Vocabulary Size: 65,536 (2^16, trained BPE)
Precision: FP8 (e4m3, tensorwise scaling)
| Stage | Description |
|---|---|
| Tokenizer | Trains BPE tokenizer (65K vocab) on ~2B characters from FineWeb |
| Base pretraining | Pretrains d24 model on FineWeb with FP8, target data ratio of 8 |
| SFT | Fine-tunes on synthetic identity conversations + SmolTalk |
| Report | Generates report.md with metrics, samples, and system info |
All required assets are in nvidia/station-nanochat/assets/:
Dockerfile – PyTorch NGC image with nanochat pip dependencies.setup.sh – Clones nanochat, checks out the supported commit, copies speedrun_station.sh, and builds the Docker image.launch.sh – Runs the training container (full pipeline: tokenizer → pretrain → SFT → report).speedrun_station.sh – Modified speedrun script adapted for single-GPU DGX Station.launch.sh will exit immediately.docker stop, remove caches, and run docker system prune -a if needed.