---
title: "Nanochat Training"
publisher: "nvidia"
type: "playbook"
updated: "2026-03-05T18:17:08.048Z"
description: "Train a small ChatGPT-style LLM (nanochat) with tokenizer, pretraining, midtraining, and SFT on DGX Station with GB300 Ultra"
canonical: "https://build.nvidia.com/station/nanochat.md"
---

# Basic idea

This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, supervised fine-tuning (SFT), and inference via CLI or web UI.

The project uses the PyTorch NGC container, FineWeb for pretraining data, SmolTalk for SFT, and Weights & Biases for logging. The default configuration trains a ~1B parameter (d24) Transformer model with FP8 precision.

# What you'll accomplish

- **Environment:** Docker image with PyTorch NGC and nanochat dependencies on your DGX Station.
- **Training pipeline:** BPE tokenizer (65K vocab), base model pretraining with FP8, SFT, and automated report generation.
- **Inference:** ChatGPT-style web UI and CLI to chat with the base or SFT checkpoints.
- **Monitoring:** W&B dashboards and `nanochat_cache/report/report.md` with metrics and samples.

# What to know before starting

- Basic Linux command line and shell usage.
- Familiarity with Docker and GPU containers (e.g. `docker run --gpus all`).
- Optional: understanding of LLM training (tokenizer, pretraining, fine-tuning).

# Prerequisites

**Hardware:**

- NVIDIA DGX Station with GB300 Ultra Superchip (288GB VRAM).
- Adequate storage for cache (~25GB+ for FineWeb data and checkpoints).

**Software:**

- Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`
- Network access to download datasets (Hugging Face, FineWeb) and container images (nvcr.io)
- [Weights & Biases](https://wandb.ai/) account and API key.
- [Hugging Face](https://huggingface.co/docs/hub/en/security-tokens) token for evaluation datasets.

# Model architecture (d24)

```
Layers: 24
Attention Heads: 12
Head Dimension: 128
Context Length: 2048 tokens
Vocabulary Size: 65,536 (2^16, trained BPE)
Precision: FP8 (e4m3, tensorwise scaling)
```

# Training stages

| Stage | Description |
|-------|-------------|
| Tokenizer | Trains BPE tokenizer (65K vocab) on ~2B characters from FineWeb |
| Base pretraining | Pretrains d24 model on FineWeb with FP8, target data:param ratio of 8 |
| SFT | Fine-tunes on synthetic identity conversations + SmolTalk |
| Report | Generates `report.md` with metrics, samples, and system info |

# Ancillary files

All required assets are in `nvidia/station-nanochat/assets/`:

- `Dockerfile` – PyTorch NGC image with nanochat pip dependencies.
- `setup.sh` – Clones nanochat, checks out the supported commit, copies `speedrun_station.sh`, and builds the Docker image.
- `launch.sh` – Runs the training container (full pipeline: tokenizer → pretrain → SFT → report).
- `speedrun_station.sh` – Modified speedrun script adapted for single-GPU DGX Station.

# Time & risk

- **Estimated time:** ~30 minutes for setup. Full d24 training takes on the order of 12+ hours on a single GB300 Ultra.
- **Risk level:** Medium
- Large downloads (FineWeb) can be slow; ensure stable network and disk space.
- API keys (W&B, HF) must be set or `launch.sh` will exit immediately.
- **Rollback:** Stop containers with `docker stop`, remove caches, and run `docker system prune -a` if needed.

# Credits

- [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy
- [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) by HuggingFace (pretraining data)
- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) by HuggingFace (SFT data)

## More

- [Instructions](/station/nanochat/instructions.md)
- [Troubleshooting](/station/nanochat/troubleshooting.md)