Nanochat Training

Step 1
Prerequisites and environment

Ensure your DGX Station has Docker with NVIDIA runtime and GPU access. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets.

# Verify GPU and Docker
nvidia-smi
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi

Create a W&B account and a Hugging Face token if you don't have them. Export both keys in your shell:

export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>

Step 2
Clone and set up

Clone the playbook repository and navigate to the assets directory:

git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-nanochat/assets

Run the setup script. It clones nanochat, checks out the supported commit, copies the station-adapted speedrun_station.sh, and builds the nanochat Docker image (PyTorch NGC base with dependencies):

./setup.sh

You should see the nanochat image listed if you run docker images. Your directory structure after setup should look like this:

assets/
├── Dockerfile
├── launch.sh
├── setup.sh
├── speedrun_station.sh
└── nanochat/

Step 3
Launch training

Ensure your API keys are exported, then launch:

./launch.sh

On a single-GPU DGX Station, the default --gpus all selects the GB300. On a multi-GPU host, set GPU_DEVICE to target the GB300 explicitly (N is the GB300's device ID from nvidia-smi):

GPU_DEVICE='"device=N"' ./launch.sh

The training runs inside the nanochat container and executes the full pipeline automatically:

Tokenizer training — downloads ~2B characters from ClimbMix, trains a 65K BPE tokenizer
Base model pretraining — downloads additional ClimbMix shards, pretrains a d24 model (~1B params) with FP8
SFT — downloads synthetic identity conversations, fine-tunes for chat
Report generation — produces report.md with metrics and samples

Training on a single GB300 Ultra takes on the order of 12+ hours for the full d24 run.

Step 4
Monitor training

W&B dashboard:

Track training at wandb.ai under the nanochat project. The exact link to the wandb run would be provided in the training logs. Key metrics:

Training loss
Validation BPB
Throughput (tokens/sec)

Step 5
Inference

After training, checkpoints are saved under the nanochat_cache/ directory. Run inference from inside the container or interactively:

On a multi-GPU host, replace --gpus all below with --gpus '"device=N"' (where N is the GB300's device ID from nvidia-smi) to pin inference to the GB300.

Web UI (recommended):

docker run --rm --gpus all --net=host \
    -v $(pwd)/nanochat:/workspace/nanochat \
    -v $(pwd)/nanochat_cache:/root/.cache/nanochat \
    -w /workspace/nanochat \
    nanochat \
    python -m scripts.chat_web

Open a browser to http://<STATION_IP>:8000 where <STATION_IP> is your DGX Station’s IP address.

CLI:

docker run --rm -it --gpus all \
    -v $(pwd)/nanochat:/workspace/nanochat \
    -v $(pwd)/nanochat_cache:/root/.cache/nanochat \
    -w /workspace/nanochat \
    nanochat \
    python -m scripts.chat_cli -p "Why is the sky blue?"

Step 6
Cleanup

To stop training early, interrupt the launch script or stop the container:

WARNING

This stops the training run and any in-progress work in the container.

# If launch.sh is running: press Ctrl+C

# Or stop the container directly
docker stop $(docker ps -q --filter ancestor=nanochat)

To free disk space (the cache directories are root-owned because the container runs as root, so sudo is required):

sudo rm -rf ./nanochat_cache ./hf_cache
docker rmi nanochat

Step 7
Customization

Smaller/faster run: Edit speedrun_station.sh before running setup to reduce data and model size:

# Fewer data shards (10 instead of default)
python -m nanochat.dataset -n 10 &

# Smaller model (d4 instead of d24), smaller batch size
python -m scripts.base_train --depth=4 --device-batch-size=32

Batch size: The default --device-batch-size=64 is tuned for the GB300's 288GB VRAM. Feel free to change the batch size if utilization is low or the training OOMs.

Then re-run ./setup.sh to rebuild with the changes.

Step 1Prerequisites and environment

Step 2Clone and set up

Step 3Launch training

Step 4Monitor training

Step 5Inference

Step 6Cleanup

Step 7Customization

Resources