Skip to main content
NVIDIA
Explore
Models
Skills
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • MIG on DGX Station

data science

  • Topic Modeling
  • Text to Knowledge Graph on DGX Station

tools

  • NVFP4 Quantization

fine tuning

  • NVFP4 Pretraining with Megatron Bridge
  • Nanochat Training

use case

  • Run NemoClaw with a Local LLM
  • DGX Station AI Skills for Coding Agents
  • Profiler-Driven Kernel Optimization for Fine-Tuning
  • Local Healthcare Agent on DGX Station
  • Secure Long Running AI Agents with OpenShell on DGX Station
  • Local Coding Agent

inference

  • vLLM for Inference
  • Image & Video Generation with ComfyUI
  • Isaac GR00T N1.6 Fine-Tuning
  • LLM Inference with SGLang
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation

Nanochat Training

30 MIN

Train a small ChatGPT-style LLM (nanochat) with tokenizer, pretraining, midtraining, and SFT on DGX Station with GB300 Ultra

DGX StationFine-tuningGB300LLMPyTorchTrainingnanochat
View on GitHub
OverviewOverviewInstructionsInstructionsTroubleshootingTroubleshooting

Step 1
Prerequisites and environment

Ensure your DGX Station has Docker with NVIDIA runtime and GPU access. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets.

# Verify GPU and Docker
nvidia-smi
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi

Create a W&B account and a Hugging Face token if you don't have them. Export both keys in your shell:

export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>

Step 2
Clone and set up

Clone the playbook repository and navigate to the assets directory:

git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-nanochat/assets

Run the setup script. It clones nanochat, checks out the supported commit, copies the station-adapted speedrun_station.sh, and builds the nanochat Docker image (PyTorch NGC base with dependencies):

./setup.sh

You should see the nanochat image listed if you run docker images. Your directory structure after setup should look like this:

assets/
├── Dockerfile
├── launch.sh
├── setup.sh
├── speedrun_station.sh
└── nanochat/

Step 3
Launch training

Ensure your API keys are exported, then launch:

./launch.sh

The training runs inside the nanochat container and executes the full pipeline automatically:

  1. Tokenizer training — downloads ~2B characters from FineWeb, trains a 65K BPE tokenizer
  2. Base model pretraining — downloads additional FineWeb shards, pretrains a d24 model (~1B params) with FP8
  3. SFT — downloads synthetic identity conversations, fine-tunes for chat
  4. Report generation — produces report.md with metrics and samples

Training on a single GB300 Ultra takes on the order of 12+ hours for the full d24 run.

Step 4
Monitor training

W&B dashboard:

Track training at wandb.ai under the nanochat project. The exact link to the wandb run would be provided in the training logs. Key metrics:

  • Training loss
  • Validation BPB
  • Throughput (tokens/sec)

Step 5
Inference

After training, checkpoints are saved under the nanochat_cache/ directory. Run inference from inside the container or interactively:

Web UI (recommended):

docker run --rm --gpus all --net=host \
    -v $(pwd)/nanochat:/workspace/nanochat \
    -v $(pwd)/nanochat_cache:/root/.cache/nanochat \
    -w /workspace/nanochat \
    nanochat \
    python -m scripts.chat_web

Open a browser to http://<STATION_IP>:8000 where <STATION_IP> is your DGX Station’s IP address.

CLI:

docker run --rm -it --gpus all \
    -v $(pwd)/nanochat:/workspace/nanochat \
    -v $(pwd)/nanochat_cache:/root/.cache/nanochat \
    -w /workspace/nanochat \
    nanochat \
    python -m scripts.chat_cli -p "Why is the sky blue?"

Step 6
Cleanup

To stop training early, interrupt the launch script or stop the container:

WARNING

This stops the training run and any in-progress work in the container.

# If launch.sh is running: press Ctrl+C

# Or stop the container directly
docker stop $(docker ps -q --filter ancestor=nanochat)

To free disk space:

rm -rf ./nanochat_cache ./hf_cache
docker system prune -a

Step 7
Customization

Smaller/faster run: Edit speedrun_station.sh before running setup to reduce data and model size:

# Fewer data shards (10 instead of default)
python -m nanochat.dataset -n 10 &

# Smaller model (d4 instead of d24), smaller batch size
python -m scripts.base_train --depth=4 --device-batch-size=32

Batch size: The default --device-batch-size=64 is tuned for the GB300's 288GB VRAM. Feel free to change the batch size if utilization is low or the training OOMs.

Then re-run ./setup.sh to rebuild with the changes.

Resources

  • nanochat (GitHub)
  • Weights & Biases
  • Hugging Face (datasets / token)