Skip to main content
NVIDIA
Explore
Models
Skills
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • MIG on DGX Station

data science

  • Topic Modeling
  • Text to Knowledge Graph on DGX Station

tools

  • NVFP4 Quantization

fine tuning

  • NVFP4 Pretraining with Megatron Bridge
  • Nanochat Training

use case

  • Run NemoClaw with a Local LLM
  • DGX Station AI Skills for Coding Agents
  • Profiler-Driven Kernel Optimization for Fine-Tuning
  • Local Healthcare Agent on DGX Station
  • Secure Long Running AI Agents with OpenShell on DGX Station
  • Local Coding Agent

inference

  • vLLM for Inference
  • Image & Video Generation with ComfyUI
  • Isaac GR00T N1.6 Fine-Tuning
  • LLM Inference with SGLang
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation

Nanochat Training

30 MIN

Train a small ChatGPT-style LLM (nanochat) with tokenizer, pretraining, midtraining, and SFT on DGX Station with GB300 Ultra

DGX StationFine-tuningGB300LLMPyTorchTrainingnanochat
View on GitHub
OverviewOverviewInstructionsInstructionsTroubleshootingTroubleshooting

Basic idea

This playbook demonstrates training of nanochat on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, supervised fine-tuning (SFT), and inference via CLI or web UI.

The project uses the PyTorch NGC container, FineWeb for pretraining data, SmolTalk for SFT, and Weights & Biases for logging. The default configuration trains a ~1B parameter (d24) Transformer model with FP8 precision.

What you'll accomplish

  • Environment: Docker image with PyTorch NGC and nanochat dependencies on your DGX Station.
  • Training pipeline: BPE tokenizer (65K vocab), base model pretraining with FP8, SFT, and automated report generation.
  • Inference: ChatGPT-style web UI and CLI to chat with the base or SFT checkpoints.
  • Monitoring: W&B dashboards and nanochat_cache/report/report.md with metrics and samples.

What to know before starting

  • Basic Linux command line and shell usage.
  • Familiarity with Docker and GPU containers (e.g. docker run --gpus all).
  • Optional: understanding of LLM training (tokenizer, pretraining, fine-tuning).

Prerequisites

Hardware:

  • NVIDIA DGX Station with GB300 Ultra Superchip (288GB VRAM).
  • Adequate storage for cache (~25GB+ for FineWeb data and checkpoints).

Software:

  • Docker with NVIDIA Container Toolkit: docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi
  • Network access to download datasets (Hugging Face, FineWeb) and container images (nvcr.io)
  • Weights & Biases account and API key.
  • Hugging Face token for evaluation datasets.

Model architecture (d24)

Layers: 24
Attention Heads: 12
Head Dimension: 128
Context Length: 2048 tokens
Vocabulary Size: 65,536 (2^16, trained BPE)
Precision: FP8 (e4m3, tensorwise scaling)

Training stages

StageDescription
TokenizerTrains BPE tokenizer (65K vocab) on ~2B characters from FineWeb
Base pretrainingPretrains d24 model on FineWeb with FP8, target data ratio of 8
SFTFine-tunes on synthetic identity conversations + SmolTalk
ReportGenerates report.md with metrics, samples, and system info

Ancillary files

All required assets are in nvidia/station-nanochat/assets/:

  • Dockerfile – PyTorch NGC image with nanochat pip dependencies.
  • setup.sh – Clones nanochat, checks out the supported commit, copies speedrun_station.sh, and builds the Docker image.
  • launch.sh – Runs the training container (full pipeline: tokenizer → pretrain → SFT → report).
  • speedrun_station.sh – Modified speedrun script adapted for single-GPU DGX Station.

Time & risk

  • Estimated time: ~30 minutes for setup. Full d24 training takes on the order of 12+ hours on a single GB300 Ultra.
  • Risk level: Medium
    • Large downloads (FineWeb) can be slow; ensure stable network and disk space.
    • API keys (W&B, HF) must be set or launch.sh will exit immediately.
  • Rollback: Stop containers with docker stop, remove caches, and run docker system prune -a if needed.

Credits

  • nanochat by Andrej Karpathy
  • FineWeb by HuggingFace (pretraining data)
  • SmolTalk by HuggingFace (SFT data)

Resources

  • nanochat (GitHub)
  • Weights & Biases
  • Hugging Face (datasets / token)