Skip to main content
NVIDIA
Explore
Models
Skills
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • MIG on DGX Station

data science

  • Topic Modeling
  • Text to Knowledge Graph on DGX Station

tools

  • NVFP4 Quantization

fine tuning

  • NVFP4 Pretraining with Megatron Bridge
  • Nanochat Training

use case

  • Run NemoClaw with a Local LLM
  • DGX Station AI Skills for Coding Agents
  • Profiler-Driven Kernel Optimization for Fine-Tuning
  • Local Healthcare Agent on DGX Station
  • Secure Long Running AI Agents with OpenShell on DGX Station
  • Local Coding Agent

inference

  • vLLM for Inference
  • Image & Video Generation with ComfyUI
  • Isaac GR00T N1.6 Fine-Tuning
  • LLM Inference with SGLang

Nanochat Training

30 MIN

Train a small ChatGPT-style LLM (nanochat) with tokenizer, pretraining, midtraining, and SFT on DGX Station with GB300 Ultra

DGX StationFine-tuningGB300LLMPyTorchTrainingnanochat
View on GitHub
OverviewOverviewInstructionsInstructionsTroubleshootingTroubleshooting
SymptomCauseFix
WANDB_API_KEY is not set or HF_TOKEN is not setRequired env vars not exported before launch.shexport WANDB_API_KEY=<key> and export HF_TOKEN=<token> in the same shell, then re-run ./launch.sh
RuntimeError: CUDA out of memoryBatch size too large for available VRAMEdit speedrun_station.sh: reduce --device-batch-size (try 64, 32, 16, 8). Re-run ./setup.sh then ./launch.sh
Docker container exits immediatelyMissing env vars, bad cache paths, or build failureCheck logs: docker ps -a then docker logs <container_id>. Fix env vars or paths as needed
nanochat image not foundSetup not run or Docker build failedFrom the assets/ directory, run ./setup.sh and confirm with docker images | grep nanochat
No such file or directory for cache pathsCache directories don't existlaunch.sh creates them automatically under $(pwd)/nanochat_cache and $(pwd)/hf_cache. If using custom paths, create them: mkdir -p $NANOCHAT_CACHE $HF_CACHE
Training hangs at "Waiting for dataset download"Network issue downloading FineWeb shardsCheck network connectivity. The download can take time depending on bandwidth. If it persists, restart ./launch.sh
W&B shows wrong user / stale loginCached W&B credentials in container volumespeedrun_station.sh runs wandb login --relogin with your key automatically. Ensure WANDB_API_KEY is correct
Container runs but launch.sh says "Training complete!" immediatelyContainer failed fast and exited before the poll loop detected itCheck docker ps -a for the exited container and inspect logs with docker logs <id>
GPU not visible inside containerDocker NVIDIA runtime not configuredTest: docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi. If it fails, install/configure NVIDIA Container Toolkit

Resources

  • nanochat (GitHub)
  • Weights & Biases
  • Hugging Face (datasets / token)
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation