NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • MIG on DGX Station

data science

  • Topic Modeling
  • Text to Knowledge Graph on DGX Station

tools

  • NVFP4 Quantization

fine tuning

  • Nanochat Training

use case

  • NemoClaw with Nemotron-3-Super and vLLM on DGX Station
  • Local Coding Agent
  • Secure Long Running AI Agents with OpenShell on DGX Station

inference

  • Serve Qwen3-235B with vLLM

Profiler-Driven Kernel Optimization for Fine-Tuning

2 HRS

Use torch.profiler to find training bottlenecks, then write custom Triton kernels to optimize LLaMA 8B fine-tuning

DGX StationFine-TuningGB300Kernel DevelopmentLLaMAPerformance OptimizationTrainingTriton
View on GitHub
OverviewOverviewInstructionsInstructionsTroubleshootingTroubleshooting
SymptomCauseFix
ModuleNotFoundError: No module named 'triton'Container missing TritonUse the kernel-dev-ft container built from the playbook's Dockerfile. Triton ships with PyTorch NGC containers. Verify: python -c "import triton; print(triton.__version__)".
triton.compiler.errors.CompilationError referencing sm_100Triton version too old for BlackwellUse PyTorch NGC container 26.01+ which includes Triton with Blackwell support. Check: python -c "import triton; print(triton.__version__)".
Correctness test fails with large differences in BF16Using FP32 tolerance for BF16 comparisonBF16 has only 7 mantissa bits. Use atol=1e-2, rtol=1e-2 for torch.allclose. Differences up to ~0.01 are normal.
torch.cuda.OutOfMemoryError during baseline profilingBatch size or sequence length too largeReduce --batch-size or --seq-len in profile_baseline.py. LLaMA 3.1 8B in BF16 needs ~16 GB for weights alone, plus ~32 GB for AdamW optimizer states.
torch.cuda.OutOfMemoryError during PyTorch cross-entropy but NOT during custom kernelStandard cross-entropy materializes full [B*T, V] logit tensorThis demonstrates exactly why the custom kernel is needed. Reduce batch size or sequence length for the baseline comparison, or run only the custom kernel path.
Profiler trace JSON is very large (>1 GB)Too many training steps profiledReduce wait, warmup, active in the profiler schedule. The default script profiles only 1 active step.
401 Client Error when downloading LLaMA 3.1 8BMissing or invalid Hugging Face token, or no LLaMA accessSet HF_TOKEN environment variable. Accept the LLaMA 3.1 license at https://huggingface.co/meta-llama/Llama-3.1-8B. Verify token: huggingface-cli whoami.
Custom RMSNorm backward produces NaN gradientsEpsilon value too small or input contains extreme valuesEnsure epsilon is 1e-6 (LLaMA default). Check input tensor for NaN/Inf with torch.isfinite(x).all().
Benchmark shows no speedup for RMSNorm on small hidden dimensionsKernel launch overhead dominates for small tensorsRMSNorm speedup is most visible at hidden_size >= 2048. LLaMA 3.1 8B uses 4096, which is well above the threshold.
docker: Error response from daemon: could not select device driverNVIDIA Container Toolkit not installed or Docker not restartedInstall: sudo apt install nvidia-container-toolkit && sudo systemctl restart docker. Verify: docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi.
Fused cross-entropy loss differs from PyTorch by more than 0.1Bug in the chunked online softmax implementationVerify the running-max update: m_new = max(m_old, chunk_max) must happen BEFORE updating the running sum-of-exp d. Check that the target index masking uses the correct chunk offset.
Fine-tuning throughput is not improved despite faster kernelsGPU is compute-bound on matmuls, not bandwidth-bound on norms/lossThis is expected if batch size is large enough that matmuls dominate. The primary benefit is memory reduction (enabling larger batches or longer sequences) rather than pure latency.
ImportError: cannot import name 'LlamaForCausalLM'transformers library version too oldUpdate: pip install --upgrade transformers>=4.45.0. The container's Dockerfile pins a compatible version.
Chrome trace file won't open in browserTrace file too large for chrome://tracingUse Perfetto UI instead, which handles larger traces. Or reduce the number of profiled steps.

Resources

  • Triton Language Documentation
  • PyTorch Profiler Documentation
  • Liger-Kernel (reference implementations)
  • Blackwell Architecture Tuning Guide
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation