Use torch.profiler to find training bottlenecks, then write custom Triton kernels to optimize LLaMA 8B fine-tuning
| Symptom | Cause | Fix |
|---|---|---|
ModuleNotFoundError: No module named 'triton' | Container missing Triton | Use the kernel-dev-ft container built from the playbook's Dockerfile. Triton ships with PyTorch NGC containers. Verify: python -c "import triton; print(triton.__version__)". |
triton.compiler.errors.CompilationError referencing sm_100 | Triton version too old for Blackwell | Use PyTorch NGC container 26.01+ which includes Triton with Blackwell support. Check: python -c "import triton; print(triton.__version__)". |
| Correctness test fails with large differences in BF16 | Using FP32 tolerance for BF16 comparison | BF16 has only 7 mantissa bits. Use atol=1e-2, rtol=1e-2 for torch.allclose. Differences up to ~0.01 are normal. |
torch.cuda.OutOfMemoryError during baseline profiling | Batch size or sequence length too large | Reduce --batch-size or --seq-len in profile_baseline.py. LLaMA 3.1 8B in BF16 needs ~16 GB for weights alone, plus ~32 GB for AdamW optimizer states. |
torch.cuda.OutOfMemoryError during PyTorch cross-entropy but NOT during custom kernel | Standard cross-entropy materializes full [B*T, V] logit tensor | This demonstrates exactly why the custom kernel is needed. Reduce batch size or sequence length for the baseline comparison, or run only the custom kernel path. |
| Profiler trace JSON is very large (>1 GB) | Too many training steps profiled | Reduce wait, warmup, active in the profiler schedule. The default script profiles only 1 active step. |
401 Client Error when downloading LLaMA 3.1 8B | Missing or invalid Hugging Face token, or no LLaMA access | Set HF_TOKEN environment variable. Accept the LLaMA 3.1 license at https://huggingface.co/meta-llama/Llama-3.1-8B. Verify token: huggingface-cli whoami. |
| Custom RMSNorm backward produces NaN gradients | Epsilon value too small or input contains extreme values | Ensure epsilon is 1e-6 (LLaMA default). Check input tensor for NaN/Inf with torch.isfinite(x).all(). |
| Benchmark shows no speedup for RMSNorm on small hidden dimensions | Kernel launch overhead dominates for small tensors | RMSNorm speedup is most visible at hidden_size >= 2048. LLaMA 3.1 8B uses 4096, which is well above the threshold. |
docker: Error response from daemon: could not select device driver | NVIDIA Container Toolkit not installed or Docker not restarted | Install: sudo apt install nvidia-container-toolkit && sudo systemctl restart docker. Verify: docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi. |
| Fused cross-entropy loss differs from PyTorch by more than 0.1 | Bug in the chunked online softmax implementation | Verify the running-max update: m_new = max(m_old, chunk_max) must happen BEFORE updating the running sum-of-exp d. Check that the target index masking uses the correct chunk offset. |
| Fine-tuning throughput is not improved despite faster kernels | GPU is compute-bound on matmuls, not bandwidth-bound on norms/loss | This is expected if batch size is large enough that matmuls dominate. The primary benefit is memory reduction (enabling larger batches or longer sequences) rather than pure latency. |
ImportError: cannot import name 'LlamaForCausalLM' | transformers library version too old | Update: pip install --upgrade transformers>=4.45.0. The container's Dockerfile pins a compatible version. |
| Chrome trace file won't open in browser | Trace file too large for chrome://tracing | Use Perfetto UI instead, which handles larger traces. Or reduce the number of profiled steps. |