Profiler-Driven Kernel Optimization for Fine-Tuning

Symptom	Cause	Fix
`ModuleNotFoundError: No module named 'triton'`	Container missing Triton	Use the `kernel-dev-ft` container built from the playbook's Dockerfile. Triton ships with PyTorch NGC containers. Verify: `python -c "import triton; print(triton.__version__)"`.
`triton.compiler.errors.CompilationError` referencing `sm_100`	Triton version too old for Blackwell	Use PyTorch NGC container 26.01+ which includes Triton with Blackwell support. Check: `python -c "import triton; print(triton.__version__)"`.
Cross-entropy BF16 test fails on loss or gradient	BF16 + 128K vocab accumulate drift vs PyTorch's CE path	`cross_entropy_test.py` uses relaxed loss tolerances and compares gradients in float32 with wider `atol/rtol`. If it still fails, check PyTorch / CUDA versions; file an issue with `torch.__version__`.
`RuntimeError: Trace is already saved` from profiler	An older `profile_baseline.py` saved the same trace twice in one run (TensorBoard `on_trace_ready` handler and a manual `export_chrome_trace`); on PyTorch 2.10 kineto enforces save-once	Use the current `profile_baseline.py`, which exports the Chrome trace exactly once. This is an in-process double-save, not a stale-directory issue — deleting `traces/*` does not fix the older script.
`torch.cuda.OutOfMemoryError` during baseline profiling	Batch size or sequence length too large	Reduce `--batch-size` or `--seq-len` in `profile_baseline.py`. LLaMA 3.1 8B in BF16 needs ~16 GB for weights alone, plus ~32 GB for AdamW optimizer states.
`torch.cuda.OutOfMemoryError` during PyTorch cross-entropy but NOT during custom kernel	Standard cross-entropy materializes full `[B*T, V]` logit tensor	This demonstrates exactly why the custom kernel is needed. Reduce batch size or sequence length for the baseline comparison, or run only the custom kernel path.
Profiler trace JSON is very large (>1 GB)	Too many training steps profiled	Reduce `wait`, `warmup`, `active` in the profiler schedule. The default script profiles only 1 active step.
`401 Client Error` when downloading LLaMA 3.1 8B	Missing or invalid Hugging Face token, or no LLaMA access	Set `HF_TOKEN` environment variable. Accept the LLaMA 3.1 license at `https://huggingface.co/meta-llama/Llama-3.1-8B`. Verify token: `huggingface-cli whoami`.
Custom RMSNorm backward produces NaN gradients	Epsilon value too small or input contains extreme values	Ensure epsilon is `1e-6` (LLaMA default). Check input tensor for NaN/Inf with `torch.isfinite(x).all()`.
Benchmark shows no speedup for RMSNorm on small hidden dimensions	Kernel launch overhead dominates for small tensors	RMSNorm speedup is most visible at `hidden_size >= 2048`. LLaMA 3.1 8B uses 4096, which is well above the threshold.
`docker: Error response from daemon: could not select device driver`	NVIDIA Container Toolkit not installed or Docker not restarted	Install: `sudo apt install nvidia-container-toolkit && sudo systemctl restart docker`. Verify: `docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi`.
Fused cross-entropy loss differs from PyTorch by more than 0.1	Bug in the chunked online softmax implementation	Verify the running-max update: `m_new = max(m_old, chunk_max)` must happen BEFORE updating the running sum-of-exp `d`. Check that the target index masking uses the correct chunk offset.
Fine-tuning throughput is not improved despite faster kernels	GPU is compute-bound on matmuls, not bandwidth-bound on norms/loss	This is expected if batch size is large enough that matmuls dominate. The primary benefit is memory reduction (enabling larger batches or longer sequences) rather than pure latency.
`ImportError: cannot import name 'LlamaForCausalLM'`	`transformers` library version too old	Update: `pip install --upgrade transformers>=4.45.0`. The container's Dockerfile pins a compatible version.
Chrome trace file won't open in browser	Trace file too large for `chrome://tracing`	Use Perfetto UI instead, which handles larger traces. Or reduce the number of profiled steps.

Profiler-Driven Kernel Optimization for Fine-Tuning

Resources