Profiler-Driven Kernel Optimization for Fine-Tuning

Basic idea

DGX Station puts a full Blackwell GPU on your desk, which makes it an ideal environment for profiling and optimizing GPU kernels used during model training. This playbook walks through a real optimization workflow: profiling a LLaMA 3.1 8B fine-tuning run to identify bottlenecks, then writing custom Triton kernels that eliminate those bottlenecks — specifically a fused RMSNorm and a fused cross-entropy loss using online softmax.

For inference workloads, tools like torch.compile and serving frameworks (vLLM, TensorRT-LLM) already ship highly optimized fused kernels. But training workloads are different. Backward passes double the kernel count, large vocabularies create massive intermediate tensors during loss computation, and torch.compile does not restructure algorithms to avoid these allocations. Projects like Liger-Kernel and Unsloth demonstrate that custom training kernels deliver real results: 20-60% memory reduction and 10-30% throughput improvement.

This playbook uses Triton instead of raw CUDA C++. Triton is a Python-native GPU programming language that JIT-compiles to optimized GPU code — no nvcc compiler, no C++ build systems, no manual thread indexing. It is the standard for custom training kernels: Liger-Kernel, Unsloth, and FlashAttention are all written in Triton.

No prior Triton, CUDA, or GPU programming experience is required. The instructions explain each concept as it comes up.

What you'll accomplish

You will profile a LLaMA 3.1 8B fine-tuning workload, identify the key performance bottlenecks, and write custom Triton kernels that address them.

Profile a baseline fine-tuning step using torch.profiler and interpret the results to identify two targets: RMSNorm (memory-bandwidth-bound) and cross-entropy loss (memory-capacity-bound).
Write a fused RMSNorm kernel in Triton that processes normalization in a single GPU pass instead of multiple separate operations, improving memory bandwidth utilization from ~11% to ~80-90% of peak.
Write a fused cross-entropy kernel using the online softmax algorithm (Milakov-Gimelshein) that computes loss without materializing intermediate softmax tensors, achieving ~6x memory reduction and up to 4x latency improvement at realistic batch sizes.
Verify correctness of both kernels (forward and backward passes) against PyTorch reference implementations.
Benchmark the kernels to measure latency, throughput, and memory savings.
Integrate both kernels into an end-to-end LLaMA 3.1 8B fine-tuning loop and measure real training throughput and memory improvements.

What to know before starting

Comfortable with Linux command line and shell scripting.
Basic familiarity with Python and PyTorch (tensors, autograd, training loops).
Understanding of what fine-tuning is (training a pre-trained model on new data).
No Triton, CUDA, or GPU programming experience required — all code is explained.

Prerequisites

Hardware:

NVIDIA DGX Station with GB300 Ultra Superchip.
At least 150 GB available storage for the container image, model weights (~16 GB for LLaMA 3.1 8B in BF16), profiler traces, and optimizer states.

Software:

Docker with NVIDIA Container Toolkit: docker run --rm --gpus all nvcr.io/nvidia/cuda:12.8.0-devel-ubuntu24.04 nvidia-smi
On a DGX Station, immediately confirm which device index belongs to the GB300 so later steps can target it explicitly. Run nvidia-smi --query-gpu=index,name --format=csv,noheader and note the index for the row showing NVIDIA GB300. Subsequent steps recommend --gpus '"device=N"' (with N = that index) instead of --gpus all so profiling and benchmark numbers stay on a single, known GPU.
Network access to pull container images from NGC and download model weights from Hugging Face.
A Hugging Face account with access to meta-llama/Llama-3.1-8B and a Hugging Face access token.

Ancillary files

All required assets are in the playbook directory nvidia/station-kernel-dev-ft/assets (see the dgx-spark-playbooks repository).

assets/Dockerfile — Development container based on NVIDIA's PyTorch NGC image with Triton, transformers, and profiling dependencies.
assets/requirements.txt — Python dependencies installed inside the container.
assets/profile_baseline.py — Profiling script that captures a torch.profiler trace of a LLaMA 3.1 8B training step and prints a breakdown of GPU time by operation. Supports flags to enable custom kernels for re-profiling.
assets/rmsnorm_kernel.py — Fused RMSNorm Triton kernel with forward and backward passes, wrapped as a drop-in torch.nn.Module replacement. Heavily commented with explanations of each Triton concept.
assets/rmsnorm_test.py — Correctness tests comparing the custom RMSNorm against PyTorch's reference implementation (forward and backward, FP32 and BF16).
assets/cross_entropy_kernel.py — Fused cross-entropy Triton kernel using online softmax, with forward and backward passes. Processes the vocabulary in chunks to avoid materializing the full logit tensor.
assets/cross_entropy_test.py — Correctness tests and memory usage comparison against torch.nn.CrossEntropyLoss.
assets/benchmark_kernels.py — Benchmarking script that measures latency, throughput, bandwidth utilization, and peak memory for both custom kernels.
assets/finetune_baseline.py — Minimal LLaMA 3.1 8B fine-tuning script using vanilla PyTorch, reporting tokens/sec and peak memory.
assets/finetune_optimized.py — Identical fine-tuning script with both custom kernels monkey-patched in for direct comparison.

Time & risk

Estimated time: About 2 hours. Steps 1-4 (setup through baseline profiling) take about 30 minutes. Steps 5-7 (RMSNorm kernel) take about 30 minutes. Steps 8-10 (cross-entropy kernel) take about 40 minutes. Step 11 (end-to-end integration) takes about 20 minutes. Steps 12-13 (cleanup and next steps) are a few minutes.
Risk level: Low
- All work runs inside a Docker container — no host system modifications.
- LLaMA 3.1 8B model weights (~16 GB in BF16) are downloaded from Hugging Face on first run and cached locally.
- Requires a Hugging Face token with access to the LLaMA 3.1 model.
Rollback: Exit the container. Your source files are preserved in the mounted assets/ directory; everything else is discarded.
Last Updated: 05/26/2026
- First publication