cuTile Kernels
60 MIN
Run cuTile kernel benchmarks, FMHA implementation, and LLM inference on DGX Spark and B300
Basic idea
TileGym is NVIDIA's benchmark suite and integration framework for cuTile kernels - high-performance GPU kernels written using the cuTile Python DSL. cuTile compiles to Tile IR, enabling developers to write efficient kernels without low-level CUDA programming.
This playbook covers three workflows:
- Kernel Benchmarks - Run standalone cuTile kernel benchmarks (FMHA, MatMul, RMSNorm, etc.)
- End-to-End Inference - Run LLM inference with cuTile-optimized kernels via monkey-patching
- FMHA Implementation - Step-by-step tutorial building a Flash Multi-Head Attention kernel from pseudocode to optimized cuTile, with companion scripts to run and benchmark
The same cuTile code runs on both DGX Spark (sm_121) and B300 (sm_103) - cuTile JIT compiles to the appropriate GPU architecture automatically.
What you'll accomplish
- Run the TileGym benchmark suite on DGX Spark
- Run Qwen2-7B or DeepSeek-V2-Lite inference with cuTile-optimized kernels
- Observe performance scaling between DGX Spark and B300
- Build an FMHA kernel step-by-step from pseudocode to optimized cuTile implementation
What to know before starting
- Basic familiarity with Docker and command-line tools
- Understanding of GPU compute concepts (TFLOPS, memory bandwidth)
- No CUDA programming experience required
- HuggingFace account with access token (for LLM inference)
Prerequisites
Hardware Requirements:
- DGX Spark with Ubuntu 24.04 or B300 cloud instance
- Minimum 16GB GPU memory for LLM inference
- At least 50GB available storage space for model downloads
Software Requirements:
- Docker installed and configured:
docker ps - CUDA Toolkit 13.x with Tile IR support
- HuggingFace token for model access (LLM inference only)
- Network access for pulling containers and downloading models
Verify Docker is available:
docker ps
If you get a permission error:
sudo usermod -aG docker $USER
newgrp docker
Kernel support matrix
| Kernel | Category | Data Types | Description |
|---|---|---|---|
| FMHA | Attention | float16, float8 | Flash Multi-Head Attention |
| MLA | Attention | bfloat16, float8 | Multi-head Latent Attention |
| MLA Decoding | Attention | float16, float8 | MLA for decode phase |
| MatMul | Matrix Ops | float16, float8 | Matrix multiplication |
| BMM | Matrix Ops | float16 | Batched matrix multiplication |
| Group GEMM | Matrix Ops | float16, float8 | Grouped GEMM for MoE |
| RMSNorm | Normalization | float16, bfloat16 | Root mean square normalization |
| RoPE | Positional | float16 | Rotary position embedding |
| SiLU | Activation | float16, float32 | SiLU activation with multiply |
| SwiGLU | Activation | float16, float32 | SwiGLU fused operation |
| Softmax | Activation | float16 | Softmax normalization |
| Dropout | Regularization | float16, float32 | Dropout forward |
Model support for LLM inference
| Model | Supported Kernels | Batch Size | Output Tokens | Notes |
|---|---|---|---|---|
| Qwen2-7B | RoPE, RMSNorm, SwiGLU, FMHA | 16 | 50 | Standard transformer |
| DeepSeek-V2-Lite | RoPE, RMSNorm, SiLU, MLA, MoE | 1 | 100 | MLA attention, MoE layers |
Ancillary files
All required assets can be found in the TileGym repository.
tests/benchmark/run_all.sh- Run all kernel benchmarksmodeling/transformers/bench_qwen.sh- Qwen2-7B benchmark scriptmodeling/transformers/bench_deepseek.sh- DeepSeek-V2-Lite benchmark scriptmodeling/transformers/infer.py- Main inference script with TileGym integrationassets/fmha_optimization_tutorial.py- FMHA step-by-step optimization tutorialassets/fmha_scaling_analysis.py- FMHA scaling analysis across sequence lengths
Time & risk
- Estimated time: 30-45 minutes (including model download for LLM inference)
- Risk level: Low
- Large downloads may fail due to network issues
- First run includes JIT compilation overhead
- Rollback: Remove Docker container to undo all changes
- Last Updated: February 2026
- First Publication