Run cuTile kernel benchmarks, FMHA implementation, and LLM inference on DGX Spark and B300
TileGym is NVIDIA's benchmark suite and integration framework for cuTile kernels - high-performance GPU kernels written using the cuTile Python DSL. cuTile compiles to Tile IR, enabling developers to write efficient kernels without low-level CUDA programming.
This playbook covers three workflows:
The same cuTile code runs on both DGX Spark (sm_121) and B300 (sm_103) - cuTile JIT compiles to the appropriate GPU architecture automatically.
Hardware Requirements:
Software Requirements:
docker psVerify Docker is available:
docker ps
If you get a permission error:
sudo usermod -aG docker $USER
newgrp docker
| Kernel | Category | Data Types | Description |
|---|---|---|---|
| FMHA | Attention | float16, float8 | Flash Multi-Head Attention |
| MLA | Attention | bfloat16, float8 | Multi-head Latent Attention |
| MLA Decoding | Attention | float16, float8 | MLA for decode phase |
| MatMul | Matrix Ops | float16, float8 | Matrix multiplication |
| BMM | Matrix Ops | float16 | Batched matrix multiplication |
| Group GEMM | Matrix Ops | float16, float8 | Grouped GEMM for MoE |
| RMSNorm | Normalization | float16, bfloat16 | Root mean square normalization |
| RoPE | Positional | float16 | Rotary position embedding |
| SiLU | Activation | float16, float32 | SiLU activation with multiply |
| SwiGLU | Activation | float16, float32 | SwiGLU fused operation |
| Softmax | Activation | float16 | Softmax normalization |
| Dropout | Regularization | float16, float32 | Dropout forward |
| Model | Supported Kernels | Batch Size | Output Tokens | Notes |
|---|---|---|---|---|
| Qwen2-7B | RoPE, RMSNorm, SwiGLU, FMHA | 16 | 50 | Standard transformer |
| DeepSeek-V2-Lite | RoPE, RMSNorm, SiLU, MLA, MoE | 1 | 100 | MLA attention, MoE layers |
All required assets can be found in the TileGym repository.
tests/benchmark/run_all.sh - Run all kernel benchmarksmodeling/transformers/bench_qwen.sh - Qwen2-7B benchmark scriptmodeling/transformers/bench_deepseek.sh - DeepSeek-V2-Lite benchmark scriptmodeling/transformers/infer.py - Main inference script with TileGym integrationassets/fmha_optimization_tutorial.py - FMHA step-by-step optimization tutorialassets/fmha_scaling_analysis.py - FMHA scaling analysis across sequence lengths