Run cuTile kernel benchmarks, FMHA implementation, and LLM inference on DGX Spark and B300
If you haven't already, pull the CUDA container and clone TileGym (see Kernel Benchmarks tab for details).
First, clone TileGym on the host:
mkdir -p ~/TileGym
git clone https://github.com/NVIDIA/TileGym ~/TileGym
Then launch the container with the repository mounted:
docker run --gpus all -it --rm \
-v ~/TileGym:/workspace/TileGym \
-v ~/.cache/huggingface:/root/.cache/huggingface \
nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \
/bin/bash
NOTE
The -v ~/.cache/huggingface:/root/.cache/huggingface mounts your HuggingFace cache to avoid re-downloading models.
Install TileGym inside the container:
cd /workspace/TileGym
pip install .
Set your HuggingFace token for accessing gated models:
export HF_TOKEN=<your_huggingface_token>
WARNING
You need a HuggingFace account and access token. Get one at https://huggingface.co/settings/tokens
Navigate to the transformers benchmark directory:
cd modeling/transformers
Option A: Run Qwen2-7B benchmark
./bench_qwen.sh
Configuration: Model Qwen/Qwen2-7B, Batch size 16, Output length 50 tokens.
Option B: Run DeepSeek-V2-Lite benchmark
./bench_deepseek.sh
Configuration: Model deepseek-ai/DeepSeek-V2-Lite-Chat, Batch size 1, Output length 100 tokens.
Both scripts run two configurations:
Sample DGX Spark (GB10) Results for Qwen2-7B:
========================================
Benchmark Results
========================================
Qwen2-7B_naive_bfloat16 | 15.66 tokens/s | 51.10s | 51151.0ms CUDA
Qwen2-7B_cutile_attn | 18.52 tokens/s | 43.20s | 43079.7ms CUDA
========================================
cuTile Kernel Breakdown (DGX Spark - Qwen2):
| Kernel | CUDA Time (ms) | Calls |
|---|---|---|
fmha_kernel | 4185.9 | 28 |
swiglu_forward_kernel | 2459.8 | 1400 |
attention_decode_kernel_grouped | 2271.8 | 1372 |
rms_norm_kernel_static_persistent | 634.7 | 57 |
rope_kernel | 355.6 | 1400 |
TileGym replaces PyTorch model operations with cuTile kernels. The snippet below is taken from TileGym's src/tilegym/transformers/monkey_patch.py and invoked from modeling/transformers/infer.py:
from tilegym.transformers import apply_tilegym_kernel_to_qwen2
apply_tilegym_kernel_to_qwen2(
rope=True, # Replace RoPE with cuTile kernel
rms_norm=True, # Replace RMSNorm with cuTile kernel
swiglu=True, # Replace SwiGLU with cuTile kernel
attn=True, # Replace attention with cuTile FMHA
use_cutile=True # Use cuTile backend (vs Triton)
)
Patched Kernels for Qwen2:
| Kernel | PyTorch Operation | cuTile Replacement |
|---|---|---|
rms_norm_kernel_static_persistent | nn.RMSNorm | Persistent RMSNorm |
rope_kernel | Rotary position embedding | Fused RoPE |
fmha_kernel | F.scaled_dot_product_attention | Flash Attention |
swiglu_forward_kernel | SiLU + Mul | Fused SwiGLU |
attention_decode_kernel_grouped | Decode attention | Grouped decode |
Patched Kernels for DeepSeek-V2: (see src/tilegym/transformers/monkey_patch.py)
from tilegym.transformers import apply_tilegym_kernel_to_deepseek_v2
apply_tilegym_kernel_to_deepseek_v2(
rope=True, # Replace RoPE with cuTile kernel
rms_norm=True, # Replace RMSNorm with cuTile kernel
swiglu=True, # Replace SiLU+Mul with cuTile kernel
attn=True, # Replace MLA attention with cuTile
moe=True, # Replace MoE routing with cuTile
use_cutile=True
)
| Kernel | PyTorch Operation | cuTile Replacement |
|---|---|---|
prefill_mla | MLA prefill attention | Multi-head Latent Attention |
_mla_decoding_split_kv | MLA decode attention | Split-KV decoding |
fused_moe_kernel | MoE expert routing | Fused MoE |
group_gemm_kernel | Expert FFN | Grouped GEMM |
cuTile exposes two complementary performance-tuning mechanisms:
ct.ByTarget - Select different kernel launch parameters per GPU architecture (sm_<major><minor>). The compiler picks the value matching the current target at JIT time; if no entry matches, the default value is used. See the Performance Tuning and Execution Model pages.num_ctas - Number of Cooperative Thread Arrays (thread blocks) launched per kernel invocation. Tune to the number of SMs on the target GPU.occupancy - Hint for the number of concurrent CTAs the compiler should target per SM. Higher occupancy hides memory latency but increases register/shared-memory pressure. See the Execution Model documentation.ct.autotune - Search a list of candidate values at runtime and pick the fastest configuration. Results are reported via cuda.tile.tune.TuningResult / Measurement.import cuda.tile as ct
@ct.kernel(
# num_ctas: how many thread blocks to launch.
# Use ByTarget to pick an arch-specific value at JIT time.
num_ctas=ct.ByTarget({
"sm_103": 8, # B300 - more SMs, launch more CTAs
"sm_121": 4, # DGX Spark - fewer SMs (48), use fewer CTAs
"default": 1, # Fallback for any other GPU architecture
}),
# occupancy: hint for concurrent CTAs per SM (latency hiding vs. register pressure).
occupancy=ct.ByTarget({
"sm_103": 16, # B300 - high occupancy, plenty of registers/SMEM
"sm_121": 12, # DGX Spark - moderate occupancy
"default": 8, # Conservative fallback
}),
opt_level=3 # Maximum compiler optimization level
)
def optimized_kernel(A, B, C):
# Same kernel code works on all platforms;
# ByTarget swaps in the arch-specific launch params automatically.
...
For automatic tuning, use ct.autotune to search over candidate values and pick the fastest configuration at runtime:
@ct.kernel(
# autotune: benchmark each value and pick the fastest.
num_ctas=ct.autotune([1, 2, 4, 8, 16]),
occupancy=ct.autotune([8, 12, 16, 24]),
opt_level=3
)
def autotuned_kernel(A, B, C):
...
Repeat Steps 1-3 on B300 hardware. The same code runs without modification - cuTile JIT compiles for sm_103 automatically.
See the Platform Comparison tab for detailed scaling results.