cuTile Kernels

60 MIN

Run cuTile kernel benchmarks, FMHA implementation, and LLM inference on DGX Spark and B300

Benchmarking Cross-Platform DeepSeek Docker FMHA Flash Attention GPU Development LLM Inference Qwen2 TileGym cuTile

View on GitHub

Overview Kernel Benchmarks End-to-End Inference FMHA Implementation Platform Comparison Troubleshooting

Step 1
Set up environment

If you haven't already, pull the CUDA container and clone TileGym (see Kernel Benchmarks tab for details).

First, clone TileGym on the host:

mkdir -p ~/TileGym
git clone https://github.com/NVIDIA/TileGym ~/TileGym

Then launch the container with the repository mounted:

docker run --gpus all -it --rm \
  -v ~/TileGym:/workspace/TileGym \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \
  /bin/bash

NOTE

The -v ~/.cache/huggingface:/root/.cache/huggingface mounts your HuggingFace cache to avoid re-downloading models.

Install TileGym inside the container:

cd /workspace/TileGym
pip install .

Set your HuggingFace token for accessing gated models:

export HF_TOKEN=<your_huggingface_token>

WARNING

You need a HuggingFace account and access token. Get one at https://huggingface.co/settings/tokens

Step 2
Run inference benchmark

Navigate to the transformers benchmark directory:

cd modeling/transformers

Option A: Run Qwen2-7B benchmark

./bench_qwen.sh

Configuration: Model Qwen/Qwen2-7B, Batch size 16, Output length 50 tokens.

Option B: Run DeepSeek-V2-Lite benchmark

./bench_deepseek.sh

Configuration: Model deepseek-ai/DeepSeek-V2-Lite-Chat, Batch size 1, Output length 100 tokens.

Both scripts run two configurations:

PyTorch baseline - Standard HuggingFace inference
TileGym cuTile - With cuTile kernel replacements

Step 3
View results

Sample DGX Spark (GB10) Results for Qwen2-7B:

========================================
  Benchmark Results
========================================
Qwen2-7B_naive_bfloat16    |  15.66 tokens/s |  51.10s |  51151.0ms CUDA
Qwen2-7B_cutile_attn       |  18.52 tokens/s |  43.20s |  43079.7ms CUDA
========================================

cuTile Kernel Breakdown (DGX Spark - Qwen2):

Kernel	CUDA Time (ms)	Calls
`fmha_kernel`	4185.9	28
`swiglu_forward_kernel`	2459.8	1400
`attention_decode_kernel_grouped`	2271.8	1372
`rms_norm_kernel_static_persistent`	634.7	57
`rope_kernel`	355.6	1400

Step 4
How TileGym monkey-patching works

TileGym replaces PyTorch model operations with cuTile kernels. The snippet below is taken from TileGym's src/tilegym/transformers/monkey_patch.py and invoked from modeling/transformers/infer.py:

from tilegym.transformers import apply_tilegym_kernel_to_qwen2

apply_tilegym_kernel_to_qwen2(
    rope=True,      # Replace RoPE with cuTile kernel
    rms_norm=True,  # Replace RMSNorm with cuTile kernel  
    swiglu=True,    # Replace SwiGLU with cuTile kernel
    attn=True,      # Replace attention with cuTile FMHA
    use_cutile=True # Use cuTile backend (vs Triton)
)

Patched Kernels for Qwen2:

Kernel	PyTorch Operation	cuTile Replacement
`rms_norm_kernel_static_persistent`	`nn.RMSNorm`	Persistent RMSNorm
`rope_kernel`	Rotary position embedding	Fused RoPE
`fmha_kernel`	`F.scaled_dot_product_attention`	Flash Attention
`swiglu_forward_kernel`	SiLU + Mul	Fused SwiGLU
`attention_decode_kernel_grouped`	Decode attention	Grouped decode

Patched Kernels for DeepSeek-V2: (see src/tilegym/transformers/monkey_patch.py)

from tilegym.transformers import apply_tilegym_kernel_to_deepseek_v2

apply_tilegym_kernel_to_deepseek_v2(
    rope=True,      # Replace RoPE with cuTile kernel
    rms_norm=True,  # Replace RMSNorm with cuTile kernel  
    swiglu=True,    # Replace SiLU+Mul with cuTile kernel
    attn=True,      # Replace MLA attention with cuTile
    moe=True,       # Replace MoE routing with cuTile
    use_cutile=True
)

Kernel	PyTorch Operation	cuTile Replacement
`prefill_mla`	MLA prefill attention	Multi-head Latent Attention
`_mla_decoding_split_kv`	MLA decode attention	Split-KV decoding
`fused_moe_kernel`	MoE expert routing	Fused MoE
`group_gemm_kernel`	Expert FFN	Grouped GEMM

Step 5
Platform-specific tuning (Advanced)

cuTile exposes two complementary performance-tuning mechanisms:

ct.ByTarget - Select different kernel launch parameters per GPU architecture (sm_<major><minor>). The compiler picks the value matching the current target at JIT time; if no entry matches, the default value is used. See the Performance Tuning and Execution Model pages.
num_ctas - Number of Cooperative Thread Arrays (thread blocks) launched per kernel invocation. Tune to the number of SMs on the target GPU.
occupancy - Hint for the number of concurrent CTAs the compiler should target per SM. Higher occupancy hides memory latency but increases register/shared-memory pressure. See the Execution Model documentation.
ct.autotune - Search a list of candidate values at runtime and pick the fastest configuration. Results are reported via cuda.tile.tune.TuningResult / Measurement.

import cuda.tile as ct

@ct.kernel(
    # num_ctas: how many thread blocks to launch.
    # Use ByTarget to pick an arch-specific value at JIT time.
    num_ctas=ct.ByTarget({
        "sm_103": 8,   # B300 - more SMs, launch more CTAs
        "sm_121": 4,   # DGX Spark - fewer SMs (48), use fewer CTAs
        "default": 1,  # Fallback for any other GPU architecture
    }),
    # occupancy: hint for concurrent CTAs per SM (latency hiding vs. register pressure).
    occupancy=ct.ByTarget({
        "sm_103": 16,  # B300 - high occupancy, plenty of registers/SMEM
        "sm_121": 12,  # DGX Spark - moderate occupancy
        "default": 8,  # Conservative fallback
    }),
    opt_level=3       # Maximum compiler optimization level
)
def optimized_kernel(A, B, C):
    # Same kernel code works on all platforms;
    # ByTarget swaps in the arch-specific launch params automatically.
    ...

For automatic tuning, use ct.autotune to search over candidate values and pick the fastest configuration at runtime:

@ct.kernel(
    # autotune: benchmark each value and pick the fastest.
    num_ctas=ct.autotune([1, 2, 4, 8, 16]),
    occupancy=ct.autotune([8, 12, 16, 24]),
    opt_level=3
)
def autotuned_kernel(A, B, C):
    ...

Step 6
Repeat on B300

Repeat Steps 1-3 on B300 hardware. The same code runs without modification - cuTile JIT compiles for sm_103 automatically.

See the Platform Comparison tab for detailed scaling results.

Resources

Step 1
Set up environment

If you haven't already, pull the CUDA container and clone TileGym (see Kernel Benchmarks tab for details).

First, clone TileGym on the host:

mkdir -p ~/TileGym
git clone https://github.com/NVIDIA/TileGym ~/TileGym

Then launch the container with the repository mounted:

docker run --gpus all -it --rm \
  -v ~/TileGym:/workspace/TileGym \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \
  /bin/bash

NOTE

The -v ~/.cache/huggingface:/root/.cache/huggingface mounts your HuggingFace cache to avoid re-downloading models.

Install TileGym inside the container:

cd /workspace/TileGym
pip install .

Set your HuggingFace token for accessing gated models:

export HF_TOKEN=<your_huggingface_token>

WARNING

You need a HuggingFace account and access token. Get one at https://huggingface.co/settings/tokens

Step 2
Run inference benchmark

Navigate to the transformers benchmark directory:

cd modeling/transformers

Option A: Run Qwen2-7B benchmark

./bench_qwen.sh

Configuration: Model Qwen/Qwen2-7B, Batch size 16, Output length 50 tokens.

Option B: Run DeepSeek-V2-Lite benchmark

./bench_deepseek.sh

Configuration: Model deepseek-ai/DeepSeek-V2-Lite-Chat, Batch size 1, Output length 100 tokens.

Both scripts run two configurations:

PyTorch baseline - Standard HuggingFace inference
TileGym cuTile - With cuTile kernel replacements

Step 3
View results

Sample DGX Spark (GB10) Results for Qwen2-7B:

========================================
  Benchmark Results
========================================
Qwen2-7B_naive_bfloat16    |  15.66 tokens/s |  51.10s |  51151.0ms CUDA
Qwen2-7B_cutile_attn       |  18.52 tokens/s |  43.20s |  43079.7ms CUDA
========================================

cuTile Kernel Breakdown (DGX Spark - Qwen2):

Kernel	CUDA Time (ms)	Calls
`fmha_kernel`	4185.9	28
`swiglu_forward_kernel`	2459.8	1400
`attention_decode_kernel_grouped`	2271.8	1372
`rms_norm_kernel_static_persistent`	634.7	57
`rope_kernel`	355.6	1400

Step 4
How TileGym monkey-patching works

TileGym replaces PyTorch model operations with cuTile kernels. The snippet below is taken from TileGym's src/tilegym/transformers/monkey_patch.py and invoked from modeling/transformers/infer.py:

from tilegym.transformers import apply_tilegym_kernel_to_qwen2

apply_tilegym_kernel_to_qwen2(
    rope=True,      # Replace RoPE with cuTile kernel
    rms_norm=True,  # Replace RMSNorm with cuTile kernel  
    swiglu=True,    # Replace SwiGLU with cuTile kernel
    attn=True,      # Replace attention with cuTile FMHA
    use_cutile=True # Use cuTile backend (vs Triton)
)

Patched Kernels for Qwen2:

Kernel	PyTorch Operation	cuTile Replacement
`rms_norm_kernel_static_persistent`	`nn.RMSNorm`	Persistent RMSNorm
`rope_kernel`	Rotary position embedding	Fused RoPE
`fmha_kernel`	`F.scaled_dot_product_attention`	Flash Attention
`swiglu_forward_kernel`	SiLU + Mul	Fused SwiGLU
`attention_decode_kernel_grouped`	Decode attention	Grouped decode

Patched Kernels for DeepSeek-V2: (see src/tilegym/transformers/monkey_patch.py)

from tilegym.transformers import apply_tilegym_kernel_to_deepseek_v2

apply_tilegym_kernel_to_deepseek_v2(
    rope=True,      # Replace RoPE with cuTile kernel
    rms_norm=True,  # Replace RMSNorm with cuTile kernel  
    swiglu=True,    # Replace SiLU+Mul with cuTile kernel
    attn=True,      # Replace MLA attention with cuTile
    moe=True,       # Replace MoE routing with cuTile
    use_cutile=True
)

Kernel	PyTorch Operation	cuTile Replacement
`prefill_mla`	MLA prefill attention	Multi-head Latent Attention
`_mla_decoding_split_kv`	MLA decode attention	Split-KV decoding
`fused_moe_kernel`	MoE expert routing	Fused MoE
`group_gemm_kernel`	Expert FFN	Grouped GEMM

Step 5
Platform-specific tuning (Advanced)

cuTile exposes two complementary performance-tuning mechanisms:

ct.ByTarget - Select different kernel launch parameters per GPU architecture (sm_<major><minor>). The compiler picks the value matching the current target at JIT time; if no entry matches, the default value is used. See the Performance Tuning and Execution Model pages.
num_ctas - Number of Cooperative Thread Arrays (thread blocks) launched per kernel invocation. Tune to the number of SMs on the target GPU.
occupancy - Hint for the number of concurrent CTAs the compiler should target per SM. Higher occupancy hides memory latency but increases register/shared-memory pressure. See the Execution Model documentation.
ct.autotune - Search a list of candidate values at runtime and pick the fastest configuration. Results are reported via cuda.tile.tune.TuningResult / Measurement.

import cuda.tile as ct

@ct.kernel(
    # num_ctas: how many thread blocks to launch.
    # Use ByTarget to pick an arch-specific value at JIT time.
    num_ctas=ct.ByTarget({
        "sm_103": 8,   # B300 - more SMs, launch more CTAs
        "sm_121": 4,   # DGX Spark - fewer SMs (48), use fewer CTAs
        "default": 1,  # Fallback for any other GPU architecture
    }),
    # occupancy: hint for concurrent CTAs per SM (latency hiding vs. register pressure).
    occupancy=ct.ByTarget({
        "sm_103": 16,  # B300 - high occupancy, plenty of registers/SMEM
        "sm_121": 12,  # DGX Spark - moderate occupancy
        "default": 8,  # Conservative fallback
    }),
    opt_level=3       # Maximum compiler optimization level
)
def optimized_kernel(A, B, C):
    # Same kernel code works on all platforms;
    # ByTarget swaps in the arch-specific launch params automatically.
    ...

For automatic tuning, use ct.autotune to search over candidate values and pick the fastest configuration at runtime:

@ct.kernel(
    # autotune: benchmark each value and pick the fastest.
    num_ctas=ct.autotune([1, 2, 4, 8, 16]),
    occupancy=ct.autotune([8, 12, 16, 24]),
    opt_level=3
)
def autotuned_kernel(A, B, C):
    ...

Step 6
Repeat on B300

Repeat Steps 1-3 on B300 hardware. The same code runs without modification - cuTile JIT compiles for sm_103 automatically.

See the Platform Comparison tab for detailed scaling results.

cuTile Kernels

Step 1Set up environment

Step 2Run inference benchmark

Step 3View results

Step 4How TileGym monkey-patching works

Step 5Platform-specific tuning (Advanced)

Step 6Repeat on B300

Resources

cuTile Kernels

Step 1Set up environment

Step 2Run inference benchmark

Step 3View results

Step 4How TileGym monkey-patching works

Step 5Platform-specific tuning (Advanced)

Step 6Repeat on B300

Resources

Step 1
Set up environment

Step 2
Run inference benchmark

Step 3
View results

Step 4
How TileGym monkey-patching works

Step 5
Platform-specific tuning (Advanced)

Step 6
Repeat on B300

Step 1
Set up environment

Step 2
Run inference benchmark

Step 3
View results

Step 4
How TileGym monkey-patching works

Step 5
Platform-specific tuning (Advanced)

Step 6
Repeat on B300