cuTile Kernels

60 MIN

Run cuTile kernel benchmarks, FMHA implementation, and LLM inference on DGX Spark and B300

Benchmarking Cross-Platform DeepSeek Docker FMHA Flash Attention GPU Development LLM Inference Qwen2 TileGym cuTile

Overview Kernel Benchmarks End-to-End Inference FMHA Implementation Platform Comparison Troubleshooting

Basic idea

TileGym is NVIDIA's benchmark suite and integration framework for cuTile kernels - high-performance GPU kernels written using the cuTile Python DSL. cuTile compiles to Tile IR, enabling developers to write efficient kernels without low-level CUDA programming.

This playbook covers three workflows:

Kernel Benchmarks - Run standalone cuTile kernel benchmarks (FMHA, MatMul, RMSNorm, etc.)
End-to-End Inference - Run LLM inference with cuTile-optimized kernels via monkey-patching
FMHA Implementation - Step-by-step tutorial building a Flash Multi-Head Attention kernel from pseudocode to optimized cuTile, with companion scripts to run and benchmark

The same cuTile code runs on both DGX Spark (sm_121) and B300 (sm_103) - cuTile JIT compiles to the appropriate GPU architecture automatically.

What you'll accomplish

Run the TileGym benchmark suite on DGX Spark
Run Qwen2-7B or DeepSeek-V2-Lite inference with cuTile-optimized kernels
Observe performance scaling between DGX Spark and B300
Build an FMHA kernel step-by-step from pseudocode to optimized cuTile implementation

What to know before starting

Basic familiarity with Docker and command-line tools
Understanding of GPU compute concepts (TFLOPS, memory bandwidth)
No CUDA programming experience required
HuggingFace account with access token (for LLM inference)

Prerequisites

Hardware Requirements:

DGX Spark with Ubuntu 24.04 or B300 cloud instance
Minimum 16GB GPU memory for LLM inference
At least 50GB available storage space for model downloads

Software Requirements:

Docker installed and configured: docker ps
CUDA Toolkit 13.x with Tile IR support
HuggingFace token for model access (LLM inference only)
Network access for pulling containers and downloading models

Verify Docker is available:

docker ps

If you get a permission error:

sudo usermod -aG docker $USER
newgrp docker

Kernel support matrix

Kernel	Category	Data Types	Description
FMHA	Attention	float16, float8	Flash Multi-Head Attention
MLA	Attention	bfloat16, float8	Multi-head Latent Attention
MLA Decoding	Attention	float16, float8	MLA for decode phase
MatMul	Matrix Ops	float16, float8	Matrix multiplication
BMM	Matrix Ops	float16	Batched matrix multiplication
Group GEMM	Matrix Ops	float16, float8	Grouped GEMM for MoE
RMSNorm	Normalization	float16, bfloat16	Root mean square normalization
RoPE	Positional	float16	Rotary position embedding
SiLU	Activation	float16, float32	SiLU activation with multiply
SwiGLU	Activation	float16, float32	SwiGLU fused operation
Softmax	Activation	float16	Softmax normalization
Dropout	Regularization	float16, float32	Dropout forward

Model support for LLM inference

Model	Supported Kernels	Batch Size	Output Tokens	Notes
Qwen2-7B	RoPE, RMSNorm, SwiGLU, FMHA	16	50	Standard transformer
DeepSeek-V2-Lite	RoPE, RMSNorm, SiLU, MLA, MoE	1	100	MLA attention, MoE layers

Ancillary files

All required assets can be found in the TileGym repository.

tests/benchmark/run_all.sh - Run all kernel benchmarks
modeling/transformers/bench_qwen.sh - Qwen2-7B benchmark script
modeling/transformers/bench_deepseek.sh - DeepSeek-V2-Lite benchmark script
modeling/transformers/infer.py - Main inference script with TileGym integration
assets/fmha_optimization_tutorial.py - FMHA step-by-step optimization tutorial
assets/fmha_scaling_analysis.py - FMHA scaling analysis across sequence lengths

Time & risk

Estimated time: 30-45 minutes (including model download for LLM inference)
Risk level: Low
- Large downloads may fail due to network issues
- First run includes JIT compilation overhead
Rollback: Remove Docker container to undo all changes
Last Updated: February 2026
- First Publication

Resources