---
title: "cuTile Kernels"
publisher: "nvidia"
type: "playbook"
updated: "2026-04-27T17:26:25.603Z"
description: "Run cuTile kernel benchmarks, FMHA implementation, and LLM inference on DGX Spark and B300"
canonical: "https://build.nvidia.com/spark/cutile-kernels.md"
---

# Basic idea

[TileGym](https://github.com/NVIDIA/TileGym) is NVIDIA's benchmark suite and integration framework for cuTile kernels - high-performance GPU kernels written using the cuTile Python DSL. cuTile compiles to Tile IR, enabling developers to write efficient kernels without low-level CUDA programming.

This playbook covers three workflows:
1. **[Kernel Benchmarks](kernel-benchmarks)** - Run standalone cuTile kernel benchmarks (FMHA, MatMul, RMSNorm, etc.)
2. **[End-to-End Inference](e2e-inference)** - Run LLM inference with cuTile-optimized kernels via monkey-patching
3. **[FMHA Implementation](fmha)** - Step-by-step tutorial building a Flash Multi-Head Attention kernel from pseudocode to optimized cuTile, with companion scripts to run and benchmark

The same cuTile code runs on both DGX Spark (sm_121) and B300 (sm_103) - cuTile JIT compiles to the appropriate GPU architecture automatically.

# What you'll accomplish

- Run the TileGym benchmark suite on DGX Spark
- Run Qwen2-7B or DeepSeek-V2-Lite inference with cuTile-optimized kernels
- Observe performance scaling between DGX Spark and B300
- Build an FMHA kernel step-by-step from pseudocode to optimized cuTile implementation

# What to know before starting

- Basic familiarity with Docker and command-line tools
- Understanding of GPU compute concepts (TFLOPS, memory bandwidth)
- No CUDA programming experience required
- HuggingFace account with access token (for LLM inference)

# Prerequisites

**Hardware Requirements:**
- DGX Spark with Ubuntu 24.04 or B300 cloud instance
- Minimum 16GB GPU memory for LLM inference
- At least 50GB available storage space for model downloads

**Software Requirements:**
- Docker installed and configured: `docker ps`
- CUDA Toolkit 13.x with Tile IR support
- HuggingFace token for model access (LLM inference only)
- Network access for pulling containers and downloading models

Verify Docker is available:
```bash
docker ps
```

If you get a permission error:
```bash
sudo usermod -aG docker $USER
newgrp docker
```

# Kernel support matrix

| Kernel | Category | Data Types | Description |
|--------|----------|------------|-------------|
| **FMHA** | Attention | float16, float8 | Flash Multi-Head Attention |
| **MLA** | Attention | bfloat16, float8 | Multi-head Latent Attention |
| **MLA Decoding** | Attention | float16, float8 | MLA for decode phase |
| **MatMul** | Matrix Ops | float16, float8 | Matrix multiplication |
| **BMM** | Matrix Ops | float16 | Batched matrix multiplication |
| **Group GEMM** | Matrix Ops | float16, float8 | Grouped GEMM for MoE |
| **RMSNorm** | Normalization | float16, bfloat16 | Root mean square normalization |
| **RoPE** | Positional | float16 | Rotary position embedding |
| **SiLU** | Activation | float16, float32 | SiLU activation with multiply |
| **SwiGLU** | Activation | float16, float32 | SwiGLU fused operation |
| **Softmax** | Activation | float16 | Softmax normalization |
| **Dropout** | Regularization | float16, float32 | Dropout forward |

# Model support for LLM inference

| Model | Supported Kernels | Batch Size | Output Tokens | Notes |
|-------|-------------------|------------|---------------|-------|
| **Qwen2-7B** | RoPE, RMSNorm, SwiGLU, FMHA | 16 | 50 | Standard transformer |
| **DeepSeek-V2-Lite** | RoPE, RMSNorm, SiLU, MLA, MoE | 1 | 100 | MLA attention, MoE layers |

# Ancillary files

All required assets can be found in the [TileGym repository](https://github.com/NVIDIA/TileGym).

- `tests/benchmark/run_all.sh` - Run all kernel benchmarks
- `modeling/transformers/bench_qwen.sh` - Qwen2-7B benchmark script
- `modeling/transformers/bench_deepseek.sh` - DeepSeek-V2-Lite benchmark script
- `modeling/transformers/infer.py` - Main inference script with TileGym integration
- [`assets/fmha_optimization_tutorial.py`](assets/fmha_optimization_tutorial.py) - FMHA step-by-step optimization tutorial
- [`assets/fmha_scaling_analysis.py`](assets/fmha_scaling_analysis.py) - FMHA scaling analysis across sequence lengths

# Time & risk

* **Estimated time:** 30-45 minutes (including model download for LLM inference)
* **Risk level:** Low
* Large downloads may fail due to network issues
* First run includes JIT compilation overhead
* **Rollback:** Remove Docker container to undo all changes
* **Last Updated:** 06/16/2026
* Upgrade CUDA container to 13.2.0-devel-ubuntu22.04
* Upgrade Nsight Systems to 2025.1.3
* Add docker preparation steps for TileGym
* Pin TileGym to v1.3.0

## More

- [Kernel Benchmarks](/spark/cutile-kernels/kernel-benchmarks.md)
- [End-to-End Inference](/spark/cutile-kernels/e2e-inference.md)
- [FMHA Implementation](/spark/cutile-kernels/fmha.md)
- [Platform Comparison](/spark/cutile-kernels/platform-comparison.md)
- [Troubleshooting](/spark/cutile-kernels/troubleshooting.md)