cuTile Kernels

60 MIN

Run cuTile kernel benchmarks, FMHA implementation, and LLM inference on DGX Spark and B300

Benchmarking Cross-Platform DeepSeek Docker FMHA Flash Attention GPU Development LLM Inference Qwen2 TileGym cuTile

Overview Kernel Benchmarks End-to-End Inference FMHA Implementation Platform Comparison Troubleshooting

Step 1
Pull CUDA NGC container with CTK 13.x

docker pull nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04

Launch an interactive session with GPU access:

docker run --gpus all -it --rm \
  -v ~/TileGym:/workspace/TileGym \
  nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \
  /bin/bash

NOTE

The -v flag mounts a local directory to persist the TileGym repository. The --rm flag automatically removes the container when you exit; omit it if you want to keep the container for later use.

Or if running outside a container, install Tile IR directly:

# Requires root privileges - run with sudo or as root
sudo apt-get install cuda-tile-ir-13-1 cuda-compiler-13-1

Step 2
Clone TileGym repository

git clone https://github.com/NVIDIA/TileGym
cd TileGym
pip install .

Step 3
Run benchmark suite

cd tests/benchmark/
bash run_all.sh

NOTE

The benchmark runs sequentially to ensure accurate timing results. This may take 10-15 minutes to complete all kernels.

Step 4
View results

Results show cuTile performance for each kernel and sequence length.

Expected output should look like:

==========================================
Running bench_fused_attention.py...
==========================================
fused-attention-batch4-head32-d128-fwd-causal=True-float16-TFLOPS:
     N_CTX     CuTile
0   1024.0  58.188262
1   2048.0  80.906892
2   4096.0  86.189532
3   8192.0  88.891086
4  16384.0  89.491869
✓ PASSED: bench_fused_attention.py

Step 5
Run individual benchmarks

To run specific kernel benchmarks:

# Flash Multi-Head Attention
python bench_fused_attention.py

# Matrix Multiplication
python bench_matrix_multiplication.py

# RMSNorm
python bench_rmsnorm.py

# RoPE
python bench_rope.py

# SwiGLU
python bench_swiglu.py

Step 6
Clean up

Exit the container:

exit

Remove this workflow's containers (if you ran without --rm):

# Preferred: remove only containers from this workflow's image
docker rm $(docker ps -a --filter ancestor=nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 --format '{{.ID}}')

# Alternative: prune all stopped containers (will prompt for confirmation)
# docker container prune

Remove the image (optional):

docker rmi nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04

Step 7
Repeat on B300

Repeat Steps 1-6 on B300 hardware to observe scaling. See the Platform Comparison tab for expected scaling results.

Resources

Step 1
Pull CUDA NGC container with CTK 13.x

docker pull nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04

Launch an interactive session with GPU access:

docker run --gpus all -it --rm \
  -v ~/TileGym:/workspace/TileGym \
  nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \
  /bin/bash

NOTE

The -v flag mounts a local directory to persist the TileGym repository. The --rm flag automatically removes the container when you exit; omit it if you want to keep the container for later use.

Or if running outside a container, install Tile IR directly:

# Requires root privileges - run with sudo or as root
sudo apt-get install cuda-tile-ir-13-1 cuda-compiler-13-1

Step 2
Clone TileGym repository

git clone https://github.com/NVIDIA/TileGym
cd TileGym
pip install .

Step 3
Run benchmark suite

cd tests/benchmark/
bash run_all.sh

NOTE

The benchmark runs sequentially to ensure accurate timing results. This may take 10-15 minutes to complete all kernels.

Step 4
View results

Results show cuTile performance for each kernel and sequence length.

Expected output should look like:

==========================================
Running bench_fused_attention.py...
==========================================
fused-attention-batch4-head32-d128-fwd-causal=True-float16-TFLOPS:
     N_CTX     CuTile
0   1024.0  58.188262
1   2048.0  80.906892
2   4096.0  86.189532
3   8192.0  88.891086
4  16384.0  89.491869
✓ PASSED: bench_fused_attention.py

Step 5
Run individual benchmarks

To run specific kernel benchmarks:

# Flash Multi-Head Attention
python bench_fused_attention.py

# Matrix Multiplication
python bench_matrix_multiplication.py

# RMSNorm
python bench_rmsnorm.py

# RoPE
python bench_rope.py

# SwiGLU
python bench_swiglu.py

Step 6
Clean up

Exit the container:

exit

Remove this workflow's containers (if you ran without --rm):

# Preferred: remove only containers from this workflow's image
docker rm $(docker ps -a --filter ancestor=nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 --format '{{.ID}}')

# Alternative: prune all stopped containers (will prompt for confirmation)
# docker container prune

Remove the image (optional):

docker rmi nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04

Step 7
Repeat on B300

Repeat Steps 1-6 on B300 hardware to observe scaling. See the Platform Comparison tab for expected scaling results.

cuTile Kernels

Step 1Pull CUDA NGC container with CTK 13.x

Step 2Clone TileGym repository

Step 3Run benchmark suite

Step 4View results

Step 5Run individual benchmarks

Step 6Clean up

Step 7Repeat on B300

Resources

cuTile Kernels

Step 1Pull CUDA NGC container with CTK 13.x

Step 2Clone TileGym repository

Step 3Run benchmark suite

Step 4View results

Step 5Run individual benchmarks

Step 6Clean up

Step 7Repeat on B300

Resources

Step 1
Pull CUDA NGC container with CTK 13.x

Step 2
Clone TileGym repository

Step 3
Run benchmark suite

Step 4
View results

Step 5
Run individual benchmarks

Step 6
Clean up

Step 7
Repeat on B300

Step 1
Pull CUDA NGC container with CTK 13.x

Step 2
Clone TileGym repository

Step 3
Run benchmark suite

Step 4
View results

Step 5
Run individual benchmarks

Step 6
Clean up

Step 7
Repeat on B300