Run cuTile kernel benchmarks, FMHA implementation, and LLM inference on DGX Spark and B300
docker pull nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04
Launch an interactive session with GPU access:
docker run --gpus all -it --rm \
-v ~/TileGym:/workspace/TileGym \
nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \
/bin/bash
NOTE
The -v flag mounts a local directory to persist the TileGym repository. The --rm flag automatically removes the container when you exit; omit it if you want to keep the container for later use.
Or if running outside a container, install Tile IR directly:
# Requires root privileges - run with sudo or as root
sudo apt-get install cuda-tile-ir-13-1 cuda-compiler-13-1
git clone https://github.com/NVIDIA/TileGym
cd TileGym
pip install .
cd tests/benchmark/
bash run_all.sh
NOTE
The benchmark runs sequentially to ensure accurate timing results. This may take 10-15 minutes to complete all kernels.
Results show cuTile performance for each kernel and sequence length.
Expected output should look like:
==========================================
Running bench_fused_attention.py...
==========================================
fused-attention-batch4-head32-d128-fwd-causal=True-float16-TFLOPS:
N_CTX CuTile
0 1024.0 58.188262
1 2048.0 80.906892
2 4096.0 86.189532
3 8192.0 88.891086
4 16384.0 89.491869
ā PASSED: bench_fused_attention.py
To run specific kernel benchmarks:
# Flash Multi-Head Attention
python bench_fused_attention.py
# Matrix Multiplication
python bench_matrix_multiplication.py
# RMSNorm
python bench_rmsnorm.py
# RoPE
python bench_rope.py
# SwiGLU
python bench_swiglu.py
Exit the container:
exit
Remove this workflow's containers (if you ran without --rm):
# Preferred: remove only containers from this workflow's image
docker rm $(docker ps -a --filter ancestor=nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 --format '{{.ID}}')
# Alternative: prune all stopped containers (will prompt for confirmation)
# docker container prune
Remove the image (optional):
docker rmi nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04
Repeat Steps 1-6 on B300 hardware to observe scaling. See the Platform Comparison tab for expected scaling results.