Run cuTile kernel benchmarks, FMHA implementation, and LLM inference on DGX Spark and B300
docker pull nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04
Launch an interactive session with GPU access:
docker run --gpus all -it --rm \
-v ~/TileGym:/workspace/TileGym \
nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04 \
/bin/bash
NOTE
The -v flag mounts a local directory to persist the TileGym repository. The --rm flag automatically removes the container when you exit; omit it if you want to keep the container for later use.
Prepare the docker for installing TileGym.
apt-get update && apt-get install -y --no-install-recommends \
python3-pip python3-dev python-is-python3 \
git wget curl build-essential nsight-systems-2025.1.3
update-alternatives --install /usr/bin/nsys nsys /opt/nvidia/nsight-systems/2025.1.3/bin/nsys 100 && hash -r
python -m pip install --upgrade pip setuptools wheel
pip install --no-cache-dir --pre "torch==2.9.1" --index-url https://download.pytorch.org/whl/cu130
pip install --no-cache-dir --no-deps accelerate==1.13.0 && \
pip install --no-cache-dir sentencepiece protobuf
git clone https://github.com/NVIDIA/TileGym
cd TileGym
git checkout v1.3.0
pip install .
To run specific kernel benchmarks:
cd tests/benchmark/
# Flash Multi-Head Attention
python bench_fused_attention.py
# Matrix Multiplication
python bench_matrix_multiplication.py
# RMSNorm
python bench_rmsnorm.py
# RoPE
python bench_rope.py
# SwiGLU
python bench_swiglu.py
Results show cuTile performance for each kernel and sequence length.
Expected output should look like:
==========================================
Running bench_fused_attention.py...
==========================================
fused-attention-batch4-head32-d128-fwd-causal=True-float16-TFLOPS:
N_CTX CuTile
0 1024.0 58.188262
1 2048.0 80.906892
2 4096.0 86.189532
3 8192.0 88.891086
4 16384.0 89.491869
✓ PASSED: bench_fused_attention.py
cd tests/benchmark/
bash run_all.sh
NOTE
NOT RECOMMENDED: The benchmark runs sequentially to ensure accurate timing results. This may take 40-60 minutes to complete all kernels.
Exit the container:
exit
Remove this workflow's containers (if you ran without --rm):
# Preferred: remove only containers from this workflow's image
docker ps -a --filter ancestor=nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04 -q | xargs -r docker rm
# Alternative: prune all stopped containers (will prompt for confirmation)
# docker container prune
Remove the image (optional):
docker rmi nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04
Repeat Steps 1-6 on B300 hardware to observe scaling. See the Platform Comparison tab for expected scaling results.