NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Text to Knowledge Graph
  • Optimized JAX

tools

  • DGX Dashboard
  • Comfy UI
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark
  • VS Code
  • Connect Three DGX Spark in a Ring Topology
  • Connect Multiple DGX Spark through a Switch

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • NemoClaw with Nemotron 3 Super and Telegram on DGX Spark
  • cuTile Kernels
  • CLI Coding Agent
  • Live VLM WebUI
  • Install and Use Isaac Sim and Isaac Lab
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • Connect Two Sparks
  • NCCL for Two Sparks
  • Build a Video Search and Summarization (VSS) Agent
  • Spark & Reachy Photo Booth
  • Secure Long Running AI Agents with OpenShell on DGX Spark
  • OpenClaw šŸ¦ž

inference

  • LM Studio on DGX Spark
  • Speculative Decoding
  • Run models with llama.cpp on DGX Spark
  • Nemotron-3-Nano with llama.cpp
  • SGLang for Inference
  • TRT LLM for Inference
  • NVFP4 Quantization
  • Multi-modal Inference
  • NIM on Spark
  • vLLM for Inference

cuTile Kernels

60 MIN

Run cuTile kernel benchmarks, FMHA implementation, and LLM inference on DGX Spark and B300

BenchmarkingCross-PlatformDeepSeekDockerFMHAFlash AttentionGPU DevelopmentLLM InferenceQwen2TileGymcuTile
View on GitHub
OverviewOverviewKernel BenchmarksKernel BenchmarksEnd-to-End InferenceEnd-to-End InferenceFMHA ImplementationFMHA ImplementationPlatform ComparisonPlatform ComparisonTroubleshootingTroubleshooting

Step 1
Pull CUDA NGC container with CTK 13.x

docker pull nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04

Launch an interactive session with GPU access:

docker run --gpus all -it --rm \
  -v ~/TileGym:/workspace/TileGym \
  nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \
  /bin/bash

NOTE

The -v flag mounts a local directory to persist the TileGym repository. The --rm flag automatically removes the container when you exit; omit it if you want to keep the container for later use.

Or if running outside a container, install Tile IR directly:

# Requires root privileges - run with sudo or as root
sudo apt-get install cuda-tile-ir-13-1 cuda-compiler-13-1

Step 2
Clone TileGym repository

git clone https://github.com/NVIDIA/TileGym
cd TileGym
pip install .

Step 3
Run benchmark suite

cd tests/benchmark/
bash run_all.sh

NOTE

The benchmark runs sequentially to ensure accurate timing results. This may take 10-15 minutes to complete all kernels.

Step 4
View results

Results show cuTile performance for each kernel and sequence length.

Expected output should look like:

==========================================
Running bench_fused_attention.py...
==========================================
fused-attention-batch4-head32-d128-fwd-causal=True-float16-TFLOPS:
     N_CTX     CuTile
0   1024.0  58.188262
1   2048.0  80.906892
2   4096.0  86.189532
3   8192.0  88.891086
4  16384.0  89.491869
āœ“ PASSED: bench_fused_attention.py

Step 5
Run individual benchmarks

To run specific kernel benchmarks:

# Flash Multi-Head Attention
python bench_fused_attention.py

# Matrix Multiplication
python bench_matrix_multiplication.py

# RMSNorm
python bench_rmsnorm.py

# RoPE
python bench_rope.py

# SwiGLU
python bench_swiglu.py

Step 6
Clean up

Exit the container:

exit

Remove this workflow's containers (if you ran without --rm):

# Preferred: remove only containers from this workflow's image
docker rm $(docker ps -a --filter ancestor=nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 --format '{{.ID}}')

# Alternative: prune all stopped containers (will prompt for confirmation)
# docker container prune

Remove the image (optional):

docker rmi nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04

Step 7
Repeat on B300

Repeat Steps 1-6 on B300 hardware to observe scaling. See the Platform Comparison tab for expected scaling results.

Resources

  • TileGym Repository
  • cuTile Python Documentation
  • Tile IR Specification
  • DGX Spark Documentation
  • DGX Spark Forum
  • Qwen2 on HuggingFace
  • DeepSeek-V2-Lite on HuggingFace
  • NVIDIA Blog - Tuning Flash Attention in CUDA Tile
  • Flash Attention Paper
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright Ā© 2026 NVIDIA Corporation