NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Text to Knowledge Graph
  • Optimized JAX

tools

  • DGX Dashboard
  • Comfy UI
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark
  • VS Code
  • Connect Three DGX Spark in a Ring Topology
  • Connect Multiple DGX Spark through a Switch

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • NemoClaw with Nemotron 3 Super and Telegram on DGX Spark
  • cuTile Kernels
  • CLI Coding Agent
  • Live VLM WebUI
  • Install and Use Isaac Sim and Isaac Lab
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • Connect Two Sparks
  • NCCL for Two Sparks
  • Build a Video Search and Summarization (VSS) Agent
  • Spark & Reachy Photo Booth
  • Secure Long Running AI Agents with OpenShell on DGX Spark
  • OpenClaw 🦞

inference

  • LM Studio on DGX Spark
  • Speculative Decoding
  • Run models with llama.cpp on DGX Spark
  • Nemotron-3-Nano with llama.cpp
  • SGLang for Inference
  • TRT LLM for Inference
  • NVFP4 Quantization
  • Multi-modal Inference
  • NIM on Spark
  • vLLM for Inference

cuTile Kernels

60 MIN

Run cuTile kernel benchmarks, FMHA implementation, and LLM inference on DGX Spark and B300

BenchmarkingCross-PlatformDeepSeekDockerFMHAFlash AttentionGPU DevelopmentLLM InferenceQwen2TileGymcuTile
View on GitHub
OverviewOverviewKernel BenchmarksKernel BenchmarksEnd-to-End InferenceEnd-to-End InferenceFMHA ImplementationFMHA ImplementationPlatform ComparisonPlatform ComparisonTroubleshootingTroubleshooting

Step 1
Set up environment

If you haven't already, pull the CUDA container and clone TileGym (see Kernel Benchmarks tab for details).

First, clone TileGym on the host:

mkdir -p ~/TileGym
git clone https://github.com/NVIDIA/TileGym ~/TileGym

Then launch the container with the repository mounted:

docker run --gpus all -it --rm \
  -v ~/TileGym:/workspace/TileGym \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \
  /bin/bash

NOTE

The -v ~/.cache/huggingface:/root/.cache/huggingface mounts your HuggingFace cache to avoid re-downloading models.

Install TileGym inside the container:

cd /workspace/TileGym
pip install .

Set your HuggingFace token for accessing gated models:

export HF_TOKEN=<your_huggingface_token>

WARNING

You need a HuggingFace account and access token. Get one at https://huggingface.co/settings/tokens

Step 2
Run inference benchmark

Navigate to the transformers benchmark directory:

cd modeling/transformers

Option A: Run Qwen2-7B benchmark

./bench_qwen.sh

Configuration: Model Qwen/Qwen2-7B, Batch size 16, Output length 50 tokens.

Option B: Run DeepSeek-V2-Lite benchmark

./bench_deepseek.sh

Configuration: Model deepseek-ai/DeepSeek-V2-Lite-Chat, Batch size 1, Output length 100 tokens.

Both scripts run two configurations:

  1. PyTorch baseline - Standard HuggingFace inference
  2. TileGym cuTile - With cuTile kernel replacements

Step 3
View results

Sample DGX Spark (GB10) Results for Qwen2-7B:

========================================
  Benchmark Results
========================================
Qwen2-7B_naive_bfloat16    |  15.66 tokens/s |  51.10s |  51151.0ms CUDA
Qwen2-7B_cutile_attn       |  18.52 tokens/s |  43.20s |  43079.7ms CUDA
========================================

cuTile Kernel Breakdown (DGX Spark - Qwen2):

KernelCUDA Time (ms)Calls
fmha_kernel4185.928
swiglu_forward_kernel2459.81400
attention_decode_kernel_grouped2271.81372
rms_norm_kernel_static_persistent634.757
rope_kernel355.61400

Step 4
How TileGym monkey-patching works

TileGym replaces PyTorch model operations with cuTile kernels. The snippet below is taken from TileGym's src/tilegym/transformers/monkey_patch.py and invoked from modeling/transformers/infer.py:

from tilegym.transformers import apply_tilegym_kernel_to_qwen2

apply_tilegym_kernel_to_qwen2(
    rope=True,      # Replace RoPE with cuTile kernel
    rms_norm=True,  # Replace RMSNorm with cuTile kernel  
    swiglu=True,    # Replace SwiGLU with cuTile kernel
    attn=True,      # Replace attention with cuTile FMHA
    use_cutile=True # Use cuTile backend (vs Triton)
)

Patched Kernels for Qwen2:

KernelPyTorch OperationcuTile Replacement
rms_norm_kernel_static_persistentnn.RMSNormPersistent RMSNorm
rope_kernelRotary position embeddingFused RoPE
fmha_kernelF.scaled_dot_product_attentionFlash Attention
swiglu_forward_kernelSiLU + MulFused SwiGLU
attention_decode_kernel_groupedDecode attentionGrouped decode

Patched Kernels for DeepSeek-V2: (see src/tilegym/transformers/monkey_patch.py)

from tilegym.transformers import apply_tilegym_kernel_to_deepseek_v2

apply_tilegym_kernel_to_deepseek_v2(
    rope=True,      # Replace RoPE with cuTile kernel
    rms_norm=True,  # Replace RMSNorm with cuTile kernel  
    swiglu=True,    # Replace SiLU+Mul with cuTile kernel
    attn=True,      # Replace MLA attention with cuTile
    moe=True,       # Replace MoE routing with cuTile
    use_cutile=True
)
KernelPyTorch OperationcuTile Replacement
prefill_mlaMLA prefill attentionMulti-head Latent Attention
_mla_decoding_split_kvMLA decode attentionSplit-KV decoding
fused_moe_kernelMoE expert routingFused MoE
group_gemm_kernelExpert FFNGrouped GEMM

Step 5
Platform-specific tuning (Advanced)

cuTile exposes two complementary performance-tuning mechanisms:

  • ct.ByTarget - Select different kernel launch parameters per GPU architecture (sm_<major><minor>). The compiler picks the value matching the current target at JIT time; if no entry matches, the default value is used. See the Performance Tuning and Execution Model pages.
  • num_ctas - Number of Cooperative Thread Arrays (thread blocks) launched per kernel invocation. Tune to the number of SMs on the target GPU.
  • occupancy - Hint for the number of concurrent CTAs the compiler should target per SM. Higher occupancy hides memory latency but increases register/shared-memory pressure. See the Execution Model documentation.
  • ct.autotune - Search a list of candidate values at runtime and pick the fastest configuration. Results are reported via cuda.tile.tune.TuningResult / Measurement.
import cuda.tile as ct

@ct.kernel(
    # num_ctas: how many thread blocks to launch.
    # Use ByTarget to pick an arch-specific value at JIT time.
    num_ctas=ct.ByTarget({
        "sm_103": 8,   # B300 - more SMs, launch more CTAs
        "sm_121": 4,   # DGX Spark - fewer SMs (48), use fewer CTAs
        "default": 1,  # Fallback for any other GPU architecture
    }),
    # occupancy: hint for concurrent CTAs per SM (latency hiding vs. register pressure).
    occupancy=ct.ByTarget({
        "sm_103": 16,  # B300 - high occupancy, plenty of registers/SMEM
        "sm_121": 12,  # DGX Spark - moderate occupancy
        "default": 8,  # Conservative fallback
    }),
    opt_level=3       # Maximum compiler optimization level
)
def optimized_kernel(A, B, C):
    # Same kernel code works on all platforms;
    # ByTarget swaps in the arch-specific launch params automatically.
    ...

For automatic tuning, use ct.autotune to search over candidate values and pick the fastest configuration at runtime:

@ct.kernel(
    # autotune: benchmark each value and pick the fastest.
    num_ctas=ct.autotune([1, 2, 4, 8, 16]),
    occupancy=ct.autotune([8, 12, 16, 24]),
    opt_level=3
)
def autotuned_kernel(A, B, C):
    ...

Step 6
Repeat on B300

Repeat Steps 1-3 on B300 hardware. The same code runs without modification - cuTile JIT compiles for sm_103 automatically.

See the Platform Comparison tab for detailed scaling results.

Resources

  • TileGym Repository
  • cuTile Python Documentation
  • Tile IR Specification
  • DGX Spark Documentation
  • DGX Spark Forum
  • Qwen2 on HuggingFace
  • DeepSeek-V2-Lite on HuggingFace
  • NVIDIA Blog - Tuning Flash Attention in CUDA Tile
  • Flash Attention Paper
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation