NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Text to Knowledge Graph
  • Optimized JAX

tools

  • DGX Dashboard
  • Comfy UI
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark
  • VS Code
  • Connect Three DGX Spark in a Ring Topology
  • Connect Multiple DGX Spark through a Switch

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • NemoClaw with Nemotron 3 Super and Telegram on DGX Spark
  • cuTile Kernels
  • CLI Coding Agent
  • Live VLM WebUI
  • Install and Use Isaac Sim and Isaac Lab
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • Connect Two Sparks
  • NCCL for Two Sparks
  • Build a Video Search and Summarization (VSS) Agent
  • Spark & Reachy Photo Booth
  • Secure Long Running AI Agents with OpenShell on DGX Spark
  • OpenClaw 🦞

inference

  • LM Studio on DGX Spark
  • Speculative Decoding
  • Run models with llama.cpp on DGX Spark
  • Nemotron-3-Nano with llama.cpp
  • SGLang for Inference
  • TRT LLM for Inference
  • NVFP4 Quantization
  • Multi-modal Inference
  • NIM on Spark
  • vLLM for Inference

cuTile Kernels

60 MIN

Run cuTile kernel benchmarks, FMHA implementation, and LLM inference on DGX Spark and B300

BenchmarkingCross-PlatformDeepSeekDockerFMHAFlash AttentionGPU DevelopmentLLM InferenceQwen2TileGymcuTile
View on GitHub
OverviewOverviewKernel BenchmarksKernel BenchmarksEnd-to-End InferenceEnd-to-End InferenceFMHA ImplementationFMHA ImplementationPlatform ComparisonPlatform ComparisonTroubleshootingTroubleshooting

Basic idea

TileGym is NVIDIA's benchmark suite and integration framework for cuTile kernels - high-performance GPU kernels written using the cuTile Python DSL. cuTile compiles to Tile IR, enabling developers to write efficient kernels without low-level CUDA programming.

This playbook covers three workflows:

  1. Kernel Benchmarks - Run standalone cuTile kernel benchmarks (FMHA, MatMul, RMSNorm, etc.)
  2. End-to-End Inference - Run LLM inference with cuTile-optimized kernels via monkey-patching
  3. FMHA Implementation - Step-by-step tutorial building a Flash Multi-Head Attention kernel from pseudocode to optimized cuTile, with companion scripts to run and benchmark

The same cuTile code runs on both DGX Spark (sm_121) and B300 (sm_103) - cuTile JIT compiles to the appropriate GPU architecture automatically.

What you'll accomplish

  • Run the TileGym benchmark suite on DGX Spark
  • Run Qwen2-7B or DeepSeek-V2-Lite inference with cuTile-optimized kernels
  • Observe performance scaling between DGX Spark and B300
  • Build an FMHA kernel step-by-step from pseudocode to optimized cuTile implementation

What to know before starting

  • Basic familiarity with Docker and command-line tools
  • Understanding of GPU compute concepts (TFLOPS, memory bandwidth)
  • No CUDA programming experience required
  • HuggingFace account with access token (for LLM inference)

Prerequisites

Hardware Requirements:

  • DGX Spark with Ubuntu 24.04 or B300 cloud instance
  • Minimum 16GB GPU memory for LLM inference
  • At least 50GB available storage space for model downloads

Software Requirements:

  • Docker installed and configured: docker ps
  • CUDA Toolkit 13.x with Tile IR support
  • HuggingFace token for model access (LLM inference only)
  • Network access for pulling containers and downloading models

Verify Docker is available:

docker ps

If you get a permission error:

sudo usermod -aG docker $USER
newgrp docker

Kernel support matrix

KernelCategoryData TypesDescription
FMHAAttentionfloat16, float8Flash Multi-Head Attention
MLAAttentionbfloat16, float8Multi-head Latent Attention
MLA DecodingAttentionfloat16, float8MLA for decode phase
MatMulMatrix Opsfloat16, float8Matrix multiplication
BMMMatrix Opsfloat16Batched matrix multiplication
Group GEMMMatrix Opsfloat16, float8Grouped GEMM for MoE
RMSNormNormalizationfloat16, bfloat16Root mean square normalization
RoPEPositionalfloat16Rotary position embedding
SiLUActivationfloat16, float32SiLU activation with multiply
SwiGLUActivationfloat16, float32SwiGLU fused operation
SoftmaxActivationfloat16Softmax normalization
DropoutRegularizationfloat16, float32Dropout forward

Model support for LLM inference

ModelSupported KernelsBatch SizeOutput TokensNotes
Qwen2-7BRoPE, RMSNorm, SwiGLU, FMHA1650Standard transformer
DeepSeek-V2-LiteRoPE, RMSNorm, SiLU, MLA, MoE1100MLA attention, MoE layers

Ancillary files

All required assets can be found in the TileGym repository.

  • tests/benchmark/run_all.sh - Run all kernel benchmarks
  • modeling/transformers/bench_qwen.sh - Qwen2-7B benchmark script
  • modeling/transformers/bench_deepseek.sh - DeepSeek-V2-Lite benchmark script
  • modeling/transformers/infer.py - Main inference script with TileGym integration
  • assets/fmha_optimization_tutorial.py - FMHA step-by-step optimization tutorial
  • assets/fmha_scaling_analysis.py - FMHA scaling analysis across sequence lengths

Time & risk

  • Estimated time: 30-45 minutes (including model download for LLM inference)
  • Risk level: Low
    • Large downloads may fail due to network issues
    • First run includes JIT compilation overhead
  • Rollback: Remove Docker container to undo all changes
  • Last Updated: February 2026
    • First Publication

Resources

  • TileGym Repository
  • cuTile Python Documentation
  • Tile IR Specification
  • DGX Spark Documentation
  • DGX Spark Forum
  • Qwen2 on HuggingFace
  • DeepSeek-V2-Lite on HuggingFace
  • NVIDIA Blog - Tuning Flash Attention in CUDA Tile
  • Flash Attention Paper
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation