NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Text to Knowledge Graph
  • Optimized JAX

tools

  • DGX Dashboard
  • Comfy UI
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark
  • VS Code
  • Connect Three DGX Spark in a Ring Topology
  • Connect Multiple DGX Spark through a Switch

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • NemoClaw with Nemotron 3 Super and Telegram on DGX Spark
  • cuTile Kernels
  • CLI Coding Agent
  • Live VLM WebUI
  • Install and Use Isaac Sim and Isaac Lab
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • Connect Two Sparks
  • NCCL for Two Sparks
  • Build a Video Search and Summarization (VSS) Agent
  • Spark & Reachy Photo Booth
  • Secure Long Running AI Agents with OpenShell on DGX Spark
  • OpenClaw šŸ¦ž

inference

  • LM Studio on DGX Spark
  • Speculative Decoding
  • Run models with llama.cpp on DGX Spark
  • Nemotron-3-Nano with llama.cpp
  • SGLang for Inference
  • TRT LLM for Inference
  • NVFP4 Quantization
  • Multi-modal Inference
  • NIM on Spark
  • vLLM for Inference

cuTile Kernels

60 MIN

Run cuTile kernel benchmarks, FMHA implementation, and LLM inference on DGX Spark and B300

BenchmarkingCross-PlatformDeepSeekDockerFMHAFlash AttentionGPU DevelopmentLLM InferenceQwen2TileGymcuTile
View on GitHub
OverviewOverviewKernel BenchmarksKernel BenchmarksEnd-to-End InferenceEnd-to-End InferenceFMHA ImplementationFMHA ImplementationPlatform ComparisonPlatform ComparisonTroubleshootingTroubleshooting

DGX Spark vs B300 Performance Comparison

This page summarizes performance scaling between DGX Spark (GB10) and B300 for both kernel benchmarks and end-to-end LLM inference.

Kernel Benchmark Scaling

Use the ratios below as a reference for how kernel performance scales from DGX Spark (GB10) to B300.

KernelMetricB300 / GB10
FMHA (causal, 8192)TFLOPS13.7x
FMHA (non-causal, 8192)TFLOPS15.1x
MatMul (8192)TFLOPS18.9x
BMM (batch8, 4096)TFLOPS19.4x
Group GEMM (4096)TFLOPS23.9x
RMSNorm (4096)GB/s33.1x
RoPE (16384)GB/s22.8x

Key Observations:

  • Compute-heavy kernels typically scale 14-24x from GB10 to B300
  • Memory-bound kernels can scale 20-33x due to HBM bandwidth advantage

Qwen2-7B Performance

End-to-End Throughput

ConfigurationDGX SparkB300Platform Speedup
cuTile18.52 tok/s257.33 tok/s13.9x

CUDA Kernel Time

ConfigurationDGX SparkB300Platform Speedup
cuTile43,080 ms2,954 ms14.6x

cuTile Kernel Breakdown

DGX Spark (GB10):

KernelCUDA Time (ms)Calls
fmha_kernel4,185.928
swiglu_forward_kernel2,459.81,400
attention_decode_kernel_grouped2,271.81,372
rms_norm_kernel_static_persistent634.757
rope_kernel355.61,400

B300:

KernelCUDA Time (ms)Speedup vs Spark
fmha_kernel337.912.4x
swiglu_forward_kernel226.310.9x
attention_decode_kernel_grouped111.020.5x
rms_norm_kernel_static_persistent29.721.4x
rope_kernel16.721.3x

Same code, different architectures - cuTile JIT compiles for sm_121 (Spark) and sm_103 (B300)

Platform Specifications

SpecificationDGX Spark (GB10)B300
Compute Capabilitysm_121 (12.1)sm_103 (10.3)
SMs48132
Memory128 GB LPDDR5x192 GB HBM3e
Memory Bandwidth273 GB/s8 TB/s

Resources

  • TileGym Repository
  • cuTile Python Documentation
  • Tile IR Specification
  • DGX Spark Documentation
  • DGX Spark Forum
  • Qwen2 on HuggingFace
  • DeepSeek-V2-Lite on HuggingFace
  • NVIDIA Blog - Tuning Flash Attention in CUDA Tile
  • Flash Attention Paper
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright Ā© 2026 NVIDIA Corporation