cuTile Kernels

60 MIN

Run cuTile kernel benchmarks, FMHA implementation, and LLM inference on DGX Spark and B300

DGX Spark vs B300 Performance Comparison

This page summarizes performance scaling between DGX Spark (GB10) and B300 for both kernel benchmarks and end-to-end LLM inference.

Use the ratios below as a reference for how kernel performance scales from DGX Spark (GB10) to B300.

Key Observations:

Configuration	DGX Spark	B300	Platform Speedup
cuTile	18.52 tok/s	257.33 tok/s	13.9x

Configuration	DGX Spark	B300	Platform Speedup
cuTile	43,080 ms	2,954 ms	14.6x

DGX Spark (GB10):

Kernel	CUDA Time (ms)	Calls
`fmha_kernel`	4,185.9	28
`swiglu_forward_kernel`	2,459.8	1,400
`attention_decode_kernel_grouped`	2,271.8	1,372
`rms_norm_kernel_static_persistent`	634.7	57
`rope_kernel`	355.6	1,400

B300:

Kernel	CUDA Time (ms)	Speedup vs Spark
`fmha_kernel`	337.9	12.4x
`swiglu_forward_kernel`	226.3	10.9x
`attention_decode_kernel_grouped`	111.0	20.5x
`rms_norm_kernel_static_persistent`	29.7	21.4x
`rope_kernel`	16.7	21.3x

Same code, different architectures - cuTile JIT compiles for sm_121 (Spark) and sm_103 (B300)