Run cuTile kernel benchmarks, FMHA implementation, and LLM inference on DGX Spark and B300
This page summarizes performance scaling between DGX Spark (GB10) and B300 for both kernel benchmarks and end-to-end LLM inference.
Use the ratios below as a reference for how kernel performance scales from DGX Spark (GB10) to B300.
| Kernel | Metric | B300 / GB10 |
|---|---|---|
| FMHA (causal, 8192) | TFLOPS | 13.7x |
| FMHA (non-causal, 8192) | TFLOPS | 15.1x |
| MatMul (8192) | TFLOPS | 18.9x |
| BMM (batch8, 4096) | TFLOPS | 19.4x |
| Group GEMM (4096) | TFLOPS | 23.9x |
| RMSNorm (4096) | GB/s | 33.1x |
| RoPE (16384) | GB/s | 22.8x |
Key Observations:
| Configuration | DGX Spark | B300 | Platform Speedup |
|---|---|---|---|
| cuTile | 18.52 tok/s | 257.33 tok/s | 13.9x |
| Configuration | DGX Spark | B300 | Platform Speedup |
|---|---|---|---|
| cuTile | 43,080 ms | 2,954 ms | 14.6x |
DGX Spark (GB10):
| Kernel | CUDA Time (ms) | Calls |
|---|---|---|
fmha_kernel | 4,185.9 | 28 |
swiglu_forward_kernel | 2,459.8 | 1,400 |
attention_decode_kernel_grouped | 2,271.8 | 1,372 |
rms_norm_kernel_static_persistent | 634.7 | 57 |
rope_kernel | 355.6 | 1,400 |
B300:
| Kernel | CUDA Time (ms) | Speedup vs Spark |
|---|---|---|
fmha_kernel | 337.9 | 12.4x |
swiglu_forward_kernel | 226.3 | 10.9x |
attention_decode_kernel_grouped | 111.0 | 20.5x |
rms_norm_kernel_static_persistent | 29.7 | 21.4x |
rope_kernel | 16.7 | 21.3x |
Same code, different architectures - cuTile JIT compiles for sm_121 (Spark) and sm_103 (B300)
| Specification | DGX Spark (GB10) | B300 |
|---|---|---|
| Compute Capability | sm_121 (12.1) | sm_103 (10.3) |
| SMs | 48 | 132 |
| Memory | 128 GB LPDDR5x | 192 GB HBM3e |
| Memory Bandwidth | 273 GB/s | 8 TB/s |