cuTile Kernels
60 MIN
Run cuTile kernel benchmarks, FMHA implementation, and LLM inference on DGX Spark and B300
DGX Spark vs B300 Performance Comparison
This page summarizes performance scaling between DGX Spark (GB10) and B300 for both kernel benchmarks and end-to-end LLM inference.
Kernel Benchmark Scaling
Use the ratios below as a reference for how kernel performance scales from DGX Spark (GB10) to B300.
| Kernel | Metric | B300 / GB10 |
|---|---|---|
| FMHA (causal, 8192) | TFLOPS | 13.7x |
| FMHA (non-causal, 8192) | TFLOPS | 15.1x |
| MatMul (8192) | TFLOPS | 18.9x |
| BMM (batch8, 4096) | TFLOPS | 19.4x |
| Group GEMM (4096) | TFLOPS | 23.9x |
| RMSNorm (4096) | GB/s | 33.1x |
| RoPE (16384) | GB/s | 22.8x |
Key Observations:
- Compute-heavy kernels typically scale 14-24x from GB10 to B300
- Memory-bound kernels can scale 20-33x due to HBM bandwidth advantage
Qwen2-7B Performance
End-to-End Throughput
| Configuration | DGX Spark | B300 | Platform Speedup |
|---|---|---|---|
| cuTile | 18.52 tok/s | 257.33 tok/s | 13.9x |
CUDA Kernel Time
| Configuration | DGX Spark | B300 | Platform Speedup |
|---|---|---|---|
| cuTile | 43,080 ms | 2,954 ms | 14.6x |
cuTile Kernel Breakdown
DGX Spark (GB10):
| Kernel | CUDA Time (ms) | Calls |
|---|---|---|
fmha_kernel | 4,185.9 | 28 |
swiglu_forward_kernel | 2,459.8 | 1,400 |
attention_decode_kernel_grouped | 2,271.8 | 1,372 |
rms_norm_kernel_static_persistent | 634.7 | 57 |
rope_kernel | 355.6 | 1,400 |
B300:
| Kernel | CUDA Time (ms) | Speedup vs Spark |
|---|---|---|
fmha_kernel | 337.9 | 12.4x |
swiglu_forward_kernel | 226.3 | 10.9x |
attention_decode_kernel_grouped | 111.0 | 20.5x |
rms_norm_kernel_static_persistent | 29.7 | 21.4x |
rope_kernel | 16.7 | 21.3x |
Same code, different architectures - cuTile JIT compiles for sm_121 (Spark) and sm_103 (B300)
Platform Specifications
| Specification | DGX Spark (GB10) | B300 |
|---|---|---|
| Compute Capability | sm_121 (12.1) | sm_103 (10.3) |
| SMs | 48 | 132 |
| Memory | 128 GB LPDDR5x | 192 GB HBM3e |
| Memory Bandwidth | 273 GB/s | 8 TB/s |