cuTile Kernels

60 MIN

Run cuTile kernel benchmarks, FMHA implementation, and LLM inference on DGX Spark and B300

Benchmarking Cross-Platform DeepSeek Docker FMHA Flash Attention GPU Development LLM Inference Qwen2 TileGym cuTile

Overview Kernel Benchmarks End-to-End Inference FMHA Implementation Platform Comparison Troubleshooting

Symptom	Cause	Fix
`docker: permission denied`	User not in docker group	`sudo usermod -aG docker $USER && newgrp docker`
`401 Client Error: Unauthorized`	Missing HuggingFace token	`export HF_TOKEN=<your_token>`
`ModuleNotFoundError: tilegym`	TileGym not installed	`cd TileGym && pip install .`
`RuntimeError: CUDA out of memory`	Model too large	Reduce batch size or use smaller model
`Killed` during model load	Out of system memory	Clear cache: `sync; echo 3 > /proc/sys/vm/drop_caches`
Slow first run	JIT compilation	Normal - cuTile compiles kernels on first run
`FileNotFoundError: input_prompt_small.txt`	Missing input file	Run from `modeling/transformers` directory
`torch.cuda.OutOfMemoryError`	Insufficient GPU memory	Reduce `--batch_size` parameter
`ImportError: cuda.tile`	Missing Tile IR	Install: `apt-get install cuda-tile-ir-13-1`
Benchmark hangs	GPU busy or locked	Check `nvidia-smi` for other processes

NOTE

DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

TIP

First run of cuTile kernels includes JIT compilation overhead. Subsequent runs will be faster as compiled kernels are cached.

For the latest known issues, please review the DGX Spark User Guide.

Resources

Symptom	Cause	Fix
`docker: permission denied`	User not in docker group	`sudo usermod -aG docker $USER && newgrp docker`
`401 Client Error: Unauthorized`	Missing HuggingFace token	`export HF_TOKEN=<your_token>`
`ModuleNotFoundError: tilegym`	TileGym not installed	`cd TileGym && pip install .`
`RuntimeError: CUDA out of memory`	Model too large	Reduce batch size or use smaller model
`Killed` during model load	Out of system memory	Clear cache: `sync; echo 3 > /proc/sys/vm/drop_caches`
Slow first run	JIT compilation	Normal - cuTile compiles kernels on first run
`FileNotFoundError: input_prompt_small.txt`	Missing input file	Run from `modeling/transformers` directory
`torch.cuda.OutOfMemoryError`	Insufficient GPU memory	Reduce `--batch_size` parameter
`ImportError: cuda.tile`	Missing Tile IR	Install: `apt-get install cuda-tile-ir-13-1`
Benchmark hangs	GPU busy or locked	Check `nvidia-smi` for other processes

NOTE

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

TIP

First run of cuTile kernels includes JIT compilation overhead. Subsequent runs will be faster as compiled kernels are cached.

For the latest known issues, please review the DGX Spark User Guide.