NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Text to Knowledge Graph
  • Optimized JAX

tools

  • DGX Dashboard
  • Comfy UI
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark
  • VS Code
  • Connect Three DGX Spark in a Ring Topology
  • Connect Multiple DGX Spark through a Switch

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • NemoClaw with Nemotron 3 Super and Telegram on DGX Spark
  • cuTile Kernels
  • CLI Coding Agent
  • Live VLM WebUI
  • Install and Use Isaac Sim and Isaac Lab
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • Connect Two Sparks
  • NCCL for Two Sparks
  • Build a Video Search and Summarization (VSS) Agent
  • Spark & Reachy Photo Booth
  • Secure Long Running AI Agents with OpenShell on DGX Spark
  • OpenClaw šŸ¦ž

inference

  • LM Studio on DGX Spark
  • Speculative Decoding
  • Run models with llama.cpp on DGX Spark
  • Nemotron-3-Nano with llama.cpp
  • SGLang for Inference
  • TRT LLM for Inference
  • NVFP4 Quantization
  • Multi-modal Inference
  • NIM on Spark
  • vLLM for Inference

cuTile Kernels

60 MIN

Run cuTile kernel benchmarks, FMHA implementation, and LLM inference on DGX Spark and B300

BenchmarkingCross-PlatformDeepSeekDockerFMHAFlash AttentionGPU DevelopmentLLM InferenceQwen2TileGymcuTile
View on GitHub
OverviewOverviewKernel BenchmarksKernel BenchmarksEnd-to-End InferenceEnd-to-End InferenceFMHA ImplementationFMHA ImplementationPlatform ComparisonPlatform ComparisonTroubleshootingTroubleshooting
SymptomCauseFix
docker: permission deniedUser not in docker groupsudo usermod -aG docker $USER && newgrp docker
401 Client Error: UnauthorizedMissing HuggingFace tokenexport HF_TOKEN=<your_token>
ModuleNotFoundError: tilegymTileGym not installedcd TileGym && pip install .
RuntimeError: CUDA out of memoryModel too largeReduce batch size or use smaller model
Killed during model loadOut of system memoryClear cache: sync; echo 3 > /proc/sys/vm/drop_caches
Slow first runJIT compilationNormal - cuTile compiles kernels on first run
FileNotFoundError: input_prompt_small.txtMissing input fileRun from modeling/transformers directory
torch.cuda.OutOfMemoryErrorInsufficient GPU memoryReduce --batch_size parameter
ImportError: cuda.tileMissing Tile IRInstall: apt-get install cuda-tile-ir-13-1
Benchmark hangsGPU busy or lockedCheck nvidia-smi for other processes

NOTE

DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

TIP

First run of cuTile kernels includes JIT compilation overhead. Subsequent runs will be faster as compiled kernels are cached.

For the latest known issues, please review the DGX Spark User Guide.

Resources

  • TileGym Repository
  • cuTile Python Documentation
  • Tile IR Specification
  • DGX Spark Documentation
  • DGX Spark Forum
  • Qwen2 on HuggingFace
  • DeepSeek-V2-Lite on HuggingFace
  • NVIDIA Blog - Tuning Flash Attention in CUDA Tile
  • Flash Attention Paper
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright Ā© 2026 NVIDIA Corporation