Speculative Decoding

30 MIN

Learn how to set up speculative decoding for fast inference on Spark

SymptomCauseFix
"CUDA out of memory" errorInsufficient GPU memoryReduce kv_cache_free_gpu_memory_fraction to 0.9 or use a device with more VRAM
Container fails to startDocker GPU support issuesVerify nvidia-docker is installed and --gpus=all flag is supported
Model download failsNetwork or authentication issuesCheck HuggingFace authentication and network connectivity
Cannot access gated repo for URLCertain HuggingFace models have restricted accessRegenerate your HuggingFace token; and request access to the gated model on your web browser
Server doesn't respondPort conflicts or firewallCheck if port 8000 is available and not blocked

NOTE

DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'