TRT LLM for Inference

1 HR

Install and configure TRT LLM to run on a single Spark or on two Sparks

Common issues for running on a single Spark

SymptomCauseFix
Cannot access gated repo for URLCertain HuggingFace models have restricted accessRegenerate your HuggingFace token; and request access to the gated model on your web browser
OOM during weight loading (e.g., Nemotron Super 49B)Parallel weight-loading memory pressureexport TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=1
"CUDA out of memory"GPU VRAM insufficient for modelReduce free_gpu_memory_fraction: 0.9 or batch size or use smaller model
"Model not found" errorHF_TOKEN invalid or model inaccessibleVerify token and model permissions
Container pull timeoutNetwork connectivity issuesRetry pull or use local mirror
Import tensorrt_llm failsContainer runtime issuesRestart Docker daemon and retry

Common Issues for running on two Starks

SymptomCauseFix
MPI hostname test returns single hostnameNetwork connectivity issuesVerify both nodes are on reachable IP addresses
"Permission denied" on HuggingFace downloadInvalid or missing HF_TOKENSet valid token: export HF_TOKEN=<TOKEN>
Cannot access gated repo for URLCertain HuggingFace models have restricted accessRegenerate your HuggingFace token; and request access to the gated model on your web browser
"CUDA out of memory" errorsInsufficient GPU memoryReduce --max_batch_size or --max_num_tokens
Container exits immediatelyMissing entrypoint scriptEnsure trtllm-mn-entrypoint.sh download succeeded and has executable permissions, also ensure you are not running the container already on your node. If port 2233 is already utilized, the entrypoint script will not start.
Error response from daemon: error while validating Root CA CertificateSystem clock out of sync or expired certificatesUpdate system time to sync with NTP server sudo timedatectl set-ntp true
"invalid mount config for type 'bind'"Missing or non-executable entrypoint scriptRun docker inspect <container_id> to see full error message. Verify trtllm-mn-entrypoint.sh exists on both nodes in your home directory (ls -la $HOME/trtllm-mn-entrypoint.sh) and has executable permissions (chmod +x $HOME/trtllm-mn-entrypoint.sh)
"task: non-zero exit (255)"Container exit with error code 255Check container logs with docker ps -a --filter "name=trtllm-multinode_trtllm" to get container ID, then docker logs <container_id> to see detailed error messages
Docker state stuck in "Pending" with "no suitable node (insufficien...)"Docker daemon not properly configured for GPU accessVerify steps 2-4 were completed successfully and check that /etc/docker/daemon.json contains correct GPU configuration

NOTE

DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'