TRT LLM for Inference
1 HR
Install and configure TRT LLM to run on a single Spark or on two Sparks
Common issues for running on a single Spark
| Symptom | Cause | Fix |
|---|---|---|
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your HuggingFace token; and request access to the gated model on your web browser |
| OOM during weight loading (e.g., Nemotron Super 49B) | Parallel weight-loading memory pressure | export TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=1 |
| "CUDA out of memory" | GPU VRAM insufficient for model | Reduce free_gpu_memory_fraction: 0.9 or batch size or use smaller model |
| "Model not found" error | HF_TOKEN invalid or model inaccessible | Verify token and model permissions |
| Container pull timeout | Network connectivity issues | Retry pull or use local mirror |
| Import tensorrt_llm fails | Container runtime issues | Restart Docker daemon and retry |
Common Issues for running on two Starks
| Symptom | Cause | Fix |
|---|---|---|
| MPI hostname test returns single hostname | Network connectivity issues | Verify both nodes are on reachable IP addresses |
| "Permission denied" on HuggingFace download | Invalid or missing HF_TOKEN | Set valid token: export HF_TOKEN=<TOKEN> |
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your HuggingFace token; and request access to the gated model on your web browser |
| "CUDA out of memory" errors | Insufficient GPU memory | Reduce --max_batch_size or --max_num_tokens |
| Container exits immediately | Missing entrypoint script | Ensure trtllm-mn-entrypoint.sh download succeeded and has executable permissions, also ensure you are not running the container already on your node. If port 2233 is already utilized, the entrypoint script will not start. |
| Error response from daemon: error while validating Root CA Certificate | System clock out of sync or expired certificates | Update system time to sync with NTP server sudo timedatectl set-ntp true |
| "invalid mount config for type 'bind'" | Missing or non-executable entrypoint script | Run docker inspect <container_id> to see full error message. Verify trtllm-mn-entrypoint.sh exists on both nodes in your home directory (ls -la $HOME/trtllm-mn-entrypoint.sh) and has executable permissions (chmod +x $HOME/trtllm-mn-entrypoint.sh) |
| "task: non-zero exit (255)" | Container exit with error code 255 | Check container logs with docker ps -a --filter "name=trtllm-multinode_trtllm" to get container ID, then docker logs <container_id> to see detailed error messages |
| Docker state stuck in "Pending" with "no suitable node (insufficien...)" | Docker daemon not properly configured for GPU access | Verify steps 2-4 were completed successfully and check that /etc/docker/daemon.json contains correct GPU configuration |
NOTE
DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'