Deploy a NIM on Spark
Check that your system meets the basic requirements for running GPU-enabled containers.
nvidia-smi
docker --version
docker run --rm --gpus all nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 nvidia-smi
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
sudo usermod -aG docker $USER
newgrp docker
Set up access to NVIDIA's container registry using your NGC API key.
export NGC_API_KEY="<YOUR_NGC_API_KEY>"
echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin
Choose a specific LLM NIM from NGC and set up local caching for model assets.
export CONTAINER_NAME="nim-llm-demo"
export IMG_NAME="nvcr.io/nim/meta/llama-3.1-8b-instruct-dgx-spark:latest"
export LOCAL_NIM_CACHE=~/.cache/nim
export LOCAL_NIM_WORKSPACE=~/.local/share/nim/workspace
mkdir -p "$LOCAL_NIM_WORKSPACE"
chmod -R a+w "$LOCAL_NIM_WORKSPACE"
mkdir -p "$LOCAL_NIM_CACHE"
chmod -R a+w "$LOCAL_NIM_CACHE"
Start the containerized LLM service with GPU acceleration and proper resource allocation.
docker run -it --rm --name=$CONTAINER_NAME \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-v "$LOCAL_NIM_WORKSPACE:/opt/nim/workspace" \
-p 8000:8000 \
$IMG_NAME
The container will download the model on first run and may take several minutes to start. Look for startup messages indicating the service is ready.
Test the deployed service with a basic completion request to verify functionality. Run the following curl command in a new terminal.
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama-3.1-8b-instruct",
"messages": [
{
"role":"system",
"content":"detailed thinking on"
},
{
"role":"user",
"content":"Can you write me a song?"
}
],
"top_p": 1,
"n": 1,
"max_tokens": 15,
"frequency_penalty": 1.0,
"stop": ["hello"]
}'
Expected output should be a JSON response containing a completion field with generated text.
Remove the running container and optionally clean up cached model files.
WARNING
Removing cached models will require re-downloading on next run.
docker stop $CONTAINER_NAME
docker rm $CONTAINER_NAME
To remove cached models and free disk space:
rm -rf "$LOCAL_NIM_CACHE"
With a working NIM deployment, you can:
Test the integration with your preferred HTTP client or SDK to begin building applications.