Install and Use vLLM for Inference

30 MIN

Use a container or build vLLM from source for Spark

Pull vLLM container image

Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.09-py3

docker pull nvcr.io/nvidia/vllm:25.09-py3

Test vLLM in container

Launch the container and start vLLM server with a test model to verify basic functionality.

docker run -it --gpus all -p 8000:8000 \
nvcr.io/nvidia/vllm:25.09-py3 \
vllm serve "Qwen/Qwen2.5-Math-1.5B-Instruct"

Expected output should include:

  • Model loading confirmation
  • Server startup on port 8000
  • GPU memory allocation details

In another terminal, test the server:

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "Qwen/Qwen2.5-Math-1.5B-Instruct",
    "messages": [{"role": "user", "content": "12*17"}],
    "max_tokens": 500
}'

Expected response should contain "content": "204" or similar mathematical calculation.

Cleanup and rollback

For container approach (non-destructive):

docker rm $(docker ps -aq --filter ancestor=nvcr.io/nvidia/vllm:25.09-py3)
docker rmi nvcr.io/nvidia/vllm

To remove CUDA 12.9:

sudo /usr/local/cuda-12.9/bin/cuda-uninstaller

Next steps

  • Production deployment: Configure vLLM with your specific model requirements
  • Performance tuning: Adjust batch sizes and memory settings for your workload
  • Monitoring: Set up logging and metrics collection for production use
  • Model management: Explore additional model formats and quantization options