To easily manage containers without sudo, you must be in the docker group. If you choose to skip this step, you will need to run Docker commands with sudo.
Open a new terminal and test Docker access. In the terminal, run:
docker ps
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo.
sudo usermod -aG docker $USER
newgrp docker
Create a local output directory where the quantized model files will be stored. This directory will be mounted into the container to persist results after the container exits.
mkdir -p ./output_models
chmod 755 ./output_models
Ensure you have access to the DeepSeek model by setting your Hugging Face authentication token.
# Export your Hugging Face token as an environment variable
# Get your token from: https://huggingface.co/settings/tokens
export HF_TOKEN="your_token_here"
The token will be automatically used by the container for model downloads.
If your system has multiple GPUs, you need to identify the device ID of your GB300 GPU. Run nvidia-smi to list all available GPUs:
nvidia-smi
Example output on a system with multiple GPUs:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.35 Driver Version: 590.35 CUDA Version: 13.1 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
|=========================================+========================+======================|
| 0 NVIDIA RTX 6000 On | 00000004:01:00.0 Off | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GB300 On | 00000009:06:00.0 Off | 0 |
+-----------------------------------------+------------------------+----------------------+
In this example, the GB300 is device 1. Note this number for use in Docker commands.
NOTE
The examples below assume the GB300 is device 1. If your GPU has a different ID, adjust the --gpus "device=X" parameter in the Docker commands accordingly.
Launch the vLLM container with GPU access, IPC settings optimized for multi-GPU workloads, and volume mounts for model caching and output persistence.
docker run --rm -it --gpus "device=1" --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-v "./output_models:/workspace/output_models" \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
-e HF_TOKEN=$HF_TOKEN \
nvcr.io/nvidia/vllm:25.12.post1-py3 \
bash -c "
git clone -b 0.41.0 --single-branch https://github.com/NVIDIA/TensorRT-Model-Optimizer.git /app/TensorRT-Model-Optimizer && \
cd /app/TensorRT-Model-Optimizer && pip install -e '.[dev]' && \
export ROOT_SAVE_PATH='/workspace/output_models' && \
/app/TensorRT-Model-Optimizer/examples/llm_ptq/scripts/huggingface_example.sh \
--model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--quant nvfp4 \
--tasks quant
"
NOTE
No module named 'mpi4py' error. It does not affect the quantization process.pynvml.NVMLError_NotSupported: Not Supported. This is expected in some environments, does not affect results, and will be fixed in an upcoming release.This command:
The quantization process will display progress information including:
After the container completes, verify that the quantized model files were created successfully.
# Check output directory contents
ls -la ./output_models/
# Verify model files are present
find ./output_models/ -name "*.bin" -o -name "*.safetensors" -o -name "config.json"
You should see model weight files, configuration files, and tokenizer files in the output directory.
Now verify the quantized model can be loaded properly using a simple test:
# Set path to quantized model directory
export MODEL_PATH="./output_models/saved_models_DeepSeek-R1-Distill-Llama-8B_nvfp4/"
docker run \
-e HF_TOKEN=$HF_TOKEN \
-v "$MODEL_PATH:/workspace/model" \
--rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
--gpus "device=1" --ipc=host --network host \
nvcr.io/nvidia/vllm:25.12.post1-py3 \
vllm serve /workspace/model \
--max-model-len 4096 \
--port 8000
NOTE
This starts a vLLM server on port 8000. When you are done validating, stop it with Ctrl+C (or exit the container) before starting Step 8, which also uses port 8000.
Start the vLLM OpenAI-compatible API server with the quantized model. First, set the path to your quantized model:
# Set path to quantized model directory
export MODEL_PATH="./output_models/saved_models_DeepSeek-R1-Distill-Llama-8B_nvfp4/"
docker run \
-e HF_TOKEN=$HF_TOKEN \
-v "$MODEL_PATH:/workspace/model" \
--rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
--gpus "device=1" --ipc=host --network host \
nvcr.io/nvidia/vllm:25.12.post1-py3 \
vllm serve /workspace/model \
--backend pytorch \
--max-num-seqs 4 \
--max-model-len 8192 \
--port 8000
When serving from a local path, vLLM exposes the model name as the path's last component (here, model). Run the following to test the server (use the same model name vLLM reports, e.g. from curl http://localhost:8000/v1/models):
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "model",
"messages": [{"role": "user", "content": "What is artificial intelligence?"}],
"max_tokens": 100,
"temperature": 0.7,
"stream": false
}'
Try changing knobs such as --max-model-len to find the right serving configuration for your use case.
To clean up the environment and remove generated files:
WARNING
This will permanently delete all quantized model files and cached data.
# Remove output directory and all quantized models
rm -rf ./output_models
# Remove Hugging Face cache (optional)
rm -rf ~/.cache/huggingface
# Remove Docker image (optional)
docker rmi nvcr.io/nvidia/vllm:25.12.post1-py3
The quantized model is now ready for deployment. Common next steps include: