NVFP4 Quantization

Step 1
Configure Docker permissions

To easily manage containers without sudo, you must be in the docker group. If you choose to skip this step, you will need to run Docker commands with sudo.

Open a new terminal and test Docker access. In the terminal, run:

docker ps

If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo.

sudo usermod -aG docker $USER
newgrp docker

Step 2
Prepare the environment

Create a local output directory where the quantized model files will be stored. This directory will be mounted into the container to persist results after the container exits.

mkdir -p ./output_models
chmod 755 ./output_models

Step 3
Authenticate with Hugging Face

Ensure you have access to the DeepSeek model by setting your Hugging Face authentication token.

# Export your Hugging Face token as an environment variable
# Get your token from: https://huggingface.co/settings/tokens
export HF_TOKEN="your_token_here"

The token will be automatically used by the container for model downloads.

Step 4
Identify your GB300 GPU

If your system has multiple GPUs, you need to identify the device ID of your GB300 GPU. Run nvidia-smi to list all available GPUs:

nvidia-smi

Example output on a system with multiple GPUs:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.35                 Driver Version: 590.35         CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
|=========================================+========================+======================|
|   0  NVIDIA RTX 6000                On  |   00000004:01:00.0 Off |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GB300                   On  |   00000009:06:00.0 Off |                    0 |
+-----------------------------------------+------------------------+----------------------+

In this example, the GB300 is device 1. Export this number as a variable for use in Docker commands:

export GPU_ID=1  # Replace with your GB300 device number

NOTE

On a single-GPU DGX Station (GB300 only), the device ID is 0. Set GPU_ID=0 in that case.

Step 5
Run the quantization process using TensorRT Model Optimizer

Launch the vLLM container with GPU access, IPC settings optimized for multi-GPU workloads, and volume mounts for model caching and output persistence.

docker run --rm -it --gpus "device=$GPU_ID" --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
  -v "./output_models:/workspace/output_models" \
  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
  -e HF_TOKEN=$HF_TOKEN \
  nvcr.io/nvidia/vllm:25.12.post1-py3 \
  bash -c "
    git clone -b 0.41.0 --single-branch https://github.com/NVIDIA/TensorRT-Model-Optimizer.git /app/TensorRT-Model-Optimizer && \
    cd /app/TensorRT-Model-Optimizer && pip install -e '.[dev]' && \
    export ROOT_SAVE_PATH='/workspace/output_models' && \
    /app/TensorRT-Model-Optimizer/examples/llm_ptq/scripts/huggingface_example.sh \
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --quant nvfp4 \
    --tasks quant
  "

NOTE

You can safely ignore the No module named 'mpi4py' error. It does not affect the quantization process.
You may encounter this pynvml.NVMLError_NotSupported: Not Supported. This is expected in some environments, does not affect results, and will be fixed in an upcoming release.
If your model is too large, you may encounter an out of memory error. You can try quantizing a smaller model instead.

This command:

Runs the container with access to the specified GPU and optimized shared memory settings
Mounts your output directory to persist quantized model files
Mounts your Hugging Face cache to avoid re-downloading the model
Clones and installs the TensorRT Model Optimizer from source
Executes the quantization script with NVFP4 quantization parameters

Step 6
Monitor the quantization process

The quantization process will display progress information including:

Model download progress from Hugging Face
Quantization calibration steps
Model export and validation phases

Step 7
Validate the quantized model

After the container completes, verify that the quantized model files were created successfully.

# Check output directory contents
ls -la ./output_models/

# Verify model files are present (weights, config, tokenizer, and NVFP4 quant metadata)
find ./output_models/ \( -name "*.bin" -o -name "*.safetensors" -o -name "*.json" -o -name "*.jinja" \)

You should see model weight files, configuration files, and tokenizer files in the output directory.

Now verify the quantized model can be loaded properly using a simple test:

# Set path to quantized model directory
export MODEL_PATH="./output_models/saved_models_DeepSeek-R1-Distill-Llama-8B_nvfp4/"

docker run \
  -e HF_TOKEN=$HF_TOKEN \
  -v "$MODEL_PATH:/workspace/model" \
  --rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
  --gpus "device=$GPU_ID" --ipc=host --network host \
  nvcr.io/nvidia/vllm:25.12.post1-py3 \
  vllm serve /workspace/model \
    --max-model-len 4096 \
    --port 8000

NOTE

This starts a vLLM server on port 8000. When you are done validating, stop it with Ctrl+C (or exit the container) before starting Step 8, which also uses port 8000.

Step 8
Serve the model with OpenAI-compatible API

Start the vLLM OpenAI-compatible API server with the quantized model. First, set the path to your quantized model:

# Set path to quantized model directory
export MODEL_PATH="./output_models/saved_models_DeepSeek-R1-Distill-Llama-8B_nvfp4/"

docker run \
  -e HF_TOKEN=$HF_TOKEN \
  -v "$MODEL_PATH:/workspace/model" \
  --rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
  --gpus "device=$GPU_ID" --ipc=host --network host \
  nvcr.io/nvidia/vllm:25.12.post1-py3 \
  vllm serve /workspace/model \
    --served-model-name DeepSeek-R1-Distill-Llama-8B-NVFP4 \
    --max-num-seqs 4 \
    --max-model-len 8192 \
    --port 8000

The --served-model-name flag sets the model ID the API responds to (DeepSeek-R1-Distill-Llama-8B-NVFP4). Without it, vLLM defaults the ID to the --model argument — here the mount path /workspace/model — which is why the request must use the served name, not the original Hugging Face ID. You can confirm the active name with curl http://localhost:8000/v1/models. Run the following to test the server:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1-Distill-Llama-8B-NVFP4",
    "messages": [{"role": "user", "content": "What is artificial intelligence?"}],
    "max_tokens": 100,
    "temperature": 0.7,
    "stream": false
  }'

Try changing knobs such as --max-model-len to find the right serving configuration for your use case.

Step 9
Cleanup and rollback

To clean up the environment and remove generated files:

WARNING

This will permanently delete all quantized model files and cached data.

NOTE

The Step 5 container writes ./output_models/ as root, so a plain rm -rf from your (non-root) shell user will fail with Permission denied. Use sudo to remove it.

# Remove output directory and all quantized models (root-owned; written by the Step 5 container)
sudo rm -rf ./output_models

# Remove Hugging Face cache (optional)
rm -rf ~/.cache/huggingface

# Remove Docker image (optional)
docker rmi nvcr.io/nvidia/vllm:25.12.post1-py3

Step 10
Next steps

The quantized model is now ready for deployment. Common next steps include:

Benchmarking inference performance compared to the original model.
Integrating the quantized model into your inference pipeline.
Deploying to NVIDIA Triton Inference Server for production serving.
Running additional validation tests on your specific use cases.

Step 1Configure Docker permissions

Step 2Prepare the environment

Step 3Authenticate with Hugging Face

Step 4Identify your GB300 GPU

Step 5Run the quantization process using TensorRT Model Optimizer

Step 6Monitor the quantization process

Step 7Validate the quantized model

Step 8Serve the model with OpenAI-compatible API

Step 9Cleanup and rollback

Step 10Next steps

Resources

Step 1
Configure Docker permissions

Step 2
Prepare the environment

Step 3
Authenticate with Hugging Face

Step 4
Identify your GB300 GPU

Step 5
Run the quantization process using TensorRT Model Optimizer

Step 6
Monitor the quantization process

Step 7
Validate the quantized model

Step 8
Serve the model with OpenAI-compatible API

Step 9
Cleanup and rollback

Step 10
Next steps