NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Text to Knowledge Graph
  • Optimized JAX

tools

  • DGX Dashboard
  • Comfy UI
  • Connect Three DGX Spark in a Ring Topology
  • Connect Multiple DGX Spark through a Switch
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark
  • VS Code

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • NemoClaw with Nemotron 3 Super and Telegram on DGX Spark
  • Secure Long Running AI Agents with OpenShell on DGX Spark
  • OpenClaw 🦞
  • Live VLM WebUI
  • Install and Use Isaac Sim and Isaac Lab
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • Connect Two Sparks
  • NCCL for Two Sparks
  • Build a Video Search and Summarization (VSS) Agent
  • Spark & Reachy Photo Booth

inference

  • Run models with llama.cpp on DGX Spark
  • vLLM for Inference
  • Nemotron-3-Nano with llama.cpp
  • Speculative Decoding
  • SGLang for Inference
  • TRT LLM for Inference
  • NVFP4 Quantization
  • Multi-modal Inference
  • NIM on Spark
  • LM Studio on DGX Spark

NVFP4 Quantization

1 HR

Quantize a model to NVFP4 to run on Spark using TensorRT Model Optimizer

DGXSpark
View on GitHub
OverviewOverviewInstructionsInstructionsTroubleshootingTroubleshooting

Basic idea

NVFP4 is a 4-bit floating-point format introduced with NVIDIA Blackwell GPUs to maintain model accuracy while reducing memory bandwidth and storage requirements for inference workloads. Unlike uniform INT4 quantization, NVFP4 retains floating-point semantics with a shared exponent and a compact mantissa, allowing higher dynamic range and more stable convergence. NVIDIA Blackwell Tensor Cores natively support mixed-precision execution across FP16, FP8, and FP4, enabling models to use FP4 for weights and activations while accumulating in higher precision (typically FP16). This design minimizes quantization error during matrix multiplications and supports efficient conversion pipelines in TensorRT-LLM for fine-tuned layer-wise quantization.

Immediate benefits are:

  • Cut memory use ~3.5x vs FP16 and ~1.8x vs FP8
  • Maintain accuracy close to FP8 (usually <1% loss)
  • Improve speed and energy efficiency for inference

What you'll accomplish

You'll quantize the DeepSeek-R1-Distill-Llama-8B model using NVIDIA's TensorRT Model Optimizer inside a TensorRT-LLM container, producing an NVFP4 quantized model for deployment on NVIDIA DGX Spark.

The examples use NVIDIA FP4 quantized models which help reduce model size by approximately 2x by reducing the precision of model layers. This quantization approach aims to preserve accuracy while providing significant throughput improvements. However, it's important to note that quantization can potentially impact model accuracy - we recommend running evaluations to verify if the quantized model maintains acceptable performance for your use case.

What to know before starting

  • Working with Docker containers and GPU-accelerated workloads
  • Understanding of model quantization concepts and their impact on inference performance
  • Experience with NVIDIA TensorRT and CUDA toolkit environments
  • Familiarity with Hugging Face model repositories and authentication

Prerequisites

  • NVIDIA Spark device with Blackwell architecture GPU
  • Docker installed with GPU support
  • NVIDIA Container Toolkit configured
  • Available storage for model files and outputs
  • Hugging Face account with access to the target model

Verify your setup:

# Check Docker GPU access
docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi

# Verify sufficient disk space
df -h .

Time & risk

  • Estimated duration: 45-90 minutes depending on network speed and model size
  • Risks:
    • Model download may fail due to network issues or Hugging Face authentication problems
    • Quantization process is memory-intensive and may fail on systems with insufficient GPU memory
    • Output files are large (several GB) and require adequate storage space
  • Rollback: Remove the output directory and any pulled Docker images to restore original state.
  • Last Updated: 12/15/2025
    • Fix broken client CURL request in Step 8
    • Update ModelOptimizer project name

Resources

  • DGX Spark Documentation
  • DGX Spark Forum
  • TensorRT Model Optimizer Documentation
  • TensorRT-LLM Documentation
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation