nvidia

Bring LLMs to NIM

Use NIM to deploy a broad range of LLMs from Hugging Face.

Deploying large language models (LLMs) at scale — where real users engage with AI to drive business outcomes — requires speed, flexibility and reliability. Yet as more LLMs and variants emerge, each with its own architecture and serving requirements, deployment becomes increasingly complex. Different inference frameworks can offer unique performance benefits and optimization opportunities, but managing them adds overhead.

This developer example will show you how to use a single NIM container to deploy a variety of different LLMs for high-performance inference on NVIDIA-accelerated infrastructure with a simple, unified workflow. Examples include using NIM to deploy LLMs hosted on Hugging Face and your local file system, exploring the available inference backend options per model, and deploying models in different quantization formats.

NIM supports a broad range of LLMs supported by NVIDIA TensorRT-LLM, vLLM and SGLang, including popular open LLMs and specialized variants on Hugging Face. For more details on supported LLM architectures, see the documentation.

Architecture Diagram

Architecture Diagram

What’s Included in the Blueprint

Key Features

The developer example includes an architecture diagram, an NVIDIA Brev launchable with a Jupyter notebook for rapid exploration and experimentation, and source code for local deployment. The NIM container supports the following key features:

  • Unified Workflow: Deploy a broad range of LLM architectures and variants using a consistent workflow, pointing the NIMcontainer at your LLM. The container accepts models in Hugging Face format as well as NVIDIA TensoRT™-LLM model checkpoints and engines.
  • Performance: Ensure high performance inference with multiple backends including NVIDIA TensorRT-LLM, vLLM, and SGLang. Leverage smart defaults for optimized latency and throughput without configuration, or configure simple options for tuning and enhancement.
  • Portable: Deploy on your choice of NVIDIA accelerated infrastructure— workstation, datacenter or cloud.

Software Used in This Blueprint

NVIDIA NIM™ microservices

NVIDIA LLM NIM microservice container (coming soon)

Other

Minimum System Requirements

Hardware Requirements

  • NVIDIA GPU(s) with appropriate drivers (CUDA 12.1+). The examples assume that you have at least 80GB GPU memory.

Software Requirements

Ethical Considerations

NVIDIA believes trustworthy AI is a shared responsibility, and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure the models meet requirements for the relevant industry and use case and address unforeseen product misuse. For more detailed information on ethical considerations for the models, please see the Model Card++ Explainability, Bias, Safety and Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI concerns here.

Terms of Use

GOVERNING TERMS: The software is governed by the NVIDIA Software License Agreement and Product-Specific Terms for NVIDIA AI Products.