NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation

nvidia

LLM Router

Route LLM requests to the best model for the task at hand.

BlueprintLLM RouterLaunchableNVIDIA AI
View GitHubDeploy on Cloud

AI systems often face a trade-off between accuracy, latency, and cost. Complex reasoning or multimodal queries need powerful models, but routing every request through the same large model wastes compute and increases response times. Simpler queries don’t need that level of reasoning or visual understanding.

This developer example makes model selection dynamic and data-driven. It supports both text and image inputs and offers two main strategies:

  • Intent-based routing that uses smaller language models to interpret query semantics.
  • Auto-routing that leverages CLIP embeddings and trained neural networks to optimize routing based on patterns in real data.

By evaluating each request’s complexity, modality, and intent in real time, the router can send lightweight queries to fast, efficient models and reserve high-capacity models for tasks that actually need them. The result is a system that maintains strong performance while reducing unnecessary compute costs.

Architecture Diagram

What’s Included in the Blueprint

Key Features

This developer example includes architectural diagrams, Docker-based deployment configurations, Jupyter notebooks for exploration and training, and complete source code for local deployment and customization. The LLM Router example supports the following key features and components:

  • Multimodal Router Backend: Built with NVIDIA NeMo Agent Toolkit with FastAPI, supporting both text and image inputs through OpenAI-compatible chat completions API.
  • Two Routing Strategies: Intent-based routing using Qwen 1.7B for semantic classification, and auto-routing using CLIP embeddings with trained neural networks for optimization.
  • Model Recommendation Engine: Returns optimal model names rather than proxying requests, providing flexible integration patterns.
  • Interactive Demo Application: Gradio-based web interface demonstrating end-to-end routing and model calling workflows.
  • Training Pipeline: Complete notebooks and scripts for training custom neural network routers on your specific data and requirements.
  • Docker Compose Profiles: Simplified deployment with separate profiles for intent-based and neural network routing strategies.
  • Flexible Model Integration: Pre-configured for NVIDIA Build API, Azure OpenAI, and standard OpenAI endpoints with easy customization for other providers.

Software Used in This Blueprint

NVIDIA NIM™ microservices and Nemotron Models

  • Llama 3.1 8B Instruct
  • Llama 3.1 70B Instruct
  • Mixtral 8x22B Instruct
  • DeepSeek R1
  • Nemotron Nano 12B VL - Multimodal reasoning and image understanding
  • Nemotron Nano 9B - Efficient text processing and conversation

External Models

  • Qwen 3-1.7B (vllm) - Intent classification for routing decisions
  • GPT-5 Chat (via Azure OpenAI or OpenAI API) - Complex reasoning and sophisticated analysis
  • CLIP - Multimodal embeddings for neural network routing

Infrastructure

  • NVIDIA Triton Inference Server
  • NVIDIA NeMo Agent Toolkit - Router backend framework
  • vLLM - High-performance LLM serving for Qwen models
  • CLIP-as-Service - CLIP embedding server for neural network routing

Minimum System Requirements

Hardware Requirements

  • Any NVIDIA GPU with an architecture newer than Volta™ (V100), such as Turing™ (T4), Ampere™ (A100, RTX 30 series), Hopper™ (H100), or later.
  • Minimum 16GB GPU memory for Qwen 1.7B model serving
  • Additional 8GB GPU memory if using neural network routing with CLIP

Software Requirements

  • Linux operating systems (Ubuntu 22.04 or later recommended) or macOS
  • Git LFS
  • Docker
  • Docker Compose
  • NVIDIA API key from build.nvidia.com (see instructions)
  • Python 3.12+ and uv package manager (for local development)
  • Azure OpenAI API access or standard OpenAI API key for GPT-5 Chat model

Ethical Considerations

NVIDIA believes trustworthy AI is a shared responsibility, and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure the models meet requirements for the relevant industry and use case and address unforeseen product misuse. For more detailed information on ethical considerations for the models, please see the Model Card++ Explainability, Bias, Safety and Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI concerns here.

License

Use of the models in this blueprint is governed by the NVIDIA AI Foundation Models Community License.

Terms of Use

GOVERNING TERMS: The software is governed by the NVIDIA Software License Agreement and Product-Specific Terms for NVIDIA AI Products. Use of the Complexity and Task Qualifier model is governed by the NVIDIA Open Model License Agreement. Additional Information: MIT License.

Meta Llama 3.1 8B, Llama 3.1 70B Instruct

GOVERNING TERMS: The NIM container is governed by the NVIDIA Software License Agreement and the Product Specific Terms for AI Products;

Mixtral 8x22B Instruct

GOVERNING TERMS: The NIM container is governed by the NVIDIA Software License Agreement and the Product Specific Terms for AI Products;

DeepSeek R1

GOVERNING TERMS: The NIM container is governed by the NVIDIA Software License Agreement and the Product Specific Terms for AI Products;

Use of these model is governed by the NVIDIA AI Foundation Models Community License Agreement. ADDITIONAL INFORMATION: Llama 3.1 Community License Agreement, Built with Llama;