nvidia

Refine AI Agents through Continuous Model Distillation with Data Flywheels

Build a data flywheel, with NVIDIA NeMo microservices, that continuously optimizes AI agents for latency and cost — while maintaining accuracy targets.

NVIDIA Data Flywheel Blueprint

Deploying AI agents at scale introduces significant challenges, including high compute costs and latency bottlenecks—especially in performance-critical environments. Balancing model accuracy with efficiency often requires complex workflows and ongoing manual intervention.

The NVIDIA Data Flywheel Blueprint provides a systematic, automated solution to refine and redeploy optimized models that maintain accuracy targets while lowering resource demands. This blueprint establishes a self-reinforcing data flywheel, using production traffic logs and institutional knowledge to continuously improve model efficiency and accuracy.

Architecture Diagram

Architecture Diagram

What's Included in the Blueprint

This blueprint automates continuous optimization of AI agents using NVIDIA NeMo microservices for data curation, customization, and evaluation. It systematically evaluates multiple models and automatically surfaces the most efficient option that meets defined latency, cost, and accuracy criteria. The architecture is adaptable to a wide variety of reasoning and task-specific use cases.

Key Benefits

  • Reduce Latency and Cost: Identifies smaller models that are empirically equivalent, enabling deployment of more efficient NIM microservice while maintaining performance.
  • Continuous Improvement Loop: Enables ongoing evaluation without retraining or relabeling—true "flywheel" behavior that runs indefinitely as new traffic flows in.
  • Data-Driven Decisions: Provides real comparisons across models using real traffic, backed by evaluator scores.
  • Standardized Optimization: Any application can opt into the flywheel with minimal effort, making it a foundational component for a wide variety of use cases.

Key Features

  • Production Data Pipeline: Collects real-world data from AI agent interactions and curates datasets from configurable log stores for evaluation, in-context learning, and fine-tuning.
  • Automated Model Experimentation: Leverages a deployment manager to dynamically spin up candidate NIMs from a model registry—including smaller or fine-tuned variants—and run experiments such as in-context learning and LoRA-based fine-tuning.
  • Semi-autonomous Operation: Operates without requiring any labeled data or human-in-the-loop curation.
  • Evaluation with NVIDIA NeMo Evaluator: Evaluates models using custom metrics and task-specific benchmarks (e.g., tool-calling accuracy), leveraging large language models (LLMs) as automated judges to reduce the need for human intervention.
  • LoRA-SFT Fine-Tuning with NVIDIA NeMo Customizer: Runs parameter-efficient fine-tuning on models using real-world data.
  • REST API Service: Provides intuitive REST APIs for seamless integration into existing systems and workflows, running continuously as a FastAPI-based service that orchestrates underlying NeMo microservices.

Minimum System Requirements

Hardware Requirements

  • With Self-hosted LLM Judge: 6× (NVIDIA H100 or A100 GPUs)
  • With Remote LLM Judge: 2× (NVIDIA H100 or A100 GPUs)
  • Minimum Memory: 1GB (512MB reserved for Elasticsearch)
  • Storage: Varies based on log volume and model size
  • Network: Ports 8000 (API), 9200 (Elasticsearch), 27017 (MongoDB), 6379 (Redis)

OS Requirements

  • Ubuntu 22.04 OS

Software Dependencies

  • Elasticsearch 8.12.2
  • MongoDB 7.0
  • Redis 7.2
  • FastAPI (API server)
  • Celery (task processing)
  • Python 3.11
  • Docker Compose
  • Docker Engine

Software Used in This Blueprint

NVIDIA Technology

3rd Party Software

  • Elasticsearch
  • MongoDB
  • Redis
  • FastAPI
  • Celery

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility, and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure the models meet requirements for the relevant industry and use case and address unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI concerns here.

Terms of Use

GOVERNING TERMS: This service is governed by the NVIDIA API Trial Terms of Service.