nvidia

Cosmos Dataset Search

Accelerate post-training of end-to-end autonomous vehicle stacks with vector search and retrieval for large video datasets.

Autonomous vehicle (AV) development is a continuous process, requiring repeated post-training to refine the end-to-end stack and add new capabilities. Post-training for specific tasks requires datasets targeted to that scenario—such as a highway curve in low-light conditions. To effectively curate such datasets, developers must search through petabytes of multimodal training data, an incredibly labor-intensive process that requires detailed annotations and high recall accuracy.

NVIDIA Cosmos Dataset Search is a vector search workflow that rapidly accelerates the data labeling and processing pipeline for AV developers. It uses the Cosmos Embed NIM to enable semantic search, bypassing the need for human annotation and improving recall quality. It connects to NVIDIA Cosmos Curator to refine datasets and retrieve queried data with incredible efficiency and accuracy.

By enabling rapid, precise discovery of targeted scenarios, AV developers can:

  • Accelerate model post-training by supplying rare or problematic cases for retraining and validation.
  • Shorten feedback loops for error diagnosis and model improvement, driving faster innovation and safer deployments.
  • Reduce manual data triage — transforming vast data lakes into actionable resources for both engineers and machine learning practitioners.
  • Enhance overall safety and robustness by making continuous, data-driven scenario mining a core part of the development lifecycle.

Experience Walkthrough

The overall experience is divided into the following parts:

  1. Select data pipeline and collection
  2. Specify query type (text, video, embedding)
  3. Input query and choose number of desired results
  4. Click "Run” to view results
  5. Click “Download Now” to retrieve the blueprint

Architecture Diagram

What’s Included in the Blueprint

Scripts

  • CLI (client.py) for creating a collection and ingesting data.
  • Docker-Compose script for standalone deployment
  • Helm-chart for Kubernetes deployment

Containers:

  • Cosmos Embed1 NIM container
  • Visual Search service container
  • UI service container

Minimum System Requirements

Hardware

Standalone Deployment

  • H200, H100, H20
  • A100
  • L40S, L4, L20

Kubernetes

  • Inference:

    • H200, H100, H20
    • A100
    • L40S, L4, L20
  • Indexing:

    • CUDA version 11+
    • H200, H100, H20
    • A100
    • L40S, L4, L20
  • Search:

    • GPU not required

Note: B100, GB200, RTX 6000 are not yet supported by the blueprint.

For the most up to date information refer to Blueprint Docs page.

Software

  • Operating System: Ubuntu 20.04 or newer
  • NVIDIA Driver version: 535 or newer
  • NVIDIA CUDA® version: 12.4 or newer
  • NVIDIA Container Toolkit version: 1.15.0 or newer
  • Docker version: Docker version 26 or newer
  • Python Version 3.11+

Terms of Use

Use of the software is governed by the NVIDIA Software License Agreement and the Product-Specific Terms for NVIDIA AI Products. Use of the NVIDIA Cosmos-Embed1-224p model is governed by the NVIDIA Community Model License. Use of the dataset is governed by the NVIDIA Autonomous Vehicle Dataset License Agreement. ADDITIONAL INFORMATION: For the Cosmos-Embed1-224p model, Apache 2.0 and MIT License.