NVIDIA Cosmos

Cosmos is NVIDIA’s World Foundation Model Development Platform that provides the tools to either finetune existing models or train new models from scratch.

Cosmos Model Family

Cosmos World Foundation Models (WFM) are a family of highly-performant pre-trained models purpose-built for generating physics-aware videos used for training robots. With Cosmos, developers can simulate a world in which robots function and train them to act and react responsibly in the real world before actual deployment.

Cosmos WFMs currently contain four main types of models: NeMo Curator, Cosmos Tokenizer, Cosmos Guardrail, and Cosmos World Foundation Model. NeMo Curator is a video curation pipeline that takes raw video frames, splits them into meaningful segments, and annotates them with semantic tags, object labels, and scene descriptions. The annotated images are then fed into the Cosmos Tokenizer, which produces a sequence of tokens. This step reduces data dimensionality enabling Cosmos World Foundation Model to effectively handle large or complex inputs for training. Cosmos WFM then consumes the curated/annotated video segments and learns the underlying physics and visual dynamics from real world data. When queried, Cosmos WFM outputs new token sequences that are then decoded back into high-resolution and physically realistic synthetic videos. Cosmos WFMs are pretrained on large-scale video datasets to expose them to a broad range of visual experiences, enabling them to serve as generalists. To construct a specialized WFM developers are expected to fine-tune Cosmos WFM using additional data collected from a specific use case. This additional data will help adapt Cosmos WFM to this intended use case, ensuring it can perform optimally under real-world conditions.

Governing Terms/Terms of Use

All Cosmos WFMs are deployed globally and are covered under NVIDIA’s Open Model License Agreement. This license agreement confirms that:

Models are commercially usable.
You are free to create and distribute derivative models.
NVIDIA does not claim ownership of any outputs generated using the models or derivative models.

If users bypass, disable, reduce the efficacy of, or circumvent any technical limitation, safety guardrail or associated safety guardrail hyperparameter, encryption, security, digital rights management, or authentication mechanism contained in the Model, the user’s rights under the NVIDIA Open Model License Agreement will automatically terminate. If users are interested in a custom license, they may contact cosmos-license@nvidia.com.

Specific Risk Areas and Mitigations

WFMs can produce unrealistic outputs, generate unsafe content or may inadvertently amplify societal biases reflected in their training data. Collectively, these risks underscore the need for technical measures to mitigate risk and careful evaluation before leveraging Cosmos WFM in real-world applications.

Cosmos Guardrail

For the safe use of our world foundation models, we develop a comprehensive guardrail system. Cosmos Guardrail consists of two stages: the pre-Guard and the post-Guard stage. The pre-Guard stage leverages Aegis-AI-Content-Safety-LlamaGuard-LLM-Defensive-1.0, which is a fine-tuned version of Llama-Guard trained on NVIDIA’s Aegis Content Safety Dataset and a blocklist filter that performs a lemmatized and whole-word keyword search to block harmful prompts. It then further sanitizes the user prompt by processing it through the Cosmos Text2World Prompt Upsampler. The post-Guard stage blocks harmful visual outputs using a video content safety classifier and a face blur filter.

Cosmos pre-Guard first uses a simple blocklist-based checker for unsafe keyword detection. This is designed to block explicitly harmful generations by doing a keyword search on the prompt against a hard-coded blocklist of a large corpus of explicit and objectionable words. Input words are lemmatized using WordNetLemmatizer, a tool that uses a lexical database of the English language to extract the root word from its variants. These lemmatized words are then compared to the words in the hard-coded blocklist, and the entire prompt is rejected if any profanity is found.

As the second line of defense, Cosmos pre-Guard uses Aegis-AI-Content-Safety-LlamaGuard-LLM-Defensive-1.0 to detect unsafe content in semantically-complex prompts. Aegis is able to classify prompts into13 critical safety risk categories: violence, sexual, criminal planning, weapons, substance abuse, suicide, child sexual abuse material, hatred, harassment, threat, and profanity. If the input prompt is categorized as unsafe by this prompt filter, the video is not generated, and an error message is displayed. Any prompt that does not fall into the above categories is considered safe from the prompt-filtering standpoint.

Prior to passing the prompt to the world generation models, the prompt is further augmented and indirectly sanitized via the Cosmos Text2World Prompt Upsampler. This is a bespoke model that not only compensates for the lack of specificity in the prompt but also steers clear of objectionable denotations or connotations.

Cosmos post-Guard is a vision-domain guardrail that is activated after the world content has been generated and comprises a video content safety filter and a face blur filter. Our video content safety filter used in the post-Guard stage has been trained on carefully-curated datasets and evaluated on human- annotated datasets created by Cosmos Red Team. To calibrate model outputs for the intended use case in the robotics and autonomous vehicle domains, we also automatically detect and blur all faces. We use RetinaFace, a state-of-the-art face detection model, to identify facial regions with high confidence scores. For any generated face region larger than 20 × 20 pixels, we apply pixelation to obscure features while preserving the overall scene composition. Note that by blurring all generated human faces in the video, potential biases based on age, gender, race and ethnicity in the output video are reduced.

Balanced Datasets

Cosmos WFM is trained using both proprietary and publicly available video datasets. We curated about 100M clips of videos ranging from 2 to 60 seconds from a 20M hour-long video collection. For each clip, we use a VLM (13B-parameter VILA model) to provide a video caption per 256 frames. As our goal is to create a VLM that is able to generate physically realistic videos, we use the video captions to curate the training dataset to cover various physical applications:

Driving (11%),
Hand motion and object manipulation (16%),
Human motion and activity (10%),
Spatial awareness and navigation (16%),
First person point-of-view (8%),
Nature dynamics (20%),
Dynamic camera movements (8%),
Synthetically rendered (4%)
Others (7%)

To ensure effective distribution of the dataset we employ a taxonomy-based classifier to label video types and prune those that introduce unrealistic behaviors, such as purely animated or abstract patterns. Certain categories relevant to world foundation models (like human actions and interactions) are upsampled, while less critical ones (such as landscapes) are downsampled.

A significant amount of the initial video data is either semantically redundant or contains different visual effects, which may induce unwanted artifacts in the generated videos if not appropriately handled. We therefore designed a sequence of data processing steps to find the most valuable parts of the raw videos for training. Shot boundary detection identifies where one shot ends and another begins, after which all footage is re-encoded into a uniform, high-quality MP4 format to ensure consistent loading and reduce codec discrepancies. The resulting video segments undergo several filtering processes. Motion filtering removes clips that are static or excessively shaky, and tags the remaining clips with camera motion types to enhance training signals. Visual quality filtering uses a video assessment model trained on DOVER to discard the bottom 15% in perceptual quality and applies an image aesthetic model exclude footage that is aesthetically poor. A deduplication step uses InternVideo2 embeddings to identify near-duplicate content and preserves the highest-resolution version for minimal quality loss.

Evaluation Methods

We employ a dedicated red team to actively probe the system using both standard and adversarial examples that are collected in an internal attack prompt dataset. These video outputs are annotated by a team of expert annotators, who were specially trained for our task, to classify the generated video on a scale of 1-5 on multiple categories of harm related to the safety taxonomy. These annotations also specify the start and end-frames where the unsafe content is detected, thereby generating high-quality annotations. The red team also probed each guardrail component independently with targeted examples to identify weaknesses and improve performance in edge cases. As of the date of publication, the red team has tested and annotated over10, 000 distinct prompt-video pairs that were carefully crafted to cover a broad range of unsafe content. We separate out our safety testing into 4 categories:

Targeted unsafe testing

Targeted unsafe testing involves generating a corpus of manually curated unsafe prompts. These are intended to emulate common unsafe interactions that are performed by non-technical users of the system with basic or limited knowledge of multimodal AI attack vectors. These have a high likelihood of being caught by prompt filters, e.g. “Video of a naked person”.

Adversarial Attack Testing

Adversarial attack testing involves generating a corpus of unsafe prompts following the styles of AI attack published in literature. This type of testing will also leverage some (not all) of the prompts from content safety datasets like Aegis and automation tooling like Garak.

Prompt Upsampler Toxicity

Prompt Upsampler Toxicity refers to a phenomenon where automated methods are used to “upsample” or expand a prompt by adding detail or context and inadvertently introduce unsafe content. Monitoring and mitigating Prompt Upsampler Toxicity ensures that content moderation systems remain effective throughout the entire generation pipeline, preserving user trust and upholding ethical standards.

Accidental Mishap Testing

Accidental mishap testing involves emulating the experience of a user prompting the model with a benign prompt, and getting unsafe content in return. This is the hardest category to test, since it does not have a fixed method or protocol for generation.

Deployment

Cosmos WFM is released under an open, permissive NVIDIA license, allowing users to download the model weights and run it on their own hardware. This means that developers can integrate the WFM into their existing workflows without dependency on external APIs. They can also tailor the model to specific domain needs, retrain or fine-tune the model with their private data. This approach fosters innovation especially for under-resourced stakeholders that cannot rely on paid services.

Once downloaded, NVIDIA has less visibility into how or where Cosmos is deployed, reducing opportunities to enforce content policies or guardrails. Downloadable models grant complete control to users, but also transfer responsibility to the users for preventing misuse, and implementing safety mechanisms, such as watermarking and content moderation. Watermarking in the context of WFMs is crucial to ensure traceability, and user awareness that generated content might not be authentic. Watermarks allow viewers and downstream users to identify AI-generated or AI-manipulated videos, helping prevent misinformation and misuse. Even though watermarking is typically the responsibility of the user, we still encourage the use of open-source libraries for watermarking by downstream users of Cosmos WFM. NVIDIA has actively promoted watermarking and has worked in consortiums and standards bodies to define common protocols for watermarking synthetic media.

Cosmos WFM is also hosted by NVIDIA at the NVIDIA API Catalog (build.nvidia.com) and accessible via a web-based user interface. In this case, NVIDIA manages infrastructure, updates, and safety features. End users with minimal machine learning expertise can harness powerful WFMs without worrying about infrastructure or setup. Hosted models give NVIDIA more oversight and moderation capabilities, for example:

Know Your Customer (KYC) and account verification ensures that users are who they claim to be, discouraging malicious actors and fostering accountability.
Usage monitoring securely records user activity and flags suspicious patterns, enabling traceability and compliance checks while helping identify harmful behavior.
Rate limiting prevents spamming and large-scale misuse, balancing computational resources and protecting against abuse or overwhelming the system.
Human review protocols provide an escalation path for questionable outputs or flagged user accounts. This is a dedicated moderation team for final decisions on content removals, user bans, or investigations.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal supporting team to ensure this system meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Getting Help/Support

Please report security vulnerabilities or NVIDIA AI Concerns here.

NVIDIA

cosmos3-nano-reasoner