nvidia/cosmos-1.0-autoregressive-5b
Generates future frames of a physics-aware world state based on simply an image or short video prompt for physical AI development.
Cosmos-1.0-Autoregressive: A Suite of Autoregressive-based World Foundation Models
Model Overview
Description:
Cosmos World Foundation Models: A family of highly performant pre-trained world foundation models purpose-built for generating physics-aware videos and world states for physical AI development.
The Cosmos autoregressive models are a collection of pre-trained world foundation models that are ideal for predicting and rapidly generating video sequences from video or image inputs for physical AI. They can serve as the building block for various applications or research that are related to world generation. The models are ready for commercial use under NVIDIA Open Model license agreement.
Model Developer: NVIDIA
Model Versions
In Cosmos 1.0 release, the Cosmos Autoregressive WFM family includes the following models:
- Cosmos-1.0-Autoregressive-4B
- Given a 9-frame input video, predicts the future 24 frames.
- Given an image as the first frame, predicts the future 32 frames.
- Cosmos-1.0-Autoregressive-5B-Video2World
- Given text description and a 9-frame input video, predicts the future 24 frames.
- Given text description and an image as the first frame, predicts the future 32 frames.
- Cosmos-1.0-Autoregressive-12B)
- Given a 9-frame input video, predicts the future 24 frames.
- Given an image as the first frame, predicts the future 32 frames.
- Cosmos-1.0-Autoregressive-13B-Video2World
- Given text description and a 9-frame input video, predicts the future 24 frames.
- Given text description and an image as the first frame, predicts the future 32 frames.
License:
This model is released under the NVIDIA Open Model License. For a custom license, please contact cosmos-license@nvidia.com.
Under the NVIDIA Open Model License, NVIDIA confirms:
- Models are commercially usable.
- You are free to create and distribute Derivative Models.
- NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models.
Important Note: If you bypass, disable, reduce the efficacy of, or circumvent any technical limitation, safety guardrail or associated safety guardrail hyperparameter, encryption, security, digital rights management, or authentication mechanism contained in the Model, your rights under NVIDIA Open Model License Agreement will automatically terminate.
- Cosmos-1.0-Guardrail is the safety guardrail for this model.
Model Architecture:
Cosmos-1.0-Autoregressive-5B-Video2World is an autoregressive transformer model designed for world generation. The network is composed of interleaved self-attention, cross-attention, and feedforward layers as its building blocks. The cross-attention layers allow the model to condition on input text throughout the decoding process.
Input/Output Specifications
-
Input
- Input Type(s): Text+Image, Text+Video
- Input Format(s):
- Text: String
- Image: jpg, png, jpeg, webp
- Video: mp4
- Input Parameters:
- Text: One-dimensional (1D)
- Image: Two-dimensional (2D)
- Video: Three-dimensional (3D)
- Other Properties Related to Input:
- The input string should contain fewer than 300 words and should provide descriptive content for world generation, such as a scene description, key objects or characters, background, and any specific actions or motions to be depicted within the 1-second duration.
- The input image and video should be of 1024x640 resolution.
-
Output
- Output Type(s): Video
- Output Format(s): mp4
- Output Parameters: Three-dimensional (3D)
- Other Properties Related to Output:
- For text+image input, the generated video will be a 32-frame clip with a resolution of 1024x640 pixels, conditioned on the input image as the first video frame.
- For text+video input, the generated video will be a 24-frame clip with a resolution of 1024x640 pixels, conditioned on the first 9 frames of the input video.
- The content of the video will visualize the input text description as a short animated scene, capturing the main elements mentioned in the input.
Software Integration:
Runtime Engine(s):
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Blackwell
- NVIDIA Hopper
- NVIDIA Ampere
Note: We have only tested doing inference with BF16 precision.
Operating System(s):
- Linux (We have not tested on other operating systems.)
Usage
- See Cosmos for details.
Evaluation
Please see our technical paper for detailed evaluations.
Inference Time and GPU Memory Usage
These numbers may vary based on system specifications and are provided for reference only.
Offloading Strategy | Cosmos-1.0-Autoregressive-5B-Video2World | Cosmos-1.0-Autoregressive-13B-Video2World |
---|---|---|
No offloading | 66.2 GB | > 80 GB |
Guardrails | 58.7 GB | 76.6 GB |
Guardrails & T5 encoder | 41.3 GB | 58.0 GB |
Guardrails & T5 encoder & Diffusion decoder | 29.0 GB | 46.9 GB |
Guardrails & T5 encoder & Diffusion decoder & Tokenizer | 28.8 GB | 46.7 GB |
Guardrails & T5 encoder & Diffusion decoder & Tokenizer & AR model | 21.1 GB | 30.9 GB |
End-to-end inference runtime on one H100 with no offloading for 5B model and guardrail offloading for 13B, after model initialization:
Cosmos-1.0-Autoregressive-5B-Video2World | Cosmos-1.0-Autoregressive-13B-Video2World |
---|---|
~73 seconds | ~150 seconds |
Failure Analysis
Our models now support video extension up to 33 frames. Starting from either a single image or a 9-frame video input, it can generate the remaining frames to reach the 33-frame length (generating 32 or 24 frames respectively).
We have evaluated all eight possible configurations (4 models × 2 vision input types: image or video) using 100 test videos from physical AI domains. Below are the failure rates for each configuration:
Model | Image input | Video input (9 frames) |
---|---|---|
Cosmos-1.0-Autoregressive-4B | 15% | 1% |
Cosmos-1.0-Autoregressive-5B-Video2World | 7% | 2% |
Cosmos-1.0-Autoregressive-12B | 2% | 1% |
Cosmos-1.0-Autoregressive-13B-Video2World | 3% | 0% |
We define failure cases as videos with severe distortions, such as:
- Sudden appearance of large unexpected objects
- Video degrading to a single solid color
Note that the following are not considered failures in our analysis:
- Static video frames
- Minor object distortions or artifacts
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the subcards of Explainability, Bias, Safety & Security, and Privacy below. Please report security vulnerabilities or NVIDIA AI Concerns here.