
Generates physics-aware video world states for physical AI development using text prompts and multiple spatial control inputs derived from real-world data or simulation.
Cosmos-Transfer2.5: A family of highly performant pre-trained world foundation models purpose-built for generating physics-aware images, videos and world states aligned with the input control conditions.
Cosmos-Transfer2.5 diffusion models are a collection of diffusion based world foundation models that generate dynamic, high quality images and videos from text, image, or control video inputs. It can serve as the building block for various applications or research that are related to world generation. This model is ready for commercial/non-commercial use.
Model Developer: NVIDIA
The Cosmos-Transfer2.5 diffusion-based model family includes the following models:
The model produces 720P video with 16FPS
The model has been trained on 720p video at 10FPS.
The trial service is governed by the NVIDIA API Trial Terms of Service; and use of the model is governed by the NVIDIA Open Model License. Additional Information: Apache License 2.0.
Global
Physical AI: encompassing robotics, autonomous vehicles (AV), and more.
Github [10/06/2025] via https://github.com/nvidia-cosmos/cosmos-transfer2.5
HuggingFace [10/06/2025] via https://huggingface.co/collections/nvidia/cosmos-transfer25-6864569b8acaf966a107bfe3
Cosmos-Transfer2.5-2B is a diffusion transformer model designed for video denoising in the latent space, modulated by multiple control branches.
The diffusion transformer network (“the base model”) is composed of interleaved self-attention, cross-attention and feedforward layers as its building blocks. The cross-attention layers allow the model to condition on input text throughout the denoising process. Before each layer, adaptive layer normalization is applied to embed the time information for denoising. When image or video is provided as input, their latent frames are concatenated with the generated frames along the temporal dimension. Augment noise is added to conditional latent frames to bridge the training and inference gap.
The control branch is formed by replicating a few transformer blocks of the base model. It processes the control input video to extract control signals, which are then injected into the corresponding transformer blocks of the base model, guiding the denoising process with structured control. When multiple control input videos are provided, each is processed by a dedicated control branch to extract modality-specific control signals. These control signals are then combined with spatial-temporal weight maps, and injected into the corresponding transformer blocks in the base model.
This model was developed based on: Cosmos-Predict2.5
Number of model parameters: 2,358,047,744
Input
Output
The video content visualizes the input text description as a short animated scene, capturing key elements within the specified time constraints.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Runtime Engine(s):
Supported Hardware Microarchitecture Compatibility:
Note: Only BF16 precision is tested. Other precisions like FP16 or FP32 are not officially supported.
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
Data Modality
Data Collection Method by dataset
Labeling Method by dataset
Data Collection Method by dataset
Labeling Method by dataset
Please see our technical paper for detailed evaluations of the base model. The control models are built upon the base foundation model.
Data Collection Method:
Labeling Method:
System Requirements and Performance: This model requires 65.4 GB of GPU VRAM.
The following table shows generation times across different NVIDIA GPU hardware for single-GPU inference:
| GPU Hardware | Cosmos-Transfer2-2B (Segmentation) |
|---|---|
| NVIDIA B200 | 285.83 sec |
| NVIDIA H100 NVL | 719.4 sec |
| NVIDIA H100 PCIe | 870.3 sec |
| NVIDIA H20 | 2326.6 sec |
Operating System(s):
Note: Only BF16 precision is tested. Other precisions like FP16 or FP32 are not officially supported.
Despite various improvements in world generation for Physical AI, Cosmos-Transfer2.5 models still face technical and application limitations for world-to-world generation. In particular, they struggle to generate long, high-resolution videos without artifacts. Common issues include temporal inconsistency, camera and object motion instability, and imprecise interactions. The models may inaccurately represent 3D space, 4D space-time, or physical laws in the generated videos, leading to artifacts such as disappearing or morphing objects, unrealistic interactions, and implausible motions. As a result, applying these models for applications that require simulating physical law-grounded environments or complex multi-agent dynamics remains challenging.
Acceleration Engine: PyTorch, Transformer Engine
Test Hardware: H100, A100, GB200
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.
Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.
For more detailed information on ethical considerations for this model, please see the subcards of Explainability, Bias, Safety & Security, and Privacy below. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.