Cosmos 3: Omnimodal World Models for Physical AI

⌂ Home ⚡ Latest Flash: New Galaxy Watch 9 and Watch Ultra

🕒 5 min read

The NVIDIA team reports that on 5 June 2026 it released Cosmos 3, an omnimodal foundation model that unifies language, image, video, audio and action into a single Mixture‑of‑Transformers (MoT) architecture. The system is open‑source under the Linux Foundation’s OpenMDW‑1.1 license and ships with code on GitHub, checkpoints on Hugging Face, synthetic datasets and an evaluation benchmark that NVIDIA hosts at its Cosmos Lab site.

Unified Mixture‑of‑Transformers Architecture

Dual‑Tower Design and Joint Attention

The paper notes that each decoder layer contains two independent parameter sets: a reasoner tower that processes the autoregressive (AR) subsequence of language, image and video tokens; and a generator tower that handles the diffusion (DM) subsequence consisting of noisy latent tokens for images, videos, audio and action. The towers interact through dual‑stream joint attention, where DM tokens can attend bidirectionally to all AR tokens while AR tokens remain causally self‑attentive. This design preserves autoregressive language generation performance inherited from the pre‑trained vision‑language model while enabling high‑fidelity diffusion across modalities.

Position Embedding with Absolute Temporal Modulation

The architecture adopts a 3‑D multimodal RoPE (MRoPE) scheme that assigns each token a temporal, height and width coordinate. For AR tokens the coordinates follow the standard MRoPE pattern; for DM tokens the system introduces an absolute temporal axis that maps physical time across modalities. Frame rates are modulated by a base temporal step derived from a 24 FPS reference, so that a 30‑fps clip and a 16‑fps clip occupy proportionally scaled positions along the timeline. A fixed temporal gap of fifteen thousand units is inserted between AR and DM subsequences to prevent checkerboard artifacts during image generation.

Action Representation Across Modalities

The authors describe a unified action vector that aggregates up to three components: ego pose (relative camera or body frame), effector pose (end‑effector or hand configuration) and grasp state. Each component is encoded as a 3D translation plus a 6‑D rotation representation, with pseudo‑actions derived from differences between consecutive SE(3) poses to avoid embodiment‑specific controller details. Domain‑aware input and output projection layers map each embodiment’s native action space into the shared latent action dimension; during training these projections are optimized jointly with the MoT backbone.

Training Regimen for Reasoning and Generation

Reasoner Pre‑training on Vision‑Language Data

The team pre‑trains Cosmos 3’s reasoner on 22 million paired image-text and video-text samples, following a two‑stage pipeline. First, semantic deduplication clusters multimodal conversations by joint embeddings; near duplicates with cosine similarity above 0.95 are discarded. Second, an AI judge, built from the Gemma‑4‑31B‑it vision‑language model, scores each sample on faithfulness, completeness and correctness, retaining only those scoring at least 2 in all three dimensions. The resulting corpus contains 42.9 % OCR, 16.5 % 2D grounding, 11.3 % visual QA and smaller shares of captioning, reasoning and instruction‑following data.

Generator Multi‑Stage Curriculum

Cosmos 3’s generator is trained in a progressive curriculum. In the pre‑training stage it learns to reconstruct images, videos and audio from noisy latent tokens: 767 million images and 347 million video clips are sampled across 256p, 480p and 720p resolutions with a fixed token budget of 74,000 per sequence. Mid‑training introduces action tokens drawn from four physical‑AI pillars (egocentric motion, autonomous vehicles, robotics and camera control) and adds control‑conditioned generation (“video transfer”) that maps edge or depth controls to RGB output. Post‑training specializes the model for specific tasks: Cosmos3‑Super‑Text2Image, Cosmos3‑Super‑Image2Video, and Cosmos3‑Nano‑Policy‑DROID are fine‑tuned on compact curated datasets derived from synthetic and real sources.

Synthetic Data Sets for Physical AI

The authors release five synthetic data collections under the SDG umbrella: SDG‑PhyxSim (rigid‑body collisions and fluid dynamics), SDG‑RobotSim (manipulation across six to eight robot embodiments), SDG‑DriveSim (autonomous driving scenarios), SDG‑SynHuman (digital human interactions) and SDG‑Warehouse (human‑forklift safety). These datasets provide high‑frequency control signals and rare scene compositions that are underrepresented in the web‑scale pre‑training corpus, allowing Cosmos 3 to learn physically grounded dynamics and robust action priors.

Current State of the Art Performance

Benchmark Scores Across Capabilities

Cosmos3‑Super consistently outperforms specialized baselines on a broad suite of tasks. According to reported metrics it achieves 91.36 in text‑to‑image generation, tops Artificial Analysis’s open‑source leaderboard, and secures the best policy score on RoboArena’s benchmark for embodied robots. Cosmos3‑Nano, adapted from the Qwen3‑VL‑8B architecture, matches or exceeds comparable open models in visual reasoning and video generation at a smaller scale than the Super variant.

Open‑Source Release and Community Impact

The codebase is available at https://github.com/nvidia/cosmos, with checkpoints on HuggingFace under the names Cosmos3‑Super, Cosmos3‑Nano, Cosmos3‑Super‑Text2Image, Cosmos3‑Super‑Image2Video and Cosmos3‑Nano‑Policy‑DROID. Synthetic datasets and evaluation benchmarks are hosted on HuggingFace datasets (SDG-PhyxSim, SDG-RobotSim, etc.) and the project website provides documentation for reproducing the training pipeline. The OpenMDW‑1.1 license permits redistribution but requires attribution; NVIDIA has made all checkpoints open to community research.

Limitations and Evidence Gaps

The evaluation focuses largely on simulated data; real‑world robotic deployment results have not been reported yet, leaving a gap in evidence for high‑speed control loops or safety-critical interactions. Action prediction remains confined to discrete simulation actions, without continuous low‑level controller integration. Moreover, the model’s performance on very long video horizons (hundreds of seconds) has not been benchmarked beyond the 400‑frame limit used during training.

The Cosmos 3 Model Family and Open Questions

Beyond the Super and Nano variants, the family includes Cosmos3‑Edge, a 2-billion-parameter dense transformer trained from scratch rather than initialized from Qwen3‑VL weights, giving deployments a lighter option that keeps the reasoner-generator design. The open question the report itself frames is whether Cosmos 3’s unified action representation can bridge high‑level policy generation and low‑level hardware execution on real robots, something that remains unresolved until real‑world trials are reported.

Author

Cem Gülbal

IT Operations and System Monitoring Lead

Cem Gülbal is an Istanbul-based IT operations and system monitoring lead with more than 15 years of professional experience. He has worked across enterprise technology platforms including Jira, Grafana, Datadog and the Atlassian ecosystem. At Talk Tender, he writes research-driven analysis on artificial intelligence, quantum technology, robotics, space, science and cybersecurity, with a focus on how emerging technologies may shape work, society and everyday life.

LinkedIn profile