Robbyant Open Sources LingBot World: a Real Time World Model for Interactive Simulation and Embodied AI

Coinmama
Robbyant Open Sources LingBot World: a Real Time World Model for Interactive Simulation and Embodied AI
BTCC

Ant Group’s embodied AI unit, Robbyant, has recently made the decision to open source LingBot-World. This innovative system is a large-scale world model designed to transform video generation into an interactive simulator for embodied agents, autonomous driving, and games. LingBot-World stands out for its ability to create controllable environments with high visual fidelity, dynamic elements, and long temporal horizons, all while maintaining responsiveness for real-time control.

### From Text to Video to Text to World

Traditional text to video models often produce short clips that appear realistic but lack interactive elements. LingBot-World takes a different approach by functioning as an action-conditioned world model. It learns the transition dynamics of a virtual world, allowing keyboard and mouse inputs, along with camera movements, to drive the evolution of future frames.

The model is trained to predict the conditional distribution of future video tokens based on past frames, language prompts, and discrete actions. During training, it can predict sequences up to approximately 60 seconds, while during inference, it can generate coherent video streams extending up to around 10 minutes, ensuring scene stability and structure.

### Data Engine: From Web Video to Interactive Trajectories

bybit

A key component of LingBot-World is its unified data engine, which provides comprehensive supervision on how actions impact the world across diverse real scenes. The data acquisition process combines three main sources:
– Large-scale web videos featuring humans, animals, and vehicles from various perspectives
– Game data that pairs RGB frames with user controls like W, A, S, D, and camera parameters
– Synthetic trajectories rendered in Unreal Engine, including clean frames, camera details, and object layouts

Following data collection, a profiling stage standardizes the diverse corpus, filtering for resolution and duration, segmenting videos into clips, and estimating missing camera parameters using geometry and pose models. A vision language model then scores clips based on quality, motion magnitude, and view type, selecting a curated subset.

Additionally, a hierarchical captioning module provides three levels of text supervision:
– Narrative captions for entire trajectories, encompassing camera motion
– Scene static captions describing environment layout without motion
– Dense temporal captions focusing on local dynamics within short time windows

This separation enables the model to distinguish between static structure and motion patterns, crucial for maintaining long-term consistency.

### Architecture: MoE Video Backbone and Action Conditioning

LingBot-World builds upon Wan2.2, a 14 billion parameter image-to-video diffusion transformer, known for capturing strong open-domain video priors. The system extends Wan2.2 with a mixture of experts DiT, featuring two experts, each with approximately 14 billion parameters. Despite the total parameter count reaching 28 billion, only one expert is active at each denoising step, maintaining inference cost similar to a dense 14 billion model while expanding capacity.

A training curriculum gradually extends training sequences from 5 seconds to 60 seconds, increasing the proportion of high-noise timesteps to stabilize global layouts over long contexts and prevent mode collapse during long rollouts.

To enable interactivity, actions are directly injected into the transformer blocks. Camera rotations are encoded using Plücker embeddings, while keyboard actions are represented as multi-hot vectors over keys like W, A, S, D. These encodings are fused and passed through adaptive layer normalization modules to modulate hidden states in the DiT. Only the action adapter layers are fine-tuned, keeping the main video backbone frozen to retain visual quality from pre-training while learning action responsiveness from a smaller interactive dataset.

Training involves both image-to-video and video-to-video continuation tasks, allowing the model to synthesize future frames from a single image or extend partial clips seamlessly. This results in an internal transition function capable of starting from arbitrary time points.

### LingBot-World Fast: Distillation for Real-Time Use

While the mid-trained model, LingBot-World Base, relies on multi-step diffusion and full temporal attention, these processes can be costly for real-time interaction. To address this, the Robbyant team introduces LingBot-World-Fast, an accelerated variant.

The fast model is initialized from the high-noise expert and replaces full temporal attention with block causal attention. Within each temporal block, attention is bidirectional, while across blocks, it remains causal. This design supports key-value caching, enabling the model to stream frames autoregressively at lower cost.

Distillation in LingBot-World-Fast utilizes a diffusion forcing strategy, training the student on a small set of target timesteps, including timestep 0, to expose it to both noisy and clean latents. Distribution Matching Distillation is combined with an adversarial discriminator head, with the adversarial loss updating only the discriminator. The student network is updated with the distillation loss, ensuring stable training while preserving action following and temporal coherence.

In experiments, LingBot-World Fast achieves a processing speed of 16 frames per second for 480p videos on a system with 1 GPU node, maintaining an end-to-end interaction latency under 1 second for real-time control.

### Emergent Memory and Long Horizon Behavior

One of the standout features of LingBot-World is its emergent memory capability. The model can maintain global consistency without explicit 3D representations like Gaussian splatting. When the camera moves away from a landmark and returns after approximately 60 seconds, the structure reappears with consistent geometry. Similarly, when a car exits and re-enters the frame, it appears at a physically plausible location, rather than being frozen or reset.

Moreover, LingBot-World can sustain ultra-long sequences, demonstrating coherent video generation extending up to 10 minutes with stable layout and narrative structure.

### VBench Results and Comparison to Other World Models

To quantitatively evaluate LingBot-World, the research team employed VBench on a curated set of 100 generated videos, each exceeding 30 seconds in length. The model was compared against two recent world models, Yume-1.5 and HY-World-1.5.

On VBench, LingBot-World outperformed the baselines, reporting higher scores in imaging quality, aesthetic quality, and dynamic degree. Notably, the dynamic degree margin was significantly larger, indicating richer scene transitions and more complex motion responding to user inputs. Motion smoothness and temporal flicker were comparable to the best baseline, with LingBot-World achieving the best overall consistency metric among the three models.

Additionally, a comparison with other interactive systems such as Matrix-Game-2.0, Mirage-2, and Genie-3 highlighted LingBot-World as one of the few fully open-sourced world models combining general domain coverage, long generative horizons, high dynamic degree, 720p resolution, and real-time capabilities.

### Applications: Promptable Worlds, Agents, and 3D Reconstruction

Beyond video synthesis, LingBot-World serves as a versatile testbed for embodied AI applications. It supports promptable world events, allowing text instructions to alter weather, lighting, style, or introduce local events like fireworks or moving animals over time, all while maintaining spatial structure.

The model can also train downstream action agents, such as a small vision language action model like Qwen3-VL-2B, which predicts control policies from images. Given the model’s geometrically consistent video streams, they can be utilized as inputs for 3D reconstruction pipelines, generating stable point clouds for indoor, outdoor, and synthetic scenes.

### Key Takeaways

In summary, LingBot-World represents a groundbreaking action-conditioned world model that expands text-to-video capabilities into text-to-world simulation. Key highlights include:
– Controllable environments with high visual fidelity and long temporal horizons
– Unified data engine combining web videos, game data, and Unreal Engine trajectories
– Architecture based on a 28B parameter mixture of experts diffusion transformer
– LingBot-World-Fast variant for real-time interaction
– Emergent memory and stable long-range structure
– Superior performance compared to other world models on VBench

For further details, refer to the Paper, Repo, Project page, and Model Weights. Follow us on Twitter, join our ML SubReddit, and subscribe to our Newsletter. Stay updated by joining us on Telegram as well.

Blockonomics

Be the first to comment

Leave a Reply

Your email address will not be published.


*