Recent generative video world models aim to simulate visual environment evolution, allowing an observer to interactively explore the scene via camera control. However, they implicitly assume that the world only evolves within the observer's field of view. Once an object leaves the observer's view, its state is "frozen" in memory, and revisiting the same region later often fails to reflect events that should have occurred in the meantime.
In this work, we identify and formalize this overlooked limitation as the "out-of-sight dynamics" problem, which impedes video world models from representing a continuously evolving world. To address this issue, we propose LiveWorld, a novel framework that extends video world models to support persistent world evolution. Instead of treating the world as static observational memory, LiveWorld models a persistent global state composed of a static 3D background and dynamic entities that continue evolving even when unobserved.
To maintain these unseen dynamics, LiveWorld introduces a monitor-based mechanism that autonomously simulates the temporal progression of active entities and synchronizes their evolved states upon revisiting, ensuring spatially coherent rendering. For evaluation, we further introduce LiveBench, a dedicated benchmark for the task of maintaining out-of-sight dynamics. Extensive experiments show that LiveWorld enables persistent event evolution and long-term scene consistency, bridging the gap between existing 2D observation-based memory and true 4D dynamic world simulation.
LiveWorld overview. Our system explicitly decouples world modeling into two processes. (1) Static Accumulation (Blue): Temporally-invariant backgrounds are fused into a static 3D point cloud via SLAM. (2) Dynamic Evolution (Green): Stationary monitors use the Evolution Engine to fast-forward the out-of-sight progression of active entities, lifting them into 4D point clouds. (3) State-aware Rendering (Purple): Both representations are projected onto the target camera trajectory. This geometric projection, alongside appearance references, guides the renderer to synthesize coherent observations reflecting the elapsed dynamics.
World State Formulation. We approximate the intractable 4D world state by decoupling it into two trackable representations: a temporally-invariant static 3D environment via T-axis projection, and 2D video sequences of dynamic entities via Z-axis projection. This structured approximation enables tractable world modeling while preserving both static and dynamic scene components.
Given one or multiple preceding frames from the previous round, we first detect if the scene visited by the observer contains active dynamic entities, using off-the-shelf VLMs and segmentors. Following a positive detection, we further validate if the entity and scene are already registered by existing monitors. If not, a new monitor is registered to autonomously track and evolve the entity's dynamics.