Unifying Object-Centric World Models and Diffusion Policy: A Hierarchical Framework for Multi-Stage Robotic Tasks

Raktim Gautam Goswami1 Prashanth Krishnamurthy1 Yann LeCun2,3 Farshad Khorrami1
1Tandon School of Engineering, New York University 2Courant Institute of Mathematical Sciences, New York University 3AMILabs
Overview of WorldDP
Overview of WorldDP, a hierarchical framework for multi-stage robotic manipulation. (a) At test-time, we use our object-centric world model within a particle filter to optimize latent action sequences, employing a structured, object-based loss to find optimal subgoals. diffusion policy (DP) then sequentially tracks and executes these subgoals to solve the task.

Abstract

Visual world models have shown great potential in learning complex system dynamics. Recent advancements leverage these models as transition functions within Model Predictive Control (MPC) frameworks to solve various control tasks. When applied to robotics, however, they are limited to single-stage tasks such as reaching or grasping, and struggle with multi-stage ones that demand complex sequential planning. In this work, we introduce WorldDP, a world model framework designed for multi-stage robotic manipulation. Our hierarchical approach utilizes a high-level world model as a transition function to optimize for feasible subgoals during runtime, which are subsequently reached by a low-level Diffusion Policy. To further aid in learning dynamics and planning, we incorporate object-centric representations that decouple environmental entities and enable us to plan sequentially with respect to each. valuated across several robotics benchmarks, WorldDP consistently outperforms existing baselines, validating that coupling the world model’s physically groundedplanning with diffusion policy’s efficient execution yields superior multi-stage performance.

Method

Object-Centric Encoder

Object-Centric Encoder training with SAM2 guidance
Object-Centric Encoder Training with SAM2 Guidance. The input image is mapped to patch-level features using a frozen DINOv2 backbone. Initialized slot tokens and patch features are processed by the Slot Corrector to yield latent slots, which a Slot Decoder maps to per-slot reconstructions and masks. The model is optimized with a reconstruction loss and a mask segmentation loss against SAM2-generated reference masks.

We adopt an object-centric state representation that decouples environment entities (the agent, objects, and background) learned on top of frozen DINOv2 patch features with privileged SAM2 guidance during training. This facilitates better attention to relevant entities and enables individual planning for each item during MPC.

Conditional Diffusion Transformer Dynamics Model

WorldDP dynamics model
WorldDP's dynamics model.

A Conditional Diffusion Transformer (CDiT) takes the object-centric states and a compressed latent action embedding to predict subsequent entity states at a future timestep. The action sequence is compressed into a latent action embedding via a transformer-based action encoder, which conditions the CDiT through Adaptive Layer Normalization (AdaLN) layers. During planning the model operates autoregressively, iteratively feeding back predicted states and action chunks to roll out future states.

Hierarchical Task Execution

Pseudo-code of WorldDP's hierarchical planning. Given the current observation and target goal, a particle filter optimizes latent action sequences through the world model to generate optimal subgoals. These subgoals are tracked and executed via the low-level diffusion policy.

def WorldDPPlanning(obs, goal, action_means, **kwargs):
    """
    obs: environment observation
    goal: goal image
    action_means: initialized means for the particle filter
    kwargs: contains parameters like total_iter, lambda_plan
    """
    # 1. World Model Subgoal Planning via Particle Filtering
    for iter_idx in range(total_iters):
        # A. Generate action particles around the current means
        action_particles = sample_around_mean(action_means, sigma, total_samples)

        # B. Evaluate particles through autoregressive rollouts
        total_costs = []
        for action in action_particles:
            pred_states, object_cost, contact_cost = get_costs(obs, action, goal)
            total_costs.append(object_cost + lambda_plan*contact_cost)

        # C. Select top-M candidates as means for next iteration
        top_m_idx = top_m_indices(total_costs, top_m)
        action_means = [action_particles[i] for i in top_m_idx]

    # 2. Extract optimal subgoal sequence from highest-ranking particle
    optimal_subgoals, _, _ = get_costs(obs, action_means[0], goal)

    # 3. Goal-Conditioned Diffusion Policy to reach subgoals
    curr_obs = env.get_info()
    for subgoal in optimal_subgoals:
        low_level_actions = diff_policy(curr_obs, subgoal)
        curr_obs, is_success = env.step(low_level_actions)
    return is_success
Task execution example
Task Execution Example. Given the initial and target states, our framework decomposes the task into sequential planning phases. Step 1 optimizes subgoals (1A, 1B) for the buttons, followed by Step 2 optimizing subgoals (2A, 2B) for the drawer and window. A low-level Diffusion Policy (DP) executes local trajectories to bridge these subgoals (represented by orange arrows).

Task execution is formulated as a hierarchical two-tier MPC framework. The upper tier uses the world model to plan high-level subgoals from a target goal image, optimizing latent action sequences with a Particle Filter against an object-centric cost augmented by a contact-prediction term. The lower tier employs a goal-conditioned Diffusion Policy to realize these subgoals — robust for short-horizon control, fast to execute, and able to compensate for suboptimal subgoals.

Robotic Tasks

Example start and goal images from each robotic task
Example Start and Goal images from each of the robotic tasks.

Task Execution

WorldDP decomposes each multi-stage task into reachable subgoals executed by the low-level diffusion policy. Each clip shows the Start state, Ours (WorldDP execution), and the target Goal state.

StartOursGoal
StartOursGoal
StartOursGoal
StartOursGoal
StartOursGoal
StartOursGoal
StartOursGoal
StartOursGoal

World Model Rollouts

Open-Loop Trajectory Rollouts. Given an initial state and an action sequence, the world model generates a predicted trajectory. Comparing it with the ground truth demonstrates the model's accurate simulation capabilities. Latent states are decoded into images for visualization.

Cube-Single

Ground TruthCube-Single ground-truth rollout
PredictionCube-Single predicted rollout

Cube-Triple

Ground TruthCube-Triple ground-truth rollout
PredictionCube-Triple predicted rollout

Scene-Single

Ground TruthScene-Single ground-truth rollout
PredictionScene-Single predicted rollout

Results

Success rate (%) across multi-stage manipulation tasks. WorldDP consistently outperforms existing world-model and diffusion-based baselines. Best results per column are highlighted.

Cube-Triple & Scene-Single-Composite

For Cube-Triple, 1/2/3 Cubes refer to at least 1, 2, and all 3 cube successes.

Methods Cube-Triple Scene-Single-Composite
1-Cube2-Cubes3-Cubes Button PressFull Task
DINO-WM6210080
LeWM748040
HECRL*7834126018
DP4042120640
DP100903247814
WorldDP (Ours)10072306020

Cube-Single & Scene-Single-Direct

Methods Cube-Single Scene-Single-Direct Both Task
Average
ButtonDrawerWindowAverage
DINO-WM02523.535.88189
LeWM012.511.7623.53168
HECRL*9831.2552.945.883063
DP40087.535.2911.764432
DP100985023.535.882663
WorldDP (Ours)7281.2576.4764.717474.5

Object-Centric Encoding

Object-Centric Encoding Visualization. Left: image; Right: predicted masks from the OCE embeddings. Because the OCE operates on patch-level DINOv2 features, masks are patch-level, yielding coarser, block-like boundaries that encapsulate each object of interest.

OCE Cube-Single
Cube-Single
OCE Cube-Triple
Cube-Triple
OCE Scene-Single
Scene-Single

Citation

@article{goswami2026worlddp,
       author = {Goswami, Raktim Gautam and Krishnamurthy, Prashanth and LeCun, Yann and Khorrami, Farshad},
        title = "{Unifying Object-Centric World Models and Diffusion Policy: A Hierarchical Framework for Multi-Stage Robotic Tasks}",
      journal = {arXiv preprint arXiv:2606.08775},
         year = {2026},
}