World Models Can Leverage Human Videos for Dexterous Manipulation

Raktim Gautam Goswami1,2, Amir Bar1, David Fan1, Tsung-Yen Yang1, Gaoyue Zhou1,2,
Prashanth Krishnamurthy2, Michael Rabbat1, Farshad Khorrami2, Yann LeCun1,2
1 2

Click on the video to pause/play

Abstract

Dexterous manipulation is challenging because it requires understanding how subtle hand motion influences the environment through contact with objects. We introduce DexWM, a Dexterous Manipulation World Model that predicts the next latent state of the environment conditioned on past states and dexterous actions. To overcome the scarcity of dexterous manipulation datasets, DexWM is trained on over 900 hours of human and non-dexterous robot videos. To enable fine-grained dexterity, we find that predicting visual features alone is insufficient; therefore, we introduce an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, and full-body actions, achieving more accurate predictions of future states. DexWM also demonstrates strong zero-shot generalization to unseen manipulation skills when deployed on a Franka Panda arm equipped with an Allegro gripper, outperforming Diffusion Policy by over 50% on average in grasping, placing, and reaching tasks.

DexWM: Dexterous Manipulation World Model


  • Learns Dexterous Manipulation dynamics from human videos
  • Fine-grained action space: Actions represented as differences in 3D hand keypoints and camera poses
  • Hand Consistency Loss to enable fine-grained dexterity

Action Representation

Image 2

Hand actions are represented as differences in 3D keypoints between frames (e.g., Hkj - Hki), providing a unified representation of dexterous actions in DexWM. This is supplemented with camera motion, which captures the agent’s movement.

Image 2

For the DROID dataset that uses parallel-jaw grippers, dexterous hands are approximated as dexterous hands by placing dummy keypoints (represented by green points above) on concentric circles centered at the end-effector. The radii of these circles vary with the gripper’s open/close state, mimicking finger spread.

Open-Loop Rollouts

Given the initial state and a dexterous action sequence, DexWM predicts future latent states autoregressively. Latent states are decoded into images for visualization.

Predicted
GT
Predicted
GT
Predicted 1 Ground Truth 1
Predicted 2 Ground Truth 2
Predicted 3 Ground Truth 3
Predicted 4 Ground Truth 4
Predicted 5 Ground Truth 5
Predicted 6 Ground Truth 6

Simulating Counterfactual Actions

Starting from the same initial state, DexWM predicts future states given different atomic actions for controlling the right hand.

Move Right Hand to the right:

Counterfactual 1-1 Counterfactual 1-2 Counterfactual 1-3 Counterfactual 1-4

Move Right Hand up:

Counterfactual 2-1 Counterfactual 2-2 Counterfactual 2-3 Counterfactual 2-4

Move Right Hand forward:

Counterfactual 3-1 Counterfactual 3-2 Counterfactual 3-3 Counterfactual 3-4

Action Transfer from Reference Sequence

Transferring actions from a reference sequence to a new environment using DexWM and PEVA*.

Reference Trajectory:

Action Transfer Ref1 Action Transfer Ref2 Action Transfer Ref3 Action Transfer Ref4

PEVA*:

Action Transfer peva1 Action Transfer peva2 Action Transfer peva3 Action Transfer peva4

DexWM:

Action Transfer dexwm1 Action Transfer dexwm2 Action Transfer dexwm3 Action Transfer dexwm4

Robot Manipulation Tasks (Simulation)

Given goal and start images, DexWM successfully plans the trajectory using an MPC framework by finding the optimal actions using the Cross-Entropy Method. The test tasks below are unseen during training.

Task: Reach

Task: Grasp

Task: Place

Real World Manipulation Task

Citation (BibTeX)

@misc{goswami2025worldmodelsleveragehuman,
  title={World Models Can Leverage Human Videos for Dexterous Manipulation},
  author={Raktim Gautam Goswami and Amir Bar and David Fan and Tsung-Yen Yang and Gaoyue Zhou and Prashanth Krishnamurthy and Michael Rabbat and Farshad Khorrami and Yann LeCun},
  year={2025},
  eprint={2512.13644},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2512.13644},
}