2
Dexterous manipulation is challenging because it requires understanding how subtle hand motion influences the environment through contact with objects. We introduce DexWM, a Dexterous Manipulation World Model that predicts the next latent state of the environment conditioned on past states and dexterous actions. To overcome the scarcity of dexterous manipulation datasets, DexWM is trained on over 900 hours of human and non-dexterous robot videos. To enable fine-grained dexterity, we find that predicting visual features alone is insufficient; therefore, we introduce an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, and full-body actions, achieving more accurate predictions of future states. DexWM also demonstrates strong zero-shot generalization to unseen manipulation skills when deployed on a Franka Panda arm equipped with an Allegro gripper, outperforming Diffusion Policy by over 50% on average in grasping, placing, and reaching tasks.
Hand actions are represented as differences in 3D keypoints between frames (e.g., Hkj - Hki), providing a unified representation of dexterous actions in DexWM. This is supplemented with camera motion, which captures the agent’s movement.
For the DROID dataset that uses parallel-jaw grippers, dexterous hands are approximated as dexterous hands by placing dummy keypoints (represented by green points above) on concentric circles centered at the end-effector. The radii of these circles vary with the gripper’s open/close state, mimicking finger spread.
Given the initial state and a dexterous action sequence, DexWM predicts future latent states autoregressively. Latent states are decoded into images for visualization.
Starting from the same initial state, DexWM predicts future states given different atomic actions for controlling the right hand.
Move Right Hand to the right:
Move Right Hand up:
Move Right Hand forward:
Transferring actions from a reference sequence to a new environment using DexWM and PEVA*.
Reference Trajectory:
PEVA*:
DexWM:
Given goal and start images, DexWM successfully plans the trajectory using an MPC framework by finding the optimal actions using the Cross-Entropy Method. The test tasks below are unseen during training.
@misc{goswami2025worldmodelsleveragehuman,
title={World Models Can Leverage Human Videos for Dexterous Manipulation},
author={Raktim Gautam Goswami and Amir Bar and David Fan and Tsung-Yen Yang and Gaoyue Zhou and Prashanth Krishnamurthy and Michael Rabbat and Farshad Khorrami and Yann LeCun},
year={2025},
eprint={2512.13644},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2512.13644},
}