2
Modeling dexterous hand-object interactions is challenging as it requires understanding how subtle finger motions influence the environment through contact with objects. While recent world models address interaction modeling, they typically rely on coarse action spaces that fail to capture fine-grained dexterity. We, therefore, introduce DexWM, a Dexterous Interaction World Model that predicts future latent states of the environment conditioned on past states and dexterous actions. To overcome the scarcity of finely annotated dexterous datasets, DexWM represents actions using finger keypoints extracted from egocentric videos, enabling training on over 900 hours of human and non-dexterous robot data. Further, to accurately model dexterity, we find that predicting visual features alone is insufficient; therefore, we incorporate an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, or full-body actions in future-state prediction and demonstrates strong zero-shot transfer to unseen skills on a Franka Panda arm with an Allegro gripper, surpassing Diffusion Policy by over 50% on average across grasping, placing, and reaching tasks.
Hand actions are represented as differences in 3D keypoints between frames (e.g., Hkj - Hki), providing a unified representation of dexterous actions in DexWM. This is supplemented with camera motion, which captures the agent’s movement.
For the DROID dataset that uses parallel-jaw grippers, dexterous hands are approximated as dexterous hands by placing dummy keypoints (represented by green points above) on concentric circles centered at the end-effector. The radii of these circles vary with the gripper’s open/close state, mimicking finger spread.
Given the initial state and a dexterous action sequence, DexWM predicts future latent states autoregressively. Latent states are decoded into images for visualization.
Starting from the same initial state, DexWM predicts future states given different atomic actions for controlling the right hand.
Move Right Hand to the right:
Move Right Hand up:
Move Right Hand forward:
Transferring actions from a reference sequence to a new environment using DexWM and PEVA*.
Reference Trajectory:
PEVA*:
DexWM:
Given goal and start images, DexWM successfully plans the trajectory using an MPC framework by finding the optimal actions using the Cross-Entropy Method. The test tasks below are unseen during training.
@misc{goswami2026dexwm,
title={World Models for Learning Dexterous Hand-Object Interactions from Human Videos},
author={Raktim Gautam Goswami and Amir Bar and David Fan and Tsung-Yen Yang and Gaoyue Zhou and Prashanth Krishnamurthy and Michael Rabbat and Farshad Khorrami and Yann LeCun},
year={2026},
eprint={2512.13644},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2512.13644},
}