World Models for Learning Dexterous Hand-Object Interactions from Human Videos

Raktim Gautam Goswami^1,2, Amir Bar¹, David Fan¹, Tsung-Yen Yang¹, Gaoyue Zhou^1,2,
Prashanth Krishnamurthy², Michael Rabbat¹, Farshad Khorrami², Yann LeCun^1,2

¹

²

📄 Paper Code

Click on the video to pause/play

Abstract

Modeling dexterous hand-object interactions is challenging as it requires understanding how subtle finger motions influence the environment through contact with objects. While recent world models address interaction modeling, they typically rely on coarse action spaces that fail to capture fine-grained dexterity. We, therefore, introduce DexWM, a Dexterous Interaction World Model that predicts future latent states of the environment conditioned on past states and dexterous actions. To overcome the scarcity of finely annotated dexterous datasets, DexWM represents actions using finger keypoints extracted from egocentric videos, enabling training on over 900 hours of human and non-dexterous robot data. Further, to accurately model dexterity, we find that predicting visual features alone is insufficient; therefore, we incorporate an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, or full-body actions in future-state prediction and demonstrates strong zero-shot transfer to unseen skills on a Franka Panda arm with an Allegro gripper, surpassing Diffusion Policy by over 50% on average across grasping, placing, and reaching tasks.

DexWM: Dexterous Interaction World Model

Learns Dexterous Interaction dynamics from human videos
Fine-grained action space: Actions represented as differences in 3D hand keypoints and camera poses
Hand Consistency Loss to enable fine-grained dexterity

Action Representation

Hand actions are represented as differences in 3D keypoints between frames (e.g., H_{k_j} - H_{k_i}), providing a unified representation of dexterous actions in DexWM. This is supplemented with camera motion, which captures the agent’s movement.

For the DROID dataset that uses parallel-jaw grippers, dexterous hands are approximated as dexterous hands by placing dummy keypoints (represented by green points above) on concentric circles centered at the end-effector. The radii of these circles vary with the gripper’s open/close state, mimicking finger spread.

Open-Loop Rollouts

Given the initial state and a dexterous action sequence, DexWM predicts future latent states autoregressively. Latent states are decoded into images for visualization.

Predicted

GT

Predicted

GT

Simulating Counterfactual Actions

Starting from the same initial state, DexWM predicts future states given different atomic actions for controlling the right hand.

Move Right Hand to the right:

Move Right Hand up:

Move Right Hand forward:

Action Transfer from Reference Sequence

Transferring actions from a reference sequence to a new environment using DexWM and PEVA*.

Reference Trajectory:

PEVA*:

DexWM:

Robot Manipulation Tasks (Simulation)

Given goal and start images, DexWM successfully plans the trajectory using an MPC framework by finding the optimal actions using the Cross-Entropy Method. The test tasks below are unseen during training.

Task: Reach

Task: Grasp

Task: Place

Real World Manipulation Task

Citation (BibTeX)

@misc{goswami2026dexwm,
  title={World Models for Learning Dexterous Hand-Object Interactions from Human Videos},
  author={Raktim Gautam Goswami and Amir Bar and David Fan and Tsung-Yen Yang and Gaoyue Zhou and Prashanth Krishnamurthy and Michael Rabbat and Farshad Khorrami and Yann LeCun},
  year={2026},
  eprint={2512.13644},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2512.13644},
}