Bimanual manipulation is a longstanding challenge in robotics due to the large number of degrees of freedom and the strict spatial and temporal synchronization required to generate meaningful behavior. Humans learn bimanual manipulation skills by watching other humans and by refining their abilities through play. In this work, we aim to enable robots to learn bimanual manipulation behaviors from human video demonstrations and fine-tune them through interaction. Inspired by seminal work in psychology and biomechanics, we propose modeling the interaction between two hands as a serial kinematic linkage — as a screw motion, in particular, that we use to define a new action space for bimanual manipulation: screw actions. We introduce ScrewMimic, a framework that leverages this novel action representation to facilitate learning from human demonstration and self-supervised policy fine-tuning. Our experiments demonstrate that ScrewMimic is able to learn several complex bimanual behaviors from a single human video demonstration, and that it outperforms baselines that interpret demonstrations and fine-tune directly in the original space of motion of both arms.
ScrewMimic inputs an RGB-D video of a human performing a bimanual task and uses off-the-shelf hand tracking (HT) models to extract a trajectory of wrist poses (τhl, τhr) and grasp contact points (ghl , ghr ). ScrewMimic interprets τhl and τhr as a screw motion between both hands to estimate screw axis parameters Sh. Next, it applies geometric augmentations on the 3D object point cloud to train a PointNet model to estimate screw actions for novel object views. Finally, the trained model generates an initial hypothesis that the robot executes and iteratively refines using an autonomously generated reward signal. The successful data point is further used to improve the prediction model.
We Evaluate ScrewMimic on novel objects. Note that in this case, the screw action prediction model has not been trained on these new objects. As can be seen in the following videos, ScrewMimic is able to fine-tune the screw action to successfully manipulate novel objects. In the paper we further show that if we re-train the screw axis prediction model with the corrected screw action, ScrewMimic can complete the task with the new object almost zero-shot. This indicates that ScrewMimic helps create a self- learning loop where the robot can continually expand its manipulation capabilities to new objects.
The following video shows the stir task with a non-zero left-hand motion. Since ScrewMimic defines the screw axis of manipulation as the relative motion between both hands, the same screw action can be used even when there exists any absolute motion of one of them (left hand).
@inproceedings{bahety2024screwmimic,
title={ScrewMimic: Bimanual Imitation from Human Videos with Screw Space Projection},
author={Bahety, Arpit and Mandikal, Priyanka and Abbatematteo, Ben and Mart{\'\i}n-Mart{\'\i}n, Roberto},
booktitle={Robotics: Science and Systems (RSS), 2024},
year={2024}
}