For robots to become efficient helpers in the home, they must learn to perform new mobile manipulation tasks simply by watching humans perform them. Learning from a single video demonstration from a human is challenging as the robot needs to first extract from the demo what needs to be done and how, translate the strategy from a third to a first-person perspective, and then adapt it to be successful with its own morphology. Furthermore, to mitigate the dependency on costly human monitoring, this learning process should be performed in a safe and autonomous manner. We present SafeMimic, a framework to learn new mobile manipulation skills safely and autonomously from a single third-person human video. Given an initial human video demonstration of a multi-step mobile manipulation task, SafeMimic first parses the video into segments, inferring both the semantic changes caused and the motions the human executed to achieve them and translating them to an egocentric reference. Then, it adapts the behavior to the robot's own morphology by sampling candidate actions around the human ones, and verifying them for safety before execution in a receding horizon fashion using an ensemble of safety Q-functions trained in simulation. When safe forward progression is not possible, SafeMimic backtracks to previous states and attempts a different sequence of actions, adapting both the trajectory and the grasping modes when required for its morphology. As a result, SafeMimic yields a strategy that succeeds in the demonstrated behavior and learns task-specific actions that reduce exploration in future attempts. Our experiments show that our method allows robots to safely and efficiently learn multi-step mobile manipulation behaviors from a single human demonstration, from different users, and in different environments, with improvements over state-of-the-art baselines across seven tasks.
SafeMimic consists of three steps. The first step extracts the semantic segments (what the human did) and the human actions (how did the human do it) from a single third-person video. The next step helps the robot to safely and autonomously adapt the human actions to the robot's own embodiment. Finally, in the third step the robot uses the policy memory module to learn from the successful trajectories to reduce the exploration in future trials.
Training Safety Q-functions in the real world would be dangerous, as the robot needs to experience and learn what actions may lead to unsafe states and when. Therefore, we pretrain an ensemble of Safety Q-functions in simulation, one for each type of unsafe transition. The safety Q-value is 1 if the action (a) from the state (s) is unsafe else it's 0 (left video). The ensemble of Safety Q-functions is pretrained in domain-randomized environments in the simulator OmniGibson in different scenarios, including articulated object interaction, rigid-body pick-and-place, and base navigation by sampling random and noise-corrupted task-related actions (right video).
All robot videos at 40×
@inproceedings{bahety2025safemimic,
title={SafeMimic: Towards Safe and Autonomous Human-to-Robot Imitation for Mobile Manipulation},
author={Bahety, Arpit and Balaji, Arnav and Abbatematteo, Ben and Mart{\'\i}n-Mart{\'\i}n, Roberto},
booktitle={Robotics: Science and Systems (RSS), 2025},
year={2025}
}