Robotics has amassed ever larger and more diverse datasets to train generalist robot policies via imitation. However, while imitation learning (IL) is a powerful paradigm for training robots to perform complex tasks, the performance of IL algorithms is highly sensitive on the data it's trained on. In this work, we introduce DataMIL, a novel data selection method that leverages datamodels to select high-quality datasets for IL. DataMIL selects data based on its relevance to the task at hand, ensuring that the model learns from the most informative examples. We demonstrate the effectiveness of DataMIL on 60+ simulation and real world tasks, most notably selecting relevant data from Open X-Embodiment datasets, showing significant improvements in performance compared to existing data selection methods.
Datamodels is a framework which tries to answer the question -- how would the output of a model change if we had trained on a different dataset? In other words, datamodels provide a way to directly measure how the presence/absence of each training sample in the training of our model would affect its output without actually training the model on the new dataset. While there are several ways of estimating datamodels, we focus on regression and metagradient-based datamodels, which assign a scaler influence score to each training sample based on how much it affects the model's output. These scores can then be used to select the most relevant samples or filter out the most harmful ones for a given task.
Datamodels have found several applications in the fields of NLP and computer vision, but need several key modifications before they can be applied to robotics. DataMIL (Datamodels for Imitation Learning) provides a recipe to adapt datamodels for the robotics datasets, providing a tractable optimization objective in place of costly rollouts and several modifications to reduce the noise in estimation, improving the quality of the selected data.
Validation loss over a few target task demos as proxy for costly rollouts
Temporal clustering to reduce estimation noise
Minimizing distribution shift by co-training with target data
Data selection with DataMIL is a two step process:
We evaluate DaMIL on over 60 tasks spanning both simulation and real-world settings. In the real world, we use the Open X-Embodiment dataset as the prior and test on four target tasks (shown above). As illustrated in the results below, DaMIL consistently outperforms existing dataset selection methods across tasks with diverse characteristics. Most notably, it successfully selects relevant data for a completely new embodiment—Tiago—by identifying useful samples from datasets collected with different robots, such as the Google Robot and WidowX. We also extend our evaluation to a multitask setting, where the target set includes multiple tasks, and show that DaMIL can effectively retrieve data that supports all of them. For a detailed analysis, please refer to the full paper.
We qualitatively find 3 interesting insights about the data selected by DataMIL: