DataMIL

Selecting Data for Robot Imitation Learning with Datamodels

* Equal Contribution, Equal Advising
1The University of Texas at Austin, 2MIT, 3Stanford University

TL;DR

Robotics has amassed ever larger and more diverse datasets to train generalist robot policies via imitation. However, while imitation learning (IL) is a powerful paradigm for training robots to perform complex tasks, the performance of IL algorithms is highly sensitive on the data it's trained on. In this work, we introduce DataMIL, a novel data selection method that leverages datamodels to select high-quality datasets for IL. DataMIL selects data based on its relevance to the task at hand, ensuring that the model learns from the most informative examples. We demonstrate the effectiveness of DataMIL on 60+ simulation and real world tasks, most notably selecting relevant data from Open X-Embodiment datasets, showing significant improvements in performance compared to existing data selection methods.

What are datamodels?

Datamodels is a framework which tries to answer the question -- how would the output of a model change if we had trained on a different dataset? In other words, datamodels provide a way to directly measure how the presence/absence of each training sample in the training of our model would affect its output without actually training the model on the new dataset. While there are several ways of estimating datamodels, we focus on regression and metagradient-based datamodels, which assign a scaler influence score to each training sample based on how much it affects the model's output. These scores can then be used to select the most relevant samples or filter out the most harmful ones for a given task.

DataMIL: DataModels for Imitation Learning

Datamodels have found several applications in the fields of NLP and computer vision, but need several key modifications before they can be applied to robotics. DataMIL (Datamodels for Imitation Learning) provides a recipe to adapt datamodels for the robotics datasets, providing a tractable optimization objective in place of costly rollouts and several modifications to reduce the noise in estimation, improving the quality of the selected data.

proxy_metric
1. Proxy Metric

Validation loss over a few target task demos as proxy for costly rollouts


clustering
2. Clustering

Temporal clustering to reduce estimation noise


co_train
3. Co-Training with Target

Minimizing distribution shift by co-training with target data


Summary: Data Selection with DataMIL

Data selection with DataMIL is a two step process:

  1. Estimate datamodels. We first cluster the training samples into trajectories or sub-trajectories and estimate datamodels on them with our proposed target metric as the proxy
  2. Select data for policy training. The datamodels provide a scaler score (influence) to each training sample which indicates how positively or negatively they influence our target metric (which in our case is the validation loss over a few target task demos). Using these scores, we can select the top x% of the samples to train our policy on.
Finally, we employ a co-training recipe and train our final policy by uniformly sampling from the selected and the target task dataset.

overall

OXE Results

We evaluate DaMIL on over 60 tasks spanning both simulation and real-world settings. In the real world, we use the Open X-Embodiment dataset as the prior and test on four target tasks (shown above). As illustrated in the results below, DaMIL consistently outperforms existing dataset selection methods across tasks with diverse characteristics. Most notably, it successfully selects relevant data for a completely new embodiment—Tiago—by identifying useful samples from datasets collected with different robots, such as the Google Robot and WidowX. We also extend our evaluation to a multitask setting, where the target set includes multiple tasks, and show that DaMIL can effectively retrieve data that supports all of them. For a detailed analysis, please refer to the full paper.

AR

What data does DataMIL select?

We qualitatively find 3 interesting insights about the data selected by DataMIL:

  1. Distribution of selected data. Below we show the distribution of datasets selected by DataMIL and some representative baselines on Tiago-Sink and Franka-Pouch tasks. We find that the data selected by DataMIL typically spans several datasets, while similarity based baselines selects most of their data from a single dataset. We hypothesize that since there is no data that exactly matches the target task, the selected data must not only be relevant but general, so as to enable positive transfer in capabilities and not make the policy overfit to a single type of domain.

DaMIl
DataMIL

AR
Action Retrival

Selected dataset distribution for Tiago-Sink task


BR
Behavior Retrival

  1. Type of embodiment selected. DataMIL is able to select useful data for a completely new embodiment in the Tiago-Sink task. Even though the datasets selected seem visually quite different, sampled from datasets such as RT-1, BC-Z and Bridge, they still represent the essence of the target task -- robots operating on a table top from an ego-perspective. For baselines, even when the target embodiment is present in prior data (eg. Franka-Pouch task), the selected data often comes from other embodiments possible due to them putting more weight on the scene and distractors when computing similarity. In contrast, DataMIL is able to select datasets from the correct embodiment if present in the prior dataset.

DaMIl
DataMIL

AR
Flow Retrival

Selected dataset distribution for Franka-Pouch task


BR
Behavior Retrival

  1. Top and bottom samples. Interestingly when we visually inspect the data selected by DataMIL (shown below), we find that the highest and lowest ranked samples typically look alike. This is in line with observation in computer vision where most useful data looks very similar to the most harmful, albiet with different labels. Similarly in robotics, similar states can have very different action distributions, and while some of these actions might help reduce the policy loss on the target data, the others might lead to a large deviation, making them harmful for final policy learning.

Ball
Franka-Ball

Pouch
Franka-Pouch

Tiago
Tiago-Sink

Droid
Droid-Multitask

BibTeX