The main challenge in learning image-conditioned robotic policies is acquiring a visual representation conducive to low-level control. Due to the high dimensionality of the image space, learning a good visual representation requires a considerable amount of visual data. However, when learning in the real world, data is expensive. Sim2Real is a promising paradigm for overcoming data scarcity in the real-world target domain by using a simulator to collect large amounts of cheap data closely related to the target task. However, it is difficult to transfer an image-conditioned policy from sim to real when the domains are very visually dissimilar. To bridge the sim2real visual gap, we propose using natural language descriptions of images as a unifying signal across domains that captures the underlying task-relevant semantics. Our key insight is that if two image observations from different domains are labeled with similar language, the policy should predict similar action distributions for both images. We demonstrate that training the image encoder to predict the language description or the distance between descriptions of a sim or real image serves as a useful, data- efficient pretraining step that helps learn a domain-invariant image representation. We can then use this image encoder as the backbone of an IL policy trained simultaneously on a large amount of simulated and a handful of real demonstrations. Our approach outperforms widely used prior sim2real methods and strong vision-language pretraining baselines like CLIP and R3M by 25 to 40%.
We annotate trajectories with language labels. Our scripted policy automatically applies labels to the demonstration according to the progress and current stage of the trajectory (see video below). We also show that we can label previously-collected trajectories using off-the-shelf vision-language models.
We evaluate against three classes of baselines: no pretraining, vision-language pretrained baselines (R3M and CLIP), and other popular sim2real methods (MMD, Domain Randomization, Automatic Domain Randomization with Random Network Adversary (ADR+RNA)).
Our method outperforms all baselines. Despite being pretrained only on a few hundred trajectories of image-language pairs, our method outperforms CLIP and R3M, which were trained on internet-scale data.
To quantify the impact of natural language on our pretraining approach, we use an alternate pretraining objective of classifying the stage of the trajectory rather than regressing to the language annotation associated with that stage of the trajectory. We find that language provides a measurable benefit of roughly 10-20% across our three task suites, especially in multi-step pick-and-place, perhaps because pretraining with language leverages similarities in language descriptions between the first and second steps of the pick-and-place task.
To examine the impact of decreasing language granularity on sim2real performance, we experiment with varying numbers of unique annotations on each trajectory. In the extreme case, the entire trajectory has only a single stage, which means that all images across all trajectories of a task have the same language description embedding. In general, decreasing language granularity hurts performance slightly. Still, our method is robust to lower granularity, which matches our hypothesis that our pretraining approach provides significant performance gains simply by pushing sim and real images into a similar embedding distribution even if the language granularity is extremely coarse.
@inproceedings{yu2024lang4sim2real,
      title={Natural Language Can Help Bridge the Sim2Real Gap},
      author={Yu, Albert and Foote, Adeline and Mooney, Raymond and Martín-Martín, Roberto},
      booktitle={Robotics: Science and Systems (RSS), 2024},
      year={2024}
}