COLLAGE

Policy Rollouts with COLLAGE

Pen-Cup
Chips-Box
Pick-Umbrella
Stir-Spatula
Stack-Lego
Scrub-Plate

Each Policy is trained with 5 target demonstrations and data retrieved using COLLAGE.

Abstract

In this work, we study the problem of data retrieval for few-shot imitation learning: selecting data from a large dataset to train a performant policy for a specific task, given only a few target demonstrations. Prior methods retrieve data using a single-feature distance heuristic, assuming that the best demonstrations are those that most closely resemble the target examples in visual, semantic, or motion space. However, this approach captures only a subset of the relevant information and is prone to introducing detrimental demonstrations, such as those from unrelated tasks with similar scene layouts or tasks with similar motion but divergent goals.

We present COLLAGE, a method for COLLective data AGgrEgation in few-shot imitation learning that uses an adaptive late fusion mechanism to guide the selection of relevant demonstrations based on a task-specific combination of multiple cues. COLLAGE assigns weights to subsets of the dataset pre-selected using single-feature retrieval (e.g., appearance, shape, or language similarity), based on how well a policy trained on each subset predicts the few target demonstrations. These weights are then used during training to importance-sample data across the retrieved subsets. This strategy is general, feature-agnostic, and flexible, enabling COLLAGE to leverage complementary information and outperform both single-modality and multitask baselines. In extensive experiments, COLLAGE improves average performance by 5.1% in simulation (LIBERO-10) and 16.6% in real-world tasks from the DROID dataset.

Video

Method Overview

Overview of COLLAGE. 1: Given a set of target demonstrations, each modality (e.g., visual, motion, shape, or language) retrieves a set of similar (sub-)trajectories from a prior dataset of diverse demonstrations. 2: We use the retrieved (sub-)trajectories for each modality to train a reference policy. For each reference policy, we compute the log-likelihood of the target trajectories — that is, how well the policy predicts the target actions at each target state. These log-likelihoods are used to assign importance weights to the modalities. 3: We train the final policy using all retrieved data, sampling more frequently from modalities with higher weights.

Experiments

Simulated Experiments

Simulated experiment results — We evaluate COLLAGE on all tasks from the LIBERO-10 benchmark using 5 target demonstrations per task and 4500 demonstrations from LIBERO-90 as the prior dataset. COLLAGE achieves a **10.75% relative performance improvement over the prior state-of-the-art retrieval method STRAP (ICLR ’25)** . It also outperforms other single-modality retrievals, including **POINTNET (Shape) by 14.25%** and **LANG (Language) by 25.31%**. Note: All methods achieve 0% performance on the Moka-Moka task.

Real-World Experiments

Real-world experiment results — Real-world evaluation on six manipulation tasks using the DROID dataset. For each task, we use only **5 target demonstrations** and retrieve from a pool of 30k successful episodes. COLLAGE achieves an average success rate of 6.83/15, representing a **58% relative performance improvement over STRAP (4.33/15)** and a **64% improvement over LANG (4.16/15)**. Policies trained solely on the 5 in-domain demonstrations (no retrieval) achieve only 1.00/15 success on average. In contrast, COLLAGE effectively leverages relevant demonstrations from DROID to significantly boost policy performance.

Importance Weights Predicted by COLLAGE

Pie chart showing COLLAGE’s predicted importance weights across Visual, Motion, Shape, Language — Importance weights assigned by **COLLAGE** to different modalities (Visual, Motion, Shape, Language).

Qualitative Results

Below shows the demonstration segments (◼) from LIBERO-10 and retrieved sub-trajectories from LIBERO-90 using the visual (◼), motion (◼), and shape (◼) modalities, respectively. The visual modality retrieves segments that best match appearance using DINO features; the motion modality finds those sharing similar motion patterns using Optical Flow features; and the shape modality selects ones with comparable geometry using PointNet features.

Click the ← and → arrows to navigate through all tasks.

Task Instruction: "Put Both the Cream Cheese Box and the Butter in the Basket"

Task Instruction: "Turn on the Stove and Put the Moka Pot on it"

Task Instruction: "Pick up the Book and Place it in the Back Compartment of the Caddy"

Task Instruction: "Put the Black Bowl in the Bottom Drawer of the Cabinet and Close it"

Task Instruction: "Put Both the Alphabet Soup and the Tomato Sauce in the Basket"

Task Instruction: "Put the Yellow and White Mug in the Microwave and Close it"

Task Instruction: "Put the White Mug on the Left Plate and Put the Yellow and White Mug on the Right Plate"

Task Instruction: "Put the White Mug on the Plate and Put the Chocolate Pudding to the Right of the Plate"

Task Instruction: "Put Both the Alphabet Soup and the Cream Cheese Box in the Basket"

BibTeX

@inproceedings{kumar2025collage,
  title={{COLLAGE}: Adaptive Fusion-based Retrieval for Augmented Policy Learning},
  author={Kumar, Sateesh and Dass, Shivin and Pavlakos, Georgios and Mart{\'\i}n-Mart{\'\i}n, Roberto},
  conference={CoRL},
  year={2025}
  }

COLLAGE:

Adaptive Fusion-based Retrieval for Augmented Policy Learning