HALO — Memory Retrieval in Visuomotor Policies for Long-Horizon Robot Control

Numerical + Event-Time Info. Cooking Popcorn: HALO adds the right number of scoops, then turns off the stove 4.5 minutes after switching it on

Event-Time Info. Boiling Spaghetti: HALO boils the spaghetti for 8 minutes before turning off the stove

Event-Time Info. Toasting Bread: HALO toasts the bread for 2 minutes before turning off the stove

Relational Info. Return to Same Container: HALO returns the bowl to the original plate

Numerical Info. Store N Objects: HALO stores exactly 2 packets and closes the door

Event-Time Info. Heat Stove for T Minutes: HALO turns off the stove after 30 seconds

Numerical Info. + Human-Robot Collab. Store N Objects: Human places 1 cup; HALO places the remaining 2

Relational Info. Return to Same Container: HALO returns the bowl to the original plate

Numerical Info. Store N Objects: HALO stores exactly 2 packets and closes the door

Event-Time Info. Heat Stove for T Minutes: HALO turns off the stove after 30 seconds

Numerical Info. + Human-Robot Collab. Store N Objects: Human places 1 cup; HALO places the remaining 2

Relational Info. Return to Same Container: HALO returns the bowl to the original plate

Numerical Info. Store N Objects: HALO stores exactly 2 packets and closes the door

Spatial Info. Retrieve Object: HALO retrieves the granola bar from its last seen location

Event-Time Info. Heat Stove for T Minutes: HALO turns off the stove after 1 minute

Numerical Info. + Human-Robot Collab. Store N Objects: Human places 1 cup; HALO places the remaining 2

Relational Info. Return to Same Container: HALO returns the bowl to the original plate after recovering from a failed grasp

Event-Time Info. Heat Stove for T Minutes: HALO turns off the stove after 1 minute

Numerical Info. + Human-Robot Collab. Store N Objects: Human places 1 cup; HALO places the remaining 2

Event-Time Info. Heat Stove for T Minutes: HALO turns off the stove after 90 seconds

Numerical Info. + Human-Robot Collab. Store N Objects: Human places 2 cups; HALO places the remaining 1

Event-Time Info. Heat Stove for T Minutes: HALO turns off the stove after 6 minutes

Numerical Info. + Human-Robot Collab. Store N Objects: Human places 2 cups; HALO places the remaining 1

Event-Time Info. Heat Stove for T Minutes: HALO turns off the stove after 8 minutes

Numerical Info. + Human-Robot Collab. Store N Objects: Human places 2 cups; HALO places the remaining 1

Abstract

HALO is a visuomotor policy for long-horizon robot control that learns to retrieve information from its past interaction (including past image observations, proprioceptives, and actions) for predicting the current action. These past interactions are stored in the long context of a transformer-based policy. However, long-context policies trained by imitation learning struggle with spurious correlations from history and compounding prediction errors. HALO addresses these challenges by distilling VLM priors through video question-answering supervision, reducing spurious correlations, and applying Top-K sparse attention to reduce the impact of accumulated errors in the context. Notably, HALO outperforms both text-based memory summarization and hand-engineered feature stores, as retaining raw observations and actions lets the policy exploit fine-grained details that VLMs discard during summarization and that human designers overlook when hand-crafting features.

Read full abstract

General-purpose robots operating in partially observable environments, such as homes, require memory to support autonomy. They must recall diverse information from the past, such as where objects were placed, which tasks a human partner has completed, and when an appliance was turned on, to accomplish a wide range of tasks. Achieving this versatility requires a memory retrieval mechanism that generalizes well across tasks. However, hand-designed or heuristic-based methods rely on task-specific assumptions that may not transfer to different settings.

Transformer architectures that use attention over long contexts for memory retrieval provide a promising alternative, as they learn retrieval from data without task-specific assumptions. However, directly incorporating long-context transformer architecture into imitation learning from offline data introduces two key challenges: (1) the policy may learn spurious correlations between information from the past and predicted actions, and (2) errors accumulate over time in the memory due to prediction inaccuracies and their compounding interactions with the environment, leading to model drift and cascading failures in long-horizon control.

To address both challenges, we introduce HALO, a visuomotor policy with an attention-based memory retrieval mechanism for long-horizon control. To suppress spurious correlations, HALO leverages vision-language model (VLM) priors to steer retrieval toward task-relevant information. Concretely, it generates task-relevant, memory-dependent question–answer pairs from demonstration trajectories and trains the policy jointly with a video question-answering objective, transferring VLM priors to the visuomotor policy. To reduce the impact of accumulated errors in memory during closed-loop control, HALO uses sparse attention that restricts retrieval to only the most relevant parts of the history. Together, these components enable more reliable long-horizon control by guiding the policy to retrieve task-relevant information from up to eight minutes of past experience.

Problem Overview

Long-horizon manipulation requires reasoning over context that is no longer visible: where an object was placed, how many items were stored, or when a stove was activated. We study tasks spanning up to eight minutes, where the policy must retrieve from its history of past observations and actions to correctly make decisions. Below, we provide a motivating example of such long-horizon tasks in mobile manipulation.

Overview of the long-horizon robot control problem: the robot must retrieve task-relevant information from a long history of past observations

Method

HALO is a visuomotor policy with an attention-based memory retrieval mechanism for long-horizon control. Two challenges arise when applying long-context transformers to imitation learning: policies may learn spurious correlations from irrelevant history, and prediction errors accumulate over time during closed-loop execution. HALO addresses both with the two components below.

Distilling Vision-Language Model Priors via Video Question-Answering

HALO generates task-relevant question–answer pairs from demonstration trajectories using a VLM pipeline. The policy is then co-trained with a video QA objective alongside the imitation learning objective. This induces priors over retrieving task-relevant information from history, reducing spurious correlations during closed-loop control.

VQA Generation

Input trajectory, task instruction

↓

Summarize object detector + VLM

↓

Output text description of scene & activities

↓

Generate sample a frame number, LLM generates Q&A

↓

Output

Q: "Where was the granola bar at t=8?"
A: "Right side of the countertop"

↓

Filter score relevance + correctness

↓

Output curated Q&A dataset

Joint Training

for each step:

  # imitation: predict expert action
  L_IL = -log P(expert action |
               trajectory, task)

  # VQA: answer from history
  L_VQA = -log P(answer |
                trajectory, question)

  # joint update
  L = L_IL + λ · L_VQA
  # update policy to minimize L

Reducing Model Drift with Top-K Attention Sparsification

During inference, the policy attends over all past observations and actions in its history context. This full-context attention introduces noise from accumulated errors in the history. To reduce the impact of this noise, we sparsify the attention to retrieve only the most informative pieces of information from history. A straight-through estimator keeps the discrete selection differentiable during training.

Training

for i in 0..t-1:
  score[i] = dot(query, key[i])

# hard top-k mask
top_idx  = argsort(score)[-k:]
mask     = zeros(t); mask[top_idx] = 1

# straight-through estimator:
# forward uses hard mask,
# backward grad flows through score
mask_st  = mask.detach() + score
         - score.detach()

context  = softmax(mask_st) @ val

Evaluation

for i in 0..t-1:
  score[i] = dot(query, key[i])

# hard Top-K (no gradient needed)
top_idx = argsort(score)[-k:]
attn    = softmax(score[top_idx])
context = attn @ val[top_idx]

HALO method overview: VQA-based VLM prior distillation and Top-K sparse attention — HALO overview. VLM-generated question–answer pairs supervise attention toward task-relevant history frames (L). Top-K sparse attention restricts retrieval to the most relevant context, reducing error propagation during closed-loop control (R).

Results

Real-World Experiments

Robot rollouts on a physical robot across five tasks (20 rollouts per task). HALO improves on every task, reaching 55% average success versus 36% for the Standard Transformer baseline.

Relational Info. Return to Same Container

HALO returns the bowl to the original plate

HALO returns the bowl to the original plate after recovering from a failed grasp

Numerical Info. Store N Objects

HALO stores exactly 2 packets and closes the door

Event-Time Info. Heat Stove for T Minutes

HALO turns off the stove after 30 seconds

HALO turns off the stove after 1 minute

HALO turns off the stove after 90 seconds

HALO turns off the stove after 6 minutes

HALO turns off the stove after 8 minutes

Numerical Info. + Human-Robot Collab. Store N Objects

Human places 1 cup; HALO places the remaining 2

Human places 2 cups; HALO places the remaining 1

Spatial Info. Retrieve Object

HALO retrieves the granola bar from its last seen location

Simulation Benchmark

Average task success rate across four simulation tasks (50 rollouts each). HALO achieves 41% average success, outperforming all baselines.

Retrieve Object: oil bottle retrieved from its last seen location

Return to Same Container: bowl returned to the original plate

Store N Objects: bread items stored and microwave door closed

Heat Stove: stove turned off after the specified duration

Key Takeaways

Text summaries lose granular detail. ReMemBer (−23 pts) stores VLM text descriptions of observations, but these discard fine-grained details such as precise event timing and low-level action information to retrace the path where the object was found, which may be critical for low-level control.
Learning retrieval beats engineering it. Hand-Designed Features achieves competitive spatial performance, showing that spatial cues are well-captured even without learning, yet HALO matches this without any manually designed rules. Surprisingly, HALO outperforms hand-designed features overall, as the policy can exploit additional information missed by human designers, such as neighboring visual frames that provide implicit cues about the environment.
Selective retrieval outperforms compression. Token Merging and Scene Memory Transformer compress history, discarding fine-grained detail. HALO retains the full context and retrieves selectively, providing flexibility to recollect any information but reducing the impact of noise at the same time.
No single method wins every task; HALO wins overall. Each baseline excels in its specialty (Hand-Designed on spatial, SAM2Act++ on counting) but fails elsewhere. HALO's 41% average comes from competitive performance across all four memory types.

Failure Cases

We categorize failures into two categories: manipulation and memory failure. We find that HALO reduces manipulation failures by an absolute 8% and memory failures by 25% over the Standard Transformer baseline in the Retrieve Object task. Below we show the failure modes of HALO. We hope it inspires future work.

Manipulation Failures Grasp / Motion Error