HALO is a visuomotor policy for long-horizon robot control that learns to retrieve information from its past interaction (including past image observations, proprioceptives, and actions) for predicting the current action. These past interactions are stored in the long context of a transformer-based policy. However, long-context policies trained by imitation learning struggle with spurious correlations from history and compounding prediction errors. HALO addresses these challenges by distilling VLM priors through video question-answering supervision, reducing spurious correlations, and applying Top-K sparse attention to reduce the impact of accumulated errors in the context. Notably, HALO outperforms both text-based memory summarization and hand-engineered feature stores, as retaining raw observations and actions lets the policy exploit fine-grained details that VLMs discard during summarization and that human designers overlook when hand-crafting features.
General-purpose robots operating in partially observable environments, such as homes, require memory to support autonomy. They must recall diverse information from the past, such as where objects were placed, which tasks a human partner has completed, and when an appliance was turned on, to accomplish a wide range of tasks. Achieving this versatility requires a memory retrieval mechanism that generalizes well across tasks. However, hand-designed or heuristic-based methods rely on task-specific assumptions that may not transfer to different settings.
Transformer architectures that use attention over long contexts for memory retrieval provide a promising alternative, as they learn retrieval from data without task-specific assumptions. However, directly incorporating long-context transformer architecture into imitation learning from offline data introduces two key challenges: (1) the policy may learn spurious correlations between information from the past and predicted actions, and (2) errors accumulate over time in the memory due to prediction inaccuracies and their compounding interactions with the environment, leading to model drift and cascading failures in long-horizon control.
To address both challenges, we introduce HALO, a visuomotor policy with an attention-based memory retrieval mechanism for long-horizon control. To suppress spurious correlations, HALO leverages vision-language model (VLM) priors to steer retrieval toward task-relevant information. Concretely, it generates task-relevant, memory-dependent question–answer pairs from demonstration trajectories and trains the policy jointly with a video question-answering objective, transferring VLM priors to the visuomotor policy. To reduce the impact of accumulated errors in memory during closed-loop control, HALO uses sparse attention that restricts retrieval to only the most relevant parts of the history. Together, these components enable more reliable long-horizon control by guiding the policy to retrieve task-relevant information from up to eight minutes of past experience.
Long-horizon manipulation requires reasoning over context that is no longer visible: where an object was placed, how many items were stored, or when a stove was activated. We study tasks spanning up to eight minutes, where the policy must retrieve from its history of past observations and actions to correctly make decisions. Below, we provide a motivating example of such long-horizon tasks in mobile manipulation.
HALO is a visuomotor policy with an attention-based memory retrieval mechanism for long-horizon control. Two challenges arise when applying long-context transformers to imitation learning: policies may learn spurious correlations from irrelevant history, and prediction errors accumulate over time during closed-loop execution. HALO addresses both with the two components below.
HALO generates task-relevant question–answer pairs from demonstration trajectories using a VLM pipeline. The policy is then co-trained with a video QA objective alongside the imitation learning objective. This induces priors over retrieving task-relevant information from history, reducing spurious correlations during closed-loop control.
VQA Generation
Joint Training
for each step:
# imitation: predict expert action
L_IL = -log P(expert action |
trajectory, task)
# VQA: answer from history
L_VQA = -log P(answer |
trajectory, question)
# joint update
L = L_IL + λ · L_VQA
# update policy to minimize L During inference, the policy attends over all past observations and actions in its history context. This full-context attention introduces noise from accumulated errors in the history. To reduce the impact of this noise, we sparsify the attention to retrieve only the most informative pieces of information from history. A straight-through estimator keeps the discrete selection differentiable during training.
Training
for i in 0..t-1:
score[i] = dot(query, key[i])
# hard top-k mask
top_idx = argsort(score)[-k:]
mask = zeros(t); mask[top_idx] = 1
# straight-through estimator:
# forward uses hard mask,
# backward grad flows through score
mask_st = mask.detach() + score
- score.detach()
context = softmax(mask_st) @ val Evaluation
for i in 0..t-1: score[i] = dot(query, key[i]) # hard Top-K (no gradient needed) top_idx = argsort(score)[-k:] attn = softmax(score[top_idx]) context = attn @ val[top_idx]
Robot rollouts on a physical robot across five tasks (20 rollouts per task). HALO improves on every task, reaching 55% average success versus 36% for the Standard Transformer baseline.
Average task success rate across four simulation tasks (50 rollouts each). HALO achieves 41% average success, outperforming all baselines.
We categorize failures into two categories: manipulation and memory failure. We find that HALO reduces manipulation failures by an absolute 8% and memory failures by 25% over the Standard Transformer baseline in the Retrieve Object task. Below we show the failure modes of HALO. We hope it inspires future work.
If you find this work useful, please cite:
@inproceedings{shah2026halo,
title={Memory Retrieval in Visuomotor Policies for Long-Horizon Robot Control},
author={Shah, Rutav and Li, Yisu and Bello, Femi and Zhu, Yuke and Mart{\'{i}}n-Mart{\'{i}}n, Roberto},
booktitle={Proceedings of Robotics: Science and Systems},
year={2026}
}