Traditionally, task conditioned robot polcies assume access to all information about the task such as the reward function being optimized but intelligent agents such as humans know how to look for important information in their surroundings and take relevant actions based on the context. For example, when given the task of serving a beverage, looking at the time of day can inform the agent what to serve.
In this work we factorize the problem of looking for information and acting as information-seeking (IS) and information-receiving (IR) respectively, where we train the IS agent to "look" for relevant task context and IR to act to complete the task. Our method, DISaM (Dual Information-Seeking And Manipulation), splits the training into two phases -- In Phase 1, we learn the IR policy that takes in ground-truth context information and controls the movement of the robot. In Phase 2, we learn an IS policy as well as an image encoder such that the context can be correctly reconstructed from the camera observation. Once all parts are trained, together they create a system that takes in image observations and controls both the robot and the camera.
During deployment, DISaM calculates the uncertainty of the IR policy over the next action by conditioning it on several contexts generated with the Encoder. If the uncertainty of the IR policy is high (above a threshold) then information-seeking actions are taken by the IS policy. When the correct context has been found by the IS policy, the IR uncertainty over the next action falls below the threshold and DISaM takes the IR actions to complete the task.
Following videos demonstrate how the control is switched between the Information-Seeking (IS) and Information-Receiving (IR) agents. The text on top is semantic representation of the information that IR is uncertain about and we annotate the image when IS is able to find that information.
Below we provide more rollouts in the simulation environments to demonstrate the variety of behaviors IS policy learns. The left frame corresponds to the the IS agent's observations and the right frame is a task visualization.