OopsieVerse

A Safety Benchmark with Damage-Aware Simulation for Robot Manipulation

Arnav Balaji*, Arpit Bahety*, Sriniket Ambatipudi, Daniel Lam, Junhong Xu, Roberto Martín-Martín

University of Texas at Austin *Equal contribution; order by dice roll

Robotics: Science and Systems (RSS) 2026

Paper Documentation Code

Scroll

Overview

What is OopsieVerse?

What if your home robot finishes the job but breaks your kitchen in the process?

Robots keep getting better at handling everyday objects. But finishing the task isn't the whole story. A robot that picks up an egg and cracks it still isn't ready for a real home. Neither is one that pours a glass of water and spills half of it. Safety is the part that's missing. And today's simulators barely measure it.

OopsieVerse is a unified, damage-aware simulation framework for household manipulation. At its core, DamageSim turns physical signals like contact forces, heat, and liquid into measurable mechanical, thermal, and fluid damage. This lets the benchmark score not just whether a robot finished the task, but whether it did so safely. It runs in both BEHAVIOR-1K and RoboCasa. It also supports safer data collection, damage-aware imitation and reinforcement learning, and Vision-Language-Action safety evaluation.

Explore the work

01 DamageSim

A simulator-agnostic plugin that turns contact forces, heat, and liquid exposure into measurable mechanical, thermal, and fluid damage.

Explore → 02 OopsieBench

A suite of 32 household tasks that contrast easy-but-risky strategies with safer, more careful ones.

Explore → 03 Use Cases

Safer data collection, damage-aware imitation and reinforcement learning, VLA safety evals, and sim-to-real transfer.

Explore →

Cite

BibTeX

@inproceedings{balaji2026oopsieverse,
  title={OopsieVerse: A Safety Benchmark with Damage-Aware Simulation for Robot Manipulation},
  author={Balaji, Arnav and Bahety, Arpit and Ambatipudi, Sriniket and Lam, Daniel and Xu, Junhong and Mart{\'i}n-Mart{\'i}n, Roberto},
  booktitle={Robotics: Science and Systems (RSS), 2026},
  year={2026}
}

Framework

DamageSim

DamageSim is our simulator-agnostic plugin that makes physical safety measurable by tracking object-centric “health.” It monitors simulator signals—such as contact forces, temperature, and liquid exposure—and converts them into mechanical (e.g., impact or compression), thermal, and fluid damage, which can be used as observations, rewards, or termination conditions. We instantiate it in RoboCasa (MuJoCo) and BEHAVIOR-1K (Omniverse) showcasing its consistency across simulators.

Mechanical Damage

Thermal Damage

Fluid Damage

Damage-Augmented POMDP Implementation with DamageSim

DamageSim is simulator-agnostic. we instantiate it in BEHAVIOR-1k (Nvidia Omniverse) and RoboCasa (MuJoCo) to demonstrate consistent safety measurement across different physics backends.

MuJoCoRoboCasa

OmniverseBEHAVIOR-1k

Benchmark

OopsieBench

OopsieBench is a suite of 32 household tasks in total (all shown in the grid below; hover a tile to see its name). The suite is designed to (i) expose policies to realistic, physically damaging failure modes in household manipulation, and (ii) make safety measurable by contrasting easy but risky strategies with safer ones that require more careful interaction (e.g., gentler contact, safer approaches, or avoiding hazards). The benchmark spans diverse scenes, objects, and damage modalities, is cross-platform (BEHAVIOR-1k and RoboCasa), and includes a dataset of safe and unsafe human teleop demonstrations for five tasks.

Pick Egg (RoboCasa)

Place In Microwave (RoboCasa)

Turn On Stove (RoboCasa)

Turn On Faucet (B1K)

Open Single Door (RoboCasa)

Fill Bowl (B1K)

Attach Camera (B1K)

Food in Microwave (B1K)

Dishes to Sink (RoboCasa)

Turn On Microwave (RoboCasa)

Place Plate (RoboCasa)

Nav & Lift Bowl (RoboCasa)

Serve Pastry (RoboCasa)

Pick Egg (B1K)

Pour Water (B1K)

Turn on Stove (RoboCasa)

Nav to Table (B1K)

Wipe Counter (RoboCasa)

Wipe Counter (B1K)

Shelve Item (B1K)

Turn On Faucet (RoboCasa)

Heat Saucepot (B1K)

Open Drawer (B1K)

Pick Scrub (B1K)

Counter to Microwave (RoboCasa)

Open Single Door (B1K)

Toggle on Microwave (RoboCasa)

Place Plate (B1K)

Add Firewood (B1K)

Prepare Breakfast (RoboCasa)

Prepare Coffee (RoboCasa)

Ignite Wood (B1K)

Click a tile to enlarge.

Experiments

What can you do with OopsieVerse?

1 Safety-aware data collection

OopsieVerse can provide real-time safety feedback during data collection, enabling data collectors to collect safer data. This feedback is provided in two ways: (1) damage-based coloration and (2) health bars of tracked objects. The health bars are particularly helpful when damage occurs outside the current camera view (e.g., the wineglass falling behind the counter).

2 Imitation learning (IL)

Many real-world (or sim-collected) demonstration datasets mix safe and unsafe behavior. Using DamageSim, we can automatically flag trajectories that incur damage and filter them out to construct a dataset of only safe demonstrations. In the videos below, the top row shows an IL policy trained on the full dataset (including unsafe trajectories), while the bottom row shows a policy trained on the filtered safe-only demonstrations.

Pour Glass

Add Firewood

Lift Egg

Shelve Item

Wipe Countertop

3 Reinforcement learning (RL)

DamageSim lets us turn physically-grounded damage signals into a shaping reward that penalizes harmful interactions. Combined with the task reward, this yields a learning signal that encourages agents to complete the task while minimizing damage. We evaluate this idea in three settings: (i) pure RL on the Place Plate task with and without the damage reward (PPO), (ii) behavior cloning on Move Glass of Water task followed by PPO finetuning with the damage reward, and (iii) fine-tuning a Shelve Item policy—initially trained on the full IL dataset—using an additional damage-reward using DSRL. The video below shows the training process for (iii), illustrating that over time the agent learns to reduce damage while maintaining high task performance.

4 VLA evaluations

We evaluate the safety of modern manipulation policies by benchmarking GR00T, a state-of-the-art Vision-Language-Action model from NVIDIA. Even when GR00T achieves high task success, it frequently causes damage, resulting in much lower safe success rates and degraded environment health. This underscores the limits of evaluating VLAs solely by task completion and highlights the need for benchmarks and learning signals that explicitly account for harmful interactions.

Attach Camera

Open Microwave Door

Pick up Scrubber

Ignite Wood

Open Single Door

Turn on Microwave

Counter to Microwave

Turn on Stove

5 Sim2Real transfer

Our ultimate goal is safe real-world robot behavior, so we test whether policies trained with OopsieVerse transfer safely to real world. Compared to a baseline IL policy trained on all data, a damage-aware IL policy trained on health-filtered episodes behaves more cautiously in the Pour Water and Shelve Cereal Box tasks. It goes futher away from the laptop when pouring water and learns to make space for the cereal box by pushing non-fragile objects like the crackers box instead of fragile objects like glass bottles. Our experiments suggest that explicit damage signals obtained via OopsieVerse can translate into improved safety on hardware.

Pour Water (Safe + Unsafe Data)

Pour Water (Safety-filtered Data)

Shelve Cereal Box (Safe + Unsafe Data)

Shelve Cereal Box (Safety-filtered Data)

↑ Back to top