docs/source/robomme.mdx

# RoboMME

[RoboMME](https://robomme.github.io) is a memory-augmented manipulation benchmark built on ManiSkill (SAPIEN). It evaluates a robot's ability to retain and use information across an episode — counting, object permanence, reference, and imitation.

- **16 tasks** across 4 memory-skill suites
- **1,600 training demos** (100 per task, 50 val, 50 test)
- **Dataset**: [`lerobot/robomme`](https://huggingface.co/datasets/lerobot/robomme) — LeRobot v3.0, 768K frames at 10 fps
- **Simulator**: ManiSkill / SAPIEN, Panda arm, Linux only

![RoboMME benchmark tasks overview](https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2603.04639/gradient.png)

## Tasks

| Suite                             | Tasks                                                         |
| --------------------------------- | ------------------------------------------------------------- |
| **Counting** (temporal memory)    | BinFill, PickXtimes, SwingXtimes, StopCube                    |
| **Permanence** (spatial memory)   | VideoUnmask, VideoUnmaskSwap, ButtonUnmask, ButtonUnmaskSwap  |
| **Reference** (object memory)     | PickHighlight, VideoRepick, VideoPlaceButton, VideoPlaceOrder |
| **Imitation** (procedural memory) | MoveCube, InsertPeg, PatternLock, RouteStick                  |

## Installation

> RoboMME requires **Linux** (ManiSkill/SAPIEN uses Vulkan rendering). Docker is recommended to isolate dependency conflicts.

### Native (Linux)

```bash
pip install --override <(printf 'gymnasium==0.29.1\nnumpy==1.26.4\n') \
  -e '.[smolvla,av-dep]' \
  'robomme @ git+https://github.com/RoboMME/robomme_benchmark.git@main'
```

> **Dependency note**: `mani-skill` (pulled by `robomme`) pins `gymnasium==0.29.1` and `numpy<2.0.0`, which conflict with lerobot's base `numpy>=2.0.0`. That's why `robomme` is not a pyproject extra — use the override install above, or the Docker approach below to avoid conflicts entirely.

### Docker (recommended)

```bash
# Build base image first (from repo root)
docker build -f docker/Dockerfile.eval-base -t lerobot-eval-base .

# Build RoboMME eval image (applies gymnasium + numpy pin overrides)
docker build -f docker/Dockerfile.benchmark.robomme -t lerobot-robomme .
```

The `docker/Dockerfile.benchmark.robomme` image overrides `gymnasium==0.29.1` and `numpy==1.26.4` after lerobot's install. Both versions are runtime-safe for lerobot's actual API usage.

## Running Evaluation

### Default (single task, single episode)

```bash
lerobot-eval \
    --policy.path=<your_policy_repo> \
    --env.type=robomme \
    --env.task=PickXtimes \
    --env.dataset_split=test \
    --env.task_ids=[0] \
    --eval.batch_size=1 \
    --eval.n_episodes=1
```

### Multi-task evaluation

Evaluate multiple tasks in one run by comma-separating task names. Use `task_ids` to control which episodes are evaluated per task. Recommended: 50 episodes per task for the test split.

```bash
lerobot-eval \
    --policy.path=<your_policy_repo> \
    --env.type=robomme \
    --env.task=PickXtimes,BinFill,StopCube,MoveCube,InsertPeg \
    --env.dataset_split=test \
    --env.task_ids=[0,1,2,3,4,5,6,7,8,9] \
    --eval.batch_size=1 \
    --eval.n_episodes=50
```

### Key CLI options for `env.type=robomme`

| Option               | Default       | Description                                        |
| -------------------- | ------------- | -------------------------------------------------- |
| `env.task`           | `PickXtimes`  | Any of the 16 task names above (comma-separated)   |
| `env.dataset_split`  | `test`        | `train`, `val`, or `test`                          |
| `env.action_space`   | `joint_angle` | `joint_angle` (8-D) or `ee_pose` (7-D)             |
| `env.episode_length` | `300`         | Max steps per episode                              |
| `env.task_ids`       | `null`        | List of episode indices to evaluate (null = `[0]`) |

## Dataset

The dataset [`lerobot/robomme`](https://huggingface.co/datasets/lerobot/robomme) is in **LeRobot v3.0 format** and can be loaded directly:

```python
from lerobot.datasets.lerobot_dataset import LeRobotDataset

dataset = LeRobotDataset("lerobot/robomme")
```

### Dataset features

| Feature            | Shape         | Description                     |
| ------------------ | ------------- | ------------------------------- |
| `image`            | (256, 256, 3) | Front camera RGB                |
| `wrist_image`      | (256, 256, 3) | Wrist camera RGB                |
| `actions`          | (8,)          | Joint angles + gripper          |
| `state`            | (8,)          | Joint positions + gripper state |
| `simple_subgoal`   | str           | High-level language annotation  |
| `grounded_subgoal` | str           | Grounded language annotation    |
| `episode_index`    | int           | Episode ID                      |
| `frame_index`      | int           | Frame within episode            |

### Feature key alignment (training)

The env wrapper exposes `pixels/image` and `pixels/wrist_image` as observation keys. The `features_map` in `RoboMMEEnv` maps these to `observation.images.image` and `observation.images.wrist_image` for the policy. State is exposed as `agent_pos` and maps to `observation.state`.

The dataset's `image` and `wrist_image` columns already align with the policy input keys, so no renaming is needed when fine-tuning.

## Action Spaces

| Type          | Dim | Description                                               |
| ------------- | --- | --------------------------------------------------------- |
| `joint_angle` | 8   | 7 joint angles + 1 gripper (−1 closed, +1 open, absolute) |
| `ee_pose`     | 7   | xyz + roll/pitch/yaw + gripper                            |

Set via `--env.action_space=joint_angle` (default) or `--env.action_space=ee_pose`.

## Platform Notes

- **Linux only**: ManiSkill requires SAPIEN/Vulkan. macOS and Windows are not supported.
- **GPU recommended**: Rendering is CPU-capable but slow; CUDA + Vulkan gives full speed.
- **gymnasium / numpy conflict**: See installation note above. Docker image handles this automatically.
- **ManiSkill fork**: `robomme` depends on a specific ManiSkill fork (`YinpeiDai/ManiSkill`), pulled in automatically via the `robomme` package.
-												feat(envs): add RoboMME benchmark (#3311)

* feat(envs): add RoboMME benchmark integration

- RoboMME env wrapper with image/wrist_image/state observations
- Docker image with Vulkan, SAPIEN, mani-skill deps
- CI workflow: 1-episode smoke eval with pepijn223/smolvla_robomme
- preprocess_observation: handle image/wrist_image/state keys
- pyproject.toml: robomme extra

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor(docker): rebase RoboMME image on huggingface/lerobot-gpu

Mirror the libero/metaworld pattern: start from the nightly GPU image
(which already has apt deps, uv, venv, and lerobot[all] preinstalled)
and only layer on what RoboMME uniquely needs — the Vulkan libs
ManiSkill/SAPIEN requires, plus the robomme extra with the
gymnasium/numpy overrides.

Drops 48 lines of duplicated base setup (CUDA FROM, python install,
user creation, venv init, base apt deps) that the nightly image already
provides. Net: 102 → 54 lines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(robomme): drop prototype-branch note and move dataset to lerobot/robomme

- Remove the "Related work" block referencing the prototype branch
  feat/robomme-integration; the PR stands on its own.
- Point all dataset references at lerobot/robomme (docs, env module
  docstring, RoboMMEEnvConfig docstring) — this is the canonical HF
  location once the dataset is mirrored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(robomme): make docs build + fast tests green

1. Docs: add robomme to _toctree.yml under Benchmarks so doc-builder's
   TOC integrity check stops rejecting the new page.

2. Fast tests: robomme's mani-skill transitively pins numpy<2 which is
   unsatisfiable against the project's numpy>=2 base pin, so `uv sync`
   couldn't resolve a universal lockfile.

   Drop robomme as a pyproject extra entirely — it truly cannot coexist
   with the rest of the dep tree. The Dockerfile installs robomme
   directly from its git URL via `uv pip install --override`, which was
   already the runtime path. pyproject, docs, env docstrings, and the
   CI job comment all now point to the docker-only install.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(robomme): realign unit tests with current env API

The tests were written against an earlier env layout and never updated when
the wrapper was refactored, so CI's fast-test job was failing with:

- KeyError: 'front_rgb' / 'wrist_rgb' — these were renamed to the
  lerobot-canonical 'image' / 'wrist_image' keys (matching the dataset
  columns and preprocess_observation's built-in fallbacks).
- AssertionError: 'robomme' not in result — create_robomme_envs now
  returns {task_name: {task_id: env}}, not {'robomme': {...}}, so
  comma-separated task lists work.
- ModuleNotFoundError: lerobot.envs.lazy_vec_env — LazyVectorEnv was
  removed; create_robomme_envs is straightforward synchronous now.

Rewrite the 7 failing cases against the current API, drop the three
LazyVectorEnv tests, and add a multi-task test so the new comma-separated
task parsing is covered. Stub install/teardown is moved into helpers
(`_install_robomme_stub` / `_uninstall_robomme_stub`) so individual tests
stop repeating six boilerplate lines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* ci: point benchmark eval checkpoints at the lerobot/ org mirrors

pepijn223/smolvla_* → lerobot/smolvla_* across every benchmark job in
this branch (libero, metaworld, and the per-branch benchmark). The
checkpoints were mirrored into the lerobot/ org and that's the canonical
location going forward.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: integrate PR #3311 review feedback

- envs: rename obs keys to pixels/image, pixels/wrist_image, agent_pos
- envs: add __post_init__ for dynamic action_dim in RoboMMEEnv config
- envs: remove special-case obs conversion in utils.py (no longer needed)
- ci: add Docker Hub login, HF_USER_TOKEN guard, --env.task_ids=[0]
- scripts: extract_task_descriptions supports multiple task_ids
- docs: title to # RoboMME, add image, restructure eval section
- tests: update all key assertions to match new obs naming

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(docs): use correct RoboMME teaser image URL

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* ci(robomme): smoke-eval 10 tasks instead of 5

Broader coverage on the RoboMME benchmark CI job: bump the smoke eval
from 5 tasks to 10 (one episode each), all drawn from ROBOMME_TASKS.

Tasks now run: PickXtimes, BinFill, StopCube, MoveCube, InsertPeg,
SwingXtimes, VideoUnmask, ButtonUnmask, PickHighlight, PatternLock.

Updated the parse_eval_metrics.py `--task` label from the single
`PickXtimes` stub to the full comma list so the metrics artifact
reflects what was actually run. `parse_eval_metrics.py` already reads
`overall` for multi-task runs, so no parser change is needed.

Made-with: Cursor

* fix(robomme): nest `pixels` as a dict so preprocess_observation picks it up

`_convert_obs` was returning flat keys (`pixels/image`,
`pixels/wrist_image`). `preprocess_observation()` in envs/utils.py
keys off the top-level `"pixels"` entry and, not finding it,
silently dropped every image from the batch. The policy then saw
zero image features and raised

    ValueError: All image features are missing from the batch.

Match the LIBERO layout: return
`{"pixels": {"image": ..., "wrist_image": ...}, "agent_pos": ...}`
and declare the same shape in `observation_space`.

Made-with: Cursor

* fix(robomme): align docs and tests with nested pixels obs layout

Addresses PR #3311 review feedback:
- Docs: correct observation keys to `pixels/image` / `pixels/wrist_image`
  (mapped to `observation.images.image` / `observation.images.wrist_image`)
  and drop the now-obsolete column-rename snippet.
- Tests: assert `result["pixels"]["image"]` instead of flat `pixels/image`,
  matching the nested layout required by `preprocess_observation()`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs

Port of #3416 onto this branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: gate Docker Hub login on secret availability

Fork PRs cannot access `secrets.DOCKERHUB_LEROBOT_{USERNAME,PASSWORD}`,
which made every benchmark job fail at the login step. Gate the login
on the env-var expansion of the username so the step is skipped (not
failed) when secrets are absent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(robomme): address review feedback

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
											
										
										
											2026-04-20 20:21:27 +02:00
+								# RoboMME
 								[RoboMME](https://robomme.github.io) is a memory-augmented manipulation benchmark built on ManiSkill (SAPIEN). It evaluates a robot's ability to retain and use information across an episode — counting, object permanence, reference, and imitation.
 								- **16 tasks** across 4 memory-skill suites
 								- **1,600 training demos** (100 per task, 50 val, 50 test)
 								- **Dataset**: [`lerobot/robomme`](https://huggingface.co/datasets/lerobot/robomme) — LeRobot v3.0, 768K frames at 10 fps
 								- **Simulator**: ManiSkill / SAPIEN, Panda arm, Linux only
 								![RoboMME benchmark tasks overview](https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2603.04639/gradient.png)
 								## Tasks
 								| Suite                             | Tasks                                                         |
 								| --------------------------------- | ------------------------------------------------------------- |
 								| **Counting** (temporal memory)    | BinFill, PickXtimes, SwingXtimes, StopCube                    |
 								| **Permanence** (spatial memory)   | VideoUnmask, VideoUnmaskSwap, ButtonUnmask, ButtonUnmaskSwap  |
 								| **Reference** (object memory)     | PickHighlight, VideoRepick, VideoPlaceButton, VideoPlaceOrder |
 								| **Imitation** (procedural memory) | MoveCube, InsertPeg, PatternLock, RouteStick                  |
 								## Installation
 								> RoboMME requires **Linux** (ManiSkill/SAPIEN uses Vulkan rendering). Docker is recommended to isolate dependency conflicts.
 								### Native (Linux)
 								```bash
 								pip install --override <(printf 'gymnasium==0.29.1\nnumpy==1.26.4\n') \
 								  -e '.[smolvla,av-dep]' \
 								  'robomme @ git+https://github.com/RoboMME/robomme_benchmark.git@main'
 								```
 								> **Dependency note**: `mani-skill` (pulled by `robomme`) pins `gymnasium==0.29.1` and `numpy<2.0.0`, which conflict with lerobot's base `numpy>=2.0.0`. That's why `robomme` is not a pyproject extra — use the override install above, or the Docker approach below to avoid conflicts entirely.
 								### Docker (recommended)
 								```bash
 								# Build base image first (from repo root)
 								docker build -f docker/Dockerfile.eval-base -t lerobot-eval-base .
 								# Build RoboMME eval image (applies gymnasium + numpy pin overrides)
 								docker build -f docker/Dockerfile.benchmark.robomme -t lerobot-robomme .
 								```
 								The `docker/Dockerfile.benchmark.robomme` image overrides `gymnasium==0.29.1` and `numpy==1.26.4` after lerobot's install. Both versions are runtime-safe for lerobot's actual API usage.
 								## Running Evaluation
 								### Default (single task, single episode)
 								```bash
 								lerobot-eval \
 								    --policy.path=<your_policy_repo> \
 								    --env.type=robomme \
 								    --env.task=PickXtimes \
 								    --env.dataset_split=test \
 								    --env.task_ids=[0] \
 								    --eval.batch_size=1 \
 								    --eval.n_episodes=1
 								```
 								### Multi-task evaluation
 								Evaluate multiple tasks in one run by comma-separating task names. Use `task_ids` to control which episodes are evaluated per task. Recommended: 50 episodes per task for the test split.
 								```bash
 								lerobot-eval \
 								    --policy.path=<your_policy_repo> \
 								    --env.type=robomme \
 								    --env.task=PickXtimes,BinFill,StopCube,MoveCube,InsertPeg \
 								    --env.dataset_split=test \
 								    --env.task_ids=[0,1,2,3,4,5,6,7,8,9] \
 								    --eval.batch_size=1 \
 								    --eval.n_episodes=50
 								```
 								### Key CLI options for `env.type=robomme`
 								| Option               | Default       | Description                                        |
 								| -------------------- | ------------- | -------------------------------------------------- |
 								| `env.task`           | `PickXtimes`  | Any of the 16 task names above (comma-separated)   |
 								| `env.dataset_split`  | `test`        | `train`, `val`, or `test`                          |
 								| `env.action_space`   | `joint_angle` | `joint_angle` (8-D) or `ee_pose` (7-D)             |
 								| `env.episode_length` | `300`         | Max steps per episode                              |
 								| `env.task_ids`       | `null`        | List of episode indices to evaluate (null = `[0]`) |
 								## Dataset
 								The dataset [`lerobot/robomme`](https://huggingface.co/datasets/lerobot/robomme) is in **LeRobot v3.0 format** and can be loaded directly:
 								```python
 								from lerobot.datasets.lerobot_dataset import LeRobotDataset
 								dataset = LeRobotDataset("lerobot/robomme")
 								```
 								### Dataset features
 								| Feature            | Shape         | Description                     |
 								| ------------------ | ------------- | ------------------------------- |
 								| `image`            | (256, 256, 3) | Front camera RGB                |
 								| `wrist_image`      | (256, 256, 3) | Wrist camera RGB                |
 								| `actions`          | (8,)          | Joint angles + gripper          |
 								| `state`            | (8,)          | Joint positions + gripper state |
 								| `simple_subgoal`   | str           | High-level language annotation  |
 								| `grounded_subgoal` | str           | Grounded language annotation    |
 								| `episode_index`    | int           | Episode ID                      |
 								| `frame_index`      | int           | Frame within episode            |
 								### Feature key alignment (training)
 								The env wrapper exposes `pixels/image` and `pixels/wrist_image` as observation keys. The `features_map` in `RoboMMEEnv` maps these to `observation.images.image` and `observation.images.wrist_image` for the policy. State is exposed as `agent_pos` and maps to `observation.state`.
 								The dataset's `image` and `wrist_image` columns already align with the policy input keys, so no renaming is needed when fine-tuning.
 								## Action Spaces
 								| Type          | Dim | Description                                               |
 								| ------------- | --- | --------------------------------------------------------- |
 								| `joint_angle` | 8   | 7 joint angles + 1 gripper (−1 closed, +1 open, absolute) |
 								| `ee_pose`     | 7   | xyz + roll/pitch/yaw + gripper                            |
 								Set via `--env.action_space=joint_angle` (default) or `--env.action_space=ee_pose`.
 								## Platform Notes
 								- **Linux only**: ManiSkill requires SAPIEN/Vulkan. macOS and Windows are not supported.
 								- **GPU recommended**: Rendering is CPU-capable but slow; CUDA + Vulkan gives full speed.
 								- **gymnasium / numpy conflict**: See installation note above. Docker image handles this automatically.
 								- **ManiSkill fork**: `robomme` depends on a specific ManiSkill fork (`YinpeiDai/ManiSkill`), pulled in automatically via the `robomme` package.