docs/source/robocerebra.mdx

# RoboCerebra

[RoboCerebra](https://robocerebra-project.github.io/) is a long-horizon manipulation benchmark that evaluates **high-level reasoning, planning, and memory** in VLAs. Episodes chain multiple sub-goals with language-grounded intermediate instructions, built on top of LIBERO's simulator stack (MuJoCo + robosuite, Franka Panda 7-DOF).

- Paper: [RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation](https://arxiv.org/abs/2506.06677)
- Project website: [robocerebra-project.github.io](https://robocerebra-project.github.io/)
- Dataset: [`lerobot/robocerebra_unified`](https://huggingface.co/datasets/lerobot/robocerebra_unified) — LeRobot v3.0, 6,660 episodes / 571,116 frames at 20 fps, 1,728 language-grounded sub-tasks.
- Pretrained policy: [`lerobot/smolvla_robocerebra`](https://huggingface.co/lerobot/smolvla_robocerebra)

## Available tasks

RoboCerebra reuses LIBERO's simulator, so evaluation runs against the LIBERO `libero_10` long-horizon suite:

| Suite     | CLI name    | Tasks | Description                                                   |
| --------- | ----------- | ----- | ------------------------------------------------------------- |
| LIBERO-10 | `libero_10` | 10    | Long-horizon kitchen/living room tasks chaining 3–6 sub-goals |

Each RoboCerebra episode in the dataset is segmented into multiple sub-tasks with natural-language instructions, which the unified dataset exposes as independent supervision signals.

## Installation

RoboCerebra piggybacks on LIBERO, so the `libero` extra is all you need:

```bash
pip install -e ".[libero]"
```

<Tip>
RoboCerebra requires Linux (MuJoCo / robosuite). Set the rendering backend before training or evaluation:

```bash
export MUJOCO_GL=egl  # for headless servers (HPC, cloud)
```

</Tip>

## Evaluation

RoboCerebra eval runs against LIBERO's `libero_10` suite with RoboCerebra's camera naming (`image` + `wrist_image`) and an extra empty-camera slot so a three-view-trained policy receives the expected input layout:

```bash
lerobot-eval \
  --policy.path=lerobot/smolvla_robocerebra \
  --env.type=libero \
  --env.task=libero_10 \
  --env.fps=20 \
  --env.obs_type=pixels_agent_pos \
  --env.observation_height=256 \
  --env.observation_width=256 \
  '--env.camera_name_mapping={"agentview_image": "image", "robot0_eye_in_hand_image": "wrist_image"}' \
  --eval.batch_size=1 \
  --eval.n_episodes=10 \
  --eval.use_async_envs=false \
  --policy.device=cuda \
  '--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.wrist_image": "observation.images.camera2"}' \
  --policy.empty_cameras=1
```

### Recommended evaluation episodes

**10 episodes per task** across the `libero_10` suite (100 total) for reproducible benchmarking. Matches the protocol used in the RoboCerebra paper.

## Policy inputs and outputs

**Observations:**

- `observation.state` — 8-dim proprioceptive state (7 joint positions + gripper)
- `observation.images.image` — third-person view, 256×256 HWC uint8
- `observation.images.wrist_image` — wrist-mounted camera view, 256×256 HWC uint8

**Actions:**

- Continuous control in `Box(-1, 1, shape=(7,))` — end-effector delta (6D) + gripper (1D)

## Training

The unified dataset at [`lerobot/robocerebra_unified`](https://huggingface.co/datasets/lerobot/robocerebra_unified) exposes two RGB streams and language-grounded sub-task annotations:

| Feature                          | Shape         | Description          |
| -------------------------------- | ------------- | -------------------- |
| `observation.images.image`       | (256, 256, 3) | Third-person view    |
| `observation.images.wrist_image` | (256, 256, 3) | Wrist-mounted camera |
| `observation.state`              | (8,)          | Joint pos + gripper  |
| `action`                         | (7,)          | EEF delta + gripper  |

Fine-tune a SmolVLA base on it:

```bash
lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=lerobot/robocerebra_unified \
  --env.type=libero \
  --env.task=libero_10 \
  --output_dir=outputs/smolvla_robocerebra
```

## Reproducing published results

The released checkpoint [`lerobot/smolvla_robocerebra`](https://huggingface.co/lerobot/smolvla_robocerebra) was trained on `lerobot/robocerebra_unified` and evaluated with the command in the [Evaluation](#evaluation) section. CI runs the same command with `--eval.n_episodes=1` as a smoke test on every PR touching the benchmark.
-												feat(envs): add RoboCerebra long-horizon manipulation benchmark (#3314)

* feat(ci): add RoboCerebra benchmark eval job

- Docker image with robosuite/libero deps for RoboCerebra eval
- CI workflow: 1-episode eval with pepijn223/smolvla_robocerebra
- Reuses libero env with rename_map + empty_cameras=3

* docs(robocerebra): add benchmark page and toctree entry

Add a dedicated docs page for RoboCerebra that points at the canonical
dataset lerobot/robocerebra_unified and shows how to run eval + fine-tune
against it. Wire it into the Benchmarks section of the toctree so
doc-builder picks it up.

* ci: point benchmark eval checkpoints at the lerobot/ org mirrors

pepijn223/smolvla_* → lerobot/smolvla_* across every benchmark job in
this branch (libero, metaworld, and the per-branch benchmark). The
checkpoints were mirrored into the lerobot/ org and that's the canonical
location going forward.

* fix(robocerebra): drop alias extra + simplify docker image

Two problems rolled up:

1. `uv sync --locked --extra test` was failing because pyproject.toml added
   a `robocerebra = ["lerobot[libero]"]` alias extra but uv.lock wasn't
   regenerated. Drop the alias — nothing in CI actually needs the extra
   name (the Dockerfile just installs what it needs directly), so this
   restores pyproject.toml and uv.lock to byte-exact origin/main.

2. Rebase docker/Dockerfile.benchmark.robocerebra on
   huggingface/lerobot-gpu:latest (same pattern as libero/metaworld/…).
   The nightly image already ships lerobot[all] which includes [libero],
   so the RoboCerebra image is essentially identical to the LIBERO one:
   fetch libero-assets, write ~/.libero/config.yaml, overlay source.
   92 → 43 lines.

Also repoint the CI workflow comment that referenced the removed extra.

* ci: use dedicated lerobot/smolvla_robocerebra checkpoint for smoke eval

Replace the generic pepijn223/smolvla_libero placeholder with the
purpose-trained lerobot/smolvla_robocerebra model in the RoboCerebra
CI smoke test.

* fix(ci): align RoboCerebra eval with training pipeline

Fixes 5 mismatches that caused 0% success rate:
- env.type: robocerebra (unregistered) → libero
- resolution: 360x360 (default) → 256x256 (matches dataset)
- camera_name_mapping: map eye_in_hand → wrist_image (not image2)
- empty_cameras: 3 → 1 (matches training)
- add HF_USER_TOKEN guard on eval step

* fix(ci): set env.fps=20 and explicit obs_type for RoboCerebra eval

Match the dataset's 20 FPS (LiberoEnv defaults to 30) and make
obs_type=pixels_agent_pos explicit for safety against future changes.

* docs(robocerebra): align page with adding_benchmarks template

Rework docs/source/robocerebra.mdx to follow the standard benchmark
doc structure: intro + links + available tasks + installation + eval
+ recommended episodes + policy I/O + training + reproducing results.

- Point everything at lerobot/smolvla_robocerebra (the released
  checkpoint), not the personal pepijn223 mirror.
- Add the --env.fps=20 and --env.obs_type=pixels_agent_pos flags
  that CI actually uses, so copy-paste eval reproduces CI.
- Split the "Training" block out of the recipe section into its own
  section with the feature table.
- Add an explicit "Reproducing published results" section pointing
  at the CI smoke eval.

* fix: integrate PR #3314 review feedback

- ci(robocerebra): drop redundant hf auth login step (auth is
  already performed inside the eval step's container).
- ci(robocerebra): add Docker Hub login before the image build
  to pick up the authenticated rate limit.
- docs(robocerebra): align eval snippet with the CI command
  (observation size, camera_name_mapping, use_async_envs, device,
  empty_cameras=1).

* fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs

Port of #3416 onto this branch.

* ci: gate Docker Hub login on secret availability

* Update .github/workflows/benchmark_tests.yml

Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>

* Update .github/workflows/benchmark_tests.yml

Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
											
										
										
											2026-04-20 19:12:15 +02:00
+								# RoboCerebra
 								[RoboCerebra](https://robocerebra-project.github.io/) is a long-horizon manipulation benchmark that evaluates **high-level reasoning, planning, and memory** in VLAs. Episodes chain multiple sub-goals with language-grounded intermediate instructions, built on top of LIBERO's simulator stack (MuJoCo + robosuite, Franka Panda 7-DOF).
 								- Paper: [RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation](https://arxiv.org/abs/2506.06677)
 								- Project website: [robocerebra-project.github.io](https://robocerebra-project.github.io/)
 								- Dataset: [`lerobot/robocerebra_unified`](https://huggingface.co/datasets/lerobot/robocerebra_unified) — LeRobot v3.0, 6,660 episodes / 571,116 frames at 20 fps, 1,728 language-grounded sub-tasks.
 								- Pretrained policy: [`lerobot/smolvla_robocerebra`](https://huggingface.co/lerobot/smolvla_robocerebra)
 								## Available tasks
 								RoboCerebra reuses LIBERO's simulator, so evaluation runs against the LIBERO `libero_10` long-horizon suite:
 								| Suite     | CLI name    | Tasks | Description                                                   |
 								| --------- | ----------- | ----- | ------------------------------------------------------------- |
 								| LIBERO-10 | `libero_10` | 10    | Long-horizon kitchen/living room tasks chaining 3–6 sub-goals |
 								Each RoboCerebra episode in the dataset is segmented into multiple sub-tasks with natural-language instructions, which the unified dataset exposes as independent supervision signals.
 								## Installation
 								RoboCerebra piggybacks on LIBERO, so the `libero` extra is all you need:
 								```bash
 								pip install -e ".[libero]"
 								```
 								<Tip>
 								RoboCerebra requires Linux (MuJoCo / robosuite). Set the rendering backend before training or evaluation:
 								```bash
 								export MUJOCO_GL=egl  # for headless servers (HPC, cloud)
 								```
 								</Tip>
 								## Evaluation
 								RoboCerebra eval runs against LIBERO's `libero_10` suite with RoboCerebra's camera naming (`image` + `wrist_image`) and an extra empty-camera slot so a three-view-trained policy receives the expected input layout:
 								```bash
 								lerobot-eval \
 								  --policy.path=lerobot/smolvla_robocerebra \
 								  --env.type=libero \
 								  --env.task=libero_10 \
 								  --env.fps=20 \
 								  --env.obs_type=pixels_agent_pos \
 								  --env.observation_height=256 \
 								  --env.observation_width=256 \
 								  '--env.camera_name_mapping={"agentview_image": "image", "robot0_eye_in_hand_image": "wrist_image"}' \
 								  --eval.batch_size=1 \
 								  --eval.n_episodes=10 \
 								  --eval.use_async_envs=false \
 								  --policy.device=cuda \
 								  '--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.wrist_image": "observation.images.camera2"}' \
 								  --policy.empty_cameras=1
 								```
 								### Recommended evaluation episodes
 								**10 episodes per task** across the `libero_10` suite (100 total) for reproducible benchmarking. Matches the protocol used in the RoboCerebra paper.
 								## Policy inputs and outputs
 								**Observations:**
 								- `observation.state` — 8-dim proprioceptive state (7 joint positions + gripper)
 								- `observation.images.image` — third-person view, 256×256 HWC uint8
 								- `observation.images.wrist_image` — wrist-mounted camera view, 256×256 HWC uint8
 								**Actions:**
 								- Continuous control in `Box(-1, 1, shape=(7,))` — end-effector delta (6D) + gripper (1D)
 								## Training
 								The unified dataset at [`lerobot/robocerebra_unified`](https://huggingface.co/datasets/lerobot/robocerebra_unified) exposes two RGB streams and language-grounded sub-task annotations:
 								| Feature                          | Shape         | Description          |
 								| -------------------------------- | ------------- | -------------------- |
 								| `observation.images.image`       | (256, 256, 3) | Third-person view    |
 								| `observation.images.wrist_image` | (256, 256, 3) | Wrist-mounted camera |
 								| `observation.state`              | (8,)          | Joint pos + gripper  |
 								| `action`                         | (7,)          | EEF delta + gripper  |
 								Fine-tune a SmolVLA base on it:
 								```bash
 								lerobot-train \
 								  --policy.path=lerobot/smolvla_base \
 								  --dataset.repo_id=lerobot/robocerebra_unified \
 								  --env.type=libero \
 								  --env.task=libero_10 \
 								  --output_dir=outputs/smolvla_robocerebra
 								```
 								## Reproducing published results
 								The released checkpoint [`lerobot/smolvla_robocerebra`](https://huggingface.co/lerobot/smolvla_robocerebra) was trained on `lerobot/robocerebra_unified` and evaluated with the command in the [Evaluation](#evaluation) section. CI runs the same command with `--eval.n_episodes=1` as a smoke test on every PR touching the benchmark.