docs/source/vlabench.mdx

# VLABench

[VLABench](https://github.com/OpenMOSS/VLABench) is a large-scale benchmark for **language-conditioned robotic manipulation with long-horizon reasoning**. The upstream suite covers 100 task categories across 2,000+ objects and evaluates six dimensions of robot intelligence: mesh & texture understanding, spatial reasoning, world-knowledge transfer, semantic instruction comprehension, physical-law understanding, and long-horizon planning. Built on MuJoCo / dm_control with a Franka Panda 7-DOF arm. LeRobot exposes **43 of these tasks** through `--env.task` (21 primitives + 22 composites, see [Available tasks](#available-tasks) below).

- Paper: [VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning](https://arxiv.org/abs/2412.18194)
- GitHub: [OpenMOSS/VLABench](https://github.com/OpenMOSS/VLABench)
- Project website: [vlabench.github.io](https://vlabench.github.io)
- Pretrained policy: [`lerobot/smolvla_vlabench`](https://huggingface.co/lerobot/smolvla_vlabench)

<img
  src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/vlabench.png"
  alt="VLABench benchmark overview"
  width="85%"
/>

## Available tasks

VLABench ships two task suites covering **43 task categories** in LeRobot's `--env.task` surface:

| Suite     | CLI name    | Tasks | Description                                                      |
| --------- | ----------- | ----- | ---------------------------------------------------------------- |
| Primitive | `primitive` | 21    | Single / few-skill combinations (select, insert, physics QA)     |
| Composite | `composite` | 22    | Multi-step reasoning and long-horizon planning (cook, rearrange) |

**Primitive tasks:** `select_fruit`, `select_toy`, `select_chemistry_tube`, `add_condiment`, `select_book`, `select_painting`, `select_drink`, `insert_flower`, `select_billiards`, `select_ingredient`, `select_mahjong`, `select_poker`, and physical-reasoning tasks (`density_qa`, `friction_qa`, `magnetism_qa`, `reflection_qa`, `simple_cuestick_usage`, `simple_seesaw_usage`, `sound_speed_qa`, `thermal_expansion_qa`, `weight_qa`).

**Composite tasks:** `cluster_billiards`, `cluster_book`, `cluster_drink`, `cluster_toy`, `cook_dishes`, `cool_drink`, `find_unseen_object`, `get_coffee`, `hammer_nail`, `heat_food`, `make_juice`, `play_mahjong`, `play_math_game`, `play_poker`, `play_snooker`, `rearrange_book`, `rearrange_chemistry_tube`, `set_dining_table`, `set_study_table`, `store_food`, `take_chemistry_experiment`, `use_seesaw_complex`.

`--env.task` accepts three forms:

- a single task name (`select_fruit`)
- a comma-separated list (`select_fruit,heat_food`)
- a suite shortcut (`primitive`, `composite`, or `primitive,composite`)

## Installation

VLABench is **not on PyPI** — its only distribution is the [OpenMOSS/VLABench](https://github.com/OpenMOSS/VLABench) GitHub repo — so LeRobot does not expose a `vlabench` extra. Install it manually as an editable clone, alongside the MuJoCo / dm_control pins VLABench needs, then fetch the mesh assets:

```bash
# After following the standard LeRobot installation instructions.

git clone https://github.com/OpenMOSS/VLABench.git ~/VLABench
git clone https://github.com/motion-planning/rrt-algorithms.git ~/rrt-algorithms
pip install -e ~/VLABench -e ~/rrt-algorithms
pip install "mujoco==3.2.2" "dm-control==1.0.22" \
            open3d colorlog scikit-learn openai gdown

python ~/VLABench/scripts/download_assets.py
```

<Tip>
VLABench requires Linux (`sys_platform == 'linux'`) and Python 3.10+. Set the MuJoCo rendering backend before running:

```bash
export MUJOCO_GL=egl  # for headless servers (HPC, cloud)
```

</Tip>

## Evaluation

All eval snippets below mirror the command CI runs (see `.github/workflows/benchmark_tests.yml`). The `--rename_map` argument maps VLABench's `image` / `second_image` / `wrist_image` camera keys onto the three-camera (`camera1` / `camera2` / `camera3`) input layout the released `smolvla_vlabench` policy was trained on.

### Single-task evaluation (recommended for quick iteration)

```bash
lerobot-eval \
  --policy.path=lerobot/smolvla_vlabench \
  --env.type=vlabench \
  --env.task=select_fruit \
  --eval.batch_size=1 \
  --eval.n_episodes=10 \
  --eval.use_async_envs=false \
  --policy.device=cuda \
  '--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
```

### Multi-task evaluation

Pass a comma-separated list of tasks:

```bash
lerobot-eval \
  --policy.path=lerobot/smolvla_vlabench \
  --env.type=vlabench \
  --env.task=select_fruit,select_toy,add_condiment,heat_food \
  --eval.batch_size=1 \
  --eval.n_episodes=10 \
  --eval.use_async_envs=false \
  --policy.device=cuda \
  '--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
```

### Suite-wide evaluation

Run an entire suite (all 21 primitives or all 22 composites):

```bash
lerobot-eval \
  --policy.path=lerobot/smolvla_vlabench \
  --env.type=vlabench \
  --env.task=primitive \
  --eval.batch_size=1 \
  --eval.n_episodes=10 \
  --eval.use_async_envs=false \
  --policy.device=cuda \
  --env.max_parallel_tasks=1 \
  '--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
```

Or both suites:

```bash
lerobot-eval \
  --policy.path=lerobot/smolvla_vlabench \
  --env.type=vlabench \
  --env.task=primitive,composite \
  --eval.batch_size=1 \
  --eval.n_episodes=10 \
  --eval.use_async_envs=false \
  --policy.device=cuda \
  --env.max_parallel_tasks=1 \
  '--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
```

### Recommended evaluation episodes

**10 episodes per task** for reproducible benchmarking (210 total for the full primitive suite, 220 for composite). Matches the protocol in the VLABench paper.

## Policy inputs and outputs

**Observations:**

- `observation.state` — 7-dim end-effector state (position xyz + Euler xyz + gripper)
- `observation.images.image` — front camera, 480×480 HWC uint8
- `observation.images.second_image` — second camera, 480×480 HWC uint8
- `observation.images.wrist_image` — wrist camera, 480×480 HWC uint8

**Actions:**

- Continuous control in `Box(-1, 1, shape=(7,))` — 3D position + 3D Euler orientation + 1D gripper.

## Training

### Datasets

Pre-collected VLABench datasets in LeRobot format on the Hub:

- [`VLABench/vlabench_primitive_ft_lerobot_video`](https://huggingface.co/datasets/VLABench/vlabench_primitive_ft_lerobot_video) — 5,000 episodes, 128 tasks, 480×480 images.
- [`VLABench/vlabench_composite_ft_lerobot_video`](https://huggingface.co/datasets/VLABench/vlabench_composite_ft_lerobot_video) — 5,977 episodes, 167 tasks, 224×224 images.

### Example training command

Fine-tune a SmolVLA base on the primitive suite:

```bash
lerobot-train \
  --policy.type=smolvla \
  --policy.repo_id=${HF_USER}/smolvla_vlabench_primitive \
  --policy.load_vlm_weights=true \
  --policy.push_to_hub=true \
  --dataset.repo_id=VLABench/vlabench_primitive_ft_lerobot_video \
  --env.type=vlabench \
  --env.task=select_fruit \
  --output_dir=./outputs/smolvla_vlabench_primitive \
  --steps=100000 \
  --batch_size=4 \
  --eval_freq=5000 \
  --eval.batch_size=1 \
  --eval.n_episodes=1 \
  --save_freq=10000
```

## Reproducing published results

The released checkpoint [`lerobot/smolvla_vlabench`](https://huggingface.co/lerobot/smolvla_vlabench) was trained on the primitive-suite dataset above and is evaluated with the [Single-task](#single-task-evaluation-recommended-for-quick-iteration) / [Suite-wide](#suite-wide-evaluation) commands. CI runs a 10-primitive-task smoke eval (one episode each) on every PR touching the benchmark.
-												feat(sim): VLABench benchmark integration (#3396)

feat(sim): add VLABench benchmark integration
Add VLABench as a new simulation benchmark in LeRobot, following the existing LIBERO and MetaWorld patterns.
This PR wires VLABench end-to-end across environment integration, Docker setup, CI smoke evaluation, and documentation. It also fixes a number of upstream packaging and runtime issues required to make VLABench usable and reproducible in CI.
What’s included
Benchmark integration
Add VLABench as a new simulation benchmark.
Expose supported VLABench tasks through the LeRobot env interface.
Follow the established LIBERO / MetaWorld factory patterns.
Preserve lazy async-env metadata so env.unwrapped.metadata["render_fps"] continues to work.
CI smoke evaluation
Add a VLABench smoke-eval job using lerobot/smolvla_vlabench.
Use the correct rename_map for the 3-camera dataset layout.
Expand smoke coverage from 1 to 10 primitive tasks.
Extract task descriptions after eval so metrics artifacts include per-task labels.
Skip Docker Hub login when secrets are unavailable (e.g. fork PRs).
Docker / install fixes
Install VLABench from GitHub rather than PyPI.
Use uv pip, not pip, in the base image.
Fail loudly on install errors instead of masking them.
Clone VLABench into the non-root user’s home directory.
Use shallow editable installs for VLABench and rrt-algorithms to work around missing __init__.py issues.
Pin upstream clones to exact commit SHAs for reproducibility.
Add undeclared runtime dependencies required by VLABench (open3d, colorlog, scikit-learn, openai).
Unpin open3d so Python 3.12 wheels resolve.
Assets
Support downloading VLABench assets from a Hugging Face Hub mirror via VLABENCH_ASSETS_REPO.
Keep Google Drive download support as fallback.
Install huggingface_hub[hf_xet] so Xet-backed assets download correctly.
Validate required mesh/XML asset subtrees at build time.
Patch VLABench constants to tolerate missing asset directories at import time.
Runtime / env correctness
Import VLABench robots and tasks explicitly so decorator-based registry population happens.
Resize and normalize camera observations so they always match the declared (H, W, 3) uint8 observation space.
Reinstall LeRobot editably inside the image so the new env code is actually used.
Coerce agent_pos / ee_state to the expected shape.
Pad actions when needed to match data.ctrl.
Replace zero-padding fallback with proper dm_control IK for 7D end-effector actions.
Refetch dm_control physics on each step instead of caching weakrefs.
Retry unstable resets with reseeding and handle PhysicsError gracefully at step time.
Dataset / policy alignment
Align VLABench observations and actions with Hugging Face dataset conventions used by lerobot/vlabench_unified:
convert EE position between world frame and robot-base frame at the env boundary,
expose / consume Euler XYZ instead of raw quaternion layout,
align gripper semantics with dataset convention (1 = open, 0 = closed).
This fixes policy/env mismatches that previously caused incorrect IK targets and unstable behavior at evaluation time.
Docs
Add a full docs/source/vlabench.mdx page aligned with the standard benchmark template.
Document task selection forms (single task, comma list, suite shortcut).
Document installation, evaluation, training, and result reproduction.
Point examples at lerobot/smolvla_vlabench.
Add a benchmark banner image.
Remove outdated / misleading references to upstream evaluation tracks.
Document manual install flow instead of a broken vlabench extra.
Packaging cleanup
Remove the unresolvable vlabench extra from pyproject.toml.
Remove the no-op VLABench processor step.
Remove the obsolete env unit test that only covered the dropped gripper remap helper.
Apply formatting / logging / style cleanup from review feedback.
Why this is needed
VLABench is not currently consumable as a normal Python dependency and requires several upstream workarounds:
no PyPI release,
missing package declarations,
undeclared runtime deps,
SSH-only submodule references,
asset downloads outside normal package install flow,
registry population that depends on import side effects,
env outputs that do not always match declared observation shapes,
task resets that can diverge under some random layouts.
This PR makes the benchmark usable in LeRobot despite those constraints, and ensures CI runs are reproducible and informative.
If you want a much shorter squash commit message, I’d use this:
feat(sim): integrate VLABench benchmark with CI, Docker, and docs
Add VLABench as a new LeRobot simulation benchmark, following the existing LIBERO / MetaWorld patterns.
This includes:
LeRobot env integration and task exposure,
CI smoke eval with lerobot/smolvla_vlabench,
Docker install and asset-download fixes,
runtime fixes for registry loading, assets, camera obs, action handling, dm_control IK, and PhysicsError recovery,
alignment of obs/action semantics with HF VLABench datasets,
docs and packaging cleanup.
The PR also incorporates review feedback, improves reproducibility by pinning upstream commits, and makes VLABench usable in CI despite upstream packaging and asset-management issues.
											
										
										
											2026-04-21 17:54:11 +02:00
+								# VLABench
 								[VLABench](https://github.com/OpenMOSS/VLABench) is a large-scale benchmark for **language-conditioned robotic manipulation with long-horizon reasoning**. The upstream suite covers 100 task categories across 2,000+ objects and evaluates six dimensions of robot intelligence: mesh & texture understanding, spatial reasoning, world-knowledge transfer, semantic instruction comprehension, physical-law understanding, and long-horizon planning. Built on MuJoCo / dm_control with a Franka Panda 7-DOF arm. LeRobot exposes **43 of these tasks** through `--env.task` (21 primitives + 22 composites, see [Available tasks](#available-tasks) below).
 								- Paper: [VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning](https://arxiv.org/abs/2412.18194)
 								- GitHub: [OpenMOSS/VLABench](https://github.com/OpenMOSS/VLABench)
 								- Project website: [vlabench.github.io](https://vlabench.github.io)
 								- Pretrained policy: [`lerobot/smolvla_vlabench`](https://huggingface.co/lerobot/smolvla_vlabench)
 								<img
 								  src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/vlabench.png"
 								  alt="VLABench benchmark overview"
 								  width="85%"
 								/>
 								## Available tasks
 								VLABench ships two task suites covering **43 task categories** in LeRobot's `--env.task` surface:
 								| Suite     | CLI name    | Tasks | Description                                                      |
 								| --------- | ----------- | ----- | ---------------------------------------------------------------- |
 								| Primitive | `primitive` | 21    | Single / few-skill combinations (select, insert, physics QA)     |
 								| Composite | `composite` | 22    | Multi-step reasoning and long-horizon planning (cook, rearrange) |
 								**Primitive tasks:** `select_fruit`, `select_toy`, `select_chemistry_tube`, `add_condiment`, `select_book`, `select_painting`, `select_drink`, `insert_flower`, `select_billiards`, `select_ingredient`, `select_mahjong`, `select_poker`, and physical-reasoning tasks (`density_qa`, `friction_qa`, `magnetism_qa`, `reflection_qa`, `simple_cuestick_usage`, `simple_seesaw_usage`, `sound_speed_qa`, `thermal_expansion_qa`, `weight_qa`).
 								**Composite tasks:** `cluster_billiards`, `cluster_book`, `cluster_drink`, `cluster_toy`, `cook_dishes`, `cool_drink`, `find_unseen_object`, `get_coffee`, `hammer_nail`, `heat_food`, `make_juice`, `play_mahjong`, `play_math_game`, `play_poker`, `play_snooker`, `rearrange_book`, `rearrange_chemistry_tube`, `set_dining_table`, `set_study_table`, `store_food`, `take_chemistry_experiment`, `use_seesaw_complex`.
 								`--env.task` accepts three forms:
 								- a single task name (`select_fruit`)
 								- a comma-separated list (`select_fruit,heat_food`)
 								- a suite shortcut (`primitive`, `composite`, or `primitive,composite`)
 								## Installation
 								VLABench is **not on PyPI** — its only distribution is the [OpenMOSS/VLABench](https://github.com/OpenMOSS/VLABench) GitHub repo — so LeRobot does not expose a `vlabench` extra. Install it manually as an editable clone, alongside the MuJoCo / dm_control pins VLABench needs, then fetch the mesh assets:
 								```bash
 								# After following the standard LeRobot installation instructions.
 								git clone https://github.com/OpenMOSS/VLABench.git ~/VLABench
 								git clone https://github.com/motion-planning/rrt-algorithms.git ~/rrt-algorithms
 								pip install -e ~/VLABench -e ~/rrt-algorithms
 								pip install "mujoco==3.2.2" "dm-control==1.0.22" \
 								            open3d colorlog scikit-learn openai gdown
 								python ~/VLABench/scripts/download_assets.py
 								```
 								<Tip>
 								VLABench requires Linux (`sys_platform == 'linux'`) and Python 3.10+. Set the MuJoCo rendering backend before running:
 								```bash
 								export MUJOCO_GL=egl  # for headless servers (HPC, cloud)
 								```
 								</Tip>
 								## Evaluation
 								All eval snippets below mirror the command CI runs (see `.github/workflows/benchmark_tests.yml`). The `--rename_map` argument maps VLABench's `image` / `second_image` / `wrist_image` camera keys onto the three-camera (`camera1` / `camera2` / `camera3`) input layout the released `smolvla_vlabench` policy was trained on.
 								### Single-task evaluation (recommended for quick iteration)
 								```bash
 								lerobot-eval \
 								  --policy.path=lerobot/smolvla_vlabench \
 								  --env.type=vlabench \
 								  --env.task=select_fruit \
 								  --eval.batch_size=1 \
 								  --eval.n_episodes=10 \
 								  --eval.use_async_envs=false \
 								  --policy.device=cuda \
 								  '--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
 								```
 								### Multi-task evaluation
 								Pass a comma-separated list of tasks:
 								```bash
 								lerobot-eval \
 								  --policy.path=lerobot/smolvla_vlabench \
 								  --env.type=vlabench \
 								  --env.task=select_fruit,select_toy,add_condiment,heat_food \
 								  --eval.batch_size=1 \
 								  --eval.n_episodes=10 \
 								  --eval.use_async_envs=false \
 								  --policy.device=cuda \
 								  '--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
 								```
 								### Suite-wide evaluation
 								Run an entire suite (all 21 primitives or all 22 composites):
 								```bash
 								lerobot-eval \
 								  --policy.path=lerobot/smolvla_vlabench \
 								  --env.type=vlabench \
 								  --env.task=primitive \
 								  --eval.batch_size=1 \
 								  --eval.n_episodes=10 \
 								  --eval.use_async_envs=false \
 								  --policy.device=cuda \
 								  --env.max_parallel_tasks=1 \
 								  '--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
 								```
 								Or both suites:
 								```bash
 								lerobot-eval \
 								  --policy.path=lerobot/smolvla_vlabench \
 								  --env.type=vlabench \
 								  --env.task=primitive,composite \
 								  --eval.batch_size=1 \
 								  --eval.n_episodes=10 \
 								  --eval.use_async_envs=false \
 								  --policy.device=cuda \
 								  --env.max_parallel_tasks=1 \
 								  '--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
 								```
 								### Recommended evaluation episodes
 								**10 episodes per task** for reproducible benchmarking (210 total for the full primitive suite, 220 for composite). Matches the protocol in the VLABench paper.
 								## Policy inputs and outputs
 								**Observations:**
 								- `observation.state` — 7-dim end-effector state (position xyz + Euler xyz + gripper)
 								- `observation.images.image` — front camera, 480×480 HWC uint8
 								- `observation.images.second_image` — second camera, 480×480 HWC uint8
 								- `observation.images.wrist_image` — wrist camera, 480×480 HWC uint8
 								**Actions:**
 								- Continuous control in `Box(-1, 1, shape=(7,))` — 3D position + 3D Euler orientation + 1D gripper.
 								## Training
 								### Datasets
 								Pre-collected VLABench datasets in LeRobot format on the Hub:
 								- [`VLABench/vlabench_primitive_ft_lerobot_video`](https://huggingface.co/datasets/VLABench/vlabench_primitive_ft_lerobot_video) — 5,000 episodes, 128 tasks, 480×480 images.
 								- [`VLABench/vlabench_composite_ft_lerobot_video`](https://huggingface.co/datasets/VLABench/vlabench_composite_ft_lerobot_video) — 5,977 episodes, 167 tasks, 224×224 images.
 								### Example training command
 								Fine-tune a SmolVLA base on the primitive suite:
 								```bash
 								lerobot-train \
 								  --policy.type=smolvla \
 								  --policy.repo_id=${HF_USER}/smolvla_vlabench_primitive \
 								  --policy.load_vlm_weights=true \
 								  --policy.push_to_hub=true \
 								  --dataset.repo_id=VLABench/vlabench_primitive_ft_lerobot_video \
 								  --env.type=vlabench \
 								  --env.task=select_fruit \
 								  --output_dir=./outputs/smolvla_vlabench_primitive \
 								  --steps=100000 \
 								  --batch_size=4 \
 								  --eval_freq=5000 \
 								  --eval.batch_size=1 \
 								  --eval.n_episodes=1 \
 								  --save_freq=10000
 								```
 								## Reproducing published results
 								The released checkpoint [`lerobot/smolvla_vlabench`](https://huggingface.co/lerobot/smolvla_vlabench) was trained on the primitive-suite dataset above and is evaluated with the [Single-task](#single-task-evaluation-recommended-for-quick-iteration) / [Suite-wide](#suite-wide-evaluation) commands. CI runs a 10-primitive-task smoke eval (one episode each) on every PR touching the benchmark.