Compare commits

..

13 Commits

Author SHA1 Message Date
Pepijn
63dedac255 fix(ci): downgrade contents permission to read in claude.yml
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 19:19:31 +02:00
Pepijn
b0286b10cf chore: remove root CLAUDE.md (moved to .github/CLAUDE.md)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 18:04:48 +02:00
Pepijn
7a8b02cd32 refactor(ci): move CLAUDE.md to .github/ to keep repo root clean
CLAUDE.md is CI-only config — moving it to .github/ ensures it is not
visible at the repo root when contributors clone lerobot. Both workflows
now explicitly reference .github/CLAUDE.md in their prompt/system-prompt.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 18:03:06 +02:00
Pepijn
892e9f13b7 docs(claude): remove LOC minimization guideline from CLAUDE.md
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 17:59:55 +02:00
Pepijn
4b8436aefa feat(ci): restrict @claude trigger to repo owners, members, and collaborators
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 17:57:50 +02:00
Pepijn
9d97426cb8 Merge branch 'main' into fix/claude-code-action-precommit 2026-04-08 17:56:32 +02:00
Pepijn
e8f504edaa feat(ci): use claude-opus-4-6 for PR reviews
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 17:56:01 +02:00
Pepijn
db7334a384 docs(claude): add Processor to core abstractions in CLAUDE.md
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 17:54:17 +02:00
Pepijn
5de7aa5a4f refactor(envs): move benchmark dispatch into EnvConfig subclasses (#3272)
* docs(benchmarks): add benchmark integration guide and standardize benchmark docs

Add a comprehensive guide for adding new benchmarks to LeRobot, and
refactor the existing LIBERO and Meta-World docs to follow the new
standardized template.

* refactor(envs): move dispatch logic from factory into EnvConfig subclasses

Replace hardcoded if/elif chains in factory.py with create_envs() and
get_env_processors() methods on EnvConfig. New benchmarks now only need
to register a config subclass — no factory.py edits required.

Net -23 lines: factory.py shrinks from ~200 to ~70 lines of logic.

* docs(benchmarks): clean up adding-benchmarks guide for clarity

Rewrite for simpler language, better structure, and easier navigation.
Move quick-reference table to the top, fold eval explanation into
architecture section, condense the doc template to a bulleted outline.

* fix link

* fix task count

* fix(tests): fix 3 failing dispatch tests

- test_registry_all_types: skip non-EnvConfig stubs (e.g. TestPluginConfig)
- test_processors_delegation: use None instead of abstract PreTrainedConfig
- test_custom_get_env_processors_override: use DataProcessorPipeline for isinstance check (PolicyProcessorPipeline is a subscripted generic)

* fix: enable SmolVLA eval on LIBERO with custom camera mappings

- Thread camera_name_mapping from LiberoEnv config through to gym envs
- Sync features_map with camera_name_mapping in LiberoEnv.__post_init__
- Fix render() to use first available camera instead of hardcoded "image"
- Handle non-dict final_info in rollout by falling back to info["is_success"]
- Add use_peft legacy field to SmolVLAConfig for checkpoint compat
- Add defaults to GR00TN15Config init=False fields for transformers 5.3

Made-with: Cursor

* fix: use direct AutoresetMode import for gymnasium compat

Made-with: Cursor

* fix: handle gymnasium < 1.0 without AutoresetMode

Made-with: Cursor

* refactor: revert policy changes, keep env-only camera mapping fixes

- Revert GR00T N1.5 default_factory/default changes (transformers compat)
- Revert SmolVLA use_peft legacy field
- Apply ruff formatting fixes
- camera_name_mapping stays entirely in env/eval layer (no policy changes)

Made-with: Cursor

* Update docs/source/env_processor.mdx

Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>

* Update docs/source/env_processor.mdx

Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>

* Update docs/source/env_processor.mdx

Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>

* fix(eval): raise RuntimeError for unsupported final_info format (Gymnasium < 1.0)

Made-with: Cursor

* style: fix markdown code fences in env_processor.mdx

Made-with: Cursor

* docs: remove duplicate code blocks in env_processor.mdx

Made-with: Cursor

* style: revert quadruple backticks to triple (prettier compat)

* docs(env_processor): add EnvConfig subclass step and policy_cfg examples

- Add missing '### 2. Update Your EnvConfig Subclass' section with
  get_env_processors() snippet
- Update factory usage example to show policy_cfg parameter and
  keyword-argument style for both SmolVLA and ACT cases

* docs(env_processor): rename step 2 and fix policy_cfg examples

- Rename '### 2. Update the Factory' → '### 2. Update Your EnvConfig Subclass'
- Update factory usage examples to use keyword-argument style with
  policy_cfg parameter for both SmolVLA and ACT cases

---------

Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
2026-04-08 17:48:58 +02:00
Pepijn
fc8d89b128 feat(ci): add CLAUDE.md and improve claude-code-action workflows
- Add CLAUDE.md with lerobot-specific review instructions (core abstractions,
  engineering principles, ML-specific checks, PR checklist)
- Enable use_sticky_comment: true on both workflows (single updating comment per PR)
- Add structured lerobot-specific review prompt to claude-code-review.yml
- Upgrade permissions: contents/pull-requests/issues write for interactive claude.yml
- Add actions: read to claude-code-review.yml for CI log access
- Set FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true to suppress Node.js 20 deprecation warnings

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 17:47:41 +02:00
Pepijn
e0bde22193 fix(ci): pin claude-code-action to commit SHA and add persist-credentials: false
Fixes pre-commit zizmor failures from PR #3322:
- Pin anthropics/claude-code-action@v1 to commit hash (26ddc358) to satisfy blanket pinning policy
- Add persist-credentials: false to actions/checkout steps to suppress credential-persistence warning
- Remove trailing blank lines to satisfy end-of-file-fixer

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 17:34:53 +02:00
Pauline Bailly-Masson
055f20f658 "Claude Code Review workflow" 2026-04-08 17:22:05 +02:00
Pauline Bailly-Masson
30d2fe3bb3 "Claude PR Assistant workflow" 2026-04-08 17:22:03 +02:00
22 changed files with 350 additions and 1255 deletions

86
.github/CLAUDE.md vendored Normal file
View File

@@ -0,0 +1,86 @@
# LeRobot — Claude Code Instructions
You are a senior robotics ML engineer reviewing code for **LeRobot**, a PyTorch framework for real-world robot learning.
Apply these principles to every PR review, fix, or task.
---
## Core Abstractions
These are the load-bearing types. Handle them with care — breaking changes here affect every user.
| Type | Location | Role |
| ---------------- | ---------------------------- | ------------------------------------------------------------ |
| `LeRobotDataset` | `src/lerobot/datasets/` | Streaming replay buffer; HF Hub integration |
| `Policy` | `src/lerobot/policies/` | Base class for all learning agents (ACT, Diffusion, SARM, …) |
| `Robot` | `src/lerobot/robots/` | Hardware abstraction; carries `_output_pipeline` |
| `Teleoperator` | `src/lerobot/teleoperators/` | Leader-side hardware abstraction; carries `_output_pipeline` |
| `Env` | `src/lerobot/envs/` | Gym-like robotics environments |
| `Processor` | `src/lerobot/processor/` | Data transformation pipelines attached to robots/teleops |
**Never break their public APIs without a migration note and explicit user approval.**
---
## Engineering Principles
### Code quality
- Explicit over magic — no hidden control flow, no implicit state.
- No deep inheritance trees. Prefer composition.
- No decorative comment separators (`===`, `---`, etc.).
- Add comments only where the logic is non-obvious.
- No over-engineering. YAGNI applies strictly.
### Type safety
- All new and modified Python code must be fully typed (PEP 484).
- `mypy --strict` must pass on changed files.
- Do not widen or weaken existing type signatures.
### Backwards compatibility
- Public API changes require migration notes.
- Additive changes are preferred over modifications.
- `so100_follower` / `so101_follower` are aliases — never bleed changes there unintentionally.
### HF ecosystem
- Use `push_to_hub()`, HF Hub dataset streaming, and `evaluate` scripts.
- Dataset changes must preserve streaming compatibility.
- Prefer reusing HF primitives over rolling custom solutions.
---
## PR Review Checklist
Before approving or marking P1 issues resolved, verify:
- [ ] `pre-commit run -a` would pass (ruff, mypy, typos, zizmor, bandit)
- [ ] All new/modified code is typed and passes `mypy --strict`
- [ ] New features have unit tests; no silent behavioral changes
- [ ] Public APIs of `LeRobotDataset`, `Policy`, `Robot`, `Teleoperator`, `Env` are unchanged (or migration note present)
- [ ] HF Hub streaming still works for dataset changes
- [ ] No unnecessary abstractions introduced
- [ ] No breaking changes to training scripts (`lerobot-train`, `lerobot-eval`, `lerobot-record`)
---
## ML-Specific Checks
Flag these as **P1** if found:
- **Data leakage**: train and val/test splits must be constructed before any normalization or augmentation that uses train statistics.
- **Loss function errors**: verify reduction mode (`mean` vs `sum`), correct masking, correct shape alignment.
- **Gradient flow**: new modules must have gradients flowing (check `requires_grad`, no detached tensors in the loss path by accident).
- **Distributed training**: operations on tensors must be DDP-safe; no in-place ops on parameters; batch norm needs `SyncBatchNorm` if used.
- **Memory leaks**: no accumulation of tensors outside the training loop; `optimizer.zero_grad()` called correctly.
---
## What to Skip
- Don't flag style nitpicks on unchanged surrounding code.
- Don't propose refactors outside the PR's scope.
- Don't add docstrings or comments to code the PR didn't touch.
- Don't suggest speculative future features (YAGNI).

View File

@@ -1,309 +0,0 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Integration tests: build an isolated Docker image per benchmark and run a
# 1-episode smoke eval. Each benchmark gets its own image so incompatible
# dependency trees (e.g. hf-libero vs metaworld==3.0.0) can never collide.
#
# To add a new benchmark:
# 1. Add docker/Dockerfile.benchmark.<name> (install only lerobot[<name>])
# 2. Copy one of the jobs below and adjust the image name and eval command.
name: Benchmark Integration Tests
on:
# Run manually from the Actions tab
workflow_dispatch:
# Run every Monday at 02:00 UTC.
schedule:
- cron: "0 2 * * 1"
push:
branches:
- feat/benchmark-ci
- main
paths:
- "src/lerobot/envs/**"
- "src/lerobot/scripts/lerobot_eval.py"
- "docker/Dockerfile.benchmark.*"
- ".github/workflows/benchmark_tests.yml"
- "pyproject.toml"
pull_request:
branches:
- main
paths:
- "src/lerobot/envs/**"
- "src/lerobot/scripts/lerobot_eval.py"
- "docker/Dockerfile.benchmark.*"
- ".github/workflows/benchmark_tests.yml"
- "pyproject.toml"
permissions:
contents: read
env:
UV_VERSION: "0.8.0"
PYTHON_VERSION: "3.12"
# Cancel in-flight runs for the same branch/PR.
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true
jobs:
# ── LIBERO ────────────────────────────────────────────────────────────────
# Isolated image: lerobot[libero] only (hf-libero, dm-control, mujoco chain)
libero-integration-test:
name: Libero — build image + 1-episode eval
runs-on:
group: aws-g6-4xlarge-plus
env:
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
lfs: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
with:
cache-binary: false
# Build the benchmark-specific image; layer cache lives in the runner's
# local Docker daemon — reused across re-runs on the same machine.
- name: Build Libero benchmark image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
with:
context: .
file: docker/Dockerfile.benchmark.libero
push: false
load: true
tags: lerobot-benchmark-libero:ci
cache-from: type=local,src=/tmp/.buildx-cache-libero
cache-to: type=local,dest=/tmp/.buildx-cache-libero,mode=max
- name: Login to Hugging Face
if: env.HF_USER_TOKEN != ''
run: |
docker run --rm \
-e HF_HOME=/tmp/hf \
lerobot-benchmark-libero:ci \
bash -c "hf auth login --token '$HF_USER_TOKEN' --add-to-git-credential && hf auth whoami"
- name: Run Libero smoke eval (1 episode)
run: |
# Named container (no --rm) so we can docker cp artifacts out.
# Output to /tmp inside the container — user_lerobot cannot create
# root-level dirs like /artifacts.
docker run --name libero-eval --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
lerobot-benchmark-libero:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
lerobot-eval \
--policy.path=pepijn223/smolvla_libero \
--env.type=libero \
--env.task=libero_spatial \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--env.camera_name_mapping={\"agentview_image\": \"camera1\", \"robot0_eye_in_hand_image\": \"camera2\"}' \
--policy.empty_cameras=1 \
--output_dir=/tmp/eval-artifacts
python3 /lerobot/scripts/ci/extract_task_descriptions.py \
--env libero --task libero_spatial \
--output /tmp/eval-artifacts/task_descriptions.json 2>/dev/null || true
"
- name: Copy Libero artifacts from container
if: always()
run: |
mkdir -p /tmp/libero-artifacts
docker cp libero-eval:/tmp/eval-artifacts/. /tmp/libero-artifacts/ 2>/dev/null || true
docker rm -f libero-eval || true
- name: Parse Libero eval metrics
if: always()
run: |
python3 scripts/ci/parse_eval_metrics.py \
--artifacts-dir /tmp/libero-artifacts \
--env libero \
--task libero_spatial \
--policy pepijn223/smolvla_libero
- name: Upload Libero rollout video
if: always()
uses: actions/upload-artifact@v4
with:
name: libero-rollout-video
path: /tmp/libero-artifacts/videos/
if-no-files-found: warn
- name: Upload Libero eval metrics
if: always()
uses: actions/upload-artifact@v4
with:
name: libero-metrics
path: /tmp/libero-artifacts/metrics.json
if-no-files-found: warn
# ── LIBERO TRAIN+EVAL SMOKE ──────────────────────────────────────────────
# Train SmolVLA for 1 step (batch_size=1, dataset episode 0 only) then
# immediately runs eval inside the training loop (eval_freq=1, 1 episode).
# Tests the full train→eval-within-training pipeline end-to-end.
- name: Run Libero train+eval smoke (1 step, eval_freq=1)
run: |
docker run --name libero-train-smoke --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
lerobot-benchmark-libero:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
accelerate launch --num_processes=1 \$(which lerobot-train) \
--policy.path=lerobot/smolvla_base \
--policy.load_vlm_weights=true \
--policy.scheduler_decay_steps=25000 \
--policy.freeze_vision_encoder=false \
--policy.train_expert_only=false \
--dataset.repo_id=lerobot/libero \
--dataset.episodes=[0] \
--dataset.use_imagenet_stats=false \
--env.type=libero \
--env.task=libero_spatial \
'--env.camera_name_mapping={\"agentview_image\": \"camera1\", \"robot0_eye_in_hand_image\": \"camera2\"}' \
--policy.empty_cameras=1 \
--output_dir=/tmp/train-smoke \
--steps=1 \
--batch_size=1 \
--eval_freq=1 \
--eval.n_episodes=1 \
--eval.batch_size=1 \
--eval.use_async_envs=false \
--save_freq=1 \
--policy.push_to_hub=false \
'--rename_map={\"observation.images.image\": \"observation.images.camera1\", \"observation.images.image2\": \"observation.images.camera2\"}'
"
- name: Copy Libero train-smoke artifacts from container
if: always()
run: |
mkdir -p /tmp/libero-train-smoke-artifacts
docker cp libero-train-smoke:/tmp/train-smoke/. /tmp/libero-train-smoke-artifacts/ 2>/dev/null || true
docker rm -f libero-train-smoke || true
- name: Upload Libero train-smoke eval video
if: always()
uses: actions/upload-artifact@v4
with:
name: libero-train-smoke-video
path: /tmp/libero-train-smoke-artifacts/eval/
if-no-files-found: warn
# ── METAWORLD ─────────────────────────────────────────────────────────────
# Isolated image: lerobot[metaworld] only (metaworld==3.0.0, mujoco>=3 chain)
metaworld-integration-test:
name: MetaWorld — build image + 1-episode eval
runs-on:
group: aws-g6-4xlarge-plus
env:
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
lfs: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
with:
cache-binary: false
- name: Build MetaWorld benchmark image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
with:
context: .
file: docker/Dockerfile.benchmark.metaworld
push: false
load: true
tags: lerobot-benchmark-metaworld:ci
cache-from: type=local,src=/tmp/.buildx-cache-metaworld
cache-to: type=local,dest=/tmp/.buildx-cache-metaworld,mode=max
- name: Run MetaWorld smoke eval (1 episode)
run: |
docker run --name metaworld-eval --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
lerobot-benchmark-metaworld:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
lerobot-eval \
--policy.path=pepijn223/smolvla_metaworld \
--env.type=metaworld \
--env.task=metaworld-push-v3 \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={\"observation.image\": \"observation.images.camera1\"}' \
--policy.empty_cameras=2 \
--output_dir=/tmp/eval-artifacts
python3 /lerobot/scripts/ci/extract_task_descriptions.py \
--env metaworld --task metaworld-push-v3 \
--output /tmp/eval-artifacts/task_descriptions.json 2>/dev/null || true
"
- name: Copy MetaWorld artifacts from container
if: always()
run: |
mkdir -p /tmp/metaworld-artifacts
docker cp metaworld-eval:/tmp/eval-artifacts/. /tmp/metaworld-artifacts/ 2>/dev/null || true
docker rm -f metaworld-eval || true
- name: Parse MetaWorld eval metrics
if: always()
run: |
python3 scripts/ci/parse_eval_metrics.py \
--artifacts-dir /tmp/metaworld-artifacts \
--env metaworld \
--task metaworld-push-v3 \
--policy pepijn223/smolvla_metaworld
- name: Upload MetaWorld rollout video
if: always()
uses: actions/upload-artifact@v4
with:
name: metaworld-rollout-video
path: /tmp/metaworld-artifacts/videos/
if-no-files-found: warn
- name: Upload MetaWorld eval metrics
if: always()
uses: actions/upload-artifact@v4
with:
name: metaworld-metrics
path: /tmp/metaworld-artifacts/metrics.json
if-no-files-found: warn

View File

@@ -0,0 +1,49 @@
name: Claude Code Review
on:
pull_request:
types: [opened, synchronize, ready_for_review, reopened]
jobs:
claude-review:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
issues: read
id-token: write
actions: read
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 1
persist-credentials: false
- name: Run Claude Code Review
id: claude-review
uses: anthropics/claude-code-action@26ddc358fe3befff50c5ec2f80304c90c763f6f8 # v1
with:
anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
use_sticky_comment: true
prompt: |
Read `.github/CLAUDE.md` for lerobot-specific conventions, then review this PR.
Provide structured, actionable feedback.
Focus areas (in priority order):
1. **Correctness**: Logic errors, off-by-ones, wrong tensor shapes, incorrect loss functions
2. **Type safety**: All new/modified Python code must pass `mypy --strict`; check for missing annotations
3. **Backwards compatibility**: Does this break `LeRobotDataset`, `Policy`, `Robot`, `Teleoperator`, `Env`, or `Processor` public APIs?
4. **Tests**: New features must have tests; no silent behavioral changes
5. **Code style**: Explicit over magic, no unnecessary abstractions, no decorative comments
6. **HF integration**: Dataset streaming, `push_to_hub`, HF Hub compatibility preserved?
7. **pre-commit**: Would `pre-commit run -a` pass? (ruff, mypy, typos, zizmor)
Format findings as P1 (must fix) / P2 (should fix) / P3 (nice to have).
Skip P3 if the PR is already high quality.
claude_args: '--model claude-opus-4-6'
# See https://github.com/anthropics/claude-code-action/blob/main/docs/usage.md
# or https://code.claude.com/docs/en/cli-reference for available options

58
.github/workflows/claude.yml vendored Normal file
View File

@@ -0,0 +1,58 @@
name: Claude Code
on:
issue_comment:
types: [created]
pull_request_review_comment:
types: [created]
issues:
types: [opened, assigned]
pull_request_review:
types: [submitted]
jobs:
claude:
if: |
(github.event_name == 'issue_comment' &&
contains(github.event.comment.body, '@claude') &&
(github.event.comment.author_association == 'OWNER' || github.event.comment.author_association == 'MEMBER' || github.event.comment.author_association == 'COLLABORATOR')) ||
(github.event_name == 'pull_request_review_comment' &&
contains(github.event.comment.body, '@claude') &&
(github.event.comment.author_association == 'OWNER' || github.event.comment.author_association == 'MEMBER' || github.event.comment.author_association == 'COLLABORATOR')) ||
(github.event_name == 'pull_request_review' &&
contains(github.event.review.body, '@claude') &&
(github.event.review.author_association == 'OWNER' || github.event.review.author_association == 'MEMBER' || github.event.review.author_association == 'COLLABORATOR')) ||
(github.event_name == 'issues' &&
(contains(github.event.issue.body, '@claude') || contains(github.event.issue.title, '@claude')) &&
(github.event.issue.author_association == 'OWNER' || github.event.issue.author_association == 'MEMBER' || github.event.issue.author_association == 'COLLABORATOR'))
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
issues: write
id-token: write
actions: read
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 1
persist-credentials: false
- name: Run Claude Code
id: claude
uses: anthropics/claude-code-action@26ddc358fe3befff50c5ec2f80304c90c763f6f8 # v1
with:
anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
use_sticky_comment: true
# This is an optional setting that allows Claude to read CI results on PRs
additional_permissions: |
actions: read
claude_args: '--system-prompt "Read .github/CLAUDE.md for lerobot-specific conventions before responding."'
# See https://github.com/anthropics/claude-code-action/blob/main/docs/usage.md
# or https://code.claude.com/docs/en/cli-reference for available options

View File

@@ -1,89 +0,0 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Isolated benchmark image for LIBERO integration tests.
# Installs only lerobot[libero] so its dep tree (hf-libero, dm-control, mujoco)
# cannot conflict with other benchmarks.
#
# Build: docker build -f docker/Dockerfile.benchmark.libero -t lerobot-benchmark-libero .
# Run: docker run --gpus all --rm lerobot-benchmark-libero lerobot-eval ...
ARG CUDA_VERSION=12.4.1
ARG OS_VERSION=22.04
FROM nvidia/cuda:${CUDA_VERSION}-base-ubuntu${OS_VERSION}
ARG PYTHON_VERSION=3.12
ENV DEBIAN_FRONTEND=noninteractive \
MUJOCO_GL=egl \
PATH=/lerobot/.venv/bin:$PATH \
CUDA_VISIBLE_DEVICES=0 \
DEVICE=cuda
# System deps — same set as Dockerfile.internal
RUN apt-get update && apt-get install -y --no-install-recommends \
software-properties-common build-essential git curl \
libglib2.0-0 libgl1-mesa-glx libegl1-mesa ffmpeg \
libusb-1.0-0-dev speech-dispatcher libgeos-dev portaudio19-dev \
cmake pkg-config ninja-build \
&& add-apt-repository -y ppa:deadsnakes/ppa \
&& apt-get update \
&& apt-get install -y --no-install-recommends \
python${PYTHON_VERSION} \
python${PYTHON_VERSION}-venv \
python${PYTHON_VERSION}-dev \
&& curl -LsSf https://astral.sh/uv/install.sh | sh \
&& mv /root/.local/bin/uv /usr/local/bin/uv \
&& useradd --create-home --shell /bin/bash user_lerobot \
&& usermod -aG sudo user_lerobot \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
WORKDIR /lerobot
RUN chown -R user_lerobot:user_lerobot /lerobot
USER user_lerobot
ENV HOME=/home/user_lerobot \
HF_HOME=/home/user_lerobot/.cache/huggingface \
HF_LEROBOT_HOME=/home/user_lerobot/.cache/huggingface/lerobot \
TORCH_HOME=/home/user_lerobot/.cache/torch \
TRITON_CACHE_DIR=/home/user_lerobot/.cache/triton
RUN uv venv --python python${PYTHON_VERSION}
# Install only lerobot[libero] — completely isolated from metaworld's dep tree
COPY --chown=user_lerobot:user_lerobot setup.py pyproject.toml uv.lock README.md MANIFEST.in ./
COPY --chown=user_lerobot:user_lerobot src/ src/
RUN uv sync --locked --extra libero --extra smolvla --no-cache
# Pre-download lerobot/libero-assets from HF Hub so nothing is fetched at
# runtime (which times out on CI). Point the libero config at the cached path.
# libero/libero/__init__.py calls input() when ~/.libero/config.yaml is missing,
# so we write the config before any libero import can happen.
RUN LIBERO_DIR=$(python${PYTHON_VERSION} -c \
"import importlib.util, os; s=importlib.util.find_spec('libero'); \
print(os.path.join(os.path.dirname(s.origin), 'libero'))") && \
mkdir -p /home/user_lerobot/.libero && \
python${PYTHON_VERSION} -c "\
from huggingface_hub import snapshot_download; \
snapshot_download(repo_id='lerobot/libero-assets', repo_type='dataset', \
local_dir='/home/user_lerobot/.libero/assets')" && \
printf "assets: /home/user_lerobot/.libero/assets\nbddl_files: ${LIBERO_DIR}/bddl_files\ndatasets: ${LIBERO_DIR}/../datasets\ninit_states: ${LIBERO_DIR}/init_files\n" \
> /home/user_lerobot/.libero/config.yaml
RUN chmod +x /lerobot/.venv/lib/python${PYTHON_VERSION}/site-packages/triton/backends/nvidia/bin/ptxas
COPY --chown=user_lerobot:user_lerobot . .
CMD ["/bin/bash"]

View File

@@ -1,74 +0,0 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Isolated benchmark image for MetaWorld integration tests.
# Installs only lerobot[metaworld] so its dep tree (metaworld==3.0.0, mujoco>=3)
# cannot conflict with other benchmarks.
#
# Build: docker build -f docker/Dockerfile.benchmark.metaworld -t lerobot-benchmark-metaworld .
# Run: docker run --gpus all --rm lerobot-benchmark-metaworld lerobot-eval ...
ARG CUDA_VERSION=12.4.1
ARG OS_VERSION=22.04
FROM nvidia/cuda:${CUDA_VERSION}-base-ubuntu${OS_VERSION}
ARG PYTHON_VERSION=3.12
ENV DEBIAN_FRONTEND=noninteractive \
MUJOCO_GL=egl \
PATH=/lerobot/.venv/bin:$PATH \
CUDA_VISIBLE_DEVICES=0 \
DEVICE=cuda
# System deps — same set as Dockerfile.internal
RUN apt-get update && apt-get install -y --no-install-recommends \
software-properties-common build-essential git curl \
libglib2.0-0 libgl1-mesa-glx libegl1-mesa ffmpeg \
libusb-1.0-0-dev speech-dispatcher libgeos-dev portaudio19-dev \
cmake pkg-config ninja-build \
&& add-apt-repository -y ppa:deadsnakes/ppa \
&& apt-get update \
&& apt-get install -y --no-install-recommends \
python${PYTHON_VERSION} \
python${PYTHON_VERSION}-venv \
python${PYTHON_VERSION}-dev \
&& curl -LsSf https://astral.sh/uv/install.sh | sh \
&& mv /root/.local/bin/uv /usr/local/bin/uv \
&& useradd --create-home --shell /bin/bash user_lerobot \
&& usermod -aG sudo user_lerobot \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
WORKDIR /lerobot
RUN chown -R user_lerobot:user_lerobot /lerobot
USER user_lerobot
ENV HOME=/home/user_lerobot \
HF_HOME=/home/user_lerobot/.cache/huggingface \
HF_LEROBOT_HOME=/home/user_lerobot/.cache/huggingface/lerobot \
TORCH_HOME=/home/user_lerobot/.cache/torch \
TRITON_CACHE_DIR=/home/user_lerobot/.cache/triton
RUN uv venv --python python${PYTHON_VERSION}
# Install only lerobot[metaworld] — completely isolated from libero's dep tree
COPY --chown=user_lerobot:user_lerobot setup.py pyproject.toml uv.lock README.md MANIFEST.in ./
COPY --chown=user_lerobot:user_lerobot src/ src/
RUN uv sync --locked --extra metaworld --extra smolvla --no-cache
RUN chmod +x /lerobot/.venv/lib/python${PYTHON_VERSION}/site-packages/triton/backends/nvidia/bin/ptxas
COPY --chown=user_lerobot:user_lerobot . .
CMD ["/bin/bash"]

View File

@@ -73,8 +73,6 @@
title: Control & Train Robots in Sim (LeIsaac) title: Control & Train Robots in Sim (LeIsaac)
title: "Simulation" title: "Simulation"
- sections: - sections:
- local: evaluation
title: Evaluation (lerobot-eval)
- local: adding_benchmarks - local: adding_benchmarks
title: Adding a New Benchmark title: Adding a New Benchmark
- local: libero - local: libero

View File

@@ -26,7 +26,7 @@ During evaluation, data moves through four stages:
1. gym.Env ──→ raw observations (numpy dicts) 1. gym.Env ──→ raw observations (numpy dicts)
2. Preprocessing ──→ standard LeRobot keys + task description 2. Preprocessing ──→ standard LeRobot keys + task description
(preprocess_observation in envs/utils.py, env.call("task_description")) (preprocess_observation, add_envs_task in envs/utils.py)
3. Processors ──→ env-specific then policy-specific transforms 3. Processors ──→ env-specific then policy-specific transforms
(env_preprocessor, policy_preprocessor) (env_preprocessor, policy_preprocessor)
@@ -122,17 +122,15 @@ Each `EnvConfig` subclass declares two dicts that tell the policy what to expect
### Checklist ### Checklist
| File | Required | Why | | File | Required | Why |
| ----------------------------------------- | -------- | ------------------------------------------------------------ | | ---------------------------------------- | -------- | ------------------------------------------------------------ |
| `src/lerobot/envs/<benchmark>.py` | Yes | Wraps the simulator as a standard gym.Env | | `src/lerobot/envs/<benchmark>.py` | Yes | Wraps the simulator as a standard gym.Env |
| `src/lerobot/envs/configs.py` | Yes | Registers your benchmark and its `create_envs()` for the CLI | | `src/lerobot/envs/configs.py` | Yes | Registers your benchmark and its `create_envs()` for the CLI |
| `src/lerobot/processor/env_processor.py` | Optional | Custom observation/action transforms | | `src/lerobot/processor/env_processor.py` | Optional | Custom observation/action transforms |
| `src/lerobot/envs/utils.py` | Optional | Only if you need new raw observation keys | | `src/lerobot/envs/utils.py` | Optional | Only if you need new raw observation keys |
| `pyproject.toml` | Yes | Declares benchmark-specific dependencies | | `pyproject.toml` | Yes | Declares benchmark-specific dependencies |
| `docs/source/<benchmark>.mdx` | Yes | User-facing documentation page | | `docs/source/<benchmark>.mdx` | Yes | User-facing documentation page |
| `docs/source/_toctree.yml` | Yes | Adds your page to the docs sidebar | | `docs/source/_toctree.yml` | Yes | Adds your page to the docs sidebar |
| `docker/Dockerfile.benchmark.<benchmark>` | Yes | Isolated Docker image for CI smoke tests |
| `.github/workflows/benchmark_tests.yml` | Yes | CI job that builds the image and runs a 1-episode smoke eval |
### 1. The gym.Env wrapper (`src/lerobot/envs/<benchmark>.py`) ### 1. The gym.Env wrapper (`src/lerobot/envs/<benchmark>.py`)
@@ -163,8 +161,6 @@ class MyBenchmarkEnv(gym.Env):
... ...
``` ```
**GPU-based simulators (e.g. MuJoCo with EGL rendering):** If your simulator allocates GPU/EGL contexts during `__init__`, defer that allocation to a `_ensure_env()` helper called on first `reset()`/`step()`. This avoids inheriting stale GPU handles when `AsyncVectorEnv` spawns worker processes. See `LiberoEnv._ensure_env()` for the pattern.
Also provide a factory function that returns the nested dict structure: Also provide a factory function that returns the nested dict structure:
```python ```python
@@ -211,7 +207,7 @@ class MyBenchmarkEnvConfig(EnvConfig):
def gym_kwargs(self) -> dict: def gym_kwargs(self) -> dict:
return {"obs_type": self.obs_type, "render_mode": self.render_mode} return {"obs_type": self.obs_type, "render_mode": self.render_mode}
def create_envs(self, n_envs: int, use_async_envs: bool = True): def create_envs(self, n_envs: int, use_async_envs: bool = False):
"""Override for multi-task benchmarks or custom env creation.""" """Override for multi-task benchmarks or custom env creation."""
from lerobot.envs.<benchmark> import create_<benchmark>_envs from lerobot.envs.<benchmark> import create_<benchmark>_envs
return create_<benchmark>_envs(task=self.task, n_envs=n_envs, ...) return create_<benchmark>_envs(task=self.task, n_envs=n_envs, ...)
@@ -297,87 +293,14 @@ Add your benchmark to the "Benchmarks" section:
title: "Benchmarks" title: "Benchmarks"
``` ```
### 7. CI smoke test (`docker/` + `.github/workflows/benchmark_tests.yml`)
Each benchmark must have an isolated Docker image and a CI job that runs a 1-episode eval. This catches install-time regressions (broken transitive deps, import errors, interactive prompts) before they reach users.
**Create `docker/Dockerfile.benchmark.<benchmark>`** — copy an existing one and change only the extra name:
```dockerfile
# Isolated benchmark image — installs lerobot[<benchmark>] only.
# Build: docker build -f docker/Dockerfile.benchmark.<benchmark> -t lerobot-benchmark-<benchmark> .
ARG CUDA_VERSION=12.4.1
ARG OS_VERSION=22.04
FROM nvidia/cuda:${CUDA_VERSION}-base-ubuntu${OS_VERSION}
ARG PYTHON_VERSION=3.12
# ... (same system deps as Dockerfile.benchmark.libero) ...
RUN uv sync --locked --extra <benchmark> --no-cache
```
Each benchmark gets its own image so its dependency tree (pinned simulator packages, specific mujoco/scipy versions) cannot conflict with other benchmarks.
**Add a job to `.github/workflows/benchmark_tests.yml`** — copy an existing job block and adjust:
```yaml
<benchmark>-integration-test:
name: <Benchmark> — build image + 1-episode eval
runs-on:
group: aws-g6-4xlarge-plus
env:
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
lfs: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
with:
cache-binary: false
- name: Build <Benchmark> image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
with:
context: .
file: docker/Dockerfile.benchmark.<benchmark>
push: false
load: true
tags: lerobot-benchmark-<benchmark>:ci
cache-from: type=local,src=/tmp/.buildx-cache-<benchmark>
cache-to: type=local,dest=/tmp/.buildx-cache-<benchmark>,mode=max
- name: Run <Benchmark> smoke eval (1 episode)
run: |
docker run --rm --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
lerobot-benchmark-<benchmark>:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
lerobot-eval \
--policy.path=<hub_policy_path> \
--env.type=<benchmark> \
--env.task=<task> \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval.use_async_envs=false \
--policy.device=cuda
"
```
**Tips:**
- If the benchmark library prompts for user input on import (like LIBERO asking for a dataset folder), pass the relevant env var in the `docker run` command (e.g. `-e LIBERO_DATA_FOLDER=/tmp/libero_data`).
- The job is scoped to only trigger on changes to `src/lerobot/envs/**`, `src/lerobot/scripts/lerobot_eval.py`, and the Dockerfiles — it won't run on unrelated PRs.
## Verifying your integration ## Verifying your integration
After completing the steps above, confirm that everything works: After completing the steps above, confirm that everything works:
1. **Install** — `pip install -e ".[mybenchmark]"` and verify the dependency group installs cleanly. 1. **Install** — `pip install -e ".[mybenchmark]"` and verify the dependency group installs cleanly.
2. **Smoke test env creation** — call `make_env()` with your config in Python, check that the returned dict has the expected `{suite: {task_id: VectorEnv}}` shape, and that `reset()` returns observations with the right keys. 2. **Smoke test env creation** — call `make_env()` with your config in Python, check that the returned dict has the expected `{suite: {task_id: VectorEnv}}` shape, and that `reset()` returns observations with the right keys.
3. **Run a full eval** — `lerobot-eval --env.type=<name> --env.task=<task> --eval.n_episodes=1 --policy.path=<any_compatible_policy>` to exercise the full pipeline end-to-end. (`batch_size` defaults to auto-tuning based on CPU cores; pass `--eval.batch_size=1` to force a single environment.) 3. **Run a full eval** — `lerobot-eval --env.type=<name> --env.task=<task> --eval.n_episodes=1 --eval.batch_size=1 --policy.path=<any_compatible_policy>` to exercise the full pipeline end-to-end.
4. **Check success detection** — verify that `info["is_success"]` flips to `True` when the task is actually completed. This is what the eval loop uses to compute success rates. 4. **Check success detection** — verify that `info["is_success"]` flips to `True` when the task is actually completed. This is what the eval loop uses to compute success rates.
5. **Add CI smoke test** — follow step 7 above to add a Dockerfile and CI job. This ensures the install stays green as dependencies evolve.
## Writing a benchmark doc page ## Writing a benchmark doc page
@@ -388,7 +311,7 @@ Each benchmark `.mdx` page should include:
- **Overview image or GIF.** - **Overview image or GIF.**
- **Available tasks** — table of task suites with counts and brief descriptions. - **Available tasks** — table of task suites with counts and brief descriptions.
- **Installation** — `pip install -e ".[<benchmark>]"` plus any extra steps (env vars, system packages). - **Installation** — `pip install -e ".[<benchmark>]"` plus any extra steps (env vars, system packages).
- **Evaluation** — recommended `lerobot-eval` command with `n_episodes` for reproducible results. `batch_size` defaults to auto; only specify it if needed. Include single-task and multi-task examples if applicable. See the [Evaluation guide](evaluation) for details. - **Evaluation** — recommended `lerobot-eval` command with `n_episodes` and `batch_size` for reproducible results. Include single-task and multi-task examples if applicable.
- **Policy inputs and outputs** — observation keys with shapes, action space description. - **Policy inputs and outputs** — observation keys with shapes, action space description.
- **Recommended evaluation episodes** — how many episodes per task is standard. - **Recommended evaluation episodes** — how many episodes per task is standard.
- **Training** — example `lerobot-train` command. - **Training** — example `lerobot-train` command.

View File

@@ -90,11 +90,17 @@ The same policy can work with different environment processors, and the same env
```python ```python
# Use SmolVLA policy with LIBERO environment # Use SmolVLA policy with LIBERO environment
libero_preprocessor, libero_postprocessor = make_env_pre_post_processors(libero_cfg) # Use SmolVLA policy with LIBERO environment
libero_preprocessor, libero_postprocessor = make_env_pre_post_processors(
env_cfg=libero_cfg,
policy_cfg=smolvla_cfg,
)
smolvla_preprocessor, smolvla_postprocessor = make_pre_post_processors(smolvla_cfg) smolvla_preprocessor, smolvla_postprocessor = make_pre_post_processors(smolvla_cfg)
# Or use ACT policy with the same LIBERO environment # Or use ACT policy with the same LIBERO environment
libero_preprocessor, libero_postprocessor = make_env_pre_post_processors(libero_cfg) libero_preprocessor, libero_postprocessor = make_env_pre_post_processors(
env_cfg=libero_cfg,
policy_cfg=act_cfg,
)
act_preprocessor, act_postprocessor = make_pre_post_processors(act_cfg) act_preprocessor, act_postprocessor = make_pre_post_processors(act_cfg)
``` ```
@@ -151,7 +157,7 @@ observation = {
### Factory Function ### Factory Function
The `make_env_pre_post_processors` function follows the same pattern as `make_pre_post_processors` for policies: The `make_env_pre_post_processors` function delegates to `env_cfg.get_env_processors()`:
```python ```python
from lerobot.envs.factory import make_env_pre_post_processors from lerobot.envs.factory import make_env_pre_post_processors
@@ -159,47 +165,31 @@ from lerobot.envs.configs import LiberoEnv, PushtEnv
# For LIBERO: Returns LiberoProcessorStep in preprocessor # For LIBERO: Returns LiberoProcessorStep in preprocessor
libero_cfg = LiberoEnv(task="libero_spatial", camera_name=["agentview"]) libero_cfg = LiberoEnv(task="libero_spatial", camera_name=["agentview"])
env_preprocessor, env_postprocessor = make_env_pre_post_processors(libero_cfg) env_preprocessor, env_postprocessor = make_env_pre_post_processors(libero_cfg, policy_cfg)
# For other environments: Returns identity processors (no-op) # For other environments: Returns identity processors (no-op)
pusht_cfg = PushtEnv() pusht_cfg = PushtEnv()
env_preprocessor, env_postprocessor = make_env_pre_post_processors(pusht_cfg) env_preprocessor, env_postprocessor = make_env_pre_post_processors(pusht_cfg, policy_cfg)
``` ```
### Implementation in `envs/factory.py` ### How It Works
Each `EnvConfig` subclass can override `get_env_processors()` to return benchmark-specific
processor pipelines. The base class returns identity (no-op) processors by default.
```python ```python
def make_env_pre_post_processors( # In your EnvConfig subclass:
env_cfg: EnvConfig, def get_env_processors(self):
) -> tuple[ from lerobot.processor.pipeline import PolicyProcessorPipeline
PolicyProcessorPipeline[dict[str, Any], dict[str, Any]], return (
PolicyProcessorPipeline[dict[str, Any], dict[str, Any]], PolicyProcessorPipeline(steps=[MyProcessorStep()]),
]: PolicyProcessorPipeline(steps=[]),
""" )
Create preprocessor and postprocessor pipelines for environment observations.
Args:
env_cfg: The configuration of the environment.
Returns:
A tuple containing:
- preprocessor: Pipeline that processes environment observations
- postprocessor: Pipeline that processes environment outputs
"""
# For LIBERO environments, add the LiberoProcessorStep to preprocessor
if isinstance(env_cfg, LiberoEnv) or "libero" in env_cfg.type:
preprocessor = PolicyProcessorPipeline(steps=[LiberoProcessorStep()])
else:
# For all other environments, return an identity preprocessor
preprocessor = PolicyProcessorPipeline(steps=[])
# Postprocessor is currently identity for all environments
# Future: Could add environment-specific action transformations
postprocessor = PolicyProcessorPipeline(steps=[])
return preprocessor, postprocessor
``` ```
The factory function `make_env_pre_post_processors` simply delegates to this method,
with a special case for `XVLAConfig` policies which override the env processors entirely.
### Integration in Evaluation ### Integration in Evaluation
In `lerobot_eval.py`, the environment processors are created once and used throughout: In `lerobot_eval.py`, the environment processors are created once and used throughout:
@@ -219,7 +209,10 @@ def eval_main(cfg: EvalPipelineConfig):
) )
# Create environment processors (NEW!) # Create environment processors (NEW!)
env_preprocessor, env_postprocessor = make_env_pre_post_processors(env_cfg=cfg.env) env_preprocessor, env_postprocessor = make_env_pre_post_processors(
env_cfg=cfg.env,
policy_cfg=cfg.policy,
)
# Run evaluation with both processor types # Run evaluation with both processor types
eval_policy_all( eval_policy_all(
@@ -323,21 +316,22 @@ class MyEnvProcessorStep(ObservationProcessorStep):
return processed return processed
``` ```
### 2. Update the Factory ### 2. Update Your `EnvConfig` Subclass
```python ```python
# In src/lerobot/envs/factory.py # In src/lerobot/envs/configs.py
@EnvConfig.register_subclass("myenv")
@dataclass
class MyEnvConfig(EnvConfig):
# ... task/features/gym kwargs ...
def make_env_pre_post_processors(env_cfg: EnvConfig): def get_env_processors(self):
if isinstance(env_cfg, LiberoEnv) or "libero" in env_cfg.type: from lerobot.processor.pipeline import PolicyProcessorPipeline
preprocessor = PolicyProcessorPipeline(steps=[LiberoProcessorStep()])
elif isinstance(env_cfg, MyEnvConfig) or "myenv" in env_cfg.type:
preprocessor = PolicyProcessorPipeline(steps=[MyEnvProcessorStep()])
else:
preprocessor = PolicyProcessorPipeline(steps=[])
postprocessor = PolicyProcessorPipeline(steps=[]) return (
return preprocessor, postprocessor PolicyProcessorPipeline(steps=[MyEnvProcessorStep()]),
PolicyProcessorPipeline(steps=[]),
)
``` ```
### 3. Use in Evaluation ### 3. Use in Evaluation

View File

@@ -1,162 +0,0 @@
# Evaluation
`lerobot-eval` runs a trained policy on a simulation benchmark and reports success rate, reward, and (optionally) episode videos. It handles environment creation, batched rollouts, and metric aggregation automatically.
## Quick start
Evaluate a Hub-hosted policy on LIBERO:
```bash
lerobot-eval \
--policy.path=pepijn223/smolvla_libero \
--env.type=libero \
--env.task=libero_spatial \
--eval.n_episodes=10 \
--policy.device=cuda
```
Evaluate a local checkpoint:
```bash
lerobot-eval \
--policy.path=outputs/train/act_pusht/checkpoints/005000/pretrained_model \
--env.type=pusht \
--eval.n_episodes=10
```
`batch_size` defaults to **auto** (based on CPU cores). The script picks the right number of parallel environments for your machine.
## Key flags
| Flag | Default | Description |
| ----------------------- | -------------- | ------------------------------------------------------------------------------------- |
| `--policy.path` | required | Hub repo ID or local path to a pretrained model |
| `--env.type` | required | Benchmark name (`pusht`, `libero`, `metaworld`, etc.) |
| `--env.task` | varies | Task or suite name (e.g. `libero_spatial`, `libero_10`) |
| `--eval.n_episodes` | `50` | Total episodes to run (across all tasks) |
| `--eval.batch_size` | `0` (auto) | Number of parallel environments. `0` = auto-tune from CPU cores |
| `--eval.use_async_envs` | `true` | Use `AsyncVectorEnv` (parallel stepping). Auto-downgrades to sync when `batch_size=1` |
| `--policy.device` | `cuda` | Inference device |
| `--policy.use_amp` | `false` | Mixed-precision inference (saves VRAM, faster on Ampere+) |
| `--seed` | `1000` | Random seed for reproducibility |
| `--output_dir` | auto-generated | Where to write results and videos |
### Environment-specific flags
Some benchmarks accept additional flags through `--env.*`:
```bash
# LIBERO: map simulator camera names to policy feature names
--env.camera_name_mapping='{"agentview_image": "camera1", "robot0_eye_in_hand_image": "camera2"}'
# Fill unused camera slots with zeros
--policy.empty_cameras=1
```
See each benchmark's documentation ([LIBERO](libero), [Meta-World](metaworld)) for benchmark-specific flags.
## How batch_size works
`batch_size` controls how many environments run in parallel within a single `VectorEnv`:
| `batch_size` | Behavior |
| ------------- | -------------------------------------------------------------------- |
| `0` (default) | Auto-tune: `floor(cpu_cores × 0.7)`, capped by `n_episodes` and `64` |
| `1` | Single environment, synchronous. Useful for debugging |
| `N` | N environments step in parallel via `AsyncVectorEnv` |
When `batch_size > 1` and `use_async_envs=true`, each environment runs in its own subprocess via Gymnasium's `AsyncVectorEnv`. This parallelizes the simulation stepping (the main bottleneck), while the policy runs a single batched forward pass on GPU.
**Example:** On a 16-core machine with `n_episodes=100`:
- Auto batch_size = `floor(16 × 0.7)` = `11`
- 11 environments step simultaneously → ~11× faster than sequential
## Performance
### AsyncVectorEnv (default)
`AsyncVectorEnv` spawns one subprocess per environment. Each subprocess has its own simulator instance. While the policy computes actions on GPU, all environments step in parallel on CPU:
```
GPU: [inference]....[inference]....[inference]....
CPU: [step × N]....................[step × N]......
↑ parallel ↑ parallel
```
For GPU-based simulators (LIBERO, Meta-World), the environments use **lazy initialization**: the GPU/EGL context is created inside the worker subprocess on first `reset()`, not in the parent process. This avoids `EGL_BAD_CONTEXT` crashes from inheriting stale GPU handles across `fork()`.
### Lazy task loading
For multi-task benchmarks (e.g. LIBERO with 10 tasks), environments are wrapped in `_LazyAsyncVectorEnv` which defers worker creation until the task is actually evaluated. This keeps peak process count = `batch_size` instead of `n_tasks × batch_size`. After each task completes, workers are closed to free resources.
### Tuning for speed
| Situation | Recommendation |
| ------------------------------ | ----------------------------------------------------- |
| Slow eval, low GPU utilization | Increase `batch_size` (or leave at auto) |
| Out of memory (system RAM) | Decrease `batch_size` |
| Out of GPU memory | Decrease `batch_size`, or use `--policy.use_amp=true` |
| Debugging / single-stepping | `--eval.batch_size=1 --eval.use_async_envs=false` |
## Output
Results are written to `output_dir` (default: `outputs/eval/<date>/<time>_<job_name>/`):
- `eval_info.json` — full metrics: per-episode, per-task, per-group, and overall aggregates
- `videos/` — episode recordings (when `--eval.n_episodes_to_render > 0`)
### Metrics
| Metric | Description |
| ---------------- | -------------------------------------------------------------------- |
| `pc_success` | Success rate (%). Based on `info["is_success"]` from the environment |
| `avg_sum_reward` | Mean cumulative reward per episode |
| `avg_max_reward` | Mean peak reward per episode |
| `n_episodes` | Total episodes evaluated |
| `eval_s` | Total wall-clock time |
| `eval_ep_s` | Mean wall-clock time per episode |
## Multi-task evaluation
For benchmarks with multiple tasks (LIBERO suites, Meta-World MT50), `lerobot-eval` automatically:
1. Creates environments for all tasks in the selected suite(s)
2. Evaluates each task sequentially (one task's workers at a time)
3. Aggregates metrics per-task, per-group (suite), and overall
```bash
# Evaluate all 10 tasks in libero_spatial
lerobot-eval \
--policy.path=pepijn223/smolvla_libero \
--env.type=libero \
--env.task=libero_spatial \
--eval.n_episodes=10
# Evaluate multiple suites
lerobot-eval \
--policy.path=pepijn223/smolvla_libero \
--env.type=libero \
--env.task="libero_spatial,libero_object" \
--eval.n_episodes=10
```
## API usage
You can call the eval functions directly from Python:
```python
from lerobot.envs.factory import make_env
from lerobot.policies.factory import make_policy
from lerobot.scripts.lerobot_eval import eval_policy
envs = make_env(env_cfg, n_envs=10)
policy = make_policy(cfg=policy_cfg, env_cfg=env_cfg)
metrics = eval_policy(
env=envs["libero_spatial"][0],
policy=policy,
n_episodes=10,
)
print(metrics["pc_success"])
```

View File

@@ -1,89 +0,0 @@
#!/usr/bin/env python3
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Extract natural-language task descriptions for a benchmark suite.
Runs inside the benchmark Docker container (where the env library is installed)
immediately after lerobot-eval, writing a JSON file that parse_eval_metrics.py
picks up and embeds in metrics.json.
Output format: {"<suite>_<task_idx>": "<nl instruction>", ...}
Usage:
python scripts/ci/extract_task_descriptions.py \\
--env libero --task libero_spatial \\
--output /tmp/eval-artifacts/task_descriptions.json
"""
from __future__ import annotations
import argparse
import json
import sys
from pathlib import Path
def _libero_descriptions(task_suite: str) -> dict[str, str]:
from libero.libero import benchmark # type: ignore[import-untyped]
suite_dict = benchmark.get_benchmark_dict()
if task_suite not in suite_dict:
print(
f"[extract_task_descriptions] Unknown LIBERO suite '{task_suite}'. "
f"Available: {list(suite_dict.keys())}",
file=sys.stderr,
)
return {}
suite = suite_dict[task_suite]()
return {f"{task_suite}_{i}": suite.get_task(i).language for i in range(suite.n_tasks)}
def _metaworld_descriptions(task_name: str) -> dict[str, str]:
# MetaWorld tasks don't expose a separate NL description attribute;
# use a cleaned version of the task name as the description.
label = task_name.removeprefix("metaworld-").replace("-", " ").strip()
return {f"{task_name}_0": label}
def main() -> int:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--env", required=True, help="Environment family (libero, metaworld, ...)")
parser.add_argument("--task", required=True, help="Task/suite name (e.g. libero_spatial)")
parser.add_argument("--output", required=True, help="Path to write task_descriptions.json")
args = parser.parse_args()
descriptions: dict[str, str] = {}
try:
if args.env == "libero":
descriptions = _libero_descriptions(args.task)
elif args.env == "metaworld":
descriptions = _metaworld_descriptions(args.task)
else:
print(
f"[extract_task_descriptions] No description extractor for env '{args.env}'.",
file=sys.stderr,
)
except Exception as exc:
print(f"[extract_task_descriptions] Warning: {exc}", file=sys.stderr)
out_path = Path(args.output)
out_path.parent.mkdir(parents=True, exist_ok=True)
out_path.write_text(json.dumps(descriptions, indent=2))
print(f"[extract_task_descriptions] {len(descriptions)} descriptions → {out_path}")
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -1,129 +0,0 @@
#!/usr/bin/env python3
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Parse lerobot-eval output into a small metrics.json artifact.
Reads eval_info.json written by lerobot-eval --output_dir and extracts the
key metrics needed by the health dashboard. Handles both single-task and
multi-task eval output formats.
Usage:
python scripts/ci/parse_eval_metrics.py \\
--artifacts-dir /tmp/libero-artifacts \\
--env libero \\
--task libero_spatial \\
--policy pepijn223/smolvla_libero
Writes <artifacts-dir>/metrics.json. The CI workflow then uploads this file
as a GitHub Actions artifact named "<env>-metrics".
"""
from __future__ import annotations
import argparse
import json
import math
import sys
from pathlib import Path
def _extract_metrics(info: dict) -> tuple[float | None, int | None, float | None, float | None]:
"""Extract (pc_success, n_episodes, avg_sum_reward, eval_s) from eval_info.json.
Handles two output shapes:
- Single-task: {"aggregated": {"pc_success": 80.0, ...}}
- Multi-task: {"overall": {"pc_success": 80.0, "n_episodes": 5, ...}}
"""
for key in ("aggregated", "overall"):
if key not in info:
continue
agg = info[key]
pc = agg.get("pc_success")
n = agg.get("n_episodes")
reward = agg.get("avg_sum_reward")
eval_s = agg.get("eval_s")
if pc is not None and not math.isnan(pc):
return (
float(pc),
int(n) if n is not None else None,
float(reward) if reward is not None else None,
float(eval_s) if eval_s is not None else None,
)
return None, None, None, None
def main() -> int:
parser = argparse.ArgumentParser(
description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter
)
parser.add_argument("--artifacts-dir", required=True, help="Path to the mounted artifacts volume")
parser.add_argument("--env", required=True, help="Environment name (e.g. libero)")
parser.add_argument("--task", required=True, help="Task name (e.g. libero_spatial)")
parser.add_argument("--policy", required=True, help="Policy hub path (e.g. pepijn223/smolvla_libero)")
args = parser.parse_args()
artifacts_dir = Path(args.artifacts_dir)
eval_info_path = artifacts_dir / "eval_info.json"
pc_success: float | None = None
n_episodes: int | None = None
avg_sum_reward: float | None = None
eval_s: float | None = None
if eval_info_path.exists():
try:
info = json.loads(eval_info_path.read_text())
pc_success, n_episodes, avg_sum_reward, eval_s = _extract_metrics(info)
except (json.JSONDecodeError, KeyError, TypeError) as exc:
print(f"[parse_eval_metrics] Warning: could not parse eval_info.json: {exc}", file=sys.stderr)
else:
print(
f"[parse_eval_metrics] Warning: {eval_info_path} not found — eval may have failed.",
file=sys.stderr,
)
task_descriptions: dict[str, str] = {}
task_desc_path = artifacts_dir / "task_descriptions.json"
if task_desc_path.exists():
try:
task_descriptions = json.loads(task_desc_path.read_text())
except json.JSONDecodeError as exc:
print(
f"[parse_eval_metrics] Warning: could not parse task_descriptions.json: {exc}",
file=sys.stderr,
)
metrics = {
"env": args.env,
"task": args.task,
"policy": args.policy,
"pc_success": pc_success,
"n_episodes": n_episodes,
"avg_sum_reward": avg_sum_reward,
"eval_s": eval_s,
"task_descriptions": task_descriptions,
}
out_path = artifacts_dir / "metrics.json"
out_path.write_text(json.dumps(metrics, indent=2))
print(f"[parse_eval_metrics] Written: {out_path}")
print(json.dumps(metrics, indent=2))
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -65,27 +65,20 @@ class WandBConfig:
class EvalConfig: class EvalConfig:
n_episodes: int = 50 n_episodes: int = 50
# `batch_size` specifies the number of environments to use in a gym.vector.VectorEnv. # `batch_size` specifies the number of environments to use in a gym.vector.VectorEnv.
# Set to 0 for auto-tuning based on available CPU cores and n_episodes. batch_size: int = 50
batch_size: int = 0
# `use_async_envs` specifies whether to use asynchronous environments (multiprocessing). # `use_async_envs` specifies whether to use asynchronous environments (multiprocessing).
# Defaults to True; automatically downgraded to SyncVectorEnv when batch_size=1. use_async_envs: bool = False
use_async_envs: bool = True
def __post_init__(self) -> None: def __post_init__(self) -> None:
if self.batch_size == 0:
self.batch_size = self._auto_batch_size()
if self.batch_size > self.n_episodes: if self.batch_size > self.n_episodes:
self.batch_size = self.n_episodes raise ValueError(
"The eval batch size is greater than the number of eval episodes "
def _auto_batch_size(self) -> int: f"({self.batch_size} > {self.n_episodes}). As a result, {self.batch_size} "
"""Pick batch_size based on CPU cores, capped by n_episodes.""" f"eval environments will be instantiated, but only {self.n_episodes} will be used. "
import math "This might significantly slow down evaluation. To fix this, you should update your command "
import os f"to increase the number of episodes to match the batch size (e.g. `eval.n_episodes={self.batch_size}`), "
f"or lower the batch size (e.g. `eval.batch_size={self.n_episodes}`)."
cpu_cores = os.cpu_count() or 4 )
# Each async env worker needs ~1 core; leave headroom for main process + inference.
by_cpu = max(1, math.floor(cpu_cores * 0.7))
return min(by_cpu, self.n_episodes, 64)
@dataclass @dataclass

View File

@@ -44,13 +44,6 @@ from lerobot.utils.constants import (
) )
def _make_vec_env_cls(use_async: bool, n_envs: int):
"""Return the right VectorEnv constructor."""
if use_async and n_envs > 1:
return gym.vector.AsyncVectorEnv
return gym.vector.SyncVectorEnv
@dataclass @dataclass
class EnvConfig(draccus.ChoiceRegistry, abc.ABC): class EnvConfig(draccus.ChoiceRegistry, abc.ABC):
task: str | None = None task: str | None = None
@@ -82,14 +75,13 @@ class EnvConfig(draccus.ChoiceRegistry, abc.ABC):
def create_envs( def create_envs(
self, self,
n_envs: int, n_envs: int,
use_async_envs: bool = True, use_async_envs: bool = False,
) -> dict[str, dict[int, gym.vector.VectorEnv]]: ) -> dict[str, dict[int, gym.vector.VectorEnv]]:
"""Create {suite: {task_id: VectorEnv}}. """Create {suite: {task_id: VectorEnv}}.
Default: single-task env via gym.make(). Multi-task benchmarks override. Default: single-task env via gym.make(). Multi-task benchmarks override.
AsyncVectorEnv is the default for n_envs > 1; auto-downgraded to Sync for n_envs=1.
""" """
env_cls = gym.vector.AsyncVectorEnv if (use_async_envs and n_envs > 1) else gym.vector.SyncVectorEnv env_cls = gym.vector.AsyncVectorEnv if use_async_envs else gym.vector.SyncVectorEnv
if self.gym_id not in gym_registry: if self.gym_id not in gym_registry:
print(f"gym id '{self.gym_id}' not found, attempting to import '{self.package_name}'...") print(f"gym id '{self.gym_id}' not found, attempting to import '{self.package_name}'...")
@@ -402,22 +394,17 @@ class LiberoEnv(EnvConfig):
@property @property
def gym_kwargs(self) -> dict: def gym_kwargs(self) -> dict:
kwargs: dict[str, Any] = { kwargs: dict[str, Any] = {"obs_type": self.obs_type, "render_mode": self.render_mode}
"obs_type": self.obs_type,
"render_mode": self.render_mode,
"observation_height": self.observation_height,
"observation_width": self.observation_width,
}
if self.task_ids is not None: if self.task_ids is not None:
kwargs["task_ids"] = self.task_ids kwargs["task_ids"] = self.task_ids
return kwargs return kwargs
def create_envs(self, n_envs: int, use_async_envs: bool = True): def create_envs(self, n_envs: int, use_async_envs: bool = False):
from lerobot.envs.libero import create_libero_envs from lerobot.envs.libero import create_libero_envs
if self.task is None: if self.task is None:
raise ValueError("LiberoEnv requires a task to be specified") raise ValueError("LiberoEnv requires a task to be specified")
env_cls = _make_vec_env_cls(use_async_envs, n_envs) env_cls = gym.vector.AsyncVectorEnv if use_async_envs else gym.vector.SyncVectorEnv
return create_libero_envs( return create_libero_envs(
task=self.task, task=self.task,
n_envs=n_envs, n_envs=n_envs,
@@ -481,12 +468,12 @@ class MetaworldEnv(EnvConfig):
"render_mode": self.render_mode, "render_mode": self.render_mode,
} }
def create_envs(self, n_envs: int, use_async_envs: bool = True): def create_envs(self, n_envs: int, use_async_envs: bool = False):
from lerobot.envs.metaworld import create_metaworld_envs from lerobot.envs.metaworld import create_metaworld_envs
if self.task is None: if self.task is None:
raise ValueError("MetaWorld requires a task to be specified") raise ValueError("MetaWorld requires a task to be specified")
env_cls = _make_vec_env_cls(use_async_envs, n_envs) env_cls = gym.vector.AsyncVectorEnv if use_async_envs else gym.vector.SyncVectorEnv
return create_metaworld_envs( return create_metaworld_envs(
task=self.task, task=self.task,
n_envs=n_envs, n_envs=n_envs,

View File

@@ -58,7 +58,7 @@ def make_env_pre_post_processors(
def make_env( def make_env(
cfg: EnvConfig | str, cfg: EnvConfig | str,
n_envs: int = 1, n_envs: int = 1,
use_async_envs: bool = True, use_async_envs: bool = False,
hub_cache_dir: str | None = None, hub_cache_dir: str | None = None,
trust_remote_code: bool = False, trust_remote_code: bool = False,
) -> dict[str, dict[int, gym.vector.VectorEnv]]: ) -> dict[str, dict[int, gym.vector.VectorEnv]]:

View File

@@ -29,7 +29,6 @@ from gymnasium import spaces
from libero.libero import benchmark, get_libero_path from libero.libero import benchmark, get_libero_path
from libero.libero.envs import OffScreenRenderEnv from libero.libero.envs import OffScreenRenderEnv
from lerobot.envs.utils import _LazyAsyncVectorEnv
from lerobot.types import RobotObservation from lerobot.types import RobotObservation
@@ -151,17 +150,7 @@ class LiberoEnv(gym.Env):
self.init_state_id = self.episode_index # tie each sub-env to a fixed init state self.init_state_id = self.episode_index # tie each sub-env to a fixed init state
# Extract task metadata without allocating GPU resources (safe before fork). self._env = self._make_envs_task(task_suite, self.task_id)
task = task_suite.get_task(task_id)
self.task = task.name
self.task_description = task.language
self._task_bddl_file = os.path.join(
get_libero_path("bddl_files"), task.problem_folder, task.bddl_file
)
self._env: OffScreenRenderEnv | None = (
None # deferred — created on first reset() inside the worker subprocess
)
default_steps = 500 default_steps = 500
self._max_episode_steps = ( self._max_episode_steps = (
TASK_SUITE_MAX_STEPS.get(task_suite_name, default_steps) TASK_SUITE_MAX_STEPS.get(task_suite_name, default_steps)
@@ -232,33 +221,29 @@ class LiberoEnv(gym.Env):
low=ACTION_LOW, high=ACTION_HIGH, shape=(ACTION_DIM,), dtype=np.float32 low=ACTION_LOW, high=ACTION_HIGH, shape=(ACTION_DIM,), dtype=np.float32
) )
def _ensure_env(self) -> None:
"""Create the underlying OffScreenRenderEnv on first use.
Called inside the worker subprocess after fork(), so each worker gets
its own clean EGL context rather than inheriting a stale one from the
parent process (which causes EGL_BAD_CONTEXT crashes with AsyncVectorEnv).
"""
if self._env is not None:
return
env = OffScreenRenderEnv(
bddl_file_name=self._task_bddl_file,
camera_heights=self.observation_height,
camera_widths=self.observation_width,
)
env.reset()
self._env = env
def render(self): def render(self):
self._ensure_env()
raw_obs = self._env.env._get_observations() raw_obs = self._env.env._get_observations()
pixels = self._format_raw_obs(raw_obs)["pixels"] pixels = self._format_raw_obs(raw_obs)["pixels"]
image = next(iter(pixels.values())) image = next(iter(pixels.values()))
image = image[::-1, ::-1] # flip both H and W for visualization image = image[::-1, ::-1] # flip both H and W for visualization
return image return image
def _make_envs_task(self, task_suite: Any, task_id: int = 0):
task = task_suite.get_task(task_id)
self.task = task.name
self.task_description = task.language
task_bddl_file = os.path.join(get_libero_path("bddl_files"), task.problem_folder, task.bddl_file)
env_args = {
"bddl_file_name": task_bddl_file,
"camera_heights": self.observation_height,
"camera_widths": self.observation_width,
}
env = OffScreenRenderEnv(**env_args)
env.reset()
return env
def _format_raw_obs(self, raw_obs: RobotObservation) -> RobotObservation: def _format_raw_obs(self, raw_obs: RobotObservation) -> RobotObservation:
assert self._env is not None, "_format_raw_obs called before _ensure_env()"
images = {} images = {}
for camera_name in self.camera_name: for camera_name in self.camera_name:
image = raw_obs[camera_name] image = raw_obs[camera_name]
@@ -310,7 +295,6 @@ class LiberoEnv(gym.Env):
) )
def reset(self, seed=None, **kwargs): def reset(self, seed=None, **kwargs):
self._ensure_env()
super().reset(seed=seed) super().reset(seed=seed)
self._env.seed(seed) self._env.seed(seed)
raw_obs = self._env.reset() raw_obs = self._env.reset()
@@ -337,8 +321,6 @@ class LiberoEnv(gym.Env):
return observation, info return observation, info
def step(self, action: np.ndarray) -> tuple[RobotObservation, float, bool, bool, dict[str, Any]]: def step(self, action: np.ndarray) -> tuple[RobotObservation, float, bool, bool, dict[str, Any]]:
self._ensure_env()
assert self._env is not None
if action.ndim != 1: if action.ndim != 1:
raise ValueError( raise ValueError(
f"Expected action to be 1-D (shape (action_dim,)), " f"Expected action to be 1-D (shape (action_dim,)), "
@@ -363,8 +345,7 @@ class LiberoEnv(gym.Env):
return observation, reward, terminated, truncated, info return observation, reward, terminated, truncated, info
def close(self): def close(self):
if self._env is not None: self._env.close()
self._env.close()
def _make_env_fns( def _make_env_fns(
@@ -447,8 +428,6 @@ def create_libero_envs(
if task_ids_filter is not None: if task_ids_filter is not None:
print(f"Restricting to task_ids={task_ids_filter}") print(f"Restricting to task_ids={task_ids_filter}")
is_async = env_cls is gym.vector.AsyncVectorEnv
out: dict[str, dict[int, Any]] = defaultdict(dict) out: dict[str, dict[int, Any]] = defaultdict(dict)
for suite_name in suite_names: for suite_name in suite_names:
suite = _get_suite(suite_name) suite = _get_suite(suite_name)
@@ -457,11 +436,6 @@ def create_libero_envs(
if not selected: if not selected:
raise ValueError(f"No tasks selected for suite '{suite_name}' (available: {total}).") raise ValueError(f"No tasks selected for suite '{suite_name}' (available: {total}).")
# All tasks in a suite share identical observation/action spaces.
# Probe once and reuse to avoid creating a temp env per task.
cached_obs_space: spaces.Space | None = None
cached_act_space: spaces.Space | None = None
for tid in selected: for tid in selected:
fns = _make_env_fns( fns = _make_env_fns(
suite=suite, suite=suite,
@@ -475,14 +449,8 @@ def create_libero_envs(
control_mode=control_mode, control_mode=control_mode,
camera_name_mapping=camera_name_mapping, camera_name_mapping=camera_name_mapping,
) )
if is_async: out[suite_name][tid] = env_cls(fns)
lazy = _LazyAsyncVectorEnv(fns, cached_obs_space, cached_act_space)
if cached_obs_space is None:
cached_obs_space = lazy.observation_space
cached_act_space = lazy.action_space
out[suite_name][tid] = lazy
else:
out[suite_name][tid] = env_cls(fns)
print(f"Built vec env | suite={suite_name} | task_id={tid} | n_envs={n_envs}") print(f"Built vec env | suite={suite_name} | task_id={tid} | n_envs={n_envs}")
# return plain dicts for predictability
return {suite: dict(task_map) for suite, task_map in out.items()} return {suite: dict(task_map) for suite, task_map in out.items()}

View File

@@ -25,7 +25,6 @@ import metaworld.policies as policies
import numpy as np import numpy as np
from gymnasium import spaces from gymnasium import spaces
from lerobot.envs.utils import _LazyAsyncVectorEnv
from lerobot.types import RobotObservation from lerobot.types import RobotObservation
# ---- Load configuration data from the external JSON file ---- # ---- Load configuration data from the external JSON file ----
@@ -98,9 +97,8 @@ class MetaworldEnv(gym.Env):
self.visualization_height = visualization_height self.visualization_height = visualization_height
self.camera_name = camera_name self.camera_name = camera_name
self._env_name = self.task # already stripped of "metaworld-" prefix above self._env = self._make_envs_task(self.task)
self._env = None # deferred — created on first reset() inside the worker subprocess self._max_episode_steps = self._env.max_path_length
self._max_episode_steps = 500 # MT1 environments always have max_path_length=500
self.task_description = TASK_DESCRIPTIONS[self.task] self.task_description = TASK_DESCRIPTIONS[self.task]
self.expert_policy = TASK_POLICY_MAPPING[self.task]() self.expert_policy = TASK_POLICY_MAPPING[self.task]()
@@ -138,24 +136,6 @@ class MetaworldEnv(gym.Env):
self.action_space = spaces.Box(low=-1, high=1, shape=(ACTION_DIM,), dtype=np.float32) self.action_space = spaces.Box(low=-1, high=1, shape=(ACTION_DIM,), dtype=np.float32)
def _ensure_env(self) -> None:
"""Create the underlying MetaWorld env on first use.
Called inside the worker subprocess after fork(), so each worker gets
its own clean rendering context rather than inheriting a stale one from
the parent process (which causes crashes with AsyncVectorEnv).
"""
if self._env is not None:
return
mt1 = metaworld.MT1(self._env_name, seed=42)
env = mt1.train_classes[self._env_name](render_mode="rgb_array", camera_name=self.camera_name)
env.set_task(mt1.train_tasks[0])
if self.camera_name == "corner2":
env.model.cam_pos[2] = [0.75, 0.075, 0.7]
env.reset()
env._freeze_rand_vec = False # otherwise no randomization
self._env = env
def render(self) -> np.ndarray: def render(self) -> np.ndarray:
""" """
Render the current environment frame. Render the current environment frame.
@@ -163,13 +143,26 @@ class MetaworldEnv(gym.Env):
Returns: Returns:
np.ndarray: The rendered RGB image from the environment. np.ndarray: The rendered RGB image from the environment.
""" """
self._ensure_env()
image = self._env.render() image = self._env.render()
if self.camera_name == "corner2": if self.camera_name == "corner2":
# Images from this camera are flipped — correct them # Images from this camera are flipped — correct them
image = np.flip(image, (0, 1)) image = np.flip(image, (0, 1))
return image return image
def _make_envs_task(self, env_name: str):
mt1 = metaworld.MT1(env_name, seed=42)
env = mt1.train_classes[env_name](render_mode="rgb_array", camera_name=self.camera_name)
env.set_task(mt1.train_tasks[0])
if self.camera_name == "corner2":
env.model.cam_pos[2] = [
0.75,
0.075,
0.7,
] # corner2 position, similar to https://arxiv.org/pdf/2206.14244
env.reset()
env._freeze_rand_vec = False # otherwise no randomization
return env
def _format_raw_obs(self, raw_obs: np.ndarray) -> RobotObservation: def _format_raw_obs(self, raw_obs: np.ndarray) -> RobotObservation:
image = None image = None
if self._env is not None: if self._env is not None:
@@ -216,7 +209,6 @@ class MetaworldEnv(gym.Env):
observation (RobotObservation): The initial formatted observation. observation (RobotObservation): The initial formatted observation.
info (Dict[str, Any]): Additional info about the reset state. info (Dict[str, Any]): Additional info about the reset state.
""" """
self._ensure_env()
super().reset(seed=seed) super().reset(seed=seed)
raw_obs, info = self._env.reset(seed=seed) raw_obs, info = self._env.reset(seed=seed)
@@ -240,7 +232,6 @@ class MetaworldEnv(gym.Env):
truncated (bool): Whether the episode was truncated due to a time limit. truncated (bool): Whether the episode was truncated due to a time limit.
info (Dict[str, Any]): Additional environment info. info (Dict[str, Any]): Additional environment info.
""" """
self._ensure_env()
if action.ndim != 1: if action.ndim != 1:
raise ValueError( raise ValueError(
f"Expected action to be 1-D (shape (action_dim,)), " f"Expected action to be 1-D (shape (action_dim,)), "
@@ -272,8 +263,7 @@ class MetaworldEnv(gym.Env):
return observation, reward, terminated, truncated, info return observation, reward, terminated, truncated, info
def close(self): def close(self):
if self._env is not None: self._env.close()
self._env.close()
# ---- Main API ---------------------------------------------------------------- # ---- Main API ----------------------------------------------------------------
@@ -307,9 +297,6 @@ def create_metaworld_envs(
print(f"Creating Meta-World envs | task_groups={task_groups} | n_envs(per task)={n_envs}") print(f"Creating Meta-World envs | task_groups={task_groups} | n_envs(per task)={n_envs}")
is_async = env_cls is gym.vector.AsyncVectorEnv
cached_obs_space = None
cached_act_space = None
out: dict[str, dict[int, Any]] = defaultdict(dict) out: dict[str, dict[int, Any]] = defaultdict(dict)
for group in task_groups: for group in task_groups:
@@ -322,14 +309,7 @@ def create_metaworld_envs(
# build n_envs factories # build n_envs factories
fns = [(lambda tn=task_name: MetaworldEnv(task=tn, **gym_kwargs)) for _ in range(n_envs)] fns = [(lambda tn=task_name: MetaworldEnv(task=tn, **gym_kwargs)) for _ in range(n_envs)]
if is_async: out[group][tid] = env_cls(fns)
lazy = _LazyAsyncVectorEnv(fns, cached_obs_space, cached_act_space)
if cached_obs_space is None:
cached_obs_space = lazy.observation_space
cached_act_space = lazy.action_space
out[group][tid] = lazy
else:
out[group][tid] = env_cls(fns)
# return a plain dict for consistency # return a plain dict for consistency
return {group: dict(task_map) for group, task_map in out.items()} return {group: dict(task_map) for group, task_map in out.items()}

View File

@@ -16,7 +16,7 @@
import importlib.util import importlib.util
import os import os
import warnings import warnings
from collections.abc import Callable, Mapping, Sequence from collections.abc import Mapping, Sequence
from functools import singledispatch from functools import singledispatch
from typing import Any from typing import Any
@@ -130,99 +130,56 @@ def env_to_policy_features(env_cfg: EnvConfig) -> dict[str, PolicyFeature]:
return policy_features return policy_features
def _sub_env_has_attr(env: gym.vector.VectorEnv, attr: str) -> bool: def are_all_envs_same_type(env: gym.vector.VectorEnv) -> bool:
try: first_type = type(env.envs[0]) # Get type of first env
env.get_attr(attr) return all(type(e) is first_type for e in env.envs) # Fast type check
return True
except (AttributeError, Exception):
return False
class _LazyAsyncVectorEnv:
"""Defers AsyncVectorEnv creation until first use.
Creating all tasks' AsyncVectorEnvs upfront spawns N_tasks × n_envs worker
processes, all of which allocate EGL/GPU resources immediately. Since tasks
are evaluated sequentially, only one task's workers need to be alive at a
time. This wrapper stores the factory functions and creates the real
AsyncVectorEnv on first reset()/step()/call(), keeping peak process count = n_envs.
"""
def __init__(
self,
env_fns: list[Callable],
observation_space=None,
action_space=None,
):
self._env_fns = env_fns
self._env: gym.vector.AsyncVectorEnv | None = None
self.num_envs = len(env_fns)
if observation_space is not None and action_space is not None:
self.observation_space = observation_space
self.action_space = action_space
else:
tmp = env_fns[0]()
self.observation_space = tmp.observation_space
self.action_space = tmp.action_space
tmp.close()
self.single_observation_space = self.observation_space
self.single_action_space = self.action_space
def _ensure(self) -> None:
if self._env is None:
self._env = gym.vector.AsyncVectorEnv(self._env_fns, context="forkserver", shared_memory=True)
def reset(self, **kwargs):
self._ensure()
return self._env.reset(**kwargs)
def step(self, actions):
self._ensure()
return self._env.step(actions)
def call(self, name, *args, **kwargs):
self._ensure()
return self._env.call(name, *args, **kwargs)
def get_attr(self, name):
self._ensure()
return self._env.get_attr(name)
def close(self) -> None:
if self._env is not None:
self._env.close()
self._env = None
def check_env_attributes_and_types(env: gym.vector.VectorEnv) -> None: def check_env_attributes_and_types(env: gym.vector.VectorEnv) -> None:
with warnings.catch_warnings(): with warnings.catch_warnings():
warnings.simplefilter("once", UserWarning) warnings.simplefilter("once", UserWarning) # Apply filter only in this function
if not (_sub_env_has_attr(env, "task_description") and _sub_env_has_attr(env, "task")): if not (hasattr(env.envs[0], "task_description") and hasattr(env.envs[0], "task")):
warnings.warn( warnings.warn(
"The environment does not have 'task_description' and 'task'. Some policies require these features.", "The environment does not have 'task_description' and 'task'. Some policies require these features.",
UserWarning, UserWarning,
stacklevel=2, stacklevel=2,
) )
if not are_all_envs_same_type(env):
warnings.warn(
"The environments have different types. Make sure you infer the right task from each environment. Empty task will be passed instead.",
UserWarning,
stacklevel=2,
)
def add_envs_task(env: gym.vector.VectorEnv, observation: RobotObservation) -> RobotObservation: def add_envs_task(env: gym.vector.VectorEnv, observation: RobotObservation) -> RobotObservation:
"""Adds task feature to the observation dict with respect to the first environment attribute.""" """Adds task feature to the observation dict with respect to the first environment attribute."""
if _sub_env_has_attr(env, "task_description"): if hasattr(env.envs[0], "task_description"):
task_result = list(env.call("task_description")) task_result = env.call("task_description")
if isinstance(task_result, tuple):
task_result = list(task_result)
if not isinstance(task_result, list):
raise TypeError(f"Expected task_description to return a list, got {type(task_result)}")
if not all(isinstance(item, str) for item in task_result): if not all(isinstance(item, str) for item in task_result):
raise TypeError("All items in task_description result must be strings") raise TypeError("All items in task_description result must be strings")
observation["task"] = task_result observation["task"] = task_result
elif _sub_env_has_attr(env, "task"): elif hasattr(env.envs[0], "task"):
task_result = list(env.call("task")) task_result = env.call("task")
if isinstance(task_result, tuple):
task_result = list(task_result)
if not isinstance(task_result, list):
raise TypeError(f"Expected task to return a list, got {type(task_result)}")
if not all(isinstance(item, str) for item in task_result): if not all(isinstance(item, str) for item in task_result):
raise TypeError("All items in task result must be strings") raise TypeError("All items in task result must be strings")
observation["task"] = task_result observation["task"] = task_result
else: else: # For envs without language instructions, e.g. aloha transfer cube and etc.
num_envs = observation[list(observation.keys())[0]].shape[0] num_envs = observation[list(observation.keys())[0]].shape[0]
observation["task"] = ["" for _ in range(num_envs)] observation["task"] = ["" for _ in range(num_envs)]
return observation return observation

View File

@@ -136,8 +136,8 @@ class TokenizerProcessorStep(ObservationProcessorStep):
# Standardize to a list of strings for the tokenizer # Standardize to a list of strings for the tokenizer
if isinstance(task, str): if isinstance(task, str):
return [task] return [task]
elif isinstance(task, (list, tuple)) and all(isinstance(t, str) for t in task): elif isinstance(task, list) and all(isinstance(t, str) for t in task):
return list(task) return task
return None return None

View File

@@ -73,6 +73,7 @@ from lerobot.configs import parser
from lerobot.configs.eval import EvalPipelineConfig from lerobot.configs.eval import EvalPipelineConfig
from lerobot.envs.factory import make_env, make_env_pre_post_processors from lerobot.envs.factory import make_env, make_env_pre_post_processors
from lerobot.envs.utils import ( from lerobot.envs.utils import (
add_envs_task,
check_env_attributes_and_types, check_env_attributes_and_types,
close_envs, close_envs,
preprocess_observation, preprocess_observation,
@@ -165,15 +166,9 @@ def rollout(
if return_observations: if return_observations:
all_observations.append(deepcopy(observation)) all_observations.append(deepcopy(observation))
# Infer "task" from sub-environments (prefer natural language description). # Infer "task" from attributes of environments.
# env.call() works with both SyncVectorEnv and AsyncVectorEnv. # TODO: works with SyncVectorEnv but not AsyncVectorEnv
try: observation = add_envs_task(env, observation)
observation["task"] = list(env.call("task_description"))
except Exception:
try:
observation["task"] = list(env.call("task"))
except Exception:
observation["task"] = [""] * env.num_envs
# Apply environment-specific preprocessing (e.g., LiberoProcessorStep for LIBERO) # Apply environment-specific preprocessing (e.g., LiberoProcessorStep for LIBERO)
observation = env_preprocessor(observation) observation = env_preprocessor(observation)
@@ -323,9 +318,8 @@ def eval_policy(
n_to_render_now = min(max_episodes_rendered - n_episodes_rendered, env.num_envs) n_to_render_now = min(max_episodes_rendered - n_episodes_rendered, env.num_envs)
if isinstance(env, gym.vector.SyncVectorEnv): if isinstance(env, gym.vector.SyncVectorEnv):
ep_frames.append(np.stack([env.envs[i].render() for i in range(n_to_render_now)])) # noqa: B023 ep_frames.append(np.stack([env.envs[i].render() for i in range(n_to_render_now)])) # noqa: B023
elif hasattr(env, "call"): elif isinstance(env, gym.vector.AsyncVectorEnv):
# Here we must render all frames and discard any we don't need. # Here we must render all frames and discard any we don't need.
# Covers AsyncVectorEnv and _LazyAsyncVectorEnv (which wraps one).
ep_frames.append(np.stack(env.call("render")[:n_to_render_now])) ep_frames.append(np.stack(env.call("render")[:n_to_render_now]))
if max_episodes_rendered > 0: if max_episodes_rendered > 0:
@@ -527,7 +521,7 @@ def eval_main(cfg: EvalPipelineConfig):
logging.info(colored("Output dir:", "yellow", attrs=["bold"]) + f" {cfg.output_dir}") logging.info(colored("Output dir:", "yellow", attrs=["bold"]) + f" {cfg.output_dir}")
logging.info(f"Making environment (batch_size={cfg.eval.batch_size}, async={cfg.eval.use_async_envs}).") logging.info("Making environment.")
envs = make_env( envs = make_env(
cfg.env, cfg.env,
n_envs=cfg.eval.batch_size, n_envs=cfg.eval.batch_size,
@@ -761,39 +755,23 @@ def eval_policy_all(
) )
if max_parallel_tasks <= 1: if max_parallel_tasks <= 1:
prefetch_thread: threading.Thread | None = None # sequential path (single accumulator path on the main thread)
for i, (task_group, task_id, env) in enumerate(tasks): # NOTE: keeping a single-threaded accumulator avoids concurrent list appends or locks
if prefetch_thread is not None: for task_group, task_id, env in tasks:
prefetch_thread.join() tg, tid, metrics = task_runner(task_group, task_id, env)
prefetch_thread = None _accumulate_to(tg, metrics)
per_task_infos.append({"task_group": tg, "task_id": tid, "metrics": metrics})
try:
tg, tid, metrics = task_runner(task_group, task_id, env)
_accumulate_to(tg, metrics)
per_task_infos.append({"task_group": tg, "task_id": tid, "metrics": metrics})
finally:
env.close()
# Prefetch next task's workers *after* closing current env to prevent
# GPU memory overlap between consecutive tasks.
if i + 1 < len(tasks):
next_env = tasks[i + 1][2]
if hasattr(next_env, "_ensure"):
prefetch_thread = threading.Thread(target=next_env._ensure, daemon=True)
prefetch_thread.start()
else: else:
# threaded path: submit all tasks, consume completions on main thread and accumulate there
with cf.ThreadPoolExecutor(max_workers=max_parallel_tasks) as executor: with cf.ThreadPoolExecutor(max_workers=max_parallel_tasks) as executor:
fut2meta = {} fut2meta = {}
for task_group, task_id, env in tasks: for task_group, task_id, env in tasks:
fut = executor.submit(task_runner, task_group, task_id, env) fut = executor.submit(task_runner, task_group, task_id, env)
fut2meta[fut] = (task_group, task_id, env) fut2meta[fut] = (task_group, task_id)
for fut in cf.as_completed(fut2meta): for fut in cf.as_completed(fut2meta):
tg, tid, env = fut2meta[fut] tg, tid, metrics = fut.result()
try: _accumulate_to(tg, metrics)
tg, tid, metrics = fut.result() per_task_infos.append({"task_group": tg, "task_id": tid, "metrics": metrics})
_accumulate_to(tg, metrics)
per_task_infos.append({"task_group": tg, "task_id": tid, "metrics": metrics})
finally:
env.close()
# compute aggregated metrics helper (robust to lists/scalars) # compute aggregated metrics helper (robust to lists/scalars)
def _agg_from_list(xs): def _agg_from_list(xs):

View File

@@ -90,7 +90,7 @@ def test_base_create_envs():
envs = _Env().create_envs(n_envs=2) envs = _Env().create_envs(n_envs=2)
assert "_dispatch_base_test" in envs assert "_dispatch_base_test" in envs
env = envs["_dispatch_base_test"][0] env = envs["_dispatch_base_test"][0]
assert isinstance(env, gym.vector.VectorEnv) assert isinstance(env, gym.vector.SyncVectorEnv)
assert env.num_envs == 2 assert env.num_envs == 2
env.close() env.close()
finally: finally:

View File

@@ -189,30 +189,6 @@ def test_list_of_strings_tokenization(mock_auto_tokenizer):
assert attention_mask.shape == (2, 8) assert attention_mask.shape == (2, 8)
@require_package("transformers")
@patch("lerobot.processor.tokenizer_processor.AutoTokenizer")
def test_tuple_of_strings_tokenization(mock_auto_tokenizer):
"""Test tokenization of a tuple of strings (returned by VectorEnv.call())."""
mock_tokenizer = MockTokenizer(vocab_size=100)
mock_auto_tokenizer.from_pretrained.return_value = mock_tokenizer
processor = TokenizerProcessorStep(tokenizer_name="test-tokenizer", max_length=8)
transition = create_transition(
observation={"state": torch.tensor([1.0, 2.0])},
action=torch.tensor([0.1, 0.2]),
complementary_data={"task": ("pick up cube", "place on table")},
)
result = processor(transition)
observation = result[TransitionKey.OBSERVATION]
tokens = observation[f"{OBS_LANGUAGE}.tokens"]
attention_mask = observation[f"{OBS_LANGUAGE}.attention_mask"]
assert tokens.shape == (2, 8)
assert attention_mask.shape == (2, 8)
@require_package("transformers") @require_package("transformers")
@patch("lerobot.processor.tokenizer_processor.AutoTokenizer") @patch("lerobot.processor.tokenizer_processor.AutoTokenizer")
def test_custom_keys(mock_auto_tokenizer): def test_custom_keys(mock_auto_tokenizer):