Commit Graph

1445 Commits

Author SHA1 Message Date
Pepijn
b86935c64b Merge branch 'main' into codex/model-profiling 2026-04-21 11:23:26 +02:00
Pepijn
a2f72e42f6 fix(profiling): convert uint8 images to float32 in deterministic forward
Mirror the uint8 → float32/255 conversion the train loop applies after
the dataloader (PR #3406). The reference batch in
`write_deterministic_forward_artifacts` skipped this step because it
calls `preprocessor(default_collate(...))` directly, which caused
SmolVLA and xVLA to crash with:

    NotImplementedError: "upsample_bilinear2d_out_frame" not implemented for 'Byte'

inside their `resize_with_pad` → `F.interpolate(..., mode="bilinear")`
path. Other policies dodged it because their image-prep casts first.

Made-with: Cursor
2026-04-20 23:33:24 +02:00
Pepijn
a515eadc96 refactor(profiling): consolidate into single module
Unify the profiling subsystem into one file per reviewer request.

Before (4 files):
  src/lerobot/utils/profiling_utils.py        399 LOC
  scripts/ci/run_model_profiling.py           337 LOC
  profiling/model_profiling_specs.json        181 LOC
  tests/scripts/test_model_profiling.py       423 LOC

After (2 files):
  src/lerobot/utils/model_profiling.py        758 LOC — TrainingProfiler +
                                                       CI orchestrator +
                                                       POLICY_SPECS (inline)
  tests/test_model_profiling.py               315 LOC

Net: -267 LOC and 4 files → 2. All functionality preserved: per-step
forward/backward/optimizer timings, torch profiler tables + chrome
traces, deterministic-forward fingerprint, HF Hub result upload, and
the same CLI surface.

Changes:
- Collapse `_StepTimingCollector` into inline attributes on
  `TrainingProfiler` (no separate class).
- Drop `ProfilingSpec` dataclass; specs are plain dicts.
- Inline the JSON matrix as a module-level `POLICY_SPECS` dict —
  one less file to keep in sync with the training args.
- CI workflow invokes `python -m lerobot.utils.model_profiling` in
  place of the standalone script.
- Tests import `lerobot.utils.model_profiling` directly instead of
  loading a script-by-path. Removed JSON schema tests that no
  longer apply.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 21:31:17 +02:00
Pepijn
a07f22e22c feat(envs): add LIBERO-plus robustness benchmark (#3313)
* feat(envs): add LIBERO-plus robustness benchmark integration

- LiberoPlusEnv config (subclass of LiberoEnv, same gym interface)
- Docker image installing LIBERO-plus fork via PYTHONPATH
- CI workflow: 1-episode smoke eval with pepijn223/smolvla_libero_plus
- pyproject.toml: libero_plus extra

* fix(libero): use suite's perturbation-aware init_states loader

LIBERO-plus's Benchmark class exposes a `get_task_init_states(i)` method that
strips perturbation suffixes (`_table_N`, `_tb_N`, `_view_`, `_language_`,
`_light_`, `_add_`, `_level`) and loads the underlying base `.pruned_init`
file — the on-disk name for a perturbation variant doesn't exist as a file,
only the base does. lerobot's loader was bypassing that logic and trying to
read the suffix-bearing filename directly, which failed for every non-zero
task id and killed the eval before any rollout video could be written.

Delegate to the suite's method when it exists; fall back to the path-based
loader for vanilla LIBERO (which does not provide the method).

Also drop the hf-libero install + init_files copy from the LIBERO-plus
Dockerfile — the LIBERO-plus clone already ships both `bddl_files/` and
`init_files/` for all five suites, so the copy was unnecessary and the
`cp -r` into an existing dir produced a confusing nested layout.

* fix(libero): resolve LIBERO-plus perturbation init_states path ourselves

Delegating to `task_suite.get_task_init_states(i)` works for path resolution
but LIBERO-plus's method calls `torch.load(path)` without `weights_only=False`,
which fails on PyTorch 2.6+ because the pickled init_states contains numpy
objects not in the default allowlist:

    _pickle.UnpicklingError: Weights only load failed.
    WeightsUnpickler error: Unsupported global:
      GLOBAL numpy.core.multiarray._reconstruct was not an allowed global.

Mirror LIBERO-plus's suffix-stripping logic (`_table_N`, `_tb_N`, `_view_`,
`_language_`, `_light_`, `_add_`, `_level`) in our own helper so we can pass
`weights_only=False` ourselves. Vanilla LIBERO task names don't contain any
of these patterns except for `_table_` when followed by the word `center`
(e.g. `pick_up_the_black_bowl_from_table_center_...`), and the regex
requires `_table_\\d+` so semantic uses are preserved.

* fix(libero-plus): download perturbation assets from Sylvest/LIBERO-plus

LIBERO-plus's bddl_base_domain.py resolves scene XMLs with
`os.path.join(DIR_PATH, "../assets")`, so the `assets` key in config.yaml
has no effect on scene lookup — MuJoCo always opens
`<clone>/libero/libero/assets/scenes/...`. With no such directory present,
every perturbation task fails on:

    FileNotFoundError: No such file or directory:
      .../libero-plus/libero/libero/assets/scenes/tabletop_table_Cobblestone01_GLOSS_6K.xml

These textures, views, and extra objects ship only in the 6.4 GB `assets.zip`
published at `Sylvest/LIBERO-plus` (the LIBERO-plus README explicitly says
to download and unzip it into the package dir). Fetch it via `hf_hub_download`,
unzip into `${LIBERO_PLUS_ROOT}/`, install `unzip`, and point config.yaml at
the extracted dir so everything stays consistent. The download lives in its
own Docker layer so subsequent rebuilds reuse the cached assets.

Drops the lerobot/libero-assets snapshot_download — that mirror only has
vanilla LIBERO textures and is ignored for scene loading anyway.

* fix(libero-plus): flatten deep path prefix from Sylvest/LIBERO-plus assets.zip

The 6.4 GB zip ships with every entry prefixed by
`inspire/hdd/project/embodied-multimodality/public/syfei/libero_new/release/dataset/LIBERO-plus-0/assets/...`
(the author's internal filesystem layout, not the layout the LIBERO-plus
README promises), so the previous `unzip -d ${LIBERO_PLUS_ROOT}/` created
`${LIBERO_PLUS_ROOT}/inspire/.../assets/` — robosuite still opened
`${LIBERO_PLUS_ROOT}/assets/scenes/tabletop_table_Cobblestone01_GLOSS_6K.xml`
and hit the same FileNotFoundError.

Extract to a scratch dir, then `mv` the nested `assets/` subtree to the
expected location. Verified the target file exists in the zip central
directory under that exact prefix.

* refactor(libero): inline init_states resolver behind single regex

Collapse the three-style suffix stripper (split/re.sub/in) into one
compiled regex, drop the (Path, bool) tuple return, and move the
`_add_`/`_level` reshape branch into the caller so each branch loads
its own file and returns directly. Net: -11 lines, one fewer helper.

* refactor(libero-plus): rebase docker image on huggingface/lerobot-gpu

Mirror the libero/metaworld/robomme pattern: start from the nightly GPU
image (apt deps, python, uv, venv, lerobot[all] already there) and only
layer on what LIBERO-plus uniquely needs — its wand/ImageMagick build
deps, the non-extra runtime pips (robosuite==1.4.1, bddl, …), the
PYTHONPATH-shadowed fork, and the 6.4 GB assets.zip.

Drops ~50 lines of duplicated base setup (CUDA FROM, apt python, uv
install, user creation, venv init) the nightly already provides.
123 → 73 lines.

Also:
- Add libero_plus to docs/source/_toctree.yml under Benchmarks so
  doc-builder's TOC integrity check stops failing.
- Repoint the docs dataset link from pepijn223/libero_plus_lerobot to
  the canonical lerobot/libero_plus.
- Revert the stray uv.lock churn (revision/marker diff that crept in
  from an unrelated resolve — unrelated to LIBERO-plus).

* fix(libero-plus): stop touching pyproject + uv.lock

The fast-tests job was rejecting the branch because pyproject.toml had a
[libero_plus] extra whose git dep wasn't represented in uv.lock.

The Docker image no longer needs the extra — it clones LIBERO-plus
directly and PYTHONPATH-shadows hf-libero. Drop [libero_plus] from
pyproject and restore pyproject.toml + uv.lock to exactly what's on
origin/main, so `uv sync --locked --extra test` is a no-op for this PR.

Also repoint the doc/CI/env comments that still mentioned the extra at
the Docker install path.

* fix(libero-plus): strip perturbation metadata from task descriptions

LIBERO-plus builds task.language by space-joining the perturbation-variant
filename, so every non-_language_ variant inherits a trailing blob like
"view 0 0 100 0 0 initstate 0 noise 45" or "add 16". That shows up in the
dashboard video labels and no longer matches the base instruction stored
in the training dataset.

Strip those tokens in extract_task_descriptions.py with an end-anchored
regex over the {view,initstate,noise,add,tb,table,light,level}(+digits)
vocabulary. The anchor preserves mid-sentence literal uses of those words
(e.g. "from table center and place it on the plate") — only the trailing
metadata chain is removed. _language_ variants carry real BDDL-sourced
text and are left untouched.

* ci: point benchmark eval checkpoints at the lerobot/ org mirrors

pepijn223/smolvla_* → lerobot/smolvla_* across every benchmark job in
this branch (libero, metaworld, and the per-branch benchmark). The
checkpoints were mirrored into the lerobot/ org and that's the canonical
location going forward.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: integrate PR #3313 review feedback

- docs: fix paper link to arxiv, add benchmark image, add suite descriptions,
  add LIBERO-plus replacement warning, restructure eval section to match
  LIBERO doc style, fix policy I/O section, remove false try/except claim
- docker: fix shell grouping for hf-libero uninstall, replace hardcoded
  asset path with dynamic find
- ci: add Docker Hub login step, add HF_USER_TOKEN guard on eval step
- envs: add is_libero_plus param to get_task_init_states so vanilla LIBERO
  always takes the simple path

* fix(docs): use correct LIBERO-plus teaser image URL

* ci(libero-plus): drop redundant hf auth login step

The standalone login step ran `hf auth login` in a throwaway
`docker run --rm` container, so no credentials persisted. Auth is
already performed inside the eval step's container. Removing the
redundant step per PR #3313 review feedback.

* fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs

Port of #3416 onto this branch. Without these attributes eval crashes
when calling `env.unwrapped.metadata["render_fps"]` with async vector
envs. Adds `metadata` / `unwrapped` to `_LazyAsyncVectorEnv` and
caches the metadata alongside obs/action spaces in the LIBERO and
MetaWorld factories.

* ci: gate Docker Hub login on secret availability

Fork PRs cannot access `secrets.DOCKERHUB_LEROBOT_{USERNAME,PASSWORD}`,
which made every benchmark job fail at the login step before any of
the actual build/eval work could run. Gate the login on the env-var
expansion of the username so the step is skipped (not failed) when
secrets are absent. Mirrors the existing pattern in the VLABench job.

* Update .github/workflows/benchmark_tests.yml

Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>

* Update scripts/ci/extract_task_descriptions.py

Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>

* Update .github/workflows/benchmark_tests.yml

Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>

* Update docker/Dockerfile.benchmark.libero_plus

Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>

* Update .github/workflows/benchmark_tests.yml

Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>

* fix(libero-plus): address review feedback

* ci(libero-plus): fix YAML indentation in upload-artifact steps

The `uses:` key on two upload-artifact steps was at column 0 instead
of nested under the step, causing `pre-commit run check-yaml` to fail
with "expected <block end>, but found '<block mapping start>'".


Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
2026-04-20 21:07:21 +02:00
Pepijn
282c31cfef feat(envs): add RoboMME benchmark (#3311)
* feat(envs): add RoboMME benchmark integration

- RoboMME env wrapper with image/wrist_image/state observations
- Docker image with Vulkan, SAPIEN, mani-skill deps
- CI workflow: 1-episode smoke eval with pepijn223/smolvla_robomme
- preprocess_observation: handle image/wrist_image/state keys
- pyproject.toml: robomme extra

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor(docker): rebase RoboMME image on huggingface/lerobot-gpu

Mirror the libero/metaworld pattern: start from the nightly GPU image
(which already has apt deps, uv, venv, and lerobot[all] preinstalled)
and only layer on what RoboMME uniquely needs — the Vulkan libs
ManiSkill/SAPIEN requires, plus the robomme extra with the
gymnasium/numpy overrides.

Drops 48 lines of duplicated base setup (CUDA FROM, python install,
user creation, venv init, base apt deps) that the nightly image already
provides. Net: 102 → 54 lines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(robomme): drop prototype-branch note and move dataset to lerobot/robomme

- Remove the "Related work" block referencing the prototype branch
  feat/robomme-integration; the PR stands on its own.
- Point all dataset references at lerobot/robomme (docs, env module
  docstring, RoboMMEEnvConfig docstring) — this is the canonical HF
  location once the dataset is mirrored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(robomme): make docs build + fast tests green

1. Docs: add robomme to _toctree.yml under Benchmarks so doc-builder's
   TOC integrity check stops rejecting the new page.

2. Fast tests: robomme's mani-skill transitively pins numpy<2 which is
   unsatisfiable against the project's numpy>=2 base pin, so `uv sync`
   couldn't resolve a universal lockfile.

   Drop robomme as a pyproject extra entirely — it truly cannot coexist
   with the rest of the dep tree. The Dockerfile installs robomme
   directly from its git URL via `uv pip install --override`, which was
   already the runtime path. pyproject, docs, env docstrings, and the
   CI job comment all now point to the docker-only install.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(robomme): realign unit tests with current env API

The tests were written against an earlier env layout and never updated when
the wrapper was refactored, so CI's fast-test job was failing with:

- KeyError: 'front_rgb' / 'wrist_rgb' — these were renamed to the
  lerobot-canonical 'image' / 'wrist_image' keys (matching the dataset
  columns and preprocess_observation's built-in fallbacks).
- AssertionError: 'robomme' not in result — create_robomme_envs now
  returns {task_name: {task_id: env}}, not {'robomme': {...}}, so
  comma-separated task lists work.
- ModuleNotFoundError: lerobot.envs.lazy_vec_env — LazyVectorEnv was
  removed; create_robomme_envs is straightforward synchronous now.

Rewrite the 7 failing cases against the current API, drop the three
LazyVectorEnv tests, and add a multi-task test so the new comma-separated
task parsing is covered. Stub install/teardown is moved into helpers
(`_install_robomme_stub` / `_uninstall_robomme_stub`) so individual tests
stop repeating six boilerplate lines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* ci: point benchmark eval checkpoints at the lerobot/ org mirrors

pepijn223/smolvla_* → lerobot/smolvla_* across every benchmark job in
this branch (libero, metaworld, and the per-branch benchmark). The
checkpoints were mirrored into the lerobot/ org and that's the canonical
location going forward.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: integrate PR #3311 review feedback

- envs: rename obs keys to pixels/image, pixels/wrist_image, agent_pos
- envs: add __post_init__ for dynamic action_dim in RoboMMEEnv config
- envs: remove special-case obs conversion in utils.py (no longer needed)
- ci: add Docker Hub login, HF_USER_TOKEN guard, --env.task_ids=[0]
- scripts: extract_task_descriptions supports multiple task_ids
- docs: title to # RoboMME, add image, restructure eval section
- tests: update all key assertions to match new obs naming

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(docs): use correct RoboMME teaser image URL

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* ci(robomme): smoke-eval 10 tasks instead of 5

Broader coverage on the RoboMME benchmark CI job: bump the smoke eval
from 5 tasks to 10 (one episode each), all drawn from ROBOMME_TASKS.

Tasks now run: PickXtimes, BinFill, StopCube, MoveCube, InsertPeg,
SwingXtimes, VideoUnmask, ButtonUnmask, PickHighlight, PatternLock.

Updated the parse_eval_metrics.py `--task` label from the single
`PickXtimes` stub to the full comma list so the metrics artifact
reflects what was actually run. `parse_eval_metrics.py` already reads
`overall` for multi-task runs, so no parser change is needed.

Made-with: Cursor

* fix(robomme): nest `pixels` as a dict so preprocess_observation picks it up

`_convert_obs` was returning flat keys (`pixels/image`,
`pixels/wrist_image`). `preprocess_observation()` in envs/utils.py
keys off the top-level `"pixels"` entry and, not finding it,
silently dropped every image from the batch. The policy then saw
zero image features and raised

    ValueError: All image features are missing from the batch.

Match the LIBERO layout: return
`{"pixels": {"image": ..., "wrist_image": ...}, "agent_pos": ...}`
and declare the same shape in `observation_space`.

Made-with: Cursor

* fix(robomme): align docs and tests with nested pixels obs layout

Addresses PR #3311 review feedback:
- Docs: correct observation keys to `pixels/image` / `pixels/wrist_image`
  (mapped to `observation.images.image` / `observation.images.wrist_image`)
  and drop the now-obsolete column-rename snippet.
- Tests: assert `result["pixels"]["image"]` instead of flat `pixels/image`,
  matching the nested layout required by `preprocess_observation()`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs

Port of #3416 onto this branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: gate Docker Hub login on secret availability

Fork PRs cannot access `secrets.DOCKERHUB_LEROBOT_{USERNAME,PASSWORD}`,
which made every benchmark job fail at the login step. Gate the login
on the env-var expansion of the username so the step is skipped (not
failed) when secrets are absent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(robomme): address review feedback

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 20:21:27 +02:00
Pepijn
a147fa4439 feat(envs): add RoboCerebra long-horizon manipulation benchmark (#3314)
* feat(ci): add RoboCerebra benchmark eval job

- Docker image with robosuite/libero deps for RoboCerebra eval
- CI workflow: 1-episode eval with pepijn223/smolvla_robocerebra
- Reuses libero env with rename_map + empty_cameras=3

* docs(robocerebra): add benchmark page and toctree entry

Add a dedicated docs page for RoboCerebra that points at the canonical
dataset lerobot/robocerebra_unified and shows how to run eval + fine-tune
against it. Wire it into the Benchmarks section of the toctree so
doc-builder picks it up.

* ci: point benchmark eval checkpoints at the lerobot/ org mirrors

pepijn223/smolvla_* → lerobot/smolvla_* across every benchmark job in
this branch (libero, metaworld, and the per-branch benchmark). The
checkpoints were mirrored into the lerobot/ org and that's the canonical
location going forward.

* fix(robocerebra): drop alias extra + simplify docker image

Two problems rolled up:

1. `uv sync --locked --extra test` was failing because pyproject.toml added
   a `robocerebra = ["lerobot[libero]"]` alias extra but uv.lock wasn't
   regenerated. Drop the alias — nothing in CI actually needs the extra
   name (the Dockerfile just installs what it needs directly), so this
   restores pyproject.toml and uv.lock to byte-exact origin/main.

2. Rebase docker/Dockerfile.benchmark.robocerebra on
   huggingface/lerobot-gpu:latest (same pattern as libero/metaworld/…).
   The nightly image already ships lerobot[all] which includes [libero],
   so the RoboCerebra image is essentially identical to the LIBERO one:
   fetch libero-assets, write ~/.libero/config.yaml, overlay source.
   92 → 43 lines.

Also repoint the CI workflow comment that referenced the removed extra.

* ci: use dedicated lerobot/smolvla_robocerebra checkpoint for smoke eval

Replace the generic pepijn223/smolvla_libero placeholder with the
purpose-trained lerobot/smolvla_robocerebra model in the RoboCerebra
CI smoke test.

* fix(ci): align RoboCerebra eval with training pipeline

Fixes 5 mismatches that caused 0% success rate:
- env.type: robocerebra (unregistered) → libero
- resolution: 360x360 (default) → 256x256 (matches dataset)
- camera_name_mapping: map eye_in_hand → wrist_image (not image2)
- empty_cameras: 3 → 1 (matches training)
- add HF_USER_TOKEN guard on eval step

* fix(ci): set env.fps=20 and explicit obs_type for RoboCerebra eval

Match the dataset's 20 FPS (LiberoEnv defaults to 30) and make
obs_type=pixels_agent_pos explicit for safety against future changes.

* docs(robocerebra): align page with adding_benchmarks template

Rework docs/source/robocerebra.mdx to follow the standard benchmark
doc structure: intro + links + available tasks + installation + eval
+ recommended episodes + policy I/O + training + reproducing results.

- Point everything at lerobot/smolvla_robocerebra (the released
  checkpoint), not the personal pepijn223 mirror.
- Add the --env.fps=20 and --env.obs_type=pixels_agent_pos flags
  that CI actually uses, so copy-paste eval reproduces CI.
- Split the "Training" block out of the recipe section into its own
  section with the feature table.
- Add an explicit "Reproducing published results" section pointing
  at the CI smoke eval.

* fix: integrate PR #3314 review feedback

- ci(robocerebra): drop redundant hf auth login step (auth is
  already performed inside the eval step's container).
- ci(robocerebra): add Docker Hub login before the image build
  to pick up the authenticated rate limit.
- docs(robocerebra): align eval snippet with the CI command
  (observation size, camera_name_mapping, use_async_envs, device,
  empty_cameras=1).

* fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs

Port of #3416 onto this branch.

* ci: gate Docker Hub login on secret availability

* Update .github/workflows/benchmark_tests.yml

Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>

* Update .github/workflows/benchmark_tests.yml

Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
2026-04-20 19:12:15 +02:00
Pepijn
0f1c9b0851 feat(envs): add RoboTwin 2.0 benchmark (#3315)
* feat(envs): add RoboTwin 2.0 benchmark integration

- RoboTwinEnvConfig with 4-camera setup (head/front/left_wrist/right_wrist)
- Docker image with SAPIEN, mplib, CuRobo, pytorch3d (Python 3.12)
- CI workflow: 1-episode smoke eval with pepijn223/smolvla_robotwin
- RoboTwinProcessorStep for state float32 casting
- Camera rename_map: head_camera/front_camera/left_wrist -> camera1/2/3

* fix(robotwin): re-enable autograd for CuRobo planner warmup and take_action

lerobot_eval wraps the full rollout in torch.no_grad() (lerobot_eval.py:566),
but RoboTwin's setup_demo → load_robot → CuroboPlanner(...) runs
motion_gen.warmup(), which invokes Newton's-method trajectory optimization.
That optimizer calls cost.backward() internally, which raises

    RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

when autograd is disabled. take_action() hits the same planner path at every
step. Wrap both setup_demo and take_action in torch.enable_grad() so CuRobo's
optimizer can build its computation graph. Policy inference is unaffected —
rollout()'s inner torch.inference_mode() block around select_action() is
untouched, so we still don't allocate grad buffers during policy forward.

* fix(robotwin): read nested get_obs() output and use aloha-agilex camera names

RoboTwin's base_task.get_obs() returns a nested dict:

    {"observation": {cam: {"rgb": ..., "intrinsic_matrix": ...}},
     "joint_action": {"left_arm": ..., "left_gripper": ...,
                      "right_arm": ..., "right_gripper": ...,
                      "vector": np.ndarray},
     "endpose": {...}}

Our _get_obs was reading raw["{cam}_rgb"] / raw["{cam}"] and raw["joint_action"]
as if they were flat, so np.asarray(raw["joint_action"], dtype=float64) tripped
on a dict and raised

    TypeError: float() argument must be a string or a real number, not 'dict'

Fix:
- Pull images from raw["observation"][cam]["rgb"]
- Pull joint state from raw["joint_action"]["vector"] (the flat array)
- Update the default camera tuple to (head_camera, left_camera, right_camera)
  to match RoboTwin's actual wrist-camera names (envs/camera/camera.py:135-151)

* refactor(robotwin): drop defensive dict guards, cache black fallback frame

_get_obs was guarding every dict access with isinstance(..., dict) in case
RoboTwin's get_obs returned something else — but the API contract
(envs/_base_task.py:437) always returns a dict, so the guards were silently
masking real failures behind plausible-looking zero observations. Drop them.

Also:
- Cache a single black fallback frame in __init__ instead of allocating
  a fresh np.zeros((H, W, 3), uint8) for every missing camera on every
  step — the "camera not exposed" set is static per env.
- Only allocate the zero joint_state on the fallback path (not unconditionally
  before the real value overwrites it).
- Replace .flatten() with .ravel() (no copy when already 1-D).
- Fold the nested-dict schema comment and two identical torch.enable_grad()
  rationales into a single Autograd section in the class docstring.
- Fix stale `left_wrist` camera name in the observation docstring.

* fix(robotwin): align observation_space dims with D435 camera output

lerobot_eval crashed in gym.vector's SyncVectorEnv.reset with:

    ValueError: Output array is the wrong shape

because RoboTwinEnvConfig declared observation_space = (480, 640, 3) but
task_config/demo_clean.yml specifies head_camera_type=D435, which renders
(240, 320, 3). gym.vector.concatenate pre-allocates a buffer from the
declared space, so the first np.stack raises on shape mismatch.

Changes:
- Config defaults now 240×320 (the D435 dims in _camera_config.yml), with
  a comment pointing at the source of truth.
- RoboTwinEnv.__init__ accepts observation_height/width as Optional and
  falls back to setup_kwargs["head_camera_h/w"] so the env is self-consistent
  even if the config is not in sync.
- Config camera_names / features_map use the actual aloha-agilex camera
  names (head_camera, left_camera, right_camera). Drops the stale
  "front_camera" and "left_wrist"/"right_wrist" entries that never matched
  anything RoboTwin exposes.
- CI workflow's rename_map updated to match the new camera names.

* fix(robotwin): expose _max_episode_steps for lerobot_eval.rollout

rollout() does `env.call("_max_episode_steps")` (lerobot_eval.py:157) to
know when to stop stepping. LiberoEnv and MetaworldEnv set this attribute;
RoboTwinEnv was tracking the limit under `episode_length` only, so the call
raised AttributeError once CuRobo finished warming up.

* fix(robotwin): install av-dep so lerobot_eval can write rollout MP4s

write_video (utils/io_utils.py:53) lazily imports PyAV via require_package
and raises silently inside the video-writing thread when the extra is not
installed — so the eval itself succeeds with pc_success=100 but no MP4
ever lands in videos/, and the artifact upload reports "No files were
found". Add av-dep to the install line (same pattern as the RoboMME image).

* feat(robotwin): eval 5 diverse tasks per CI run with NL descriptions

Widen the smoke eval from a single task (beat_block_hammer) to five:
click_bell, handover_block, open_laptop, stack_blocks_two on top of the
original. Each gets its own rollout video in videos/<task>_0/ so the
dashboard can surface visually distinct behaviours.

extract_task_descriptions.py now has a RoboTwin branch that reads
`description/task_instruction/<task>.json` (already shipped in the clone
at /opt/robotwin) and pulls the `full_description` field. CI cds into
the clone before invoking the script so the relative path resolves.

parse_eval_metrics.py is invoked with the same 5-task list so the
metrics.json embeds one entry per task.

* ci: point benchmark eval checkpoints at the lerobot/ org mirrors

pepijn223/smolvla_* → lerobot/smolvla_* across every benchmark job in
this branch (libero, metaworld, and the per-branch benchmark). The
checkpoints were mirrored into the lerobot/ org and that's the canonical
location going forward.

* refactor(robotwin): rebase docker image on huggingface/lerobot-gpu

Mirror the libero/metaworld/libero_plus/robomme pattern: start from the
nightly GPU image (apt deps, python, uv, venv, lerobot[all] already
there) and layer on only what RoboTwin 2.0 uniquely needs —
cuda-nvcc + cuda-cudart-dev (CuRobo builds from source), Vulkan libs +
NVIDIA ICD (SAPIEN renderer), sapien/mplib/open3d/pytorch3d/curobo
installs, the mplib + sapien upstream patches, and the TianxingChen
asset download.

Drops ~90 lines of duplicated base setup (CUDA FROM, apt python, uv
install, user creation, venv init, base lerobot install). 199 → 110.

Also repoint the docs + env docstring dataset link from
hxma/RoboTwin-LeRobot-v3.0 to the canonical lerobot/robotwin_unified.

* docs(robotwin): add robotwin to _toctree.yml under Benchmarks

doc-builder's TOC integrity check was rejecting the branch because
docs/source/robotwin.mdx existed but wasn't listed in _toctree.yml.


* fix(robotwin): defer YAML lookup and realign tests with current API

__init__ was eagerly calling _load_robotwin_setup_kwargs just to read
head_camera_h/w from the YAML. That import (`from envs import CONFIGS_PATH`)
required a real RoboTwin install, so constructing the env — and thus every
test in tests/envs/test_robotwin.py — blew up with ModuleNotFoundError
on fast-tests where RoboTwin isn't installed.

Replace the eager lookup with DEFAULT_CAMERA_H/W constants (240×320, the
D435 dims baked into task_config/demo_clean.yml). reset() still resolves
the full setup_kwargs lazily — that's fine because reset() is only
called inside the benchmark Docker image where RoboTwin is present.

Also resync the test file with the current env API:
  - mock get_obs() as the real nested {"observation": {cam: {"rgb": …}},
    "joint_action": {"vector": …}} shape
  - patch both _load_robotwin_task and _load_robotwin_setup_kwargs
    (_patch_load → _patch_runtime)
  - drop `front_camera` / `left_wrist` from assertions — aloha-agilex
    exposes head_camera + left_camera + right_camera, not those
  - black-frame test now uses left_camera as the missing camera
  - setup_demo call check loosened to the caller-provided seed/is_test
    bits (full kwargs include the YAML-derived blob)

* fix: integrate PR #3315 review feedback

- ci: add Docker Hub login step, add HF_USER_TOKEN guard on eval step
- docker: tie patches to pinned versions with removal guidance, remove
  unnecessary HF_TOKEN for public dataset, fix hadolint warnings
- docs: fix paper link to arxiv, add teaser image, fix camera names
  (4→3 cameras), fix observation dims (480x640→240x320)


* fix(docs): correct RoboTwin 2.0 paper arxiv link


* fix(docs): use correct RoboTwin 2.0 teaser image URL


* fix(docs): use plain markdown image to fix MDX build

* ci(robotwin): smoke-eval 10 tasks instead of 5

Broader coverage on the RoboTwin 2.0 benchmark CI job: bump the smoke
eval from 5 tasks to 10 (one episode each). Added tasks are all drawn
from ROBOTWIN_TASKS and mirror the shape/complexity of the existing
set (simple single-object or single-fixture manipulations).

Tasks now run: beat_block_hammer, click_bell, handover_block,
open_laptop, stack_blocks_two, click_alarmclock, close_laptop,
close_microwave, open_microwave, place_block.

`parse_eval_metrics.py` reads `overall` for multi-task runs so no
parser change is needed. Bumped the step name and the metrics label
to reflect the 10-task layout.


* fix(ci): swap 4 broken RoboTwin tasks in smoke eval

The smoke eval hit two upstream issues:
- `open_laptop`: bug in OpenMOSS/RoboTwin main — `check_success()` uses
  `self.arm_tag`, but that attribute is only set inside `play_once()`
  (the scripted-expert path). During eval `take_action()` calls
  `check_success()` directly, hitting `AttributeError: 'open_laptop'
  object has no attribute 'arm_tag'`.
- `close_laptop`, `close_microwave`, `place_block`: not present in
  upstream RoboTwin `envs/` at all — our ROBOTWIN_TASKS tuple drifted
  from upstream and these names leaked into CI.

Replace the four broken tasks with upstream-confirmed equivalents
that exist both in ROBOTWIN_TASKS and in RoboTwin's `envs/`:
`adjust_bottle`, `lift_pot`, `stamp_seal`, `turn_switch`.

New 10-task smoke set: beat_block_hammer, click_bell, handover_block,
stack_blocks_two, click_alarmclock, open_microwave, adjust_bottle,
lift_pot, stamp_seal, turn_switch.


* fix(robotwin): sync ROBOTWIN_TASKS + doc with upstream (50 tasks)

The local ROBOTWIN_TASKS tuple drifted from upstream
RoboTwin-Platform/RoboTwin. Users passing names like `close_laptop`,
`close_microwave`, `dump_bin`, `place_block`, `pour_water`,
`fold_cloth`, etc. got past our validator (the names were in the
tuple) but then crashed inside robosuite with a confusing error,
because those tasks don't exist in upstream `envs/`.

- Replace ROBOTWIN_TASKS with a verbatim mirror of upstream's
  `envs/` directory: 50 tasks as of main (was 60 with many
  stale entries). Added a `gh api`-based one-liner comment so
  future bumps are mechanical.
- Update the `60 tasks` claims in robotwin.mdx and
  RoboTwinEnvConfig's docstring to `50`.
- Replace the stale example-task table in robotwin.mdx with ten
  upstream-confirmed examples, and flag `open_laptop` as
  temporarily broken (its `check_success()` uses `self.arm_tag`
  which is only set inside `play_once()`; eval-mode callers hit
  AttributeError).
- Rebuild the "Full benchmark" command with the actual 50-task
  list, omitting `open_laptop`.


* test(robotwin): lower task-count floor from 60 to 50

ROBOTWIN_TASKS was trimmed to 50 tasks (see comment in
`src/lerobot/envs/robotwin.py:48`), but the assertion still
required ≥60, causing CI failures. Align the test with the
current upstream task count.


* fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs

Port of #3416 onto this branch.

* ci: gate Docker Hub login on secret availability


* fix: integrate PR #3315 review feedback

- envs(robotwin): default `observation_height/width` in
  `create_robotwin_envs` to `DEFAULT_CAMERA_H/W` (240/320) so they
  match the D435 dims baked into `task_config/demo_clean.yml`.
- envs(robotwin): resolve `task_config/demo_clean.yml` via
  `CONFIGS_PATH` instead of a cwd-relative path; works regardless
  of where `lerobot-eval` is invoked.
- envs(robotwin): replace `print()` calls in `create_robotwin_envs`
  with `logger.info(...)` (module-level `logger = logging.getLogger`).
- envs(robotwin): use `_LazyAsyncVectorEnv` for the async path so
  async workers start lazily (matches LIBERO / RoboCasa / VLABench).
- envs(robotwin): cast `agent_pos` space + joint-state output to
  float32 end-to-end (was mixed float64/float32).
- envs(configs): use the existing `_make_vec_env_cls(use_async,
  n_envs)` helper in `RoboTwinEnvConfig.create_envs`; drop the
  `get_env_processors` override so RoboTwin uses the identity
  processor inherited from `EnvConfig`.
- processor: delete `RoboTwinProcessorStep` — the float32 cast now
  happens in the wrapper itself, so the processor is redundant.
- tests: drop the `TestRoboTwinProcessorStep` suite; update the
  mock obs fixture to use float32 `joint_action.vector`.
- ci: hoist `ROBOTWIN_POLICY` and `ROBOTWIN_TASKS` to job-level
  env vars so the task list and policy aren't duplicated across
  eval / extract / parse steps.
- docker: pin RoboTwin + CuRobo upstream clones to commit SHAs
  (`RoboTwin@0aeea2d6`, `curobo@ca941586`) for reproducibility.
2026-04-20 17:46:39 +02:00
Pepijn
e699e52388 feat(envs): add RoboCasa365 benchmark integration (#3375)
* feat(envs): add RoboCasa365 benchmark integration

Add RoboCasa365 (arXiv:2603.04356) as a new simulation benchmark with
365 everyday kitchen manipulation tasks across 2,500 diverse environments.

New files:
- src/lerobot/envs/robocasa.py: gym.Env wrapper with deferred env creation,
  flat 12D action / 16D state vectors, 3-camera support
- docs/source/robocasa.mdx: user-facing documentation
- docker/Dockerfile.benchmark.robocasa: CI benchmark image

Modified files:
- src/lerobot/envs/configs.py: RoboCasaEnv config (--env.type=robocasa)
- pyproject.toml: robocasa optional dependency group
- docs/source/_toctree.yml: sidebar entry
- .github/workflows/benchmark_tests.yml: integration test job

Refs: https://arxiv.org/abs/2603.04356, https://robocasa.ai
Related: huggingface/lerobot#321

* fix(docker): use uv pip to install robocasa in benchmark image

The huggingface/lerobot-gpu base image uses `uv` with a venv at
/lerobot/.venv — `pip` is not on PATH, so `pip install` fails with
"pip: not found". Switch to `uv pip install` which installs into the
existing venv.

Also drop the @v1.0.0 tag pin from the robocasa git URL since the
upstream repo may not have that tag; use default branch instead.

* fix(robocasa): editable install + switch to lerobot/smolvla_robocasa

- pip install from git omits data files like box_links_assets.json
  (not declared in package_data). Clone and install editable so the
  source tree is used at runtime.
- Download only tex + fixtures_lw asset types (smoke test doesn't need
  objaverse/aigen objects). Pipe 'y' to auto-accept download prompt.
- Switch CI policy from pepijn223/smolvla_robocasa to lerobot/smolvla_robocasa.

* fix(docker): re-install lerobot editably after COPY

The nightly huggingface/lerobot-gpu image predates the RoboCasaEnv
registration — so `lerobot-eval --env.type=robocasa` fails at argparse
with "invalid choice" even after COPY . . overlays the new source.
Force an editable reinstall so the venv picks up the current configs.py.


* fix(ci): add rename_map for robocasa eval (image* -> camera*)

Policy lerobot/smolvla_robocasa expects observation.images.camera1/2/3,
but RoboCasaEnv produces observation.images.image/image2/image3.

* fix(robocasa): override RoboCasaGymEnv default split (test -> all)

RoboCasaGymEnv defaults split="test", but create_env only accepts
{None, "all", "pretrain", "target"}, so the out-of-the-box default
crashes with ValueError. Always pass "all" when split is None.


* fix(docker): also download objs_lw (lightwheel objects) for robocasa

Kitchen tasks (e.g. CloseFridge) reference lightwheel object meshes
like Stool022/model.xml. fixtures_lw alone isn't enough — we also
need objs_lw. Still skipping objaverse/aigen to keep image size down.

Made-with: Cursor

* feat(robocasa): raw camera names + benchmark-group task shortcuts

Align the LeRobot env with RoboCasa's native conventions so policies
trained on the upstream datasets don't need a --rename_map at eval
time, and expose the standard task groups as first-class --env.task
values.

- Preserve raw RoboCasa camera names (e.g. robot0_agentview_left)
  as observation.images.<name> end-to-end. Drops camera_name_mapping
  and DEFAULT_CAMERA_NAME_MAPPING; features/features_map are now
  built dynamically from the parsed camera list.
- Accept benchmark-group names as --env.task: atomic_seen,
  composite_seen, composite_unseen, pretrain50/100/200/300. Expanded
  lazily via robocasa.utils.dataset_registry and auto-sets the
  split ("target" | "pretrain").
- Update CI smoke-eval rename_map to map raw cam names to the
  camera1/2/3 keys expected by lerobot/smolvla_robocasa.


* docs(robocasa): single-task smolvla train+eval recipe on pepijn223/robocasa_CloseFridge

- Rewrite observation section to use raw RoboCasa camera keys
  (observation.images.robot0_agentview_{left,right},
  observation.images.robot0_eye_in_hand).
- Add a "Training on a single task" section with a full smolvla
  training command on pepijn223/robocasa_CloseFridge, plus matching
  single-task eval command.
- Document benchmark-group task shortcuts (atomic_seen, composite_seen,
  composite_unseen, pretrain50/100/200/300) as valid --env.task values.


* fix(robocasa): restrict obj_registries to lightwheel by default

CloseFridge (and most kitchen tasks) crashed at reset with
`ValueError: Probabilities contain NaN` coming out of
`sample_kitchen_object_helper`. RoboCasa's upstream default
`obj_registries=("objaverse", "lightwheel")` normalizes per-registry
candidate counts as probabilities; when a sampled category has zero
mjcf paths in every configured registry (because the objaverse asset
pack isn't on disk — ~30GB, skipped by our Docker build), the 0/0
divide yields NaNs and `rng.choice` raises.

- Add `obj_registries: list[str] = ["lightwheel"]` to `RoboCasaEnv`
  config; thread it through `create_robocasa_envs`, `_make_env_fns`,
  and the gym.Env wrapper to the underlying `RoboCasaGymEnv` (which
  forwards to `create_env` → `robosuite.make` → kitchen env).
- Default matches what `download_kitchen_assets --type objs_lw`
  actually ships, so the env works out of the box without a 30GB
  objaverse download.
- Document the override (`--env.obj_registries='[objaverse,lightwheel]'`)
  for users who have downloaded the full asset set.


* fix(docker): also download tex_generative for robocasa benchmark

RoboCasa's lightwheel kitchen fixtures embed references to
`generative_textures/wall/tex*.png` directly in their MuJoCo XML, so
`MjModel.from_xml_string` errors out at reset time with
"No such file or directory" even when the env is constructed with
`generative_textures=None`. The generative textures live under a
separate asset registry key (`tex_generative`) in
`download_kitchen_assets`, distinct from the base `tex` pack we were
already fetching.

- Add `tex_generative` to the download list so the fixture XMLs
  resolve.
- Document the remaining omissions (objaverse/aigen, ~30GB) and how
  the runtime side pairs this with obj_registries=["lightwheel"] to
  avoid sampling from categories whose assets aren't on disk.

* ci(robocasa): smoke-eval 10 atomic tasks instead of 1

Broader coverage in the benchmark CI job: evaluate SmolVLA on ten
fixture-centric atomic RoboCasa tasks (one episode each) instead of
just CloseFridge. The tasks are all drawn from TARGET_TASKS.atomic_seen
and selected to avoid object-manipulation categories that would require
the objaverse/aigen asset packs (we only ship objs_lw in the Docker
image, paired with obj_registries=["lightwheel"] on the runtime side).

Tasks: CloseFridge, OpenCabinet, OpenDrawer, TurnOnMicrowave,
TurnOffStove, CloseToasterOvenDoor, SlideDishwasherRack,
TurnOnSinkFaucet, NavigateKitchen, TurnOnElectricKettle.

`scripts/ci/parse_eval_metrics.py` already handles multi-task output
via the `overall` key, so no parser changes needed. Bumped the metrics
artifact's task label to `atomic_smoke_10` to reflect the grouping.

* fix(pyproject): drop unresolvable robocasa extra

robocasa's upstream setup.py hardcodes `lerobot==0.3.3` in
install_requires. Exposing it as the `lerobot[robocasa]` extra made
uv's dep resolver cycle: `lerobot[robocasa]` -> robocasa -> lerobot
(a different version) -> unsolvable. This broke every `uv sync` — even
invocations with an unrelated extra like `--extra test` — because uv
validates the whole lockfile graph.

- Remove the `robocasa` extra from pyproject.toml. Installation
  instructions in docs/source/robocasa.mdx now walk users through the
  manual `git clone` + `pip install --no-deps` flow, which matches
  what the Docker image already does and sidesteps the cyclic dep
  entirely.
- Dockerfile: `uv pip install -e ~/robocasa --no-deps` so the
  shadowed lerobot==0.3.3 never lands in the image; install
  robocasa's actual runtime deps (numpy, numba, scipy, mujoco,
  tianshou, etc.) explicitly.

* docs(robocasa): align page with adding_benchmarks template

Rework docs/source/robocasa.mdx to follow the standard benchmark doc
structure: intro + links + available tasks (with family breakdown and
first-class benchmark-group shortcuts) + installation + eval +
recommended episodes + policy I/O + training + reproducing results.

- Fix the paper link (was pointing at a non-existent arxiv ID).
- Surface lerobot/smolvla_robocasa and pepijn223/robocasa_CloseFridge
  in the top-of-page links so they're findable without reading the
  training section.
- Add an explicit "Object registries" subsection explaining the
  `--env.obj_registries=[objaverse,lightwheel]` override path.
- Add an explicit "Reproducing published results" section pointing
  at the CI smoke eval.

* fix: integrate PR #3375 review feedback

- envs(robocasa): hoist the duplicated `_parse_camera_names` helper
  out of `libero.py` and `robocasa.py` into `envs/utils.py` as the
  public `parse_camera_names`; call sites updated.
- envs(robocasa): give each factory a distinct `episode_index`
  (`0..n_envs-1`) and derive a per-worker seed series in `reset()`
  so n_envs workers don't all roll the same scene under a shared
  outer seed.
- envs(robocasa): drop the unused `**kwargs` on `_make_env`; declare
  `visualization_height` / `visualization_width` on both the wrapper
  and the `RoboCasaEnv` config + propagate via `gym_kwargs`.
- envs(robocasa): emit `info["final_info"]` on termination (matching
  MetaWorld) so downstream vector-env auto-reset keeps the terminal
  task/success flags.
- docs(robocasa): add `--rename_map` (robot0_agentview_left/
  eye_in_hand/agentview_right → camera1/2/3) plus CI-parity flags to
  all three eval snippets.
- docker(robocasa): pin robocasa + robosuite git SHAs and the pip
  dep versions (pygame, Pillow, opencv-python, pyyaml, pynput, tqdm,
  termcolor, imageio, h5py, lxml, hidapi, gymnasium) for
  reproducible benchmark images.
- ci(robocasa): update the workflow comment — there is no
  `lerobot[robocasa]` extra; robocasa/robosuite are installed
  manually because upstream's `lerobot==0.3.3` pin shadows ours.

* docs(robocasa): add benchmark banner image

* fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs

Port of #3416 onto this branch. Also threads the cached metadata
through the RoboCasa factory so async eval on `--env.type=robocasa`
keeps the same improvement.


* fix: integrate PR #3375 review feedback (round 2)

- envs(robocasa): when the caller passes `seed=None` to `reset()`,
  fall back to `self.episode_index` for the inner env seed so each
  worker still samples a distinct trajectory instead of all workers
  inheriting the same global RNG state.
- envs(robocasa): replace the two module-level `print()` calls in
  `create_robocasa_envs` with `logger.info(...)` via a module-level
  `logger = logging.getLogger(__name__)`.
- ci(robocasa): run `scripts/ci/extract_task_descriptions.py` after
  the eval so `metrics.json` carries per-task natural-language
  labels, matching LIBERO / MetaWorld / VLABench jobs. Added a
  `_robocasa_descriptions()` extractor that splits CamelCase task
  names into word-level labels keyed by `<task>_0`.
2026-04-20 17:10:53 +02:00
Haoming Song
b2765b39b8 Cache lazy async env metadata for eval (#3416)
Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>
2026-04-20 15:33:13 +02:00
Pepijn
777b808c70 ci: skip Docker Hub login step on fork PRs (#3417)
On fork PRs, `secrets.DOCKERHUB_LEROBOT_*` expand to empty strings,
which fails `docker/login-action@v3` with `Error: Username and
password required` before any of the actual build/eval work runs.

Gate the login step on the env-var expansion of the username so the
step is skipped (not failed) when secrets are absent. On the main
repo + maintainer-approved fork runs (`pull_request_review` path),
the secrets resolve normally, the step runs, and image pulls get
the authenticated Docker Hub rate limit.

Scope: only `benchmark_tests.yml`, the lone benchmark workflow that
triggers on `pull_request` from forks. `full_tests.yml` and
`latest_deps_tests.yml` run under `pull_request_review` / schedule /
workflow_dispatch, where secrets are already guaranteed.

Context: surfaced on #3416 where an external contributor's PR failed
at the login step before any test could run.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 15:14:35 +02:00
Pepijn
8d982614a6 Merge remote-tracking branch 'origin/main' into codex/model-profiling
# Conflicts:
#	src/lerobot/configs/train.py
2026-04-20 11:32:10 +02:00
Defalt
5c43fa1cce fix(policies): replace deprecated torch.cuda.amp.autocast with torch.amp.autocast (#3167)
Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
2026-04-19 16:25:08 +02:00
k1000dai
3f16d98a9b episods→episodes (#3410)
Fixing typo
2026-04-19 12:58:06 +02:00
whats2000
52f508c51c fix(dataset): cleanup_interrupted_episode wipes image temp dirs (#3405) 2026-04-19 12:04:24 +02:00
Steven Palma
a8b72d9615 feat(dataset): 2x faster dataloader via parallel decode, uint8 transport, and persistent workers (#3406)
* feat(dataset): 2xfaster dataloader

* fix(dataset): streaming return uint8 decode

* fix(tests): adjust normalization step comparison

* fix(dataset): with threadexecutor + False default

* chore(dataset): make it a config

* fix(test): account for uint8 in training path testing
2026-04-19 00:08:22 +02:00
Steven Palma
760220d532 chore(dependencies): update uv.lock (#3365)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-04-18 22:32:05 +02:00
Shu Jiuhe
a99943ca26 Improve loading performance in _absolute_to_relative_idx when remapping indices (#3279) 2026-04-18 19:28:50 +02:00
Cheng Yin
a9821af61b fix(record): pass rename_map to make_policy in lerobot-record (#3240)
* fix(record): pass rename_map to make_policy in lerobot-record

Fixes #3181. The rename_map from dataset config was used for preprocessor
construction but not passed to make_policy(), causing feature mismatch
errors when camera key names differ between dataset and model config.

make_policy() already accepts a rename_map parameter and uses it to skip
visual feature consistency validation when remapping is active, but
lerobot_record.py was not passing it through.

* style: fix ruff format for ternary expression

---------

Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
2026-04-17 16:40:08 +02:00
Pepijn
c8df80ae91 Merge remote-tracking branch 'origin/main' into codex/model-profiling 2026-04-17 12:27:11 +01:00
Steven Palma
d4a229444b fix(ci): not fail when skipped (#3399) 2026-04-17 12:02:38 +02:00
Pepijn
1ac8e96575 refactor(profiling): shrink lerobot_train.py diff via start()/finalize()
Replace the `with profiler or nullcontext():` wrap around the entire
training loop with explicit `profiler.start()` / `profiler.finalize()`
calls, and tighten `_section(...)` regions in `update_policy` to only
wrap the hot calls (forward / backward / optimizer.step).

This avoids ~120 lines of pure re-indentation noise while keeping the
exact same artifacts on disk and the same public behavior.

lerobot_train.py diff vs main: 267 -> 29 changed lines.

Made-with: Cursor
2026-04-17 10:59:43 +01:00
Steven Palma
098ebb4d72 feat(ci): send slack notification if latest dependecy test is broken (#3398) 2026-04-17 11:28:24 +02:00
Pepijn
a6dd28e8b4 fix(profiling): tolerate groot dep-install failure
groot's only policy-specific dependency is flash-attn, which has no
prebuilt wheel for torch 2.10 and requires nvcc to build from source.
The CI image is based on nvidia/cuda:12.4.1-base, which ships the
CUDA runtime but not the compiler toolkit, so the source build fails
with `/usr/local/cuda/bin/nvcc: No such file or directory`. The
repo's own pyproject.toml already carries a TODO acknowledging this:
gr00t needs bespoke flash-attn install steps.

Treat this as an environmental limitation rather than a regression:
dep-install failures for groot are logged via `::warning::` and skip
the policy without failing the job. Dep-install failures for any
other policy remain fatal, so real regressions still surface.

Made-with: Cursor
2026-04-16 21:15:14 +02:00
Pepijn
1842100402 feat(profiling): record forward/backward/optimizer timings
The dashboard expects per-phase timings (forward_s, backward_s,
optimizer_s) in step_timing_summary.json, but only total_update_s
and dataloading_s were collected — leaving every chart except
dataloading empty.

Add a lightweight TrainingProfiler.section(name) context manager
that times a region with torch.cuda.synchronize before and after
(so GPU work is captured, not just the kernel-launch latency) and
accumulates per-section samples into step_timing_summary.json.

Wrap forward, backward (incl. grad clip), and optimizer (incl.
zero_grad and scheduler.step) in update_policy with these sections.
When profiling is off (profiler=None) the wrappers become no-ops,
so training performance is unchanged outside CI.

Made-with: Cursor
2026-04-16 20:26:27 +02:00
Pepijn
00e9defb80 fix(profiling): build flash-attn without isolation for groot
groot depends on flash-attn, which fails to build in uv's default
isolated build env because it doesn't declare torch as a build-time
dependency. Torch is a core lerobot dep and is already present in
the target venv when groot is synced, so we can safely disable
build isolation just for flash-attn. The flag is a no-op for
policies that don't pull in flash-attn.

Made-with: Cursor
2026-04-16 20:21:58 +02:00
Pepijn
b81eef43c8 fix(profiling): wall_x OOM and xvla rename_map
- wall_x: switch to SGD optimizer + explicit scheduler overrides.
  The 4B-param model casts to bf16 internally, but AdamW's exp_avg/
  exp_avg_sq states blow past the 22 GB GPU. Same fix we applied to
  pi0/pi05/pi0_fast.
- xvla: fix rename_map. Dataset (libero_plus) exposes front/wrist
  image keys; the model expects image/image2. Previous map was
  direction-reversed and left the batch without any recognized
  image feature.

Made-with: Cursor
2026-04-16 19:49:12 +02:00
Pepijn
d483dd4c4b feat(profiling): profile groot, xvla, diffusion, wall_x on PRs
Add groot, xvla, diffusion and wall_x (wall-oss-flow) to the smoke
profiling filter and switch the runner to per-policy dependency
resolution. Each policy now gets its own `uv sync --extra <policy>`
pass followed by a profiling run, so heavy or conflicting extras
(flash-attn, peft, diffusers, etc.) can never block another policy's
profiling. A failure in one policy is logged and surfaces a non-zero
exit at the end instead of aborting the matrix.

Made-with: Cursor
2026-04-16 19:04:27 +02:00
Pepijn
a56423fa33 Merge branch 'main' into codex/model-profiling 2026-04-16 18:58:35 +02:00
Maxime Ellerbach
9bc2df80bb chore(docs): adding a jupyter notebook that gives you ready-to-paste commands (#3395)
* chore(docs): adding an example quickstart jupyter notebook that gives you ready-to-paste commands

* some fixes in the commands

* uv lock

* Adding notebook to all

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>

* uv lock again

---------

Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-04-16 17:53:35 +02:00
Pepijn
da7da741f1 fix(profiling): use SGD for pi0/pi05/pi0_fast and free CUDA cache after deterministic forward
Adam optimizer states (exp_avg + exp_avg_sq) require ~16GB extra on top of
model params and gradients for 4B parameter models, exceeding the 22GB GPU.
SGD has zero optimizer state overhead and profiling only measures
forward/backward timing anyway.

Also adds torch.cuda.empty_cache() after deterministic forward to release
transient memory before the training loop starts.

Made-with: Cursor
2026-04-16 16:09:56 +02:00
Pepijn
b1e16783de refactor: extract profiling into self-contained TrainingProfiler class
Move all profiling orchestration out of lerobot_train.py and
TrainPipelineConfig into a TrainingProfiler class in profiling_utils.py.

- lerobot_train.py: ~74 lines of profiling code reduced to ~7 call sites
- TrainPipelineConfig: 10 profile_* fields reduced to 2 (mode + output_dir)
- update_policy: reverted to clean main-branch signature (no timing_collector)
- TrainingProfiler encapsulates torch profiler, timing collection,
  deterministic forward artifacts, and all output writing
- CI script (run_model_profiling.py) unchanged—it only passes the 2 kept fields

Made-with: Cursor
2026-04-16 16:00:49 +02:00
Pepijn
a4544ffea7 fix(profiling): use bf16 dtype and gradient checkpointing for pi0/pi05
Enable --policy.dtype=bfloat16 and --policy.gradient_checkpointing=true
for pi0, pi0_fast, and pi05 profiling specs. Combined with use_amp=true,
this brings the 4B-param VLA models well within the 22GB GPU budget.

Made-with: Cursor
2026-04-16 15:35:25 +02:00
Pepijn
dbe01b0444 fix(profiling): fix pi0 cuBLAS error and pi05 OOM on 22GB GPU
- Move cudnn_deterministic to per-spec train_args instead of hardcoding
  it for all models. cuBLAS deterministic mode triggers internal errors
  on Gemma-based models (pi0, pi05) during backward pass.
- Enable use_amp=true for pi0, pi0_fast, and pi05 to reduce memory
  footprint from fp32 (~16GB weights alone) to bf16, fitting within
  22GB GPU budget with room for activations and gradients.
- Small models (act, diffusion, multi_task_dit) still use deterministic
  mode for reproducible profiling results.

Made-with: Cursor
2026-04-16 15:34:17 +02:00
Pepijn
e16a95a78e refactor(profiling): remove cProfile, keep torch profiler only
Remove cProfile wrapping from the training loop and profiling utilities.
The torch profiler already captures fine-grained timing and operator
breakdowns; cProfile added redundant overhead without actionable
insight for GPU-bound models.

- Remove render_cprofile_summary, run_with_cprofile from profiling_utils
- Replace cProfile-wrapped calls in lerobot_train with direct calls
- Remove cprofile_summaries from artifact index in run_model_profiling
- Update tests to match

Made-with: Cursor
2026-04-16 15:34:17 +02:00
Pepijn
4137b5785d fix(profiling): align libero smoke specs with pretrained policies 2026-04-16 15:11:54 +02:00
Pepijn
8ece10e484 feat(ci): profile more models in pr smoke runs 2026-04-16 14:49:37 +02:00
Pepijn
ddeb216ab9 fix(ci): skip hub publish for pr profiling runs 2026-04-16 14:38:43 +02:00
Pepijn
d46d67f75d fix(profiling): forward GIT_REF + PR_NUMBER into Docker container
The previous commit moved these expressions from inline shell expansion
to job-level env: vars, but the profiling script runs inside a Docker
container. Job-level env vars are only visible in the runner, not inside
the container — they need explicit -e flags on the docker run command
(same pattern as HOST_GIT_COMMIT which was already forwarded).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:38:13 +02:00
Pepijn
b746cd3c61 fix(profiling): sort import + move expressions to env vars for zizmor
Pre-commit Quality gate flagged two issues:

1. ruff/isort: `from numbers import Real` must sort after
   `from collections.abc import Callable` (stdlib alphabetical order).

2. zizmor (high): `github.head_ref`, `github.ref_name`,
   `github.event.inputs.git_ref`, and `github.event.pull_request.head.sha`
   were expanded directly in `run:` shell blocks, which zizmor flags as
   attacker-controllable. Move all four into job-level `env:` vars
   (GIT_REF, PR_NUMBER, HOST_GIT_COMMIT) so the shell only sees env-var
   references — the same pattern the workflow already uses for
   PROFILE_MODE, POLICY_FILTER, etc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:30:13 +02:00
Pepijn
6d1a5fca02 fix(profiling): keep ci green when hub publish is unauthorized 2026-04-16 13:07:30 +02:00
Pepijn
8d7099cd7d fix(profiling): publish preview runs via hf dataset prs 2026-04-16 12:50:57 +02:00
Pepijn
516f39685a fix(profiling): skip dataset creation on publish 2026-04-16 12:09:03 +02:00
Pepijn
b27e838376 fix(profiling): publish preview rows to existing dataset 2026-04-16 11:54:35 +02:00
Pepijn
40470648d1 feat(profiling): publish preview runs for dashboard debugging 2026-04-16 10:54:34 +02:00
Pepijn
25e5062b2c fix(profiling): read generic device timings from profiler 2026-04-16 10:29:01 +02:00
Pepijn
35e3b28da1 fix(profiling): normalize timing metrics before export 2026-04-16 10:11:14 +02:00
Pepijn
ed8a98dda6 fix(profiling): preserve policy mode for deterministic forward 2026-04-16 09:50:29 +02:00
Pepijn
9dc38d9993 fix(ci): isolate torch cache in profiling job 2026-04-16 09:32:16 +02:00
Pepijn
3922f81791 fix(ci): set HF_LEROBOT_HOME in profiling job 2026-04-15 23:35:27 +02:00
Pepijn
28e8483297 fix(ci): disable policy hub push in profiling runs 2026-04-15 23:02:28 +02:00