feat(eval): add per-episode timing logs to eval worker

Logs avg env step time, avg inference call time, and totals per episode to identify whether env or policy server is the bottleneck. Made-with: Cursor
feat(eval): respect n_action_steps in inference server
2026-06-01 03:11:29 +00:00 · 2026-03-25 07:30:49 +01:00 · 2026-03-25 06:21:02 +01:00 · 2026-03-24 22:38:03 +01:00 · 2026-03-24 22:14:27 +01:00 · 2026-03-24 20:28:58 +01:00
40 changed files with 5141 additions and 1030 deletions
--- a/.dockerignore
+++ b/.dockerignore
@@ -12,6 +12,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+# Python virtual environments — never copy into Docker images
+.venv
+venv
+env/
+
 # Misc
 .git
 tmp
--- a/docker/Dockerfile.benchmark
+++ b/docker/Dockerfile.benchmark
@@ -0,0 +1,200 @@
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Benchmark evaluation container — one image per benchmark, built via BENCHMARK arg.
+#
+# Supported values for BENCHMARK:
+#   libero       — LIBERO suite (spatial / object / goal / 10 / 90)
+#   libero_plus  — LIBERO-plus extended benchmark (requires robosuite, bddl, robomimic)
+#   robomme      — RoboMME memory-augmented manipulation benchmark
+#   robocasa     — RoboCasa kitchen composite-task benchmark
+#
+# Build:
+#   docker build --build-arg BENCHMARK=libero -f docker/Dockerfile.benchmark \
+#                -t lerobot-benchmark-libero .
+#
+# Run (interactive):
+#   docker run --gpus all --rm -it lerobot-benchmark-libero
+# Run eval:
+#   docker run --gpus all --rm lerobot-benchmark-libero lerobot-eval --help
+
+ARG CUDA_VERSION=12.4.1
+ARG OS_VERSION=22.04
+FROM nvidia/cuda:${CUDA_VERSION}-base-ubuntu${OS_VERSION}
+
+ARG PYTHON_VERSION=3.12
+ARG BENCHMARK=libero
+
+ENV DEBIAN_FRONTEND=noninteractive \
+    MUJOCO_GL=egl \
+    PYOPENGL_PLATFORM=egl \
+    EGL_PLATFORM=device \
+    NVIDIA_DRIVER_CAPABILITIES=all \
+    NVIDIA_VISIBLE_DEVICES=all \
+    PATH=/lerobot/.venv/bin:$PATH \
+    CMAKE_POLICY_VERSION_MINIMUM=3.5 \
+    CUDA_VISIBLE_DEVICES=0 \
+    DEVICE=cuda \
+    BENCHMARK=${BENCHMARK}
+
+# ── Base system deps (shared across all benchmarks) ───────────────────────────
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    software-properties-common build-essential git curl \
+    libglib2.0-0 libgl1 libgl1-mesa-glx libgles2 \
+    libegl1 libegl1-mesa libegl1-mesa-dev \
+    libglew-dev libglfw3 libglfw3-dev libgl1-mesa-dri \
+    libglvnd-dev libosmesa6 libosmesa6-dev \
+    libvulkan1 mesa-vulkan-drivers \
+    libsm6 libxext6 libxrender-dev \
+    ffmpeg libusb-1.0-0-dev speech-dispatcher libgeos-dev portaudio19-dev \
+    cmake pkg-config ninja-build \
+    && add-apt-repository -y ppa:deadsnakes/ppa \
+    && apt-get update \
+    && apt-get install -y --no-install-recommends \
+       python${PYTHON_VERSION} \
+       python${PYTHON_VERSION}-venv \
+       python${PYTHON_VERSION}-dev \
+    && curl -LsSf https://astral.sh/uv/install.sh | sh \
+    && mv /root/.local/bin/uv /usr/local/bin/uv \
+    && useradd --create-home --shell /bin/bash user_lerobot \
+    && usermod -aG sudo user_lerobot \
+    && apt-get clean && rm -rf /var/lib/apt/lists/*
+
+# ── NVIDIA EGL + Vulkan vendor ICDs (lets GLVND find the GPU driver) ──────────
+RUN mkdir -p /usr/share/vulkan/icd.d /usr/share/glvnd/egl_vendor.d \
+    && printf '{"file_format_version":"1.0.0","ICD":{"library_path":"libGLX_nvidia.so.0","api_version":"1.2.155"}}\n' \
+       > /usr/share/vulkan/icd.d/nvidia_icd.json \
+    && printf '{"file_format_version":"1.0.0","ICD":{"library_path":"libEGL_nvidia.so.0"}}\n' \
+       > /usr/share/glvnd/egl_vendor.d/10_nvidia.json
+
+# ── Benchmark-specific system deps ────────────────────────────────────────────
+# libero_plus: the `wand` Python package requires ImageMagick headers.
+RUN case "${BENCHMARK}" in \
+    libero_plus) \
+        apt-get update && apt-get install -y --no-install-recommends \
+            libmagickwand-dev \
+        && apt-get clean && rm -rf /var/lib/apt/lists/* ;; \
+    esac
+
+WORKDIR /lerobot
+RUN chown -R user_lerobot:user_lerobot /lerobot
+
+USER user_lerobot
+
+ENV HOME=/home/user_lerobot \
+    HF_HOME=/home/user_lerobot/.cache/huggingface \
+    HF_LEROBOT_HOME=/home/user_lerobot/.cache/huggingface/lerobot \
+    TORCH_HOME=/home/user_lerobot/.cache/torch \
+    TRITON_CACHE_DIR=/home/user_lerobot/.cache/triton
+
+RUN uv venv --seed --python python${PYTHON_VERSION}
+
+# Copy only the dependency manifests first so Docker can cache this layer
+# independently of source-code changes.
+COPY --chown=user_lerobot:user_lerobot setup.py pyproject.toml README.md MANIFEST.in ./
+COPY --chown=user_lerobot:user_lerobot src/ src/
+
+ARG UNBOUND_DEPS=false
+RUN if [ "$UNBOUND_DEPS" = "true" ]; then \
+    sed -i 's/,[[:space:]]*<[0-9\.]*//g' pyproject.toml; \
+    echo "Dependencies unbound:" && cat pyproject.toml; \
+    fi
+
+# Install lerobot core + the selected benchmark extra.
+# LIBERO-plus needs a dedicated install path because the upstream package is
+# import-broken when installed via the extras chain alone.
+RUN case "${BENCHMARK}" in \
+    libero_plus) \
+        PATH=/usr/bin:/bin:/lerobot/.venv/bin:$PATH /lerobot/.venv/bin/python -m pip install --no-cache-dir \
+            "hf-libero>=0.1.3,<0.2.0" \
+            "hf-egl-probe>=1.0.1" \
+            "transformers>=5.3.0,<6.0.0" \
+            "scipy>=1.14.0,<2.0.0" \
+            "bddl>=1.0.1,<2.0.0" \
+            "future" \
+            "easydict>=1.9" \
+            "wand" \
+            "scikit-image>=0.20.0" \
+            "gym>=0.25.0,<0.27.0" \
+        && git clone --depth 1 https://github.com/sylvestf/LIBERO-plus.git /tmp/LIBERO-plus \
+        && PATH=/usr/bin:/bin:/lerobot/.venv/bin:$PATH /lerobot/.venv/bin/python -m pip install --no-cache-dir --no-deps /tmp/LIBERO-plus \
+        && /lerobot/.venv/bin/python -c "import pathlib, site; pathlib.Path(site.getsitepackages()[0], 'libero_plus_repo.pth').write_text('/tmp/LIBERO-plus\n')" \
+        && /lerobot/.venv/bin/python -m pip install --no-cache-dir . \
+        && /lerobot/.venv/bin/python -c "\
+import os, yaml, importlib.util; \
+root = os.path.dirname(importlib.util.find_spec('libero.libero').origin); \
+d = dict(benchmark_root=root, bddl_files=os.path.join(root,'bddl_files'), \
+init_states=os.path.join(root,'init_files'), datasets=os.path.join(root,'..','datasets'), \
+assets=os.path.join(root,'assets')); \
+cfg_dir = os.path.expanduser('~/.libero'); os.makedirs(cfg_dir, exist_ok=True); \
+yaml.dump(d, open(os.path.join(cfg_dir,'config.yaml'),'w')); print('libero config created')" \
+        && /lerobot/.venv/bin/python -c "from libero.libero import benchmark, get_libero_path; print('libero OK')" ;; \
+    libero) \
+        uv pip install --no-cache ".[libero]" \
+        && /lerobot/.venv/bin/python -c "\
+import os, yaml, importlib.util; \
+root = os.path.dirname(importlib.util.find_spec('libero.libero').origin); \
+d = dict(benchmark_root=root, bddl_files=os.path.join(root,'bddl_files'), \
+init_states=os.path.join(root,'init_files'), datasets=os.path.join(root,'..','datasets'), \
+assets=os.path.join(root,'assets')); \
+cfg_dir = os.path.expanduser('~/.libero'); os.makedirs(cfg_dir, exist_ok=True); \
+yaml.dump(d, open(os.path.join(cfg_dir,'config.yaml'),'w')); print('libero config created')" \
+        && /lerobot/.venv/bin/python -c "from libero.libero import benchmark, get_libero_path; print('libero OK')" ;; \
+    *) \
+        uv pip install --no-cache ".[${BENCHMARK}]" ;; \
+    esac
+
+# LIBERO-plus requires ~6 GB of scene/texture/object assets from HuggingFace.
+# Download at build time so containers don't need network access at runtime.
+USER root
+COPY <<'FETCH_ASSETS' /tmp/fetch_assets.py
+from huggingface_hub import hf_hub_download
+hf_hub_download("Sylvest/LIBERO-plus", "assets.zip",
+                repo_type="dataset", local_dir="/tmp/libero-plus-assets")
+FETCH_ASSETS
+COPY <<'VERIFY_ASSETS' /tmp/verify_assets.py
+from pathlib import Path
+from libero.libero import get_libero_path
+d = Path(get_libero_path("benchmark_root")) / "assets" / "scenes"
+assert d.is_dir(), f"assets missing at {d}"
+print("assets OK:", d)
+VERIFY_ASSETS
+RUN if [ "${BENCHMARK}" = "libero_plus" ]; then \
+    apt-get update && apt-get install -y --no-install-recommends unzip \
+    && apt-get clean && rm -rf /var/lib/apt/lists/* \
+    && /lerobot/.venv/bin/python /tmp/fetch_assets.py \
+    && unzip -q /tmp/libero-plus-assets/assets.zip -d /tmp/libero-plus-unzipped \
+    && ASSETS_DIR=$(/lerobot/.venv/bin/python -c "from libero.libero import get_libero_path; print(get_libero_path('benchmark_root'))") \
+    && SRC=$(find /tmp/libero-plus-unzipped -type d -name assets | head -1) \
+    && mv "$SRC" "$ASSETS_DIR/assets" \
+    && chown -R user_lerobot:user_lerobot "$ASSETS_DIR/assets" \
+    && rm -rf /tmp/libero-plus-assets /tmp/libero-plus-unzipped /tmp/fetch_assets.py \
+    && /lerobot/.venv/bin/python /tmp/verify_assets.py \
+    && rm /tmp/verify_assets.py; \
+    fi
+USER user_lerobot
+
+# Triton requires its ptxas binary to be executable (NVIDIA-specific).
+RUN if [ -f /lerobot/.venv/lib/python${PYTHON_VERSION}/site-packages/triton/backends/nvidia/bin/ptxas ]; then \
+    chmod +x /lerobot/.venv/lib/python${PYTHON_VERSION}/site-packages/triton/backends/nvidia/bin/ptxas; \
+    fi
+
+# Verify EGL probe is importable (runtime GPU check requires NVIDIA drivers at container start).
+RUN /lerobot/.venv/bin/python -c "import egl_probe; print('egl_probe OK')" \
+    2>/dev/null || echo 'NOTE: egl_probe not installed (non-libero build), skipping'
+
+# Copy full source (tests, examples, configs, etc.)
+COPY --chown=user_lerobot:user_lerobot . .
+
+CMD ["/bin/bash"]
--- a/docker/Dockerfile.eval-base
+++ b/docker/Dockerfile.eval-base
@@ -0,0 +1,78 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+ARG CUDA_VERSION=12.4.1
+ARG OS_VERSION=22.04
+FROM nvidia/cuda:${CUDA_VERSION}-base-ubuntu${OS_VERSION}
+
+ARG PYTHON_VERSION=3.12
+
+ENV DEBIAN_FRONTEND=noninteractive \
+    MUJOCO_GL=egl \
+    PYOPENGL_PLATFORM=egl \
+    EGL_PLATFORM=device \
+    NVIDIA_DRIVER_CAPABILITIES=all \
+    NVIDIA_VISIBLE_DEVICES=all \
+    PATH=/lerobot/.venv/bin:$PATH \
+    # cmake 4.x removed backward compat with cmake_minimum_required < 3.5.
+    # This env var re-enables it so packages like egl-probe can compile.
+    CMAKE_POLICY_VERSION_MINIMUM=3.5
+
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    software-properties-common build-essential git curl \
+    libglib2.0-0 libgl1 libgl1-mesa-glx libgles2 \
+    libegl1 libegl1-mesa libegl1-mesa-dev \
+    libglew-dev libglfw3 libglvnd-dev \
+    libosmesa6 libosmesa6-dev \
+    libvulkan1 mesa-vulkan-drivers \
+    libsm6 libxext6 libxrender-dev \
+    ffmpeg libusb-1.0-0-dev speech-dispatcher libgeos-dev portaudio19-dev \
+    cmake pkg-config ninja-build \
+    && add-apt-repository -y ppa:deadsnakes/ppa \
+    && apt-get update \
+    && apt-get install -y --no-install-recommends \
+       python${PYTHON_VERSION} \
+       python${PYTHON_VERSION}-venv \
+       python${PYTHON_VERSION}-dev \
+    && curl -LsSf https://astral.sh/uv/install.sh | sh \
+    && mv /root/.local/bin/uv /usr/local/bin/uv \
+    && useradd --create-home --shell /bin/bash user_lerobot \
+    && apt-get clean && rm -rf /var/lib/apt/lists/*
+
+# NVIDIA EGL + Vulkan vendor ICDs (lets GLVND find the GPU driver)
+RUN mkdir -p /usr/share/vulkan/icd.d /usr/share/glvnd/egl_vendor.d \
+    && printf '{"file_format_version":"1.0.0","ICD":{"library_path":"libGLX_nvidia.so.0","api_version":"1.2.155"}}\n' \
+       > /usr/share/vulkan/icd.d/nvidia_icd.json \
+    && printf '{"file_format_version":"1.0.0","ICD":{"library_path":"libEGL_nvidia.so.0"}}\n' \
+       > /usr/share/glvnd/egl_vendor.d/10_nvidia.json
+
+WORKDIR /lerobot
+RUN chown -R user_lerobot:user_lerobot /lerobot
+USER user_lerobot
+
+ENV HOME=/home/user_lerobot \
+    HF_HOME=/home/user_lerobot/.cache/huggingface \
+    HF_LEROBOT_HOME=/home/user_lerobot/.cache/huggingface/lerobot \
+    TORCH_HOME=/home/user_lerobot/.cache/torch \
+    TRITON_CACHE_DIR=/home/user_lerobot/.cache/triton
+
+RUN uv venv --seed --python python${PYTHON_VERSION}
+
+COPY --chown=user_lerobot:user_lerobot setup.py pyproject.toml README.md MANIFEST.in ./
+COPY --chown=user_lerobot:user_lerobot src/ src/
+RUN uv pip install --no-cache .
+
+COPY --chown=user_lerobot:user_lerobot . .
+
+CMD ["/bin/bash"]
--- a/docker/Dockerfile.eval-libero
+++ b/docker/Dockerfile.eval-libero
@@ -0,0 +1,20 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+FROM lerobot-eval-base:latest
+
+RUN uv pip install --no-cache ".[libero]" \
+    && python -c "import libero"
+
+CMD ["/bin/bash"]
--- a/docker/Dockerfile.eval-libero-plus
+++ b/docker/Dockerfile.eval-libero-plus
@@ -0,0 +1,47 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+FROM lerobot-eval-base:latest
+
+# Install libero_plus deps explicitly rather than via ".[libero_plus]" extras chain.
+# uv has a bug where it considers packages "already resolved" when coming through
+# a nested lerobot[libero] → lerobot[libero_plus] extras chain, silently skipping them.
+RUN uv pip install --no-cache \
+    "hf-libero>=0.1.3,<0.2.0" \
+    "hf-egl-probe>=1.0.1" \
+    "transformers>=5.3.0,<6.0.0" \
+    "scipy>=1.14.0,<2.0.0" \
+    "bddl>=1.0.1,<2.0.0" \
+    "future" \
+    "easydict>=1.9" \
+    "wand" \
+    "scikit-image>=0.20.0" \
+    "gym>=0.25.0,<0.27.0"
+
+# Clone LIBERO-plus; install with --no-deps (runtime deps declared above via hf-libero).
+# Add .pth so the libero module can locate its data files at runtime.
+RUN git clone --depth 1 https://github.com/sylvestf/LIBERO-plus.git /tmp/LIBERO-plus \
+    && uv pip install --no-cache --no-deps /tmp/LIBERO-plus \
+    && python -c "import pathlib, site; pathlib.Path(site.getsitepackages()[0], 'libero_plus_repo.pth').write_text('/tmp/LIBERO-plus\n')" \
+    && python -c "\
+import os, yaml, importlib.util; \
+root = os.path.dirname(importlib.util.find_spec('libero.libero').origin); \
+d = dict(benchmark_root=root, bddl_files=os.path.join(root,'bddl_files'), \
+init_states=os.path.join(root,'init_files'), datasets=os.path.join(root,'..','datasets'), \
+assets=os.path.join(root,'assets')); \
+cfg_dir = os.path.expanduser('~/.libero'); os.makedirs(cfg_dir, exist_ok=True); \
+yaml.dump(d, open(os.path.join(cfg_dir,'config.yaml'),'w')); print('libero config created')" \
+    && python -c "from libero.libero import benchmark, get_libero_path; print('libero OK')"
+
+CMD ["/bin/bash"]
--- a/docker/Dockerfile.eval-metaworld
+++ b/docker/Dockerfile.eval-metaworld
@@ -0,0 +1,20 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+FROM lerobot-eval-base:latest
+
+RUN uv pip install --no-cache ".[metaworld]" \
+    && python -c "import metaworld"
+
+CMD ["/bin/bash"]
--- a/docker/Dockerfile.eval-robocasa
+++ b/docker/Dockerfile.eval-robocasa
@@ -0,0 +1,40 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+FROM lerobot-eval-base:latest
+
+# robocasa README says to use master branch of ARISE-Initiative/robosuite.
+# Install it with deps (robosuite from master has modern dep declarations).
+RUN git clone --depth 1 https://github.com/ARISE-Initiative/robosuite.git /tmp/robosuite \
+    && uv pip install --no-cache /tmp/robosuite
+
+# Clone robocasa and install with --no-deps to skip its lerobot==0.3.3 pin.
+# Install robocasa's actual runtime deps explicitly instead.
+RUN git clone --depth 1 https://github.com/robocasa/robocasa.git /tmp/robocasa \
+    && uv pip install --no-cache --no-deps /tmp/robocasa \
+    && uv pip install --no-cache \
+        "scikit-image>=0.20.0" \
+        "numba>=0.61.0,<0.62.0" \
+        "mujoco==3.3.1" \
+        "h5py" \
+        "lxml" \
+        "tianshou==0.4.10" \
+        "easydict>=1.9"
+
+# robocasa/__init__.py asserts numpy.__version__ in ["2.2.5"] — pin it last
+# so no subsequent package can bump it away.
+RUN uv pip install --no-cache "numpy==2.2.5" \
+    && python -c "import robocasa"
+
+CMD ["/bin/bash"]
--- a/docker/Dockerfile.eval-robomme
+++ b/docker/Dockerfile.eval-robomme
@@ -0,0 +1,26 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+FROM lerobot-eval-base:latest
+
+# mani-skill==3.0.0b21 (robomme dep) pins gymnasium==0.29.1 and numpy<2.0.0,
+# conflicting with lerobot's gymnasium>=1.1.1 and numpy>=2.0.0.
+# Both overrides are safe at runtime:
+#   - gymnasium 0.29.x has the same 5-tuple step() API as 1.x (since gym 0.26)
+#   - numpy 1.26.4 is API-compatible with lerobot's actual usage (no 2.x-only APIs used)
+RUN printf 'gymnasium==0.29.1\nnumpy==1.26.4\n' > /tmp/robomme_override.txt \
+    && uv pip install --no-cache --override /tmp/robomme_override.txt ".[robomme]" \
+    && python -c "import robomme"
+
+CMD ["/bin/bash"]
--- a/docker/build_benchmark_images.sh
+++ b/docker/build_benchmark_images.sh
@@ -0,0 +1,120 @@
+#!/usr/bin/env bash
+# Build (and optionally push) all lerobot benchmark eval images.
+#
+# Usage:
+#   # Build locally only (for testing on this machine)
+#   bash docker/build_benchmark_images.sh
+#
+#   # Build and push to Docker Hub under your org
+#   bash docker/build_benchmark_images.sh --push --hub_org=pepijn223
+#
+#   # Force-rebuild base image (e.g. after Dockerfile.eval-base changes)
+#   bash docker/build_benchmark_images.sh --no-cache-base --push --hub_org=pepijn223
+#
+#   # Build only specific benchmarks
+#   bash docker/build_benchmark_images.sh --benchmarks="libero_plus robomme"
+#
+# After building, run eval with:
+#   lerobot-eval --eval.runtime=docker --eval.docker.pull=false \
+#     --eval.docker.image=<hub_org>/lerobot-benchmark-<benchmark>:latest ...
+#   OR (if run locally with the default tag):
+#   lerobot-eval --eval.runtime=docker --eval.docker.pull=false \
+#     --env.type=<benchmark> ...   # auto-resolves to lerobot-benchmark-<benchmark>
+
+set -euo pipefail
+
+PUSH=false
+HUB_ORG=""
+BENCHMARKS="libero libero_plus robomme robocasa metaworld"
+NO_CACHE_BASE=false
+PROGRESS="auto"
+
+for arg in "$@"; do
+    case "$arg" in
+        --push)            PUSH=true ;;
+        --hub_org=*)       HUB_ORG="${arg#*=}" ;;
+        --benchmarks=*)    BENCHMARKS="${arg#*=}" ;;
+        --no-cache-base)   NO_CACHE_BASE=true ;;
+        --plain)           PROGRESS="plain" ;;
+        *)                 echo "Unknown arg: $arg"; exit 1 ;;
+    esac
+done
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "${SCRIPT_DIR}/.." && pwd)"
+
+if [[ "$PUSH" == "true" && -z "$HUB_ORG" ]]; then
+    echo "ERROR: --push requires --hub_org=<your-dockerhub-org>"
+    exit 1
+fi
+
+ok()   { echo "[OK]   $*"; }
+fail() { echo "[FAIL] $*"; exit 1; }
+
+BASE_CACHE_FLAG=""
+if [[ "$NO_CACHE_BASE" == "true" ]]; then
+    BASE_CACHE_FLAG="--no-cache"
+fi
+
+echo "=== Building lerobot-eval-base ==="
+docker build \
+    ${BASE_CACHE_FLAG} \
+    --progress="${PROGRESS}" \
+    -f "${SCRIPT_DIR}/Dockerfile.eval-base" \
+    -t lerobot-eval-base:latest \
+    "${REPO_ROOT}" || fail "lerobot-eval-base build failed"
+ok "lerobot-eval-base"
+
+for BENCHMARK in $BENCHMARKS; do
+    LOCAL_TAG="lerobot-benchmark-${BENCHMARK}:latest"
+    DOCKERFILE="${SCRIPT_DIR}/Dockerfile.eval-${BENCHMARK//_/-}"
+
+    # Handle underscore → hyphen mapping for filename lookup
+    DOCKERFILE_HYPHEN="${SCRIPT_DIR}/Dockerfile.eval-${BENCHMARK//_/-}"
+    DOCKERFILE_UNDERSCORE="${SCRIPT_DIR}/Dockerfile.eval-${BENCHMARK}"
+    if [[ -f "$DOCKERFILE_HYPHEN" ]]; then
+        DOCKERFILE="$DOCKERFILE_HYPHEN"
+    elif [[ -f "$DOCKERFILE_UNDERSCORE" ]]; then
+        DOCKERFILE="$DOCKERFILE_UNDERSCORE"
+    else
+        fail "No Dockerfile found for benchmark '${BENCHMARK}' (tried ${DOCKERFILE_HYPHEN} and ${DOCKERFILE_UNDERSCORE})"
+    fi
+
+    echo ""
+    echo "=== Building ${LOCAL_TAG} from $(basename ${DOCKERFILE}) ==="
+    docker build \
+        --progress="${PROGRESS}" \
+        -f "${DOCKERFILE}" \
+        -t "${LOCAL_TAG}" \
+        "${REPO_ROOT}" || fail "${LOCAL_TAG} build failed"
+    ok "${LOCAL_TAG}"
+
+    if [[ "$PUSH" == "true" ]]; then
+        HUB_TAG="${HUB_ORG}/lerobot-benchmark-${BENCHMARK}:latest"
+        docker tag "${LOCAL_TAG}" "${HUB_TAG}"
+        docker push "${HUB_TAG}" || fail "push ${HUB_TAG} failed"
+        ok "Pushed ${HUB_TAG}"
+    fi
+done
+
+echo ""
+echo "=== Smoke-testing images ==="
+for BENCHMARK in $BENCHMARKS; do
+    LOCAL_TAG="lerobot-benchmark-${BENCHMARK}:latest"
+    echo "  Smoke test: ${LOCAL_TAG}"
+    docker run --rm -e BENCHMARK="${BENCHMARK}" \
+        "${LOCAL_TAG}" bash docker/smoke_test_benchmark.sh \
+        && ok "smoke test ${BENCHMARK}" \
+        || echo "[WARN] smoke test failed for ${BENCHMARK} (may need GPU)"
+done
+
+echo ""
+echo "All benchmark images built successfully."
+if [[ "$PUSH" == "true" ]]; then
+    echo "Pushed to Docker Hub under: ${HUB_ORG}/"
+    echo ""
+    echo "To use Hub images in eval, pass:"
+    for BENCHMARK in $BENCHMARKS; do
+        echo "  --eval.docker.image=${HUB_ORG}/lerobot-benchmark-${BENCHMARK}:latest"
+    done
+fi
--- a/docker/smoke_test_benchmark.sh
+++ b/docker/smoke_test_benchmark.sh
@@ -0,0 +1,115 @@
+#!/usr/bin/env bash
+# Smoke-test a benchmark container: verifies imports and CLI entry-points.
+#
+# Build and run for a specific benchmark:
+#   docker build --build-arg BENCHMARK=libero -f docker/Dockerfile.benchmark -t lerobot-benchmark-libero .
+#   docker run --gpus all --rm -e BENCHMARK=libero lerobot-benchmark-libero bash docker/smoke_test_benchmark.sh
+#
+# Test all benchmarks individually:
+#   for b in libero libero_plus robomme robocasa; do
+#     docker build --build-arg BENCHMARK=$b -f docker/Dockerfile.benchmark -t lerobot-benchmark-$b .
+#     docker run --gpus all --rm -e BENCHMARK=$b lerobot-benchmark-$b bash docker/smoke_test_benchmark.sh
+#   done
+
+set -euo pipefail
+
+BENCHMARK="${BENCHMARK:-libero}"
+PASS=0
+FAIL=0
+
+ok()   { echo "[PASS] $*"; PASS=$((PASS + 1)); }
+fail() { echo "[FAIL] $*"; FAIL=$((FAIL + 1)); }
+
+python_import() {
+    local module="$1"
+    if python -c "import ${module}" 2>/dev/null; then
+        ok "import ${module}"
+    else
+        fail "import ${module}"
+    fi
+}
+
+cli_help() {
+    local cmd="$1"
+    if "${cmd}" --help > /dev/null 2>&1; then
+        ok "${cmd} --help"
+    else
+        fail "${cmd} --help"
+    fi
+}
+
+echo "=== Smoke test: benchmark=${BENCHMARK} ==="
+
+# ── lerobot core ──────────────────────────────────────────────────────────────
+python_import "lerobot"
+python_import "lerobot.envs"
+python_import "lerobot.configs.eval"
+cli_help "lerobot-eval"
+
+# ── Benchmark-specific env import ─────────────────────────────────────────────
+case "${BENCHMARK}" in
+    libero)
+        python_import "lerobot.envs.libero"
+        python -c "
+from lerobot.envs.configs import LiberoEnv
+cfg = LiberoEnv(task='libero_spatial/KITCHEN_SCENE1_open_the_bottom_drawer_of_the_cabinet')
+print('  LiberoEnv config OK:', cfg.type)
+" && ok "LiberoEnv config instantiation" || fail "LiberoEnv config instantiation"
+        ;;
+
+    libero_plus)
+        python_import "lerobot.envs.libero"
+        python -c "
+from lerobot.envs.configs import LiberoPlusEnv
+cfg = LiberoPlusEnv()
+print('  LiberoPlusEnv config OK:', cfg.type)
+" && ok "LiberoPlusEnv config instantiation" || fail "LiberoPlusEnv config instantiation"
+        # Verify the LIBERO-plus package itself is importable
+        python_import "libero"
+        python_import "robosuite"
+        ;;
+
+    robomme)
+        python_import "lerobot.envs.robomme"
+        python -c "
+from lerobot.envs.robomme import ROBOMME_TASKS, RoboMMEGymEnv
+assert len(ROBOMME_TASKS) == 16, f'Expected 16 tasks, got {len(ROBOMME_TASKS)}'
+print('  ROBOMME_TASKS OK:', ROBOMME_TASKS[:3], '...')
+" && ok "RoboMME task list" || fail "RoboMME task list"
+        python -c "
+from lerobot.envs.configs import RoboMMEEnv
+cfg = RoboMMEEnv(task='PickXtimes')
+print('  RoboMMEEnv config OK:', cfg.type)
+" && ok "RoboMMEEnv config instantiation" || fail "RoboMMEEnv config instantiation"
+        python_import "robomme"
+        ;;
+
+    robocasa)
+        python_import "lerobot.envs.robocasa"
+        python -c "
+from lerobot.envs.robocasa import ACTION_DIM, STATE_DIM
+assert ACTION_DIM == 12, f'Expected ACTION_DIM=12, got {ACTION_DIM}'
+assert STATE_DIM == 16, f'Expected STATE_DIM=16, got {STATE_DIM}'
+print('  ACTION_DIM:', ACTION_DIM, '  STATE_DIM:', STATE_DIM)
+" && ok "RoboCasa constants" || fail "RoboCasa constants"
+        python -c "
+from lerobot.envs.configs import RoboCasaEnv
+cfg = RoboCasaEnv(task='PickPlaceCounterToCabinet')
+print('  RoboCasaEnv config OK:', cfg.type)
+" && ok "RoboCasaEnv config instantiation" || fail "RoboCasaEnv config instantiation"
+        python_import "robocasa"
+        python_import "robosuite"
+        ;;
+
+    *)
+        echo "Unknown BENCHMARK='${BENCHMARK}'. Valid values: libero, libero_plus, robomme, robocasa"
+        exit 1
+        ;;
+esac
+
+# ── Summary ───────────────────────────────────────────────────────────────────
+echo ""
+echo "=== Results: ${PASS} passed, ${FAIL} failed ==="
+if [ "${FAIL}" -gt 0 ]; then
+    exit 1
+fi
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -19,6 +19,8 @@
    title: Multi GPU training
  - local: peft_training
    title: Training with PEFT (e.g., LoRA)
+  - local: benchmark_training
+    title: Benchmark Training & Evaluation
  title: "Tutorials"
 - sections:
  - local: lerobot-dataset-v3
@@ -31,8 +33,6 @@
    title: Using Subtasks in the Dataset
  - local: streaming_video_encoding
    title: Streaming Video Encoding
-  - local: multi_dataset_training
-    title: Multi-Dataset Training
  title: "Datasets"
 - sections:
  - local: act
--- a/docs/source/benchmark_training.mdx
+++ b/docs/source/benchmark_training.mdx
@@ -0,0 +1,398 @@
+# Benchmark Training & Evaluation
+
+This guide explains how to train and evaluate policies on the simulation benchmarks
+integrated in LeRobot: **LIBERO**, **LIBERO-plus**, **MetaWorld**, **RoboCasa**, and **RoboMME**.
+
+The workflow is:
+
+1. Pick one or more benchmarks.
+2. For each benchmark, train a policy on its combined dataset (multi-GPU).
+3. Upload the trained policy to the Hugging Face Hub.
+4. Evaluate the policy on every task suite within that benchmark.
+
+## Prerequisites
+
+Install the benchmark-specific dependencies for the environments you want to evaluate on:
+
+```bash
+# LIBERO (original)
+pip install -e ".[libero]"
+
+# LIBERO-plus
+pip install -e ".[libero_plus]"
+
+# MetaWorld
+pip install -e ".[metaworld]"
+
+# RoboCasa
+pip install -e ".[robocasa]"
+
+# RoboMME
+pip install -e ".[robomme]"
+```
+
+`libero_plus` includes the same EGL probe dependencies as `libero` so headless
+renderer setup is consistent between both installs.
+
+If your environment has CMake build-isolation issues, use the same fallback as
+standard LIBERO installs:
+
+```bash
+PATH=/usr/bin:/bin:$PATH pip install --no-build-isolation -e ".[libero-plus]"
+```
+
+For multi-GPU training you also need [Accelerate](https://huggingface.co/docs/accelerate):
+
+```bash
+pip install accelerate
+```
+
+## Docker-isolated evaluation (EnvHub)
+
+LeRobot eval now supports running the full eval worker in a Docker container
+while keeping policy loading compatible with local checkpoints and local code changes.
+
+Use `lerobot-eval` with `--eval.runtime=docker`:
+
+```bash
+lerobot-eval \
+  --policy.path=outputs/train/my_policy/checkpoints/050000/pretrained_model \
+  --env.type=libero_plus \
+  --eval.runtime=docker \
+  --eval.docker.envhub_ref=envhub://lerobot/libero_plus@v1 \
+  --eval.n_episodes=10 \
+  --eval.batch_size=10
+```
+
+`eval.docker.envhub_ref` is optional. If omitted, LeRobot resolves a default
+image from `env.type`. You can also override the image directly:
+
+```bash
+--eval.docker.image=docker://ghcr.io/huggingface/lerobot-eval-libero-plus:latest
+```
+
+By default (`eval.docker.use_local_code=true`), the local repository is mounted
+in the container and added to `PYTHONPATH`, so edited policy/env code and local
+checkpoints continue to work without rebuilding the image for each change.
+
+Common Docker runtime options:
+
+```bash
+--eval.docker.pull=true \
+--eval.docker.gpus=all \
+--eval.docker.shm_size=8g \
+--eval.docker.use_local_code=true
+```
+
+The benchmark runner supports the same Docker eval path (extra args are
+forwarded to each generated `lerobot-eval` call):
+
+```bash
+lerobot-benchmark eval \
+    --benchmarks libero_plus,robocasa \
+    --hub-user $HF_USER \
+    --n-episodes 50 \
+    --eval.runtime=docker \
+    --eval.docker.pull=true
+```
+
+Build benchmark images locally:
+
+```bash
+make build-eval-images
+```
+
+## Fast single-machine eval tuning
+
+`lerobot-eval` now has two orthogonal throughput knobs:
+
+- `eval.batch_size`: number of sub-envs per task (inside one vector env).
+- `env.max_parallel_tasks`: number of tasks scheduled concurrently.
+- `eval.instance_count`: number of full eval instances (process-level sharding).
+
+Use them in this order:
+
+1. Increase `eval.batch_size` first for per-task throughput.
+2. Then increase `env.max_parallel_tasks` to overlap tasks, while monitoring RAM/VRAM.
+3. Optionally increase `eval.instance_count` for process-level parallelism (best with enough CPU/RAM and small models).
+
+The eval logs print the active scheduler mode (`sequential`, `threaded`, or `batched_lazy`) so you can verify the effective concurrency path.
+
+### Suggested starting points
+
+| Benchmark | Conservative | Faster (single GPU) | Notes |
+|---|---|---|---|
+| `libero` / `libero_plus` | `eval.batch_size=1`, `env.max_parallel_tasks=4` | `eval.batch_size=1`, `env.max_parallel_tasks=16` | For large suite sweeps, increase `max_parallel_tasks` before `batch_size` to avoid MuJoCo memory spikes. |
+| `metaworld` | `eval.batch_size=8`, `env.max_parallel_tasks=1` | `eval.batch_size=16`, `env.max_parallel_tasks=2` | Prefer larger per-task vectorization first. |
+| `robocasa` | `eval.batch_size=4`, `env.max_parallel_tasks=1` | `eval.batch_size=8`, `env.max_parallel_tasks=2` | Rendering/memory can dominate at high image resolution. |
+| `robomme` | `eval.batch_size=4`, `env.max_parallel_tasks=1` | `eval.batch_size=8`, `env.max_parallel_tasks=2` | Start small and scale gradually with task count. |
+
+### Local fast eval recipe
+
+```bash
+lerobot-eval \
+  --policy.path=$HF_USER/smolvla_libero_plus \
+  --env.type=libero_plus \
+  --eval.n_episodes=1 \
+  --eval.batch_size=1 \
+  --env.max_parallel_tasks=16 \
+  --eval.instance_count=2 \
+  --rename_map='{"observation.images.image":"observation.images.camera1","observation.images.image2":"observation.images.camera2"}' \
+  --output_dir=outputs/eval/smolvla_libero_plus \
+  --push_to_hub=true
+```
+
+### Docker fast eval recipe
+
+```bash
+lerobot-eval \
+  --policy.path=$HF_USER/smolvla_libero_plus \
+  --env.type=libero_plus \
+  --eval.runtime=docker \
+  --eval.docker.envhub_ref=envhub://lerobot/libero_plus@v1 \
+  --eval.docker.gpus=all \
+  --eval.docker.shm_size=16g \
+  --eval.n_episodes=1 \
+  --eval.batch_size=1 \
+  --env.max_parallel_tasks=16
+```
+
+## Quick start — single benchmark
+
+Train SmolVLA on LIBERO-plus with 4 GPUs for 50 000 steps:
+
+```bash
+lerobot-benchmark train \
+    --benchmarks libero_plus \
+    --policy-path lerobot/smolvla_base \
+    --hub-user $HF_USER \
+    --num-gpus 4 \
+    --steps 50000 \
+    --batch-size 32 \
+    --wandb
+```
+
+This trains on the combined LIBERO-plus dataset and pushes the checkpoint to
+`$HF_USER/smolvla_libero_plus` on the Hub.
+
+Then evaluate on **all four** LIBERO suites (spatial, object, goal, 10):
+
+```bash
+lerobot-benchmark eval \
+    --benchmarks libero_plus \
+    --hub-user $HF_USER \
+    --n-episodes 50
+```
+
+This automatically runs a separate `lerobot-eval` for each suite.
+
+## Full sweep — multiple benchmarks
+
+Run training **and** evaluation across all benchmarks:
+
+```bash
+lerobot-benchmark all \
+    --benchmarks libero,libero_plus,metaworld,robocasa,robomme \
+    --policy-path lerobot/smolvla_base \
+    --hub-user $HF_USER \
+    --num-gpus 4 \
+    --steps 50000 \
+    --batch-size 32 \
+    --wandb \
+    --push-eval-to-hub
+```
+
+For each benchmark the runner:
+1. Trains a policy on its dataset.
+2. Evaluates on every eval task in the benchmark (e.g. 4 suites for LIBERO).
+3. Pushes HF-native `.eval_results` rows (and optional artifacts) to the Hub.
+
+<Tip>
+
+Use `--dry-run` to print the exact `lerobot-train` / `lerobot-eval` commands without executing them, so you can inspect or modify them before running.
+
+</Tip>
+
+## Using the CLI directly (without the benchmark runner)
+
+You can also compose the commands yourself. The benchmark runner is a thin wrapper; here is what it does under the hood.
+
+### Training
+
+```bash
+accelerate launch \
+    --multi_gpu \
+    --num_processes=4 \
+    $(which lerobot-train) \
+    --policy.path=lerobot/smolvla_base \
+    --dataset.repo_id=$HF_USER/libero_plus \
+    --policy.repo_id=$HF_USER/smolvla_libero_plus \
+    --env.type=libero_plus \
+    --env.task=libero_spatial \
+    --steps=50000 \
+    --batch_size=32 \
+    --eval_freq=10000 \
+    --save_freq=10000 \
+    --output_dir=outputs/train/smolvla_libero_plus \
+    --job_name=smolvla_libero_plus \
+    --policy.push_to_hub=true \
+    --wandb.enable=true
+```
+
+### Evaluation (run once per suite)
+
+```bash
+for SUITE in libero_spatial libero_object libero_goal libero_10; do
+    lerobot-eval \
+        --policy.path=$HF_USER/smolvla_libero_plus \
+        --env.type=libero_plus \
+        --env.task=$SUITE \
+        --eval.n_episodes=50 \
+        --eval.batch_size=10 \
+        --output_dir=outputs/eval/smolvla_libero_plus/$SUITE \
+        --policy.device=cuda \
+        --push_to_hub=true \
+        --benchmark_dataset_id=lerobot/sim-benchmarks
+done
+```
+
+## Available benchmarks
+
+| Benchmark | Env type | Dataset | Eval tasks | Action dim |
+|---|---|---|---|---|
+| `libero` | `libero` | `{hub_user}/libero` | spatial, object, goal, 10 | 7 |
+| `libero_plus` | `libero_plus` | `{hub_user}/libero_plus` | spatial, object, goal, 10 | 7 |
+| `metaworld` | `metaworld` | `{hub_user}/metaworld` | push-v2 | 4 |
+| `robocasa` | `robocasa` | `{hub_user}/robocasa` | PickPlaceCounterToCabinet | 12 |
+| `robomme` | `robomme` | `{hub_user}/robomme` | PickXtimes | 8 |
+
+Run `lerobot-benchmark list` to see the full registry with all eval tasks.
+
+## Policy naming convention
+
+The benchmark runner stores trained policies under:
+
+```
+{hub_user}/{policy_name}_{benchmark}
+```
+
+The default `--policy-name` is `smolvla`. So training on `libero_plus` as user `alice` produces `alice/smolvla_libero_plus`.
+
+You can override this, e.g. `--policy-name pi05` if training π₀.₅ instead.
+
+## Multi-GPU considerations
+
+The effective batch size is `batch_size × num_gpus`. With `--batch-size=32` and
+`--num-gpus=4`, you train with an effective batch of 128 per step. LeRobot does **not**
+auto-scale the learning rate; see the [Multi-GPU Training guide](./multi_gpu_training) for
+details on when and how to adjust it.
+
+## Custom benchmarks
+
+To add a new benchmark, edit the `BENCHMARK_REGISTRY` in
+`src/lerobot/scripts/lerobot_benchmark.py`:
+
+```python
+from lerobot.scripts.lerobot_benchmark import BenchmarkEntry, BENCHMARK_REGISTRY
+
+BENCHMARK_REGISTRY["my_benchmark"] = BenchmarkEntry(
+    dataset_repo_id="{hub_user}/my_dataset",
+    env_type="my_env",
+    env_task="MyDefaultTask",
+    eval_tasks=["TaskA", "TaskB", "TaskC"],
+)
+```
+
+Then use `--benchmarks my_benchmark` as usual. The runner will train once and
+evaluate separately on TaskA, TaskB, and TaskC.
+
+## Outputs
+
+After training and evaluation, your outputs directory looks like:
+
+```
+outputs/
+├── train/
+│   ├── smolvla_libero/
+│   │   ├── checkpoints/
+│   │   └── ...
+│   ├── smolvla_libero_plus/
+│   ├── smolvla_robocasa/
+│   └── smolvla_robomme/
+└── eval/
+    ├── smolvla_libero/
+    │   ├── libero_spatial/
+    │   │   ├── eval_info.json
+    │   │   └── videos/
+    │   ├── libero_object/
+    │   ├── libero_goal/
+    │   └── libero_10/
+    ├── smolvla_libero_plus/
+    │   ├── libero_spatial/
+    │   ├── libero_object/
+    │   ├── libero_goal/
+    │   └── libero_10/
+    ├── smolvla_robocasa/
+    └── smolvla_robomme/
+```
+
+Each `eval_info.json` contains per-episode rewards, success rates, and aggregate metrics.
+
+## HF Eval Results + Leaderboard
+
+LeRobot publishes benchmark scores using Hugging Face's native
+`/.eval_results/*.yaml` format, which powers model-page eval cards and
+benchmark leaderboards.
+
+Add `--push-eval-to-hub` to push results after each eval run:
+
+```bash
+lerobot-benchmark eval \
+    --benchmarks libero_plus,robocasa \
+    --hub-user $HF_USER \
+    --benchmark-dataset-id lerobot/sim-benchmarks \
+    --push-eval-to-hub
+```
+
+This writes one or more files under `.eval_results/` in the model repo, for example:
+
+```yaml
+- dataset:
+    id: lerobot/sim-benchmarks
+    task_id: libero_plus/spatial
+  value: 82.4
+  notes: lerobot-eval
+```
+
+Notes:
+- `--benchmark-dataset-id` points to your consolidated benchmark dataset repo.
+- `task_id` values are derived from `env.type` and evaluated suite/task names.
+- Eval artifacts (`eval_info.json`, `eval_config.json`, videos) are still uploaded
+  for provenance, but leaderboard ranking comes from `.eval_results`.
+
+## Passing extra arguments
+
+Any arguments after the recognized flags are forwarded to `lerobot-train` or
+`lerobot-eval`.
+
+Example (training): use PEFT/LoRA during training.
+
+```bash
+lerobot-benchmark train \
+    --benchmarks libero_plus \
+    --policy-path lerobot/smolvla_base \
+    --hub-user $HF_USER \
+    --num-gpus 4 \
+    --steps 50000 \
+    --peft.method_type=LORA --peft.r=16
+```
+
+Example (evaluation): forward Docker runtime flags to each `lerobot-eval` call.
+
+```bash
+lerobot-benchmark eval \
+    --benchmarks libero_plus \
+    --hub-user $HF_USER \
+    --eval.runtime=docker \
+    --eval.docker.envhub_ref=envhub://lerobot/libero_plus@v1
+```
--- a/docs/source/multi_dataset_training.mdx
+++ b/docs/source/multi_dataset_training.mdx
@@ -1,232 +0,0 @@
-# Multi-Dataset Training
-
-This guide covers how to train a single policy on multiple heterogeneous datasets using `MultiLeRobotDataset`.
-
-## Overview
-
-Real-world robot learning datasets come from different environments, robots, and camera setups. A RoboCasa dataset might have three cameras named `robot0_agentview_left`, `robot0_agentview_right`, and `robot0_eye_in_hand`, while a LIBERO dataset uses `observation.images.front` and `observation.images.wrist`, and a RoboMME dataset uses bare `image` and `wrist_image` keys. State and action dimensions also differ.
-
-`MultiLeRobotDataset` lets you train on all of them jointly by:
-
- **Mapping** each dataset's feature keys into a shared namespace
- **Padding** features that a dataset doesn't have with zeros
- **Weighting** how often each dataset is sampled
- **Transforming** samples per-dataset (e.g. padding actions to a common dimension)
- **Aggregating** statistics across all sub-datasets for normalization
-
-## Configuration
-
-Multi-dataset training is configured via `MultiDatasetConfig` in a YAML config file. Instead of a single `dataset.repo_id`, you provide a `datasets` list where each entry is a `SubDatasetConfig`.
-
-### SubDatasetConfig fields
-
-| Field | Type | Default | Description |
-|-------|------|---------|-------------|
-| `repo_id` | `str` | required | HuggingFace repo ID or local dataset name |
-| `root` | `str \| None` | `None` | Local root directory for the dataset |
-| `episodes` | `list[int] \| None` | `None` | Subset of episode indices to use |
-| `revision` | `str \| None` | `None` | Dataset version / revision |
-| `video_backend` | `str` | auto | Video decoding backend (`pyav`, `torchcodec`, etc.) |
-| `weight` | `float` | `1.0` | Relative sampling weight for this dataset |
-| `feature_map` | `dict[str, str]` | `{}` | Maps dataset keys to unified policy keys |
-| `transforms` | `list` | `None` | Per-dataset transform steps (applied per sample) |
-
-### Example: Three-dataset config
-
-```yaml
-dataset:
-  type: multi
-  use_imagenet_stats: true
-  datasets:
-    # RoboCasa: 3 cameras, state(16), action(12)
-    - repo_id: pepijn223/robocasa_PrepareCoffee
-      root: /data/robocasa_PrepareCoffee
-      weight: 1.0
-      feature_map:
-        observation.images.robot0_agentview_left: observation.images.front_left
-        observation.images.robot0_agentview_right: observation.images.front_right
-        observation.images.robot0_eye_in_hand: observation.images.wrist
-
-    # LIBERO-plus: 2 cameras, state(8), action(7)
-    - repo_id: pepijn223/libero_plus_lerobot
-      root: /data/libero_plus_lerobot
-      weight: 0.5
-      feature_map:
-        observation.images.front: observation.images.front_left
-        observation.images.wrist: observation.images.wrist
-      transforms:
-        - type: pad_action
-          kwargs: {target_dim: 12}
-        - type: pad_state
-          kwargs: {target_dim: 16}
-
-    # RoboMME: 2 cameras (non-standard keys), state(8), action(8)
-    - repo_id: pepijn223/robomme_data_lerobot
-      root: /data/robomme_data_lerobot
-      weight: 0.3
-      feature_map:
-        image: observation.images.front_left
-        wrist_image: observation.images.wrist
-        state: observation.state
-        actions: action
-      transforms:
-        - type: pad_action
-          kwargs: {target_dim: 12}
-        - type: pad_state
-          kwargs: {target_dim: 16}
-```
-
-## Feature Mapping
-
-The `feature_map` dictionary renames dataset-local keys into a shared namespace. Keys not listed pass through unchanged. In the example above, all three datasets end up with the same camera key names (`observation.images.front_left`, `observation.images.wrist`) even though they use different conventions internally.
-
-After mapping, the **union** of all features across datasets defines the unified schema. If a feature exists in some datasets but not others, it is automatically zero-padded for datasets that lack it, and a boolean `{key}_is_pad` flag is added to the sample so the policy can optionally mask padded features.
-
-## Automatic Padding
-
-When a sub-dataset doesn't have a feature that exists in the unified schema:
-
- **Images/videos**: padded with a black frame (zeros) matching the expected resolution
- **Float tensors** (state, action): padded with zeros
- **Integer/bool tensors**: padded with zeros / False
-
-A companion `{key}_is_pad = True` tensor is added so the model can distinguish real data from padding.
-
-## Per-Dataset Transforms
-
-Each sub-dataset can have its own `transforms` pipeline that runs after feature renaming but before cross-dataset padding. This is useful for making shapes compatible before PyTorch's collate function stacks the batch.
-
-### Built-in transforms
-
-| Name | Description | Parameters |
-|------|-------------|------------|
-| `pad_action` | Zero-pad `action` to a target dimension | `target_dim: int` |
-| `pad_state` | Zero-pad `observation.state` to a target dimension | `target_dim: int` |
-| `resize_images` | Resize all `observation.images.*` tensors | `height: int`, `width: int` |
-
-### Custom transforms
-
-You can register your own transforms in `lerobot/datasets/transforms.py`:
-
-```python
-from lerobot.datasets.transforms import DatasetTransformStep, register_dataset_transform
-
-@register_dataset_transform("my_transform")
-class MyTransform(DatasetTransformStep):
-    def __init__(self, some_param: int):
-        self.some_param = some_param
-
-    def __call__(self, sample: dict) -> dict:
-        # Modify sample in-place or return a new dict
-        sample["action"] = sample["action"] * self.some_param
-        return sample
-```
-
-Then reference it in the config:
-
-```yaml
-transforms:
-  - type: my_transform
-    kwargs: {some_param: 2}
-```
-
-## Weighted Sampling
-
-The `weight` field on each sub-dataset controls how often it is sampled during training. Weights are relative and automatically normalized to probabilities. For example, with weights `[1.0, 0.5, 0.3]`, the first dataset is sampled roughly 56% of the time, the second 28%, and the third 16%.
-
-This uses `WeightedEpisodeAwareSampler`, which respects episode boundaries (so `drop_n_last_frames` and similar policy settings work correctly) while sampling across datasets proportionally.
-
-## Stats Aggregation
-
-Normalization statistics (mean, std, min, max, quantiles) are automatically aggregated across all sub-datasets using the mapped feature keys. The aggregation uses a weighted parallel variance algorithm so that datasets with more frames contribute proportionally to the global statistics.
-
-The aggregated stats are used by the standard LeRobot preprocessor for normalization during training.
-
-## Training
-
-Launch training the same way as single-dataset training. The factory and training script automatically detect `MultiDatasetConfig` and set up the weighted sampler:
-
-```bash
-python -m lerobot.scripts.lerobot_train \
-    --config_path path/to/multi_dataset_config.yaml
-```
-
-## Architecture
-
-The data flow during training with `MultiLeRobotDataset`:
-
-```
-┌─────────────────────────────────────────────────────────┐
-│  MultiLeRobotDataset.__getitem__(global_idx)            │
-│                                                         │
-│  1. Map global_idx → (dataset_idx, local_idx)           │
-│  2. Fetch sample from sub-dataset                       │
-│  3. Rename keys via feature_map                         │
-│  4. Apply per-dataset transforms (pad_action, etc.)     │
-│  5. Zero-pad missing features + add _is_pad flags       │
-│  6. Add dataset_index tag                               │
-└─────────────────────┬───────────────────────────────────┘
-                      │
-         ┌────────────▼────────────┐
-         │  PyTorch DataLoader     │
-         │  (collates into batch)  │
-         └────────────┬────────────┘
-                      │
-         ┌────────────▼────────────┐
-         │  LeRobot Preprocessor   │
-         │  (normalize, tokenize)  │
-         └────────────┬────────────┘
-                      │
-         ┌────────────▼────────────┐
-         │  Policy forward + loss  │
-         └─────────────────────────┘
-```
-
-## API Reference
-
-### `NewMultiLeRobotDataset`
-
-```python
-from lerobot.datasets.multi_dataset import NewMultiLeRobotDataset
-
-dataset = NewMultiLeRobotDataset(
-    configs=[...],           # list[SubDatasetConfig]
-    image_transforms=None,   # optional image augmentation
-    delta_timestamps=None,   # optional temporal neighbors
-    tolerance_s=1e-4,        # timestamp tolerance
-)
-
-dataset.num_frames      # total frames across all sub-datasets
-dataset.num_episodes    # total episodes
-dataset.meta            # MultiDatasetMeta (stats, features, episodes)
-dataset.dataset_weights # list of per-dataset weights
-dataset.features        # unified feature dict (union of all mapped features)
-dataset.camera_keys     # unified camera key list
-```
-
-### `WeightedEpisodeAwareSampler`
-
-```python
-from lerobot.datasets.sampler import WeightedEpisodeAwareSampler
-
-sampler = WeightedEpisodeAwareSampler(
-    dataset_from_indices=dataset.meta.episodes["dataset_from_index"],
-    dataset_to_indices=dataset.meta.episodes["dataset_to_index"],
-    dataset_membership=dataset.meta.episodes["dataset_source"],
-    dataset_weights=dataset.dataset_weights,
-    shuffle=True,
-)
-```
-
-### `DatasetTransformPipeline`
-
-```python
-from lerobot.datasets.transforms import DatasetTransformPipeline, DatasetTransformStepConfig
-
-pipeline = DatasetTransformPipeline([
-    DatasetTransformStepConfig(type="pad_action", kwargs={"target_dim": 12}),
-    DatasetTransformStepConfig(type="pad_state", kwargs={"target_dim": 16}),
-])
-
-sample = pipeline(sample)  # modifies the sample dict
-```
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -174,15 +174,41 @@ video_benchmark = ["scikit-image>=0.23.2,<0.26.0", "pandas>=2.2.2,<2.4.0"]
 # NOTE: Explicitly listing scipy helps flatten the dependecy tree.
 aloha = ["gym-aloha>=0.1.2,<0.2.0", "lerobot[scipy-dep]"]
 pusht = ["gym-pusht>=0.1.5,<0.2.0", "pymunk>=6.6.0,<7.0.0"] # TODO: Fix pymunk version in gym-pusht instead
-libero = ["lerobot[transformers-dep]", "hf-libero>=0.1.3,<0.2.0; sys_platform == 'linux'", "lerobot[scipy-dep]"]
-libero_plus = [
+libero = [
    "lerobot[transformers-dep]",
-    "libero @ git+https://github.com/sylvestf/LIBERO-plus.git@main ; sys_platform == 'linux'",
+    "hf-libero>=0.1.3,<0.2.0; sys_platform == 'linux'",
+    # hf-egl-probe is the fixed fork of egl-probe (robomimic transitive dep).
+    # egl-probe's CMakeLists.txt requires cmake_minimum_required < 3.5 which
+    # modern cmake rejects. Installing hf-egl-probe first satisfies the egl_probe
+    # import without source compilation.
+    "hf-egl-probe>=1.0.1; sys_platform == 'linux'",
    "lerobot[scipy-dep]",
 ]
+libero_plus = [
+    # Inherit all of libero's deps (hf-libero → robosuite/robomimic/egl-probe/scipy/transformers).
+    # LIBERO-plus extends LIBERO with extra task suites; its Python module is installed
+    # from the git clone in Dockerfile.eval-libero-plus (overrides hf-libero via .pth).
+    "lerobot[libero]",
+    # Additional runtime deps declared by LIBERO-plus but absent from its setup.py:
+    "bddl>=1.0.1,<2.0.0; sys_platform == 'linux'",
+    "future; sys_platform == 'linux'",  # bddl transitive dep not declared in its metadata
+    "easydict>=1.9; sys_platform == 'linux'",
+    "wand; sys_platform == 'linux'",
+    "scikit-image>=0.20.0; sys_platform == 'linux'",
+    "gym>=0.25.0,<0.27.0; sys_platform == 'linux'",
+]
+libero-plus = ["lerobot[libero_plus]"]
 robomme = [
    "robomme @ git+https://github.com/RoboMME/robomme_benchmark.git@main ; sys_platform == 'linux'",
 ]
+robocasa = [
+    # robocasa and its robosuite fork are not on PyPI; both installed from source
+    # in Dockerfile.eval-robocasa (requires ARISE-Initiative/robosuite@robocasa_v1.4.1
+    # for PandaOmron and other robocasa-specific robots).
+    "easydict>=1.9; sys_platform == 'linux'",
+    "scikit-image>=0.20.0; sys_platform == 'linux'",
+    "lerobot[scipy-dep]",
+]
 metaworld = ["metaworld==3.0.0", "lerobot[scipy-dep]"]

 # All
@@ -228,6 +254,7 @@ lerobot-replay="lerobot.scripts.lerobot_replay:main"
 lerobot-setup-motors="lerobot.scripts.lerobot_setup_motors:main"
 lerobot-teleoperate="lerobot.scripts.lerobot_teleoperate:main"
 lerobot-eval="lerobot.scripts.lerobot_eval:main"
+lerobot-eval-worker="lerobot.scripts.lerobot_eval_worker:main"
 lerobot-train="lerobot.scripts.lerobot_train:main"
 lerobot-train-tokenizer="lerobot.scripts.lerobot_train_tokenizer:main"
 lerobot-dataset-viz="lerobot.scripts.lerobot_dataset_viz:main"
@@ -235,7 +262,9 @@ lerobot-info="lerobot.scripts.lerobot_info:main"
 lerobot-find-joint-limits="lerobot.scripts.lerobot_find_joint_limits:main"
 lerobot-imgtransform-viz="lerobot.scripts.lerobot_imgtransform_viz:main"
 lerobot-edit-dataset="lerobot.scripts.lerobot_edit_dataset:main"
+lerobot-leaderboard="lerobot.scripts.lerobot_leaderboard:main"
 lerobot-setup-can="lerobot.scripts.lerobot_setup_can:main"
+lerobot-benchmark="lerobot.scripts.lerobot_benchmark:main"

 # ---------------- Tool Configurations ----------------
 [tool.setuptools.package-data]
--- a/scripts/multimodal_analysis.py
+++ b/scripts/multimodal_analysis.py
@@ -0,0 +1,689 @@
+#!/usr/bin/env python
+
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Chunk-level multi-modality analysis for comparing full/mixed vs curated datasets.
+
+Treats each action chunk (sliding window of CHUNK_SIZE consecutive frames) as the
+atomic unit, tagged by the SARM progress score at its start frame. For each
+progress band, compares the full vs HQ dataset on:
+
+  1. Intra-band action variance
+  2. Progress delta per chunk
+  3. GMM + BIC optimal K (number of distinct strategies)
+  4. PCA embedding (visual cluster inspection)
+
+Usage:
+    python chunk_multimodality_analysis.py \\
+        --full-dataset lerobot-data-collection/level12_rac_2_2026-02-08_1 \\
+        --hq-dataset   lerobot-data-collection/level2_final_quality3 \\
+        --output-dir   ./chunk_analysis
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+from collections import defaultdict
+from pathlib import Path
+
+import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+from huggingface_hub import hf_hub_download
+from scipy.stats import gaussian_kde
+from sklearn.decomposition import PCA
+from sklearn.mixture import GaussianMixture
+from sklearn.preprocessing import StandardScaler
+
+from lerobot.datasets.lerobot_dataset import LeRobotDataset
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
+logger = logging.getLogger(__name__)
+
+# ── Visual style ──────────────────────────────────────────────────────────
+
+BG = "#0e1117"
+CARD = "#1a1d27"
+BORDER = "#2a2d3a"
+SUB = "#8b8fa8"
+TEXT = "#e8eaf0"
+C_FULL = "#f7934f"
+C_HQ = "#4dc98a"
+
+
+def _style_ax(ax: plt.Axes) -> None:
+    ax.set_facecolor(CARD)
+    ax.tick_params(colors=SUB, labelsize=8)
+    for spine in ax.spines.values():
+        spine.set_color(BORDER)
+
+
+def _save(fig: plt.Figure, path: Path) -> None:
+    fig.savefig(path, dpi=150, bbox_inches="tight", facecolor=BG)
+    plt.close(fig)
+    logger.info("Saved %s", path)
+
+
+# ── Step 0: Load episodes ────────────────────────────────────────────────
+
+def _load_sarm_progress(repo_id: str) -> pd.DataFrame | None:
+    """Try to download sarm_progress.parquet from the Hub."""
+    try:
+        path = hf_hub_download(
+            repo_id=repo_id, filename="sarm_progress.parquet",
+            repo_type="dataset",
+        )
+        df = pd.read_parquet(path)
+        col = "progress_sparse" if "progress_sparse" in df.columns else "progress_dense"
+        if col not in df.columns:
+            logger.warning("sarm_progress.parquet has no progress columns — ignoring")
+            return None
+        logger.info("Loaded SARM progress (%s) for %s (%d rows)", col, repo_id, len(df))
+        return df.rename(columns={col: "progress"})[["episode_index", "frame_index", "progress"]]
+    except Exception as exc:
+        logger.warning("Could not load sarm_progress.parquet for %s: %s", repo_id, exc)
+        return None
+
+
+def load_episodes(
+    repo_id: str,
+    n_joints: int = 16,
+    max_episodes: int | None = None,
+) -> list[dict]:
+    dataset = LeRobotDataset(repo_id, download_videos=False)
+    raw = dataset.hf_dataset
+
+    sarm_df = _load_sarm_progress(repo_id)
+    # Build per-episode progress arrays from SARM parquet (indexed by frame_index)
+    sarm_by_ep: dict[int, dict[int, float]] = {}
+    if sarm_df is not None:
+        if max_episodes is not None:
+            sarm_df = sarm_df[sarm_df["episode_index"] < max_episodes]
+        for ep_id, grp in sarm_df.groupby("episode_index"):
+            sarm_by_ep[int(ep_id)] = dict(
+                zip(grp["frame_index"].astype(int), grp["progress"].astype(float))
+            )
+
+    episodes: dict[int, dict] = defaultdict(lambda: {"actions": [], "progress": []})
+    for row in raw:
+        ep = int(row["episode_index"])
+        if max_episodes is not None and ep >= max_episodes:
+            continue
+        action = np.array(row["action"], dtype=np.float32)[:n_joints]
+        episodes[ep]["actions"].append(action)
+        fi = int(row["frame_index"])
+        ep_prog = sarm_by_ep.get(ep, {})
+        episodes[ep]["progress"].append(ep_prog.get(fi, float("nan")))
+
+    has_sarm = len(sarm_lookup) > 0
+    result = []
+    for ep_id, d in sorted(episodes.items()):
+        actions = np.stack(d["actions"])
+        T = len(actions)
+        if has_sarm:
+            prog = np.array(d["progress"], dtype=np.float32)
+            prog = np.clip(np.nan_to_num(prog, nan=0.0), 0.0, 1.0)
+            prog = np.maximum.accumulate(prog)
+        else:
+            prog = np.linspace(0.0, 1.0, T, dtype=np.float32)
+        result.append({"episode": ep_id, "actions": actions, "progress": prog})
+
+    src = "SARM" if has_sarm else "time-based"
+    logger.info("Progress source: %s", src)
+    return result
+
+
+# ── Step 1: Filter short episodes ────────────────────────────────────────
+
+def auto_length_threshold(
+    episodes_full: list[dict], episodes_hq: list[dict]
+) -> int:
+    all_lengths = np.array(
+        [e["actions"].shape[0] for e in episodes_full + episodes_hq]
+    )
+    kde = gaussian_kde(all_lengths, bw_method=0.25)
+    xs = np.linspace(all_lengths.min(), np.percentile(all_lengths, 40), 300)
+    return int(xs[np.argmin(kde(xs))])
+
+
+def plot_length_distribution(
+    episodes_full: list[dict],
+    episodes_hq: list[dict],
+    threshold: int,
+    out_path: Path,
+) -> None:
+    lens_full = np.array([e["actions"].shape[0] for e in episodes_full])
+    lens_hq = np.array([e["actions"].shape[0] for e in episodes_hq])
+    all_lens = np.concatenate([lens_full, lens_hq])
+
+    fig, ax = plt.subplots(figsize=(10, 5))
+    fig.patch.set_facecolor(BG)
+    _style_ax(ax)
+
+    bins = np.linspace(all_lens.min(), all_lens.max(), 50)
+    ax.hist(lens_full, bins=bins, alpha=0.5, color=C_FULL, label="Full/Mixed")
+    ax.hist(lens_hq, bins=bins, alpha=0.5, color=C_HQ, label="HQ")
+
+    xs = np.linspace(all_lens.min(), all_lens.max(), 300)
+    kde = gaussian_kde(all_lens, bw_method=0.25)
+    ax.plot(xs, kde(xs) * len(all_lens) * (bins[1] - bins[0]), color=TEXT, lw=1.5, label="KDE (combined)")
+
+    ax.axvline(threshold, color="#ff4b4b", ls="--", lw=1.5, label=f"Threshold = {threshold}")
+    ax.set_xlabel("Episode length (frames)", color=SUB)
+    ax.set_ylabel("Count", color=SUB)
+    ax.set_title("Episode Length Distribution", color=TEXT, fontsize=13)
+    ax.legend(facecolor=CARD, edgecolor=BORDER, labelcolor=TEXT, fontsize=8)
+    _save(fig, out_path)
+
+
+def filter_episodes(episodes: list[dict], min_length: int) -> list[dict]:
+    kept = [e for e in episodes if e["actions"].shape[0] >= min_length]
+    logger.info("Kept %d / %d episodes (min_length=%d)", len(kept), len(episodes), min_length)
+    return kept
+
+
+# ── Step 2: Extract chunks ───────────────────────────────────────────────
+
+def extract_chunks(
+    episodes: list[dict],
+    chunk_size: int = 30,
+    chunk_stride: int = 15,
+) -> list[dict]:
+    chunks = []
+    for ep in episodes:
+        actions = ep["actions"]
+        T = len(actions)
+        prog = ep["progress"]
+
+        for t in range(0, T - chunk_size, chunk_stride):
+            chunk = actions[t : t + chunk_size]
+            p_start = float(prog[t])
+            p_end = float(prog[min(t + chunk_size, T - 1)])
+
+            chunks.append({
+                "action_mean": chunk.mean(axis=0).astype(np.float32),
+                "action_flat": chunk.flatten().astype(np.float32),
+                "progress_start": p_start,
+                "progress_delta": p_end - p_start,
+                "episode": ep["episode"],
+            })
+    return chunks
+
+
+# ── Step 3: Adaptive progress bands ─────────────────────────────────────
+
+def make_bands(n_bands: int = 5) -> list[tuple[float, float]]:
+    edges = np.linspace(0.0, 1.0, n_bands + 1)
+    return [(float(edges[i]), float(edges[i + 1])) for i in range(n_bands)]
+
+
+def assign_bands(
+    chunks: list[dict], band_edges: list[tuple[float, float]]
+) -> list[dict]:
+    n = len(band_edges)
+    for c in chunks:
+        p = c["progress_start"]
+        c["band"] = next(
+            (bi for bi, (lo, hi) in enumerate(band_edges) if p < hi),
+            n - 1,
+        )
+    return chunks
+
+
+def split_by_band(chunks: list[dict], n_bands: int) -> dict[int, list[dict]]:
+    out: dict[int, list[dict]] = {b: [] for b in range(n_bands)}
+    for c in chunks:
+        out[c["band"]].append(c)
+    return out
+
+
+# ── Step 4: Intra-band action variance ──────────────────────────────────
+
+def band_variance_matrix(
+    bands: dict[int, list[dict]], n_bands: int, n_joints: int
+) -> np.ndarray:
+    var_mat = np.full((n_bands, n_joints), np.nan)
+    for b, clist in bands.items():
+        if len(clist) < 3:
+            continue
+        means = np.stack([c["action_mean"] for c in clist])
+        var_mat[b] = np.var(means, axis=0)
+    return var_mat
+
+
+def plot_variance_heatmap(
+    var_full: np.ndarray,
+    var_hq: np.ndarray,
+    band_edges: list[tuple[float, float]],
+    out_path: Path,
+) -> None:
+    n_bands = var_full.shape[0]
+    vmin = 0.0
+    vmax = max(np.nanmax(var_full), np.nanmax(var_hq))
+
+    band_labels = [f"{lo:.0%}–{hi:.0%}" for lo, hi in band_edges]
+    joint_labels = [f"J{j}" for j in range(var_full.shape[1])]
+
+    fig, axes = plt.subplots(3, 1, figsize=(12, 10), gridspec_kw={"height_ratios": [3, 3, 2]})
+    fig.patch.set_facecolor(BG)
+    fig.suptitle("Intra-Band Action Variance", color=TEXT, fontsize=14, y=0.98)
+
+    for ax_idx, (mat, label) in enumerate([(var_full, "Full/Mixed"), (var_hq, "HQ")]):
+        ax = axes[ax_idx]
+        _style_ax(ax)
+        im = ax.imshow(mat, aspect="auto", cmap="YlOrRd", vmin=vmin, vmax=vmax)
+        ax.set_yticks(range(n_bands))
+        ax.set_yticklabels(band_labels, fontsize=7, color=SUB)
+        ax.set_xticks(range(var_full.shape[1]))
+        ax.set_xticklabels(joint_labels, fontsize=7, color=SUB)
+        ax.set_title(f"Panel {'A' if ax_idx == 0 else 'B'}: {label}", color=TEXT, fontsize=11)
+        fig.colorbar(im, ax=ax, fraction=0.02, pad=0.02)
+
+    with np.errstate(invalid="ignore"):
+        mean_full = np.nanmean(var_full, axis=1)
+        mean_hq = np.nanmean(var_hq, axis=1)
+    ratio = np.where(np.isnan(mean_full) | np.isnan(mean_hq), np.nan,
+                     mean_full / (mean_hq + 1e-8))
+    ax_bar = axes[2]
+    _style_ax(ax_bar)
+    colors = [
+        "#ff4b4b" if r > 2.0 else "#ffaa33" if r > 1.2 else C_HQ
+        for r in ratio
+    ]
+    ax_bar.bar(range(n_bands), ratio, color=colors, edgecolor=BORDER)
+    ax_bar.axhline(1.0, color=SUB, ls="--", lw=0.8)
+    ax_bar.set_xticks(range(n_bands))
+    ax_bar.set_xticklabels(band_labels, fontsize=7, color=SUB)
+    ax_bar.set_ylabel("Variance ratio\n(Full / HQ)", color=SUB, fontsize=9)
+    ax_bar.set_title("Panel C: Variance Ratio per Band", color=TEXT, fontsize=11)
+
+    fig.tight_layout(rect=[0, 0, 1, 0.96])
+    _save(fig, out_path)
+
+
+# ── Step 5: Progress delta per band ──────────────────────────────────────
+
+def plot_progress_delta(
+    bands_full: dict[int, list[dict]],
+    bands_hq: dict[int, list[dict]],
+    band_edges: list[tuple[float, float]],
+    out_path: Path,
+) -> None:
+    n_bands = len(band_edges)
+    band_labels = [f"{lo:.0%}–{hi:.0%}" for lo, hi in band_edges]
+    x = np.arange(n_bands)
+    w = 0.35
+
+    means_full, stds_full = [], []
+    means_hq, stds_hq = [], []
+    all_deltas_full, all_deltas_hq = [], []
+
+    for b in range(n_bands):
+        df = np.array([c["progress_delta"] for c in bands_full.get(b, [])])
+        dh = np.array([c["progress_delta"] for c in bands_hq.get(b, [])])
+        means_full.append(np.mean(df) if len(df) > 0 else 0)
+        stds_full.append(np.std(df) if len(df) > 0 else 0)
+        means_hq.append(np.mean(dh) if len(dh) > 0 else 0)
+        stds_hq.append(np.std(dh) if len(dh) > 0 else 0)
+        all_deltas_full.extend(df.tolist())
+        all_deltas_hq.extend(dh.tolist())
+
+    fig, (ax_bar, ax_viol) = plt.subplots(1, 2, figsize=(14, 5), gridspec_kw={"width_ratios": [3, 1]})
+    fig.patch.set_facecolor(BG)
+    fig.suptitle("Progress Delta per Chunk", color=TEXT, fontsize=14)
+
+    _style_ax(ax_bar)
+    ax_bar.bar(x - w / 2, means_full, w, yerr=stds_full, color=C_FULL, edgecolor=BORDER,
+               capsize=3, label="Full/Mixed", error_kw={"ecolor": SUB})
+    ax_bar.bar(x + w / 2, means_hq, w, yerr=stds_hq, color=C_HQ, edgecolor=BORDER,
+               capsize=3, label="HQ", error_kw={"ecolor": SUB})
+    ax_bar.set_xticks(x)
+    ax_bar.set_xticklabels(band_labels, fontsize=7, color=SUB, rotation=30)
+    ax_bar.set_ylabel("Mean progress Δ", color=SUB)
+    ax_bar.legend(facecolor=CARD, edgecolor=BORDER, labelcolor=TEXT, fontsize=8)
+
+    _style_ax(ax_viol)
+    data_viol = [np.array(all_deltas_full), np.array(all_deltas_hq)]
+    if all(len(d) > 0 for d in data_viol):
+        parts = ax_viol.violinplot(data_viol, positions=[0, 1], showmeans=True, showmedians=True)
+        for pc, c in zip(parts["bodies"], [C_FULL, C_HQ]):
+            pc.set_facecolor(c)
+            pc.set_alpha(0.7)
+        for key in ("cmeans", "cmedians", "cbars", "cmins", "cmaxes"):
+            if key in parts:
+                parts[key].set_color(SUB)
+    ax_viol.set_xticks([0, 1])
+    ax_viol.set_xticklabels(["Full", "HQ"], color=SUB)
+    ax_viol.set_ylabel("Progress Δ", color=SUB)
+    ax_viol.set_title("Overall Distribution", color=TEXT, fontsize=10)
+
+    fig.tight_layout()
+    _save(fig, out_path)
+
+
+# ── Step 6: GMM + BIC per band ──────────────────────────────────────────
+
+def gmm_optimal_k(
+    band_chunks: list[dict],
+    pca_components: int = 15,
+    max_k: int = 12,
+    seed: int = 42,
+) -> int | None:
+    if len(band_chunks) < 20:
+        return None
+    X = np.stack([c["action_flat"] for c in band_chunks])
+    X = StandardScaler().fit_transform(X)
+    n = min(pca_components, X.shape[1], X.shape[0] - 1)
+    X_r = PCA(n_components=n, random_state=seed).fit_transform(X)
+    bics = []
+    for k in range(1, min(max_k + 1, len(X_r) // 6)):
+        gmm = GaussianMixture(
+            n_components=k, covariance_type="full",
+            n_init=5, max_iter=300, random_state=seed,
+        )
+        gmm.fit(X_r)
+        bics.append((k, gmm.bic(X_r)))
+    if not bics:
+        return None
+    return min(bics, key=lambda x: x[1])[0]
+
+
+def plot_gmm_bic(
+    bands_full: dict[int, list[dict]],
+    bands_hq: dict[int, list[dict]],
+    band_edges: list[tuple[float, float]],
+    seed: int,
+    out_path: Path,
+) -> tuple[list[int | None], list[int | None]]:
+    n_bands = len(band_edges)
+    ks_full = [gmm_optimal_k(bands_full.get(b, []), seed=seed) for b in range(n_bands)]
+    ks_hq = [gmm_optimal_k(bands_hq.get(b, []), seed=seed) for b in range(n_bands)]
+
+    band_labels = [f"{lo:.0%}–{hi:.0%}" for lo, hi in band_edges]
+
+    fig, ax = plt.subplots(figsize=(10, 5))
+    fig.patch.set_facecolor(BG)
+    _style_ax(ax)
+
+    xs = np.arange(n_bands)
+    valid_full = [(i, k) for i, k in enumerate(ks_full) if k is not None]
+    valid_hq = [(i, k) for i, k in enumerate(ks_hq) if k is not None]
+
+    if valid_full:
+        xi, yi = zip(*valid_full)
+        ax.plot(xi, yi, "o-", color=C_FULL, label="Full/Mixed", lw=2, markersize=7)
+    if valid_hq:
+        xi, yi = zip(*valid_hq)
+        ax.plot(xi, yi, "o-", color=C_HQ, label="HQ", lw=2, markersize=7)
+
+    if valid_full and valid_hq:
+        all_x = sorted(set([i for i, _ in valid_full]) & set([i for i, _ in valid_hq]))
+        if len(all_x) >= 2:
+            kf_interp = {i: k for i, k in valid_full}
+            kh_interp = {i: k for i, k in valid_hq}
+            shared_x = [i for i in all_x if i in kf_interp and i in kh_interp]
+            yf = [kf_interp[i] for i in shared_x]
+            yh = [kh_interp[i] for i in shared_x]
+            ax.fill_between(shared_x, yf, yh, alpha=0.15, color=TEXT)
+
+    ax.set_xticks(xs)
+    ax.set_xticklabels(band_labels, fontsize=7, color=SUB, rotation=30)
+    ax.set_ylabel("Optimal K (GMM-BIC)", color=SUB)
+    ax.set_title("Number of Distinct Strategies per Band", color=TEXT, fontsize=13)
+    ax.legend(facecolor=CARD, edgecolor=BORDER, labelcolor=TEXT, fontsize=9)
+    ax.yaxis.set_major_locator(plt.MaxNLocator(integer=True))
+    fig.tight_layout()
+    _save(fig, out_path)
+    return ks_full, ks_hq
+
+
+# ── Step 7: PCA scatter per band ────────────────────────────────────────
+
+def plot_pca_scatter(
+    bands_full: dict[int, list[dict]],
+    bands_hq: dict[int, list[dict]],
+    band_edges: list[tuple[float, float]],
+    out_path: Path,
+) -> None:
+    n_plot = min(4, len(band_edges))
+    fig, axes = plt.subplots(2, n_plot, figsize=(4 * n_plot, 7))
+    fig.patch.set_facecolor(BG)
+    fig.suptitle("PCA of Action Chunks per Band", color=TEXT, fontsize=14)
+
+    if n_plot == 1:
+        axes = axes.reshape(2, 1)
+
+    for col, b in enumerate(range(n_plot)):
+        cf = bands_full.get(b, [])
+        ch = bands_hq.get(b, [])
+        lo, hi = band_edges[b]
+
+        for row, (clist, color, label) in enumerate([
+            (cf, C_FULL, "Full/Mixed"), (ch, C_HQ, "HQ")
+        ]):
+            ax = axes[row, col]
+            _style_ax(ax)
+            if row == 0:
+                ax.set_title(f"{lo:.0%}–{hi:.0%}", color=TEXT, fontsize=10)
+            if col == 0:
+                ax.set_ylabel(label, color=SUB, fontsize=9)
+
+            if len(cf) < 3 or len(ch) < 3:
+                ax.text(0.5, 0.5, "Too few\nchunks", transform=ax.transAxes,
+                        ha="center", va="center", color=SUB, fontsize=9)
+                continue
+
+            X_full_b = np.stack([c["action_flat"] for c in cf])
+            X_hq_b = np.stack([c["action_flat"] for c in ch])
+            X_all = np.vstack([X_full_b, X_hq_b])
+            X_all = StandardScaler().fit_transform(X_all)
+            X_2d = PCA(n_components=2, random_state=42).fit_transform(X_all)
+
+            X_2d_full = X_2d[: len(cf)]
+            X_2d_hq = X_2d[len(cf) :]
+
+            pts = X_2d_full if row == 0 else X_2d_hq
+            ax.scatter(pts[:, 0], pts[:, 1], s=8, alpha=0.5, color=color, edgecolors="none")
+
+    fig.tight_layout(rect=[0, 0, 1, 0.95])
+    _save(fig, out_path)
+
+
+# ── Plot 1: Chunk counts per band ───────────────────────────────────────
+
+def plot_chunk_counts(
+    bands_full: dict[int, list[dict]],
+    bands_hq: dict[int, list[dict]],
+    band_edges: list[tuple[float, float]],
+    out_path: Path,
+) -> None:
+    n_bands = len(band_edges)
+    band_labels = [f"{lo:.0%}–{hi:.0%}" for lo, hi in band_edges]
+    x = np.arange(n_bands)
+    w = 0.35
+
+    counts_full = [len(bands_full.get(b, [])) for b in range(n_bands)]
+    counts_hq = [len(bands_hq.get(b, [])) for b in range(n_bands)]
+
+    fig, ax = plt.subplots(figsize=(10, 5))
+    fig.patch.set_facecolor(BG)
+    _style_ax(ax)
+
+    ax.bar(x - w / 2, counts_full, w, color=C_FULL, edgecolor=BORDER, label="Full/Mixed")
+    ax.bar(x + w / 2, counts_hq, w, color=C_HQ, edgecolor=BORDER, label="HQ")
+    ax.set_xticks(x)
+    ax.set_xticklabels(band_labels, fontsize=7, color=SUB, rotation=30)
+    ax.set_ylabel("Chunk count", color=SUB)
+    ax.set_title("Chunk Counts per Progress Band", color=TEXT, fontsize=13)
+    ax.legend(facecolor=CARD, edgecolor=BORDER, labelcolor=TEXT, fontsize=8)
+    fig.tight_layout()
+    _save(fig, out_path)
+
+
+# ── Summary figure ───────────────────────────────────────────────────────
+
+def plot_summary(
+    var_full: np.ndarray,
+    var_hq: np.ndarray,
+    band_edges: list[tuple[float, float]],
+    ks_full: list[int | None],
+    ks_hq: list[int | None],
+    bands_full: dict[int, list[dict]],
+    bands_hq: dict[int, list[dict]],
+    out_path: Path,
+) -> None:
+    with np.errstate(invalid="ignore"):
+        mean_full = np.nanmean(var_full, axis=1)
+        mean_hq = np.nanmean(var_hq, axis=1)
+    ratio = np.where(np.isnan(mean_full) | np.isnan(mean_hq), np.nan,
+                     mean_full / (mean_hq + 1e-8))
+    valid_ratio = ratio[~np.isnan(ratio)]
+    mean_ratio = float(np.mean(valid_ratio)) if len(valid_ratio) > 0 else float("nan")
+    peak_idx = int(np.argmax(valid_ratio)) if len(valid_ratio) > 0 else 0
+    peak_ratio = float(valid_ratio[peak_idx]) if len(valid_ratio) > 0 else float("nan")
+    lo, hi = band_edges[peak_idx]
+    peak_band = f"{lo:.0%}–{hi:.0%}"
+
+    valid_kf = [k for k in ks_full if k is not None]
+    valid_kh = [k for k in ks_hq if k is not None]
+    mean_k_full = np.mean(valid_kf) if valid_kf else float("nan")
+    mean_k_hq = np.mean(valid_kh) if valid_kh else float("nan")
+
+    n_bands = len(band_edges)
+    deltas_full = [c["progress_delta"] for b in range(n_bands) for c in bands_full.get(b, [])]
+    deltas_hq = [c["progress_delta"] for b in range(n_bands) for c in bands_hq.get(b, [])]
+    mean_delta_full = float(np.mean(deltas_full)) if deltas_full else float("nan")
+    mean_delta_hq = float(np.mean(deltas_hq)) if deltas_hq else float("nan")
+
+    rows = [
+        ("Mean variance ratio (Full / HQ)", f"{mean_ratio:.2f}x"),
+        ("Peak variance ratio", f"{peak_ratio:.2f}x at {peak_band}"),
+        ("Mean GMM K — Full", f"{mean_k_full:.1f}"),
+        ("Mean GMM K — HQ", f"{mean_k_hq:.1f}"),
+        ("Mean progress Δ — Full", f"{mean_delta_full:.4f}"),
+        ("Mean progress Δ — HQ", f"{mean_delta_hq:.4f}"),
+    ]
+
+    fig, ax = plt.subplots(figsize=(8, 3))
+    fig.patch.set_facecolor(BG)
+    ax.set_facecolor(CARD)
+    ax.axis("off")
+
+    table = ax.table(
+        cellText=[[m, v] for m, v in rows],
+        colLabels=["Metric", "Value"],
+        loc="center",
+        cellLoc="left",
+    )
+    table.auto_set_font_size(False)
+    table.set_fontsize(10)
+    for key, cell in table.get_celld().items():
+        cell.set_edgecolor(BORDER)
+        cell.set_facecolor(CARD)
+        cell.set_text_props(color=TEXT)
+        if key[0] == 0:
+            cell.set_text_props(color=TEXT, fontweight="bold")
+    table.scale(1, 1.6)
+    ax.set_title("Summary Statistics", color=TEXT, fontsize=13, pad=15)
+    fig.tight_layout()
+    _save(fig, out_path)
+
+    for metric, value in rows:
+        logger.info("  %s: %s", metric, value)
+
+
+# ── Main ─────────────────────────────────────────────────────────────────
+
+def main(args: argparse.Namespace) -> None:
+    out = Path(args.output_dir)
+    out.mkdir(parents=True, exist_ok=True)
+
+    logger.info("Loading FULL dataset: %s", args.full_dataset)
+    episodes_full = load_episodes(args.full_dataset, args.n_joints, args.max_episodes)
+    logger.info("Loading HQ dataset: %s", args.hq_dataset)
+    episodes_hq = load_episodes(args.hq_dataset, args.n_joints, args.max_episodes)
+    logger.info("Loaded %d full episodes, %d HQ episodes", len(episodes_full), len(episodes_hq))
+
+    # Step 1: length threshold + filter
+    if args.min_episode_length is not None:
+        threshold = args.min_episode_length
+    else:
+        threshold = auto_length_threshold(episodes_full, episodes_hq)
+    logger.info("Episode length threshold: %d", threshold)
+
+    plot_length_distribution(episodes_full, episodes_hq, threshold, out / "0_length_distribution.png")
+    episodes_full = filter_episodes(episodes_full, threshold)
+    episodes_hq = filter_episodes(episodes_hq, threshold)
+
+    # Step 2: extract chunks
+    chunks_full = extract_chunks(episodes_full, args.chunk_size, args.chunk_stride)
+    chunks_hq = extract_chunks(episodes_hq, args.chunk_size, args.chunk_stride)
+    logger.info("Extracted %d full chunks, %d HQ chunks", len(chunks_full), len(chunks_hq))
+
+    # Step 3: fixed equal-width bands over episode-relative progress
+    band_edges = make_bands(args.n_bands)
+    n_bands = len(band_edges)
+    logger.info("Progress bands (%d): %s", n_bands,
+                [f"{lo:.0%}–{hi:.0%}" for lo, hi in band_edges])
+
+    chunks_full = assign_bands(chunks_full, band_edges)
+    chunks_hq = assign_bands(chunks_hq, band_edges)
+    bands_full = split_by_band(chunks_full, n_bands)
+    bands_hq = split_by_band(chunks_hq, n_bands)
+
+    # Plot 1: chunk counts
+    plot_chunk_counts(bands_full, bands_hq, band_edges, out / "1_chunk_counts_per_band.png")
+
+    # Step 4: variance heatmap
+    var_full = band_variance_matrix(bands_full, n_bands, args.n_joints)
+    var_hq = band_variance_matrix(bands_hq, n_bands, args.n_joints)
+    plot_variance_heatmap(var_full, var_hq, band_edges, out / "2_variance_heatmap.png")
+
+    # Step 5: progress delta
+    plot_progress_delta(bands_full, bands_hq, band_edges, out / "3_progress_delta_per_band.png")
+
+    # Step 6: GMM BIC
+    ks_full, ks_hq = plot_gmm_bic(bands_full, bands_hq, band_edges, args.seed, out / "4_gmm_bic_per_band.png")
+
+    # Step 7: PCA scatter
+    plot_pca_scatter(bands_full, bands_hq, band_edges, out / "5_pca_per_band.png")
+
+    # Summary
+    plot_summary(var_full, var_hq, band_edges, ks_full, ks_hq,
+                 bands_full, bands_hq, out / "6_summary.png")
+
+    logger.info("All figures saved to %s", out)
+
+
+if __name__ == "__main__":
+    p = argparse.ArgumentParser(
+        description="Chunk-level multi-modality analysis: Full/Mixed vs HQ dataset.",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+    )
+    p.add_argument("--full-dataset", default="lerobot-data-collection/level12_rac_2_2026-02-08_1")
+    p.add_argument("--hq-dataset", default="lerobot-data-collection/level2_final_quality3_trim_0_hil_data")
+    p.add_argument("--output-dir", default="./chunk_analysis")
+    p.add_argument("--chunk-size", type=int, default=30)
+    p.add_argument("--chunk-stride", type=int, default=15)
+    p.add_argument("--n-bands", type=int, default=5, help="Number of equal-width progress bands")
+    p.add_argument("--max-episodes", type=int, default=500)
+    p.add_argument("--n-joints", type=int, default=16)
+    p.add_argument("--min-episode-length", type=int, default=None,
+                   help="Override auto-detected length filter threshold")
+    p.add_argument("--seed", type=int, default=42)
+    args = p.parse_args()
+    main(args)
--- a/scripts/train_smolvla_libero_plus.slurm
+++ b/scripts/train_smolvla_libero_plus.slurm
@@ -0,0 +1,29 @@
+#!/bin/bash
+#SBATCH --job-name=smolvla_libero_plus
+#SBATCH --partition=hopper-prod
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=1
+#SBATCH --gpus-per-node=4
+#SBATCH --cpus-per-task=48
+#SBATCH --mem=200G
+#SBATCH --time=12:00:00
+#SBATCH --output=logs/smolvla_libero_plus_%j.out
+#SBATCH --error=logs/smolvla_libero_plus_%j.err
+
+set -euo pipefail
+
+eval "$(conda shell.bash hook 2>/dev/null)"
+conda activate lerobot312
+
+cd /admin/home/pepijn/lerobot_wt_robocasa
+
+lerobot-benchmark train \
+    --benchmarks libero_plus \
+    --policy-path lerobot/smolvla_base \
+    --hub-user pepijn223 \
+    --num-gpus 4 \
+    --steps 30000 \
+    --batch-size 32 \
+    --eval-freq 0 \
+    --wandb \
+    --dataset.repo_id=pepijn223/libero_plus_lerobot
--- a/src/lerobot/configs/default.py
+++ b/src/lerobot/configs/default.py
@@ -16,13 +16,18 @@

 from dataclasses import dataclass, field

-from lerobot.datasets.transforms import DatasetTransformStepConfig, ImageTransformsConfig
+from lerobot.datasets.transforms import ImageTransformsConfig
 from lerobot.datasets.video_utils import get_safe_default_codec


@dataclass
 class DatasetConfig:
+    # You may provide a list of datasets here. `train.py` creates them all and concatenates them. Note: only data
+    # keys common between the datasets are kept. Each dataset gets and additional transform that inserts the
+    # "dataset_index" into the returned item. The index mapping is made according to the order in which the
+    # datasets are provided.
    repo_id: str
+    # Root directory where the dataset will be stored (e.g. 'dataset/path'). If None, defaults to $HF_LEROBOT_HOME/repo_id.
    root: str | None = None
    episodes: list[int] | None = None
    image_transforms: ImageTransformsConfig = field(default_factory=ImageTransformsConfig)
@@ -32,32 +37,6 @@ class DatasetConfig:
    streaming: bool = False


-@dataclass
-class SubDatasetConfig:
-    """Configuration for a single dataset within a MultiDatasetConfig."""
-
-    repo_id: str
-    root: str | None = None
-    episodes: list[int] | None = None
-    revision: str | None = None
-    video_backend: str = field(default_factory=get_safe_default_codec)
-    weight: float = 1.0
-    # Maps dataset-local feature keys to unified policy keys.
-    # Keys not listed pass through unchanged.
-    feature_map: dict[str, str] = field(default_factory=dict)
-    # Per-dataset transforms applied after feature renaming, before cross-dataset padding.
-    transforms: list[DatasetTransformStepConfig] | None = None
-
-
-@dataclass
-class MultiDatasetConfig:
-    """Configuration for training on multiple datasets jointly."""
-
-    datasets: list[SubDatasetConfig] = field(default_factory=list)
-    image_transforms: ImageTransformsConfig = field(default_factory=ImageTransformsConfig)
-    use_imagenet_stats: bool = True
-
-
@dataclass
 class WandBConfig:
    enable: bool = False
@@ -70,15 +49,64 @@ class WandBConfig:
    mode: str | None = None  # Allowed values: 'online', 'offline' 'disabled'. Defaults to 'online'


+@dataclass
+class EvalDockerConfig:
+    # Docker image to use for evaluation (e.g., "ghcr.io/org/lerobot-eval-libero:latest").
+    # Takes precedence over eval.envhub_ref.
+    image: str | None = None
+    # Optional EnvHub reference to resolve an image, e.g. "envhub://lerobot/libero_plus@v1".
+    envhub_ref: str | None = None
+    # If true, mount the local repository and prefer local source code in the container.
+    use_local_code: bool = True
+    # Pull the image before running.
+    pull: bool = True
+    # Docker --gpus value. Set to None to disable GPU flags and run CPU-only.
+    gpus: str | None = "all"
+    # Docker --shm-size value (increase when using larger eval.batch_size values).
+    shm_size: str = "8g"
+    # Port on which the host HTTP policy inference server listens.
+    port: int = 50051
+
+
@dataclass
 class EvalConfig:
    n_episodes: int = 50
-    # `batch_size` specifies the number of environments to use in a gym.vector.VectorEnv.
+    # Number of sub-envs per task inside one VectorEnv. Increase to improve per-task
+    # inference throughput until GPU or simulator memory saturates.
    batch_size: int = 50
-    # `use_async_envs` specifies whether to use asynchronous environments (multiprocessing).
+    # Use AsyncVectorEnv (multiprocessing). Prefer SyncVectorEnv unless your environment
+    # spends significant time in Python-side stepping and can benefit from process parallelism.
    use_async_envs: bool = False
+    # Runtime where evaluation executes: "local", "docker", or "multiprocess".
+    # "multiprocess" spawns local worker processes + policy servers.
+    runtime: str = "local"
+    docker: EvalDockerConfig = field(default_factory=EvalDockerConfig)
+    # Number of parallel eval script instances to launch for one run.
+    # instance_count > 1 enables multi-instance task sharding.
+    instance_count: int = 1
+    # 0-indexed shard id for this process. Users usually leave this at 0.
+    # Additional shards are launched automatically by `lerobot-eval` when instance_count > 1.
+    instance_id: int = 0
+    # Number of policy inference servers to run in parallel (docker/multiprocess runtimes).
+    # Each server loads a copy of the model and listens on consecutive ports
+    # starting from eval.docker.port. Workers are round-robin assigned.
+    policy_servers: int = 1
+    # Base port for policy servers in multiprocess mode.
+    port: int = 50051

    def __post_init__(self) -> None:
+        if self.runtime not in {"local", "docker", "multiprocess"}:
+            raise ValueError(
+                f"Unsupported eval.runtime '{self.runtime}'. Expected one of: local, docker, multiprocess."
+            )
+        if self.instance_count < 1:
+            raise ValueError("eval.instance_count must be >= 1.")
+        if self.instance_id < 0 or self.instance_id >= self.instance_count:
+            raise ValueError(
+                f"eval.instance_id must be in [0, {self.instance_count - 1}] (got {self.instance_id})."
+            )
+        if self.policy_servers < 1:
+            raise ValueError("eval.policy_servers must be >= 1.")
        if self.batch_size > self.n_episodes:
            raise ValueError(
                "The eval batch size is greater than the number of eval episodes "
--- a/src/lerobot/configs/eval.py
+++ b/src/lerobot/configs/eval.py
@@ -40,6 +40,8 @@ class EvalPipelineConfig:
    rename_map: dict[str, str] = field(default_factory=dict)
    # Explicit consent to execute remote code from the Hub (required for hub environments).
    trust_remote_code: bool = False
+    # Push eval results (metrics JSON, rollout videos, model card update) to the model's Hub repo.
+    push_to_hub: bool = False

    def __post_init__(self) -> None:
        # HACK: We parse again the cli args here to get the pretrained path if there was one.
--- a/src/lerobot/configs/train.py
+++ b/src/lerobot/configs/train.py
@@ -24,7 +24,7 @@ from huggingface_hub.errors import HfHubHTTPError

 from lerobot import envs
 from lerobot.configs import parser
-from lerobot.configs.default import DatasetConfig, EvalConfig, MultiDatasetConfig, PeftConfig, WandBConfig
+from lerobot.configs.default import DatasetConfig, EvalConfig, PeftConfig, WandBConfig
 from lerobot.configs.policies import PreTrainedConfig
 from lerobot.optim import OptimizerConfig
 from lerobot.optim.schedulers import LRSchedulerConfig
@@ -35,7 +35,7 @@ TRAIN_CONFIG_NAME = "train_config.json"

@dataclass
 class TrainPipelineConfig(HubMixin):
-    dataset: DatasetConfig | MultiDatasetConfig
+    dataset: DatasetConfig
    env: envs.EnvConfig | None = None
    policy: PreTrainedConfig | None = None
    # Set `dir` to where you would like to save all of the run outputs. If you run another training session
@@ -129,9 +129,8 @@ class TrainPipelineConfig(HubMixin):
            train_dir = f"{now:%Y-%m-%d}/{now:%H-%M-%S}_{self.job_name}"
            self.output_dir = Path("outputs/train") / train_dir

-        if isinstance(self.dataset, MultiDatasetConfig):
-            if len(self.dataset.datasets) < 1:
-                raise ValueError("MultiDatasetConfig.datasets must contain at least one sub-dataset.")
+        if isinstance(self.dataset.repo_id, list):
+            raise NotImplementedError("LeRobotMultiDataset is not currently implemented.")

        if not self.use_policy_training_preset and (self.optimizer is None or self.scheduler is None):
            raise ValueError("Optimizer and Scheduler must be set when the policy presets are not used.")
@@ -144,7 +143,8 @@ class TrainPipelineConfig(HubMixin):
                "'policy.repo_id' argument missing. Please specify it to push the model to the hub."
            )

-        if self.use_rabc and not self.rabc_progress_path and isinstance(self.dataset, DatasetConfig):
+        if self.use_rabc and not self.rabc_progress_path:
+            # Auto-detect from dataset path
            repo_id = self.dataset.repo_id
            if self.dataset.root:
                self.rabc_progress_path = str(Path(self.dataset.root) / "sarm_progress.parquet")
--- a/src/lerobot/datasets/factory.py
+++ b/src/lerobot/datasets/factory.py
@@ -18,14 +18,13 @@ from pprint import pformat

 import torch

-from lerobot.configs.default import DatasetConfig, MultiDatasetConfig
 from lerobot.configs.policies import PreTrainedConfig
 from lerobot.configs.train import TrainPipelineConfig
 from lerobot.datasets.lerobot_dataset import (
    LeRobotDataset,
    LeRobotDatasetMetadata,
+    MultiLeRobotDataset,
 )
-from lerobot.datasets.multi_dataset import NewMultiLeRobotDataset
 from lerobot.datasets.streaming_dataset import StreamingLeRobotDataset
 from lerobot.datasets.transforms import ImageTransforms
 from lerobot.utils.constants import ACTION, OBS_PREFIX, REWARD
@@ -69,81 +68,70 @@ def resolve_delta_timestamps(
    return delta_timestamps


-def make_dataset(cfg: TrainPipelineConfig) -> LeRobotDataset | NewMultiLeRobotDataset:
-    """Create a single or multi-dataset depending on the config type.
+def make_dataset(cfg: TrainPipelineConfig) -> LeRobotDataset | MultiLeRobotDataset:
+    """Handles the logic of setting up delta timestamps and image transforms before creating a dataset.
+
+    Args:
+        cfg (TrainPipelineConfig): A TrainPipelineConfig config which contains a DatasetConfig and a PreTrainedConfig.
+
+    Raises:
+        NotImplementedError: The MultiLeRobotDataset is currently deactivated.

    Returns:
-        LeRobotDataset | NewMultiLeRobotDataset
+        LeRobotDataset | MultiLeRobotDataset
    """
-    if isinstance(cfg.dataset, MultiDatasetConfig):
-        return _make_multi_dataset(cfg)
-
-    return _make_single_dataset(cfg)
-
-
-def _make_single_dataset(cfg: TrainPipelineConfig) -> LeRobotDataset:
-    ds_cfg: DatasetConfig = cfg.dataset  # type: ignore[assignment]
    image_transforms = (
-        ImageTransforms(ds_cfg.image_transforms) if ds_cfg.image_transforms.enable else None
+        ImageTransforms(cfg.dataset.image_transforms) if cfg.dataset.image_transforms.enable else None
    )
-    ds_meta = LeRobotDatasetMetadata(ds_cfg.repo_id, root=ds_cfg.root, revision=ds_cfg.revision)
-    delta_timestamps = resolve_delta_timestamps(cfg.policy, ds_meta)

-    if not ds_cfg.streaming:
-        dataset = LeRobotDataset(
-            ds_cfg.repo_id,
-            root=ds_cfg.root,
-            episodes=ds_cfg.episodes,
-            delta_timestamps=delta_timestamps,
-            image_transforms=image_transforms,
-            revision=ds_cfg.revision,
-            video_backend=ds_cfg.video_backend,
-            tolerance_s=cfg.tolerance_s,
+    if isinstance(cfg.dataset.repo_id, str):
+        ds_meta = LeRobotDatasetMetadata(
+            cfg.dataset.repo_id, root=cfg.dataset.root, revision=cfg.dataset.revision
        )
+        delta_timestamps = resolve_delta_timestamps(cfg.policy, ds_meta)
+        if not cfg.dataset.streaming:
+            dataset = LeRobotDataset(
+                cfg.dataset.repo_id,
+                root=cfg.dataset.root,
+                episodes=cfg.dataset.episodes,
+                delta_timestamps=delta_timestamps,
+                image_transforms=image_transforms,
+                revision=cfg.dataset.revision,
+                video_backend=cfg.dataset.video_backend,
+                tolerance_s=cfg.tolerance_s,
+            )
+        else:
+            dataset = StreamingLeRobotDataset(
+                cfg.dataset.repo_id,
+                root=cfg.dataset.root,
+                episodes=cfg.dataset.episodes,
+                delta_timestamps=delta_timestamps,
+                image_transforms=image_transforms,
+                revision=cfg.dataset.revision,
+                max_num_shards=cfg.num_workers,
+                tolerance_s=cfg.tolerance_s,
+            )
    else:
-        dataset = StreamingLeRobotDataset(
-            ds_cfg.repo_id,
-            root=ds_cfg.root,
-            episodes=ds_cfg.episodes,
-            delta_timestamps=delta_timestamps,
+        raise NotImplementedError("The MultiLeRobotDataset isn't supported for now.")
+        dataset = MultiLeRobotDataset(
+            cfg.dataset.repo_id,
+            # TODO(aliberts): add proper support for multi dataset
+            # delta_timestamps=delta_timestamps,
            image_transforms=image_transforms,
-            revision=ds_cfg.revision,
-            max_num_shards=cfg.num_workers,
-            tolerance_s=cfg.tolerance_s,
+            video_backend=cfg.dataset.video_backend,
+        )
+        logging.info(
+            "Multiple datasets were provided. Applied the following index mapping to the provided datasets: "
+            f"{pformat(dataset.repo_id_to_index, indent=2)}"
        )

-    if ds_cfg.use_imagenet_stats:
+    if cfg.dataset.use_imagenet_stats:
+        if dataset.meta.stats is None:
+            dataset.meta.stats = {}
        for key in dataset.meta.camera_keys:
-            for stats_type, stats_val in IMAGENET_STATS.items():
-                dataset.meta.stats[key][stats_type] = torch.tensor(stats_val, dtype=torch.float32)
-
-    return dataset
-
-
-def _make_multi_dataset(cfg: TrainPipelineConfig) -> NewMultiLeRobotDataset:
-    multi_cfg: MultiDatasetConfig = cfg.dataset  # type: ignore[assignment]
-    image_transforms = (
-        ImageTransforms(multi_cfg.image_transforms) if multi_cfg.image_transforms.enable else None
-    )
-
-    dataset = NewMultiLeRobotDataset(
-        configs=multi_cfg.datasets,
-        image_transforms=image_transforms,
-        tolerance_s=cfg.tolerance_s,
-    )
-
-    logging.info(
-        "MultiLeRobotDataset created with %d sub-datasets:\n%s",
-        len(multi_cfg.datasets),
-        pformat(
-            {i: c.repo_id for i, c in enumerate(multi_cfg.datasets)},
-            indent=2,
-        ),
-    )
-
-    if multi_cfg.use_imagenet_stats:
-        for key in dataset.meta.camera_keys:
-            for stats_type, stats_val in IMAGENET_STATS.items():
-                dataset.meta.stats[key][stats_type] = torch.tensor(stats_val, dtype=torch.float32)
+            if key not in dataset.meta.stats:
+                dataset.meta.stats[key] = {}
+            for stats_type, stats in IMAGENET_STATS.items():
+                dataset.meta.stats[key][stats_type] = torch.tensor(stats, dtype=torch.float32)

    return dataset
--- a/src/lerobot/datasets/multi_dataset.py
+++ b/src/lerobot/datasets/multi_dataset.py
@@ -1,364 +0,0 @@
-"""MultiLeRobotDataset: joint training over heterogeneous LeRobot datasets.
-
-Supports:
- Per-dataset feature mapping (rename keys to a unified namespace)
- Automatic zero-padding for features missing in some datasets
- Per-dataset transform pipelines
- Weighted sampling via dataset weights
- Aggregated stats across all sub-datasets
- A ``meta`` shim compatible with EpisodeAwareSampler and make_policy
-"""
-
-from __future__ import annotations
-
-import logging
-from collections.abc import Callable
-
-import numpy as np
-import torch
-import torch.utils.data
-
-from lerobot.configs.default import SubDatasetConfig
-from lerobot.datasets.compute_stats import aggregate_stats
-from lerobot.datasets.lerobot_dataset import LeRobotDataset
-from lerobot.datasets.transforms import DatasetTransformPipeline
-
-
-class MultiDatasetMeta:
-    """Lightweight metadata shim that exposes the same interface as ``LeRobotDatasetMetadata``.
-
-    Built by aggregating the metadata of multiple sub-datasets after their
-    feature keys have been mapped to a unified namespace.
-    """
-
-    def __init__(
-        self,
-        datasets: list[LeRobotDataset],
-        feature_maps: list[dict[str, str]],
-    ):
-        self._datasets = datasets
-        self._feature_maps = feature_maps
-
-        self._unified_features = self._build_unified_features()
-        self._episodes = self._build_episodes()
-        self._stats = self._build_stats()
-
-    # ------------------------------------------------------------------
-    # Feature union
-    # ------------------------------------------------------------------
-
-    def _build_unified_features(self) -> dict[str, dict]:
-        """Build feature dict as the *union* of all mapped feature keys."""
-        unified: dict[str, dict] = {}
-        for ds, fmap in zip(self._datasets, self._feature_maps):
-            for original_key, feat_info in ds.meta.features.items():
-                mapped_key = fmap.get(original_key, original_key)
-                if mapped_key not in unified:
-                    unified[mapped_key] = dict(feat_info)
-                else:
-                    existing_shape = tuple(unified[mapped_key]["shape"])
-                    new_shape = tuple(feat_info["shape"])
-                    if existing_shape != new_shape and unified[mapped_key]["dtype"] == feat_info["dtype"]:
-                        logging.warning(
-                            "Feature '%s' has shape %s in one dataset but %s in another. "
-                            "The larger shape will be used (padding applied automatically).",
-                            mapped_key,
-                            existing_shape,
-                            new_shape,
-                        )
-                        if np.prod(new_shape) > np.prod(existing_shape):
-                            unified[mapped_key] = dict(feat_info)
-        return unified
-
-    # ------------------------------------------------------------------
-    # Episode metadata (global flat indexing)
-    # ------------------------------------------------------------------
-
-    def _build_episodes(self) -> dict[str, list]:
-        """Concatenate episode boundaries across sub-datasets with frame offsets.
-
-        Produces the same column structure as ``load_episodes()`` so that
-        ``EpisodeAwareSampler`` and ``WeightedEpisodeAwareSampler`` can consume it.
-        """
-        from_indices: list[int] = []
-        to_indices: list[int] = []
-        dataset_source: list[int] = []
-
-        frame_offset = 0
-        for ds_idx, ds in enumerate(self._datasets):
-            eps = ds.meta.episodes
-            for ep in eps:
-                from_indices.append(ep["dataset_from_index"] + frame_offset)
-                to_indices.append(ep["dataset_to_index"] + frame_offset)
-                dataset_source.append(ds_idx)
-            frame_offset += ds.num_frames
-
-        return {
-            "dataset_from_index": from_indices,
-            "dataset_to_index": to_indices,
-            "dataset_source": dataset_source,
-        }
-
-    # ------------------------------------------------------------------
-    # Stats aggregation
-    # ------------------------------------------------------------------
-
-    def _build_stats(self) -> dict[str, dict[str, np.ndarray]]:
-        """Aggregate stats across sub-datasets using mapped feature keys."""
-        mapped_stats_list: list[dict[str, dict]] = []
-        for ds, fmap in zip(self._datasets, self._feature_maps):
-            reverse_map = {v: k for k, v in fmap.items()}
-            mapped: dict[str, dict] = {}
-            for unified_key in self._unified_features:
-                original_key = reverse_map.get(unified_key, unified_key)
-                if original_key in ds.meta.stats:
-                    mapped[unified_key] = ds.meta.stats[original_key]
-            mapped_stats_list.append(mapped)
-
-        return aggregate_stats(mapped_stats_list)
-
-    # ------------------------------------------------------------------
-    # Properties matching LeRobotDatasetMetadata API
-    # ------------------------------------------------------------------
-
-    @property
-    def features(self) -> dict[str, dict]:
-        return self._unified_features
-
-    @property
-    def image_keys(self) -> list[str]:
-        return [k for k, f in self._unified_features.items() if f["dtype"] == "image"]
-
-    @property
-    def video_keys(self) -> list[str]:
-        return [k for k, f in self._unified_features.items() if f["dtype"] == "video"]
-
-    @property
-    def camera_keys(self) -> list[str]:
-        return [k for k, f in self._unified_features.items() if f["dtype"] in ("video", "image")]
-
-    @property
-    def names(self) -> dict[str, list | dict]:
-        return {k: f["names"] for k, f in self._unified_features.items()}
-
-    @property
-    def shapes(self) -> dict[str, tuple]:
-        return {k: tuple(f["shape"]) for k, f in self._unified_features.items()}
-
-    @property
-    def fps(self) -> int:
-        fps_values = {ds.meta.fps for ds in self._datasets}
-        if len(fps_values) > 1:
-            logging.warning("Sub-datasets have different FPS values: %s. Using the first.", fps_values)
-        return self._datasets[0].meta.fps
-
-    @property
-    def stats(self) -> dict[str, dict[str, np.ndarray]]:
-        return self._stats
-
-    @stats.setter
-    def stats(self, value: dict):
-        self._stats = value
-
-    @property
-    def episodes(self) -> dict[str, list]:
-        return self._episodes
-
-    @property
-    def total_episodes(self) -> int:
-        return sum(ds.meta.total_episodes for ds in self._datasets)
-
-    @property
-    def total_frames(self) -> int:
-        return sum(ds.meta.total_frames for ds in self._datasets)
-
-    @property
-    def total_tasks(self) -> int:
-        return sum(ds.meta.total_tasks for ds in self._datasets)
-
-    @property
-    def info(self) -> dict:
-        return {
-            "fps": self.fps,
-            "features": self._unified_features,
-            "total_episodes": self.total_episodes,
-            "total_frames": self.total_frames,
-            "total_tasks": self.total_tasks,
-            "codebase_version": "v3.0",
-        }
-
-
-class NewMultiLeRobotDataset(torch.utils.data.Dataset):
-    """Dataset that wraps multiple ``LeRobotDataset`` instances with feature mapping and padding.
-
-    Each sub-dataset can have different feature names and shapes.  A per-dataset
-    ``feature_map`` renames keys into a shared namespace.  Features that a given
-    sub-dataset does not provide are zero-padded so every ``__getitem__`` returns
-    the full unified feature set.
-    """
-
-    def __init__(
-        self,
-        configs: list[SubDatasetConfig],
-        image_transforms: Callable | None = None,
-        delta_timestamps: dict[str, list[float]] | None = None,
-        tolerance_s: float = 1e-4,
-    ):
-        super().__init__()
-        self._configs = configs
-        self.image_transforms = image_transforms
-
-        self._datasets: list[LeRobotDataset] = []
-        self._feature_maps: list[dict[str, str]] = []
-        self._transform_pipelines: list[DatasetTransformPipeline | None] = []
-        self._weights: list[float] = []
-
-        for cfg in configs:
-            ds = LeRobotDataset(
-                repo_id=cfg.repo_id,
-                root=cfg.root,
-                episodes=cfg.episodes,
-                image_transforms=image_transforms,
-                delta_timestamps=delta_timestamps,
-                tolerance_s=tolerance_s,
-                revision=cfg.revision,
-                video_backend=cfg.video_backend,
-            )
-            self._datasets.append(ds)
-            self._feature_maps.append(cfg.feature_map or {})
-            self._transform_pipelines.append(
-                DatasetTransformPipeline(cfg.transforms) if cfg.transforms else None
-            )
-            self._weights.append(cfg.weight)
-
-        self._meta = MultiDatasetMeta(self._datasets, self._feature_maps)
-
-        # Pre-compute cumulative frame counts for fast index mapping.
-        self._cumulative_frames: list[int] = []
-        total = 0
-        for ds in self._datasets:
-            total += ds.num_frames
-            self._cumulative_frames.append(total)
-
-        # Build reverse maps (unified_key -> original_key) per dataset for padding.
-        self._reverse_maps: list[dict[str, str]] = []
-        for fmap in self._feature_maps:
-            self._reverse_maps.append({v: k for k, v in fmap.items()})
-
-        logging.info(
-            "MultiLeRobotDataset: %d sub-datasets, %d total frames, %d total episodes, "
-            "%d unified features",
-            len(self._datasets),
-            self.num_frames,
-            self.num_episodes,
-            len(self._meta.features),
-        )
-
-    # ------------------------------------------------------------------
-    # Public interface
-    # ------------------------------------------------------------------
-
-    @property
-    def meta(self) -> MultiDatasetMeta:
-        return self._meta
-
-    @property
-    def dataset_weights(self) -> list[float]:
-        return self._weights
-
-    @property
-    def num_frames(self) -> int:
-        return self._cumulative_frames[-1] if self._cumulative_frames else 0
-
-    @property
-    def num_episodes(self) -> int:
-        return sum(ds.num_episodes for ds in self._datasets)
-
-    @property
-    def episodes(self) -> list[int] | None:
-        return None
-
-    @property
-    def fps(self) -> int:
-        return self._meta.fps
-
-    @property
-    def features(self) -> dict[str, dict]:
-        return self._meta.features
-
-    @property
-    def camera_keys(self) -> list[str]:
-        return self._meta.camera_keys
-
-    # ------------------------------------------------------------------
-    # Indexing
-    # ------------------------------------------------------------------
-
-    def _locate(self, idx: int) -> tuple[int, int]:
-        """Map a global frame index to (dataset_index, local_index)."""
-        for ds_idx, cum in enumerate(self._cumulative_frames):
-            if idx < cum:
-                local = idx - (self._cumulative_frames[ds_idx - 1] if ds_idx > 0 else 0)
-                return ds_idx, local
-        raise IndexError(f"Index {idx} out of range (total {self.num_frames})")
-
-    def __len__(self) -> int:
-        return self.num_frames
-
-    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
-        ds_idx, local_idx = self._locate(idx)
-        item = self._datasets[ds_idx][local_idx]
-
-        # 1. Rename keys according to feature_map.
-        fmap = self._feature_maps[ds_idx]
-        if fmap:
-            renamed: dict[str, torch.Tensor] = {}
-            for key, value in item.items():
-                renamed[fmap.get(key, key)] = value
-            item = renamed
-
-        # 2. Apply per-dataset transform pipeline.
-        pipeline = self._transform_pipelines[ds_idx]
-        if pipeline is not None:
-            item = pipeline(item)
-
-        # 3. Pad missing features with zeros.
-        reverse_map = self._reverse_maps[ds_idx]
-        ds_features = self._datasets[ds_idx].meta.features
-        for unified_key, feat_info in self._meta.features.items():
-            if unified_key in item:
-                continue
-            original_key = reverse_map.get(unified_key, unified_key)
-            if original_key in ds_features:
-                continue
-            shape = tuple(feat_info["shape"])
-            dtype = feat_info["dtype"]
-            if dtype in ("video", "image"):
-                # Camera tensors are (C, H, W) after transforms.
-                c, h, w = (shape[2], shape[0], shape[1]) if len(shape) == 3 else (3, shape[0], shape[1])
-                item[unified_key] = torch.zeros(c, h, w, dtype=torch.float32)
-            elif dtype in ("float32", "float64"):
-                item[unified_key] = torch.zeros(shape, dtype=torch.float32)
-            elif dtype in ("int32", "int64"):
-                item[unified_key] = torch.zeros(shape, dtype=torch.int64)
-            elif dtype == "bool":
-                item[unified_key] = torch.zeros(shape, dtype=torch.bool)
-            else:
-                item[unified_key] = torch.zeros(shape, dtype=torch.float32)
-            item[f"{unified_key}_is_pad"] = torch.tensor(True)
-
-        # 4. Tag which dataset this sample came from.
-        item["dataset_index"] = torch.tensor(ds_idx)
-        return item
-
-    def __repr__(self) -> str:
-        repo_ids = [c.repo_id for c in self._configs]
-        return (
-            f"NewMultiLeRobotDataset(\n"
-            f"  repo_ids={repo_ids},\n"
-            f"  num_frames={self.num_frames},\n"
-            f"  num_episodes={self.num_episodes},\n"
-            f"  unified_features={list(self._meta.features.keys())},\n"
-            f"  weights={self._weights},\n"
-            f")"
-        )
--- a/src/lerobot/datasets/sampler.py
+++ b/src/lerobot/datasets/sampler.py
@@ -59,80 +59,3 @@ class EpisodeAwareSampler:

    def __len__(self) -> int:
        return len(self.indices)
-
-
-class WeightedEpisodeAwareSampler:
-    """Sampler that draws frames from multiple datasets according to per-dataset weights.
-
-    Each iteration first selects a sub-dataset proportionally to its weight, then
-    uniformly samples a frame from that sub-dataset's valid index set.  Episode
-    boundary information is respected so that dropped frames are excluded.
-
-    Args:
-        dataset_from_indices: Start index for each episode (global, flat).
-        dataset_to_indices: End index (exclusive) for each episode (global, flat).
-        dataset_membership: Which sub-dataset each episode belongs to (integer id).
-        dataset_weights: Relative sampling weight per sub-dataset.
-        episode_indices_to_use: If given, only episodes in this set are used.
-        drop_n_first_frames: Frames to skip at the start of each episode.
-        drop_n_last_frames: Frames to skip at the end of each episode.
-        shuffle: Whether to shuffle within each epoch.
-        num_samples: How many samples per epoch. Defaults to total valid frames.
-        generator: Optional torch.Generator for reproducibility.
-    """
-
-    def __init__(
-        self,
-        dataset_from_indices: list[int],
-        dataset_to_indices: list[int],
-        dataset_membership: list[int],
-        dataset_weights: list[float],
-        episode_indices_to_use: list | None = None,
-        drop_n_first_frames: int = 0,
-        drop_n_last_frames: int = 0,
-        shuffle: bool = False,
-        num_samples: int | None = None,
-        generator: torch.Generator | None = None,
-    ):
-        n_datasets = max(dataset_membership) + 1 if dataset_membership else 0
-        self._per_dataset_indices: list[list[int]] = [[] for _ in range(n_datasets)]
-
-        episodes_to_use = set(episode_indices_to_use) if episode_indices_to_use is not None else None
-
-        for ep_idx, (start, end, ds_id) in enumerate(
-            zip(dataset_from_indices, dataset_to_indices, dataset_membership, strict=True)
-        ):
-            if episodes_to_use is not None and ep_idx not in episodes_to_use:
-                continue
-            frame_range = range(start + drop_n_first_frames, end - drop_n_last_frames)
-            self._per_dataset_indices[ds_id].extend(frame_range)
-
-        # Normalise weights (only over datasets that actually have frames).
-        raw_weights = list(dataset_weights[:n_datasets])
-        self._weights = torch.zeros(n_datasets)
-        for i, w in enumerate(raw_weights):
-            if len(self._per_dataset_indices[i]) > 0:
-                self._weights[i] = w
-        total_w = self._weights.sum()
-        if total_w > 0:
-            self._weights /= total_w
-
-        self._total_frames = sum(len(idx) for idx in self._per_dataset_indices)
-        self._num_samples = num_samples if num_samples is not None else self._total_frames
-        self.shuffle = shuffle
-        self._generator = generator
-
-    def __iter__(self) -> Iterator[int]:
-        if not self.shuffle:
-            for ds_indices in self._per_dataset_indices:
-                yield from ds_indices
-            return
-
-        for _ in range(self._num_samples):
-            ds_id = int(torch.multinomial(self._weights, 1, generator=self._generator).item())
-            indices = self._per_dataset_indices[ds_id]
-            local_idx = int(torch.randint(len(indices), (1,), generator=self._generator).item())
-            yield indices[local_idx]
-
-    def __len__(self) -> int:
-        return self._num_samples
--- a/src/lerobot/datasets/transforms.py
+++ b/src/lerobot/datasets/transforms.py
@@ -14,13 +14,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import collections
-import logging
 from collections.abc import Callable, Sequence
 from dataclasses import dataclass, field
 from typing import Any

 import torch
-import torch.nn.functional as F_nn
 from torchvision.transforms import v2
 from torchvision.transforms.v2 import (
    Transform,
@@ -260,114 +258,3 @@ class ImageTransforms(Transform):

    def forward(self, *inputs: Any) -> Any:
        return self.tf(*inputs)
-
-
-
-# Per-dataset transform pipeline (used by MultiLeRobotDataset)
-
-@dataclass
-class DatasetTransformStepConfig:
-    """Config for a single per-dataset transform step."""
-
-    type: str
-    kwargs: dict[str, Any] = field(default_factory=dict)
-
-
-_DATASET_TRANSFORM_REGISTRY: dict[str, type["DatasetTransformStep"]] = {}
-
-
-def register_dataset_transform(name: str):
-    """Decorator to register a DatasetTransformStep by name."""
-
-    def decorator(cls: type["DatasetTransformStep"]) -> type["DatasetTransformStep"]:
-        _DATASET_TRANSFORM_REGISTRY[name] = cls
-        return cls
-
-    return decorator
-
-
-class DatasetTransformStep:
-    """Base class for a single per-dataset transform applied to a sample dict."""
-
-    def __call__(self, sample: dict) -> dict:
-        raise NotImplementedError
-
-
-@register_dataset_transform("pad_action")
-class PadAction(DatasetTransformStep):
-    """Zero-pad the ``action`` tensor to *target_dim* along the last axis."""
-
-    def __init__(self, target_dim: int):
-        self.target_dim = target_dim
-
-    def __call__(self, sample: dict) -> dict:
-        action = sample.get("action")
-        if action is None:
-            return sample
-        current = action.shape[-1]
-        if current < self.target_dim:
-            sample["action"] = F_nn.pad(action, (0, self.target_dim - current))
-        return sample
-
-
-@register_dataset_transform("pad_state")
-class PadState(DatasetTransformStep):
-    """Zero-pad ``observation.state`` to *target_dim* along the last axis."""
-
-    def __init__(self, target_dim: int):
-        self.target_dim = target_dim
-
-    def __call__(self, sample: dict) -> dict:
-        state = sample.get("observation.state")
-        if state is None:
-            return sample
-        current = state.shape[-1]
-        if current < self.target_dim:
-            sample["observation.state"] = F_nn.pad(state, (0, self.target_dim - current))
-        return sample
-
-
-@register_dataset_transform("resize_images")
-class ResizeImages(DatasetTransformStep):
-    """Resize all image/video camera tensors to (height, width)."""
-
-    def __init__(self, height: int, width: int):
-        self.size = (height, width)
-
-    def __call__(self, sample: dict) -> dict:
-        for key in list(sample.keys()):
-            if not key.startswith("observation.images."):
-                continue
-            img = sample[key]
-            if not isinstance(img, torch.Tensor) or img.ndim < 3:
-                continue
-            sample[key] = F.resize(img, self.size, antialias=True)
-        return sample
-
-
-class DatasetTransformPipeline:
-    """Sequential pipeline of DatasetTransformStep instances."""
-
-    def __init__(self, configs: list[DatasetTransformStepConfig] | None = None):
-        self.steps: list[DatasetTransformStep] = []
-        if configs:
-            for cfg in configs:
-                self.steps.append(self._build(cfg))
-
-    @staticmethod
-    def _build(cfg: DatasetTransformStepConfig) -> DatasetTransformStep:
-        cls = _DATASET_TRANSFORM_REGISTRY.get(cfg.type)
-        if cls is None:
-            raise ValueError(
-                f"Unknown dataset transform '{cfg.type}'. "
-                f"Available: {list(_DATASET_TRANSFORM_REGISTRY)}"
-            )
-        return cls(**cfg.kwargs)
-
-    def __call__(self, sample: dict) -> dict:
-        for step in self.steps:
-            sample = step(sample)
-        return sample
-
-    def __repr__(self) -> str:
-        return f"DatasetTransformPipeline(steps={self.steps})"
--- a/src/lerobot/envs/configs.py
+++ b/src/lerobot/envs/configs.py
@@ -45,6 +45,10 @@ class EnvConfig(draccus.ChoiceRegistry, abc.ABC):
    fps: int = 30
    features: dict[str, PolicyFeature] = field(default_factory=dict)
    features_map: dict[str, str] = field(default_factory=dict)
+    # Upper bound on concurrent task evaluation in `lerobot-eval`.
+    # - For lazy wrappers (e.g. LIBERO/LIBERO-plus), values >1 can enable chunked
+    #   task batching with one policy forward pass over multiple tasks.
+    # - For other envs, values >1 use a threaded task scheduler fallback.
    max_parallel_tasks: int = 1
    disable_env_checker: bool = True

@@ -356,7 +360,7 @@ class LiberoPlusEnv(LiberoEnv):
    in experiment configs (`env.type=libero_plus`).
    """

-    task: str = "libero_spatial"
+    task: str = "libero_spatial,libero_object,libero_goal,libero_10"


@EnvConfig.register_subclass("robocasa")
--- a/src/lerobot/envs/docker_runtime.py
+++ b/src/lerobot/envs/docker_runtime.py
@@ -0,0 +1,442 @@
+#!/usr/bin/env python
+
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Docker runtime for lerobot-eval.
+
+The policy stays on the host GPU; gym environments run inside Docker containers.
+Each container runs `lerobot-eval-worker`, which calls back to a host HTTP inference
+server for action chunks.
+
+Architecture:
+    host (GPU):
+        1. Load policy + preprocessors from EvalPipelineConfig.
+        2. Start ``policy_servers`` HTTP inference servers on consecutive ports.
+        3. Spawn ``instance_count`` Docker containers, round-robin assigned to servers.
+        4. Wait; collect per-task JSON written to the mounted output volume.
+        5. Merge shards → aggregate → write eval_info.json.
+
+    container (CPU only):
+        1. make_env(cfg.env) → shard tasks by (instance_id, instance_count).
+        2. For each task: run n_episodes, POST obs to /predict_chunk, step env.
+        3. Write per-task JSON to /results/worker_{instance_id}.json.
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+import pickle  # nosec B403 — internal serialisation only
+import platform
+import subprocess  # nosec B404
+import sys
+import threading
+import time
+from http.server import BaseHTTPRequestHandler, HTTPServer
+from pathlib import Path
+from typing import TYPE_CHECKING, Any
+
+import numpy as np
+import torch
+
+from lerobot.envs.factory import make_env_pre_post_processors
+from lerobot.policies.factory import make_policy, make_pre_post_processors
+from lerobot.utils.utils import get_safe_torch_device
+
+if TYPE_CHECKING:
+    from lerobot.configs.eval import EvalPipelineConfig
+
+logger = logging.getLogger(__name__)
+
+
+# ---------------------------------------------------------------------------
+# HTTP inference server (host side)
+# ---------------------------------------------------------------------------
+
+
+class _PolicyInferenceHandler(BaseHTTPRequestHandler):
+    """POST /predict_chunk → pickled numpy action chunk."""
+
+    server: _InferenceServer
+
+    def do_POST(self) -> None:
+        if self.path != "/predict_chunk":
+            self.send_error(404)
+            return
+        length = int(self.headers["Content-Length"])
+        body = self.rfile.read(length)
+        payload: dict = pickle.loads(body)  # nosec B301
+        obs_t: dict = payload["obs_t"]
+
+        with self.server._lock:
+            chunk_np = self.server._predict(obs_t)
+
+        resp = pickle.dumps(chunk_np)  # nosec B301
+        self.send_response(200)
+        self.send_header("Content-Type", "application/octet-stream")
+        self.send_header("Content-Length", str(len(resp)))
+        self.end_headers()
+        self.wfile.write(resp)
+
+    def log_message(self, fmt: str, *args: Any) -> None:  # noqa: ANN401
+        pass  # suppress per-request logs
+
+
+class _InferenceServer(HTTPServer):
+    """Wraps the loaded policy behind a trivial HTTP interface."""
+
+    def __init__(
+        self,
+        addr: tuple[str, int],
+        policy: Any,
+        env_preprocessor: Any,
+        preprocessor: Any,
+        postprocessor: Any,
+    ) -> None:
+        super().__init__(addr, _PolicyInferenceHandler)
+        self._policy = policy
+        self._env_preprocessor = env_preprocessor
+        self._preprocessor = preprocessor
+        self._postprocessor = postprocessor
+        self._lock = threading.Lock()
+        self._device = torch.device(str(policy.config.device))
+
+    def _predict(self, obs_t: dict) -> np.ndarray:
+        """Apply full preprocessing pipeline and return (n_action_steps, A) numpy chunk."""
+        obs = self._env_preprocessor(obs_t)
+        obs = self._preprocessor(obs)
+        obs_gpu: dict = {k: v.to(self._device) if isinstance(v, torch.Tensor) else v for k, v in obs.items()}
+        with torch.no_grad():
+            chunk: torch.Tensor = self._policy.predict_action_chunk(obs_gpu)  # (B, T, A)
+
+        n_action_steps = getattr(self._policy.config, "n_action_steps", chunk.shape[1])
+        batch, n_steps, action_dim = chunk.shape
+        chunk_2d = chunk.reshape(batch * n_steps, action_dim)  # (B*T, A)
+        chunk_2d = self._postprocessor(chunk_2d)  # (B*T, A)
+        return chunk_2d[:n_action_steps].cpu().numpy()  # (n_action_steps, A)
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+
+def _get_host_ip() -> str:
+    """Return the IP that containers can use to reach the host."""
+    if platform.system() in ("Darwin", "Windows"):
+        return "host.docker.internal"
+    return "172.17.0.1"  # Linux Docker bridge default gateway
+
+
+def _resolve_image(cfg: EvalPipelineConfig) -> str:
+    """Return the Docker image name to use for the env containers."""
+    if cfg.eval.docker.image:
+        return cfg.eval.docker.image
+    return f"lerobot-benchmark-{cfg.env.type}"
+
+
+def _env_argv() -> list[str]:
+    """Extract --env.* args from sys.argv to forward verbatim to the worker."""
+    return [arg for arg in sys.argv[1:] if arg.startswith("--env.")]
+
+
+def _spawn_container(
+    *,
+    image: str,
+    instance_id: int,
+    instance_count: int,
+    server_address: str,
+    n_episodes: int,
+    seed: int,
+    output_dir: Path,
+    docker_cfg: Any,
+    env_argv: list[str],
+) -> subprocess.Popen:
+    output_dir.mkdir(parents=True, exist_ok=True)
+    container_results = "/results"
+
+    cmd: list[str] = ["docker", "run", "--rm"]
+    if docker_cfg.gpus:
+        cmd += [f"--gpus={docker_cfg.gpus}"]
+    cmd += [f"--shm-size={docker_cfg.shm_size}"]
+    cmd += ["-v", f"{output_dir.resolve()}:{container_results}"]
+    # Allow containers on Linux to resolve host.docker.internal.
+    cmd += ["--add-host=host.docker.internal:host-gateway"]
+    cmd.append(image)
+
+    cmd += [
+        "lerobot-eval-worker",
+        *env_argv,
+        f"--server_address={server_address}",
+        f"--n_episodes={n_episodes}",
+        f"--seed={seed}",
+        f"--instance_id={instance_id}",
+        f"--instance_count={instance_count}",
+        f"--output_path={container_results}/worker_{instance_id}.json",
+    ]
+
+    logger.info(
+        "Spawning container %d/%d: %s",
+        instance_id + 1,
+        instance_count,
+        " ".join(cmd),
+    )
+    return subprocess.Popen(cmd)  # nosec B603 B607
+
+
+# ---------------------------------------------------------------------------
+# Public entry point
+# ---------------------------------------------------------------------------
+
+
+def run_eval_in_docker(cfg: EvalPipelineConfig) -> None:
+    """Run eval with env in Docker containers and policy on the host GPU.
+
+    Writes ``eval_info.json`` to ``cfg.output_dir``.  Called by
+    ``lerobot_eval._run_eval_worker`` when ``eval.runtime == "docker"``.
+    """
+    # Import here to avoid circular import at module level.
+    from lerobot.scripts.lerobot_eval import _aggregate_eval_from_per_task
+
+    start_t = time.time()
+    output_dir = Path(cfg.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    docker_cfg = cfg.eval.docker
+
+    # Optionally pull the image before starting.
+    image = _resolve_image(cfg)
+    if docker_cfg.pull:
+        logger.info("Pulling Docker image: %s", image)
+        subprocess.run(["docker", "pull", image], check=True)  # nosec B603 B607
+
+    # ── Load policy + all preprocessors on the host GPU ──────────────────
+    device = get_safe_torch_device(cfg.policy.device, log=True)
+    torch.backends.cudnn.benchmark = True
+    torch.backends.cuda.matmul.allow_tf32 = True
+
+    policy = make_policy(cfg=cfg.policy, env_cfg=cfg.env, rename_map=cfg.rename_map)
+    policy.eval()
+
+    preprocessor_overrides: dict = {
+        "device_processor": {"device": str(device)},
+        "rename_observations_processor": {"rename_map": cfg.rename_map},
+    }
+    preprocessor, postprocessor = make_pre_post_processors(
+        policy_cfg=cfg.policy,
+        pretrained_path=cfg.policy.pretrained_path,
+        preprocessor_overrides=preprocessor_overrides,
+    )
+    env_preprocessor, _env_postprocessor = make_env_pre_post_processors(
+        env_cfg=cfg.env,
+        policy_cfg=cfg.policy,
+    )
+
+    # ── Start HTTP inference server(s) ────────────────────────────────────
+    n_policy_servers = cfg.eval.policy_servers
+    base_port = docker_cfg.port
+    host_ip = _get_host_ip()
+    instance_count = cfg.eval.instance_count
+    env_argv = _env_argv()
+
+    servers: list[_InferenceServer] = []
+    for s_idx in range(n_policy_servers):
+        port = base_port + s_idx
+        if s_idx > 0:
+            policy = make_policy(cfg=cfg.policy, env_cfg=cfg.env, rename_map=cfg.rename_map)
+            policy.eval()
+            preprocessor, postprocessor = make_pre_post_processors(
+                policy_cfg=cfg.policy,
+                pretrained_path=cfg.policy.pretrained_path,
+                preprocessor_overrides=preprocessor_overrides,
+            )
+            env_preprocessor, _ = make_env_pre_post_processors(
+                env_cfg=cfg.env, policy_cfg=cfg.policy,
+            )
+        srv = _InferenceServer(
+            ("0.0.0.0", port),  # nosec B104
+            policy=policy,
+            env_preprocessor=env_preprocessor,
+            preprocessor=preprocessor,
+            postprocessor=postprocessor,
+        )
+        t = threading.Thread(target=srv.serve_forever, daemon=True)
+        t.start()
+        servers.append(srv)
+        logger.info("Policy inference server %d/%d running on port %d", s_idx + 1, n_policy_servers, port)
+
+    # ── Spawn containers (round-robin across policy servers) ──────────────
+    container_dirs: list[Path] = []
+    procs: list[subprocess.Popen] = []
+    try:
+        for i in range(instance_count):
+            assigned_port = base_port + (i % n_policy_servers)
+            server_address = f"{host_ip}:{assigned_port}"
+            shard_dir = output_dir / "shards" / str(i)
+            container_dirs.append(shard_dir)
+            proc = _spawn_container(
+                image=image,
+                instance_id=i,
+                instance_count=instance_count,
+                server_address=server_address,
+                n_episodes=cfg.eval.n_episodes,
+                seed=cfg.seed,
+                output_dir=shard_dir,
+                docker_cfg=docker_cfg,
+                env_argv=env_argv,
+            )
+            procs.append(proc)
+
+        failed: list[tuple[int, int]] = []
+        for i, proc in enumerate(procs):
+            rc = proc.wait()
+            if rc != 0:
+                failed.append((i, rc))
+                logger.error("Container %d/%d exited with code %d", i + 1, instance_count, rc)
+        if failed:
+            raise RuntimeError(f"Docker eval containers failed (instance_id, exit_code): {failed}")
+
+    finally:
+        for srv in servers:
+            srv.shutdown()
+
+    # ── Collect and merge per-task results ───────────────────────────────
+    per_task: list[dict] = []
+    for i, shard_dir in enumerate(container_dirs):
+        result_file = shard_dir / f"worker_{i}.json"
+        with open(result_file) as f:
+            shard_data: dict = json.load(f)
+        per_task.extend(shard_data.get("per_task", []))
+
+    per_task.sort(key=lambda x: (x["task_group"], x["task_id"]))
+
+    info = _aggregate_eval_from_per_task(per_task, total_eval_s=time.time() - start_t)
+    with open(output_dir / "eval_info.json", "w") as f:
+        json.dump(info, f, indent=2)
+
+    logger.info("Docker eval complete. Results: %s/eval_info.json", output_dir)
+
+
+def run_eval_multiprocess(cfg: EvalPipelineConfig) -> None:
+    """Run eval with multiple local worker processes and policy servers (no Docker).
+
+    Same architecture as Docker runtime but spawns `lerobot-eval-worker` as local
+    subprocesses. Works on SLURM clusters and anywhere Docker is unavailable.
+    """
+    from lerobot.scripts.lerobot_eval import _aggregate_eval_from_per_task
+
+    start_t = time.time()
+    output_dir = Path(cfg.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    device = get_safe_torch_device(cfg.policy.device, log=True)
+    torch.backends.cudnn.benchmark = True
+    torch.backends.cuda.matmul.allow_tf32 = True
+
+    policy = make_policy(cfg=cfg.policy, env_cfg=cfg.env, rename_map=cfg.rename_map)
+    policy.eval()
+
+    preprocessor_overrides: dict = {
+        "device_processor": {"device": str(device)},
+        "rename_observations_processor": {"rename_map": cfg.rename_map},
+    }
+    preprocessor, postprocessor = make_pre_post_processors(
+        policy_cfg=cfg.policy,
+        pretrained_path=cfg.policy.pretrained_path,
+        preprocessor_overrides=preprocessor_overrides,
+    )
+    env_preprocessor, _env_postprocessor = make_env_pre_post_processors(
+        env_cfg=cfg.env, policy_cfg=cfg.policy,
+    )
+
+    n_policy_servers = cfg.eval.policy_servers
+    base_port = cfg.eval.port
+    instance_count = cfg.eval.instance_count
+    env_argv = _env_argv()
+
+    servers: list[_InferenceServer] = []
+    for s_idx in range(n_policy_servers):
+        port = base_port + s_idx
+        if s_idx > 0:
+            policy = make_policy(cfg=cfg.policy, env_cfg=cfg.env, rename_map=cfg.rename_map)
+            policy.eval()
+            preprocessor, postprocessor = make_pre_post_processors(
+                policy_cfg=cfg.policy,
+                pretrained_path=cfg.policy.pretrained_path,
+                preprocessor_overrides=preprocessor_overrides,
+            )
+            env_preprocessor, _ = make_env_pre_post_processors(
+                env_cfg=cfg.env, policy_cfg=cfg.policy,
+            )
+        srv = _InferenceServer(
+            ("0.0.0.0", port),  # nosec B104
+            policy=policy,
+            env_preprocessor=env_preprocessor,
+            preprocessor=preprocessor,
+            postprocessor=postprocessor,
+        )
+        t = threading.Thread(target=srv.serve_forever, daemon=True)
+        t.start()
+        servers.append(srv)
+        logger.info("Policy server %d/%d on port %d", s_idx + 1, n_policy_servers, port)
+
+    worker_dirs: list[Path] = []
+    procs: list[subprocess.Popen] = []
+    try:
+        for i in range(instance_count):
+            assigned_port = base_port + (i % n_policy_servers)
+            shard_dir = output_dir / "shards" / str(i)
+            shard_dir.mkdir(parents=True, exist_ok=True)
+            worker_dirs.append(shard_dir)
+
+            cmd = [
+                sys.executable, "-m", "lerobot.scripts.lerobot_eval_worker",
+                *env_argv,
+                f"--server_address=127.0.0.1:{assigned_port}",
+                f"--n_episodes={cfg.eval.n_episodes}",
+                f"--seed={cfg.seed}",
+                f"--instance_id={i}",
+                f"--instance_count={instance_count}",
+                f"--output_path={shard_dir / f'worker_{i}.json'}",
+            ]
+            logger.info("Spawning worker %d/%d → port %d", i + 1, instance_count, assigned_port)
+            procs.append(subprocess.Popen(cmd))  # nosec B603
+
+        failed: list[tuple[int, int]] = []
+        for i, proc in enumerate(procs):
+            rc = proc.wait()
+            if rc != 0:
+                failed.append((i, rc))
+                logger.error("Worker %d/%d exited with code %d", i + 1, instance_count, rc)
+        if failed:
+            raise RuntimeError(f"Multiprocess eval workers failed (id, exit_code): {failed}")
+
+    finally:
+        for srv in servers:
+            srv.shutdown()
+
+    per_task: list[dict] = []
+    for i, shard_dir in enumerate(worker_dirs):
+        result_file = shard_dir / f"worker_{i}.json"
+        with open(result_file) as f:
+            shard_data: dict = json.load(f)
+        per_task.extend(shard_data.get("per_task", []))
+
+    per_task.sort(key=lambda x: (x["task_group"], x["task_id"]))
+
+    info = _aggregate_eval_from_per_task(per_task, total_eval_s=time.time() - start_t)
+    with open(output_dir / "eval_info.json", "w") as f:
+        json.dump(info, f, indent=2)
+
+    logger.info("Multiprocess eval complete. Results: %s/eval_info.json", output_dir)
--- a/src/lerobot/envs/factory.py
+++ b/src/lerobot/envs/factory.py
@@ -125,7 +125,7 @@ def make_env(
    use_async_envs: bool = False,
    hub_cache_dir: str | None = None,
    trust_remote_code: bool = False,
-) -> dict[str, dict[int, gym.vector.VectorEnv]]:
+) -> dict[str, dict[int, Any]]:
    """Makes a gym vector environment according to the config or Hub reference.

    Args:
@@ -143,8 +143,9 @@ def make_env(
        ModuleNotFoundError: If the requested env package is not installed

    Returns:
-        dict[str, dict[int, gym.vector.VectorEnv]]:
-            A mapping from suite name to indexed vectorized environments.
+        dict[str, dict[int, Any]]:
+            A mapping from suite name to indexed environments. Values are either
+            materialized vector envs or lazy wrappers that materialize on first use.
            - For multi-task benchmarks (e.g., LIBERO): one entry per suite, and one vec env per task_id.
            - For single-task environments: a single suite entry (cfg.type) with task_id=0.

@@ -191,6 +192,11 @@ def make_env(
        if cfg.task is None:
            raise ValueError("LiberoEnv requires a task to be specified")

+        if cfg.type == "libero_plus":
+            from lerobot.envs.libero import _check_libero_plus_assets
+
+            _check_libero_plus_assets()
+
        return create_libero_envs(
            task=cfg.task,
            n_envs=n_envs,
--- a/src/lerobot/envs/lazy_vec_env.py
+++ b/src/lerobot/envs/lazy_vec_env.py
@@ -0,0 +1,58 @@
+#!/usr/bin/env python
+
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+from collections.abc import Callable, Sequence
+from typing import Any
+
+
+class LazyVectorEnv:
+    """Defer vector-env construction until first usage.
+
+    This is useful for benchmarks with many tasks: we can register one env object
+    per task without eagerly allocating all simulator/rendering resources.
+    """
+
+    def __init__(self, env_cls: Callable[[Sequence[Callable[[], Any]]], Any], factory_fns: list[Callable]):
+        self._env_cls = env_cls
+        self._factory_fns = factory_fns
+        self._env = None
+
+    @property
+    def env_cls(self) -> Callable[[Sequence[Callable[[], Any]]], Any]:
+        return self._env_cls
+
+    @property
+    def factory_fns(self) -> list[Callable]:
+        return self._factory_fns
+
+    @property
+    def num_factory_fns(self) -> int:
+        return len(self._factory_fns)
+
+    def materialize(self):
+        if self._env is None:
+            self._env = self._env_cls(self._factory_fns)
+        return self._env
+
+    def close(self):
+        if self._env is not None:
+            self._env.close()
+            self._env = None
+
+    def __getattr__(self, name):
+        return getattr(self.materialize(), name)
+
--- a/src/lerobot/envs/libero.py
+++ b/src/lerobot/envs/libero.py
@@ -16,6 +16,7 @@
 from __future__ import annotations

 import os
+import re
 from collections import defaultdict
 from collections.abc import Callable, Iterable, Mapping, Sequence
 from functools import partial
@@ -28,15 +29,220 @@ import torch
 from gymnasium import spaces

 try:
-    from libero.libero import benchmark, get_libero_path
-    from libero.libero.envs import OffScreenRenderEnv
+    import libero as _libero_pkg  # noqa: F401
 except ImportError:
-    # LIBERO-plus may be installed from source with an extra nested package level.
-    from libero.libero.libero import benchmark, get_libero_path
-    from libero.libero.libero.envs import OffScreenRenderEnv
+    raise ImportError(
+        "Could not import libero. Install benchmark dependencies with one of:\n"
+        "  pip install -e \".[libero]\"\n"
+        "  pip install -e \".[libero_plus]\"  (alias: \".[libero-plus]\")"
+    )

+# LIBERO's env_wrapper unconditionally imports wand (ImageMagick Python binding)
+# which requires the system-level libMagickWand library. The wand features are only
+# used for visual noise perturbations and are not needed for standard evaluation.
+# Pre-install a stub so the import succeeds even without ImageMagick.
+import sys
+import types
+
+if "wand" not in sys.modules:
+    try:
+        import wand.api  # noqa: F401
+    except (ImportError, OSError):
+
+        class _AttrSink:
+            """Accepts any attribute get/set without error."""
+
+            def __getattr__(self, _name):
+                return self
+
+            def __setattr__(self, _name, _value):
+                pass
+
+            def __call__(self, *a, **kw):
+                pass
+
+        _wand = types.ModuleType("wand")
+        _wand_api = types.ModuleType("wand.api")
+        _wand_api.library = _AttrSink()
+        _wand_image = types.ModuleType("wand.image")
+        _wand_image.Image = type("Image", (), {})
+        _wand.api = _wand_api
+        _wand.image = _wand_image
+        sys.modules["wand"] = _wand
+        sys.modules["wand.api"] = _wand_api
+        sys.modules["wand.image"] = _wand_image
+
+from libero.libero import benchmark, get_libero_path
+from libero.libero.envs import OffScreenRenderEnv
+
+from lerobot.envs.lazy_vec_env import LazyVectorEnv
 from lerobot.processor import RobotObservation

+_ASSET_DOWNLOAD_INSTRUCTIONS = """\
+LIBERO-plus assets not found at: {assets_dir}
+
+The LIBERO-plus benchmark requires ~6 GB of scene/texture/object assets that
+are hosted separately on Hugging Face.  To download and install them:
+
+    python -c "
+from huggingface_hub import hf_hub_download
+hf_hub_download('Sylvest/LIBERO-plus', 'assets.zip',
+                repo_type='dataset', local_dir='/tmp/libero-plus-assets')
+"
+    unzip /tmp/libero-plus-assets/assets.zip -d /tmp/libero-plus-assets-unzipped
+    # The zip contains a deeply nested path; move the assets directory:
+    mv /tmp/libero-plus-assets-unzipped/inspire/*/assets {assets_dir}
+    rm -rf /tmp/libero-plus-assets /tmp/libero-plus-assets-unzipped
+
+See https://huggingface.co/datasets/Sylvest/LIBERO-plus for details.
+"""
+
+
+def _check_libero_plus_assets() -> None:
+    """Validate that LIBERO-plus scene assets are present."""
+    assets_dir = Path(get_libero_path("benchmark_root")) / "assets"
+    if not (assets_dir / "scenes").is_dir():
+        raise FileNotFoundError(_ASSET_DOWNLOAD_INSTRUCTIONS.format(assets_dir=assets_dir))
+
+
+# ---- Perturbation support for LIBERO-Plus -----------------------------------
+
+PERTURBATION_DIMENSIONS = (
+    "Camera Viewpoints",
+    "Robot Initial States",
+    "Language Instructions",
+    "Light Conditions",
+    "Background Textures",
+    "Sensor Noise",
+    "Objects Layout",
+)
+
+PERTURBATION_SHORT_KEYS = {
+    "Camera Viewpoints": "camera",
+    "Robot Initial States": "robot",
+    "Language Instructions": "language",
+    "Light Conditions": "light",
+    "Background Textures": "background",
+    "Sensor Noise": "noise",
+    "Objects Layout": "layout",
+}
+
+
+def load_task_classification() -> dict:
+    """Load task_classification.json shipped with LIBERO-Plus."""
+    import json
+
+    benchmark_root = Path(get_libero_path("benchmark_root"))
+    candidates = [
+        benchmark_root / "benchmark" / "task_classification.json",
+        benchmark_root / "task_classification.json",
+        benchmark_root.parent / "benchmark" / "task_classification.json",
+    ]
+    for p in candidates:
+        if p.exists():
+            with open(p) as f:
+                return json.load(f)
+    raise FileNotFoundError(
+        f"task_classification.json not found. Tried: {[str(c) for c in candidates]}"
+    )
+
+
+def build_perturbation_index(suite_name: str) -> dict[int, str]:
+    """Return {0-indexed task_id: perturbation_dimension} for *suite_name*."""
+    tc = load_task_classification()
+    suite_data = tc.get(suite_name, {})
+    index: dict[int, str] = {}
+
+    # LIBERO-Plus task_classification.json has appeared in two shapes:
+    # 1) dict[suite][task_id_str] -> meta
+    # 2) dict[suite] -> list[{id, category, ...}]
+    if isinstance(suite_data, list):
+        for item in suite_data:
+            if not isinstance(item, dict):
+                continue
+            raw_id = item.get("id")
+            if raw_id is None:
+                continue
+            try:
+                # list-form ids are 1-indexed in current LIBERO-Plus release.
+                tid = int(raw_id) - 1
+            except (TypeError, ValueError):
+                continue
+            if tid < 0:
+                continue
+            dim = item.get("perturbation_type") or item.get("category", "unknown")
+            index[tid] = dim
+        return index
+
+    if isinstance(suite_data, dict):
+        # Handle both 0-indexed and 1-indexed key conventions.
+        numeric_keys: list[int] = []
+        for k in suite_data:
+            try:
+                numeric_keys.append(int(k))
+            except (TypeError, ValueError):
+                continue
+        one_indexed = bool(numeric_keys) and 0 not in numeric_keys and min(numeric_keys) >= 1
+
+        for task_id_str, meta in suite_data.items():
+            try:
+                tid = int(task_id_str)
+            except (TypeError, ValueError):
+                continue
+            if one_indexed:
+                tid -= 1
+            if tid < 0:
+                continue
+            if isinstance(meta, dict):
+                dim = meta.get("perturbation_type") or meta.get("category", "unknown")
+            else:
+                dim = "unknown"
+            index[tid] = dim
+        return index
+
+    return index
+
+
+def aggregate_by_perturbation(
+    per_task: list[dict], suite_indices: dict[str, dict[int, str]]
+) -> dict[str, dict]:
+    """Aggregate per-task eval results by perturbation dimension.
+
+    Args:
+        per_task: list of {"task_group": str, "task_id": int, "metrics": {...}}
+        suite_indices: {suite_name: {task_id: dimension_name}} from build_perturbation_index
+
+    Returns:
+        {short_key: {"pc_success": float, "n_episodes": int}} for each perturbation dimension
+    """
+    dim_successes: dict[str, list] = defaultdict(list)
+    for entry in per_task:
+        suite = entry["task_group"]
+        tid = entry["task_id"]
+        idx = suite_indices.get(suite, {})
+        dim = idx.get(tid, "unknown")
+        short = PERTURBATION_SHORT_KEYS.get(dim, dim.lower().replace(" ", "_"))
+        successes = entry["metrics"].get("successes", [])
+        dim_successes[short].extend(successes)
+
+    results: dict[str, dict] = {}
+    all_successes: list = []
+    for short_key in list(PERTURBATION_SHORT_KEYS.values()) + ["unknown"]:
+        if short_key not in dim_successes:
+            continue
+        s = dim_successes[short_key]
+        all_successes.extend(s)
+        results[short_key] = {
+            "pc_success": float(np.nanmean(s) * 100) if s else float("nan"),
+            "n_episodes": len(s),
+        }
+    if all_successes:
+        results["total"] = {
+            "pc_success": float(np.nanmean(all_successes) * 100),
+            "n_episodes": len(all_successes),
+        }
+    return results
+

 def _parse_camera_names(camera_name: str | Sequence[str]) -> list[str]:
    """Normalize camera_name into a non-empty list of strings."""
@@ -74,13 +280,35 @@ def _select_task_ids(total_tasks: int, task_ids: Iterable[int] | None) -> list[i


 def get_task_init_states(task_suite: Any, i: int) -> np.ndarray:
-    init_states_path = (
-        Path(get_libero_path("init_states"))
-        / task_suite.tasks[i].problem_folder
-        / task_suite.tasks[i].init_states_file
+    init_states_dir = Path(get_libero_path("init_states")) / task_suite.tasks[i].problem_folder
+    init_states_file = task_suite.tasks[i].init_states_file
+
+    # 1. Direct match
+    direct = init_states_dir / init_states_file
+    if direct.exists():
+        return torch.load(direct, weights_only=False)  # nosec B614
+
+    # 2. LIBERO-Plus perturbation filenames append suffixes like
+    #    _view_0_0_100_0_0_initstate_1, _language_19, _noise_45, _table_1, _tb_9, _add_16
+    #    to the base task name.  Instead of regex-matching every variant, find the
+    #    longest existing base file whose stem is a prefix of the perturbation stem.
+    stem, ext = os.path.splitext(init_states_file)
+    best_match: Path | None = None
+    best_len = 0
+    for candidate in init_states_dir.glob(f"*{ext}"):
+        cstem = candidate.stem
+        if stem == cstem or (stem.startswith(cstem) and stem[len(cstem)] == "_"):
+            if len(cstem) > best_len:
+                best_len = len(cstem)
+                best_match = candidate
+
+    if best_match is not None:
+        return torch.load(best_match, weights_only=False)  # nosec B614
+
+    raise FileNotFoundError(
+        f"Could not find init states for task {i}. "
+        f"Tried '{init_states_file}' and prefix matching in '{init_states_dir}'."
    )
-    init_states = torch.load(init_states_path, weights_only=False)  # nosec B614
-    return init_states


 def get_libero_dummy_action():
@@ -100,6 +328,29 @@ TASK_SUITE_MAX_STEPS: dict[str, int] = {
 }


+def _make_offscreen_env_with_renderer_fallback(env_args: dict[str, Any]) -> Any:
+    """Create OffScreenRenderEnv and fallback to OSMesa if EGL is unavailable."""
+    try:
+        return OffScreenRenderEnv(**env_args)
+    except ImportError as exc:
+        msg = str(exc)
+        if "EGL" not in msg and "PLATFORM_DEVICE" not in msg:
+            raise
+
+        # Headless clusters often miss EGL PLATFORM_DEVICE support. Retry with
+        # software rendering to keep evaluation working.
+        os.environ["MUJOCO_GL"] = "osmesa"
+        os.environ["PYOPENGL_PLATFORM"] = "osmesa"
+        try:
+            return OffScreenRenderEnv(**env_args)
+        except Exception as fallback_exc:
+            raise ImportError(
+                "Failed to initialize robosuite offscreen renderer with both EGL and "
+                "OSMesa backends. Set up EGL-capable drivers or install OSMesa (e.g. "
+                "`conda install -c conda-forge mesalib`) and retry."
+            ) from fallback_exc
+
+
 class LiberoEnv(gym.Env):
    metadata = {"render_modes": ["rgb_array"], "render_fps": 80}

@@ -153,6 +404,7 @@ class LiberoEnv(gym.Env):
        # Load once and keep
        self._init_states = get_task_init_states(task_suite, self.task_id) if self.init_states else None
        self._reset_stride = n_envs  # when performing a reset, append `_reset_stride` to `init_state_id`.
+        self._init_state_error_warned = False

        self.init_state_id = self.episode_index  # tie each sub-env to a fixed init state

@@ -244,7 +496,7 @@ class LiberoEnv(gym.Env):
            "camera_heights": self.observation_height,
            "camera_widths": self.observation_width,
        }
-        env = OffScreenRenderEnv(**env_args)
+        env = _make_offscreen_env_with_renderer_fallback(env_args)
        env.reset()
        return env

@@ -304,8 +556,21 @@ class LiberoEnv(gym.Env):
        self._env.seed(seed)
        raw_obs = self._env.reset()
        if self.init_states and self._init_states is not None:
-            raw_obs = self._env.set_init_state(self._init_states[self.init_state_id % len(self._init_states)])
-            self.init_state_id += self._reset_stride  # Change init_state_id when reset
+            try:
+                raw_obs = self._env.set_init_state(self._init_states[self.init_state_id % len(self._init_states)])
+                self.init_state_id += self._reset_stride  # Change init_state_id when reset
+            except Exception as exc:
+                # Some LIBERO-Plus perturbation tasks (notably object-layout variants)
+                # can have different simulator state dimensions than their base init files.
+                # Fall back to plain env.reset() instead of aborting the whole evaluation.
+                self.init_states = False
+                if not self._init_state_error_warned:
+                    print(
+                        "WARNING: Failed to apply init state for "
+                        f"task_id={self.task_id} ({self.task}). "
+                        f"Falling back to plain reset. Error: {exc}"
+                    )
+                    self._init_state_error_warned = True

        # After reset, objects may be unstable (slightly floating, intersecting, etc.).
        # Step the simulator with a no-op action for a few frames so everything settles.
@@ -331,7 +596,17 @@ class LiberoEnv(gym.Env):
                f"Expected action to be 1-D (shape (action_dim,)), "
                f"but got shape {action.shape} with ndim={action.ndim}"
            )
-        raw_obs, reward, done, info = self._env.step(action)
+
+        try:
+            raw_obs, reward, done, info = self._env.step(action)
+        except ValueError as e:
+            if "terminated episode" not in str(e):
+                raise
+            # Robosuite's internal done flag is stale (e.g. from a previous
+            # termination that wasn't properly cleared by SyncVectorEnv).
+            # Signal termination so the caller resets us.
+            obs, reset_info = self.reset()
+            return obs, 0.0, True, False, {"is_success": False, **reset_info}

        is_success = self._env.check_success()
        terminated = done or is_success
@@ -351,7 +626,6 @@ class LiberoEnv(gym.Env):
                "done": bool(done),
                "is_success": bool(is_success),
            }
-            self.reset()
        truncated = False
        return observation, reward, terminated, truncated, info

@@ -394,6 +668,9 @@ def _make_env_fns(
    return fns


+_LazyVecEnv = LazyVectorEnv
+
+
 # ---- Main API ----------------------------------------------------------------


@@ -437,12 +714,23 @@ def create_libero_envs(
        print(f"Restricting to task_ids={task_ids_filter}")

    out: dict[str, dict[int, Any]] = defaultdict(dict)
+    total_tasks = 0
    for suite_name in suite_names:
        suite = _get_suite(suite_name)
        total = len(suite.tasks)
        selected = _select_task_ids(total, task_ids_filter)
        if not selected:
            raise ValueError(f"No tasks selected for suite '{suite_name}' (available: {total}).")
+        total_tasks += len(selected)
+
+    lazy = total_tasks > 1
+    if lazy:
+        print(f"Using lazy env creation for {total_tasks} tasks (envs created on demand)")
+
+    for suite_name in suite_names:
+        suite = _get_suite(suite_name)
+        total = len(suite.tasks)
+        selected = _select_task_ids(total, task_ids_filter)

        for tid in selected:
            fns = _make_env_fns(
@@ -456,8 +744,11 @@ def create_libero_envs(
                gym_kwargs=gym_kwargs,
                control_mode=control_mode,
            )
-            out[suite_name][tid] = env_cls(fns)
-            print(f"Built vec env | suite={suite_name} | task_id={tid} | n_envs={n_envs}")
+            if lazy:
+                out[suite_name][tid] = LazyVectorEnv(env_cls, fns)
+            else:
+                out[suite_name][tid] = env_cls(fns)
+                print(f"Built vec env | suite={suite_name} | task_id={tid} | n_envs={n_envs}")

    # return plain dicts for predictability
    return {suite: dict(task_map) for suite, task_map in out.items()}
--- a/src/lerobot/envs/metaworld.py
+++ b/src/lerobot/envs/metaworld.py
@@ -25,6 +25,7 @@ import metaworld.policies as policies
 import numpy as np
 from gymnasium import spaces

+from lerobot.envs.lazy_vec_env import LazyVectorEnv
 from lerobot.processor import RobotObservation

 # ---- Load configuration data from the external JSON file ----
@@ -297,19 +298,24 @@ def create_metaworld_envs(

    print(f"Creating Meta-World envs | task_groups={task_groups} | n_envs(per task)={n_envs}")

+    group_to_tasks = {group: DIFFICULTY_TO_TASKS.get(group, [group]) for group in task_groups}
+    total_tasks = sum(len(tasks) for tasks in group_to_tasks.values())
+    lazy = total_tasks > 50
+    if lazy:
+        print(f"Using lazy env creation for {total_tasks} tasks (envs created on demand)")
+
    out: dict[str, dict[int, Any]] = defaultdict(dict)

    for group in task_groups:
-        # if not in difficulty presets, treat it as a single custom task
-        tasks = DIFFICULTY_TO_TASKS.get(group, [group])
+        tasks = group_to_tasks[group]

        for tid, task_name in enumerate(tasks):
-            print(f"Building vec env | group={group} | task_id={tid} | task={task_name}")
+            if not lazy:
+                print(f"Building vec env | group={group} | task_id={tid} | task={task_name}")

            # build n_envs factories
            fns = [(lambda tn=task_name: MetaworldEnv(task=tn, **gym_kwargs)) for _ in range(n_envs)]
-
-            out[group][tid] = env_cls(fns)
+            out[group][tid] = LazyVectorEnv(env_cls, fns) if lazy else env_cls(fns)

    # return a plain dict for consistency
    return {group: dict(task_map) for group, task_map in out.items()}
--- a/src/lerobot/envs/robocasa.py
+++ b/src/lerobot/envs/robocasa.py
@@ -24,6 +24,7 @@ import gymnasium as gym
 import numpy as np
 from gymnasium import spaces

+from lerobot.envs.lazy_vec_env import LazyVectorEnv

 # Action layout (flat 12D, normalized to [-1, 1]):
 #   [0:3]   end_effector_position  (delta x, y, z)
@@ -256,8 +257,12 @@ def create_robocasa_envs(

    gym_kwargs = dict(gym_kwargs or {})
    out: dict[str, dict[int, Any]] = defaultdict(dict)
+    total_tasks = len(task_list)
+    lazy = total_tasks > 50

    print(f"Creating RoboCasa envs | tasks={task_list} | n_envs(per task)={n_envs} | split={split}")
+    if lazy:
+        print(f"Using lazy env creation for {total_tasks} tasks (envs created on demand)")
    for task in task_list:
        fns = _make_env_fns(
            task=task,
@@ -267,7 +272,8 @@ def create_robocasa_envs(
            episode_length=episode_length,
            gym_kwargs=gym_kwargs,
        )
-        out["robocasa"][len(out["robocasa"])] = env_cls(fns)
-        print(f"  Built vec env | task={task} | n_envs={n_envs}")
+        out["robocasa"][len(out["robocasa"])] = LazyVectorEnv(env_cls, fns) if lazy else env_cls(fns)
+        if not lazy:
+            print(f"  Built vec env | task={task} | n_envs={n_envs}")

    return {suite: dict(task_map) for suite, task_map in out.items()}
--- a/src/lerobot/envs/robomme.py
+++ b/src/lerobot/envs/robomme.py
@@ -14,12 +14,16 @@ Install: pip install robomme  (or from source: https://github.com/RoboMME/robomm

 from __future__ import annotations

+from collections.abc import Callable, Sequence
+from functools import partial
 from typing import Any

 import gymnasium as gym
 import numpy as np
 from gymnasium import spaces

+from lerobot.envs.lazy_vec_env import LazyVectorEnv
+
 ROBOMME_TASKS = [
    "BinFill", "PickXtimes", "SwingXtimes", "StopCube",
    "VideoUnmask", "VideoUnmaskSwap", "ButtonUnmask", "ButtonUnmaskSwap",
@@ -113,6 +117,29 @@ class RoboMMEGymEnv(gym.Env):
        }


+def _make_env_fns(
+    *,
+    task: str,
+    n_envs: int,
+    action_space_type: str,
+    dataset: str,
+    episode_length: int,
+    task_id: int,
+) -> list[Callable[[], RoboMMEGymEnv]]:
+    """Build n_envs factory callables for one RoboMME task id."""
+
+    def _make_one(episode_index: int) -> RoboMMEGymEnv:
+        return RoboMMEGymEnv(
+            task=task,
+            action_space_type=action_space_type,
+            dataset=dataset,
+            episode_idx=episode_index,
+            max_steps=episode_length,
+        )
+
+    return [partial(_make_one, task_id + i) for i in range(n_envs)]
+
+
 def create_robomme_envs(
    task: str,
    n_envs: int = 1,
@@ -120,35 +147,35 @@ def create_robomme_envs(
    dataset: str = "test",
    episode_length: int = 300,
    task_ids: list[int] | None = None,
-    env_cls=None,
+    env_cls: Callable[[Sequence[Callable[[], Any]]], Any] | None = None,
 ) -> dict[str, dict[int, gym.vector.VectorEnv]]:
    """Create vectorized RoboMME environments for evaluation.

    Returns {suite_name: {task_id: VectorEnv}} matching lerobot's expected format.
    """
-    if env_cls is None:
-        env_cls = gym.vector.SyncVectorEnv
+    if env_cls is None or not callable(env_cls):
+        raise ValueError("env_cls must be a callable that wraps a list of env factory callables.")
+    if not isinstance(n_envs, int) or n_envs <= 0:
+        raise ValueError(f"n_envs must be a positive int; got {n_envs}.")

    if task_ids is None:
        task_ids = [0]

    suite_name = "robomme"
    envs_by_task = {}
+    lazy = len(task_ids) > 50
+    if lazy:
+        print(f"Using lazy env creation for {len(task_ids)} tasks (envs created on demand)")

    for task_id in task_ids:
-        def _make_one(ep_idx=task_id):
-            return RoboMMEGymEnv(
-                task=task,
-                action_space_type=action_space_type,
-                dataset=dataset,
-                episode_idx=ep_idx,
-                max_steps=episode_length,
-            )
-
-        vec = env_cls(
-            [_make_one for _ in range(n_envs)],
-            autoreset_mode=gym.vector.AutoresetMode.SAME_STEP,
+        fns = _make_env_fns(
+            task=task,
+            n_envs=n_envs,
+            action_space_type=action_space_type,
+            dataset=dataset,
+            episode_length=episode_length,
+            task_id=task_id,
        )
-        envs_by_task[task_id] = vec
+        envs_by_task[task_id] = LazyVectorEnv(env_cls, fns) if lazy else env_cls(fns)

    return {suite_name: envs_by_task}
--- a/src/lerobot/scripts/lerobot_benchmark.py
+++ b/src/lerobot/scripts/lerobot_benchmark.py
@@ -0,0 +1,462 @@
+#!/usr/bin/env python
+
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Benchmark runner: train and evaluate policies across simulation benchmarks.
+
+Orchestrates per-benchmark training and evaluation using the existing
+``lerobot-train`` and ``lerobot-eval`` CLI tools.
+
+Typical usage::
+
+    # Train SmolVLA on LIBERO-plus (4 GPUs, 50k steps):
+    lerobot-benchmark train \\
+        --benchmarks libero_plus \\
+        --policy-path lerobot/smolvla_base \\
+        --hub-user $HF_USER \\
+        --num-gpus 4 --steps 50000
+
+    # Evaluate the trained policies:
+    lerobot-benchmark eval \\
+        --benchmarks libero_plus \\
+        --hub-user $HF_USER
+
+    # Full pipeline (train → upload → eval) for multiple benchmarks:
+    lerobot-benchmark all \\
+        --benchmarks libero_plus,robocasa,robomme \\
+        --policy-path lerobot/smolvla_base \\
+        --hub-user $HF_USER \\
+        --num-gpus 4 --steps 50000
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+import shutil
+import subprocess
+import sys
+from dataclasses import dataclass, field
+from pathlib import Path
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
+log = logging.getLogger(__name__)
+
+
+@dataclass
+class BenchmarkEntry:
+    """Training + evaluation settings for a single benchmark.
+
+    When ``eval_tasks`` is set, evaluation runs once per task in the list
+    (e.g. libero_spatial, libero_object, …).  ``env_task`` is still used as
+    the task for mid-training evaluation during ``lerobot-train``.
+    """
+
+    dataset_repo_id: str
+    env_type: str
+    env_task: str
+    eval_tasks: list[str] | None = None
+    train_overrides: dict[str, str] = field(default_factory=dict)
+    eval_overrides: dict[str, str] = field(default_factory=dict)
+
+
+LIBERO_SUITES = ["libero_spatial", "libero_object", "libero_goal", "libero_10"]
+
+# Each benchmark maps a human-readable name to its dataset and eval env.
+# ``dataset_repo_id`` can contain ``{hub_user}`` which is interpolated at
+# runtime from ``--hub-user``.
+BENCHMARK_REGISTRY: dict[str, BenchmarkEntry] = {
+    "libero": BenchmarkEntry(
+        dataset_repo_id="{hub_user}/libero",
+        env_type="libero",
+        env_task="libero_spatial",
+        eval_tasks=LIBERO_SUITES,
+    ),
+    "libero_plus": BenchmarkEntry(
+        dataset_repo_id="{hub_user}/libero_plus",
+        env_type="libero_plus",
+        env_task="libero_spatial",
+        eval_tasks=LIBERO_SUITES,
+    ),
+    "metaworld": BenchmarkEntry(
+        dataset_repo_id="{hub_user}/metaworld",
+        env_type="metaworld",
+        env_task="metaworld-push-v2",
+    ),
+    "robocasa": BenchmarkEntry(
+        dataset_repo_id="{hub_user}/robocasa",
+        env_type="robocasa",
+        env_task="PickPlaceCounterToCabinet",
+    ),
+    "robomme": BenchmarkEntry(
+        dataset_repo_id="{hub_user}/robomme",
+        env_type="robomme",
+        env_task="PickXtimes",
+    ),
+}
+
+
+def _policy_repo_id(hub_user: str, policy_name: str, benchmark: str) -> str:
+    return f"{hub_user}/{policy_name}_{benchmark}"
+
+
+def _extra_keys(extra_args: list[str]) -> set[str]:
+    """Extract ``--key`` prefixes from extra CLI args for override detection."""
+    keys: set[str] = set()
+    for arg in extra_args:
+        if arg.startswith("--") and "=" in arg:
+            keys.add(arg.split("=", 1)[0])
+    return keys
+
+
+def _build_train_cmd(
+    benchmark: BenchmarkEntry,
+    *,
+    policy_path: str,
+    hub_user: str,
+    policy_name: str,
+    benchmark_name: str,
+    num_gpus: int,
+    steps: int,
+    batch_size: int,
+    eval_freq: int,
+    save_freq: int,
+    wandb: bool,
+    extra_args: list[str],
+) -> list[str]:
+    """Build the ``accelerate launch lerobot-train`` command list."""
+    lerobot_train = shutil.which("lerobot-train")
+    if lerobot_train is None:
+        raise RuntimeError("lerobot-train not found on PATH. Is lerobot installed?")
+
+    # Strip bare "--" separators that argparse may pass through
+    cleaned_extra = [a for a in extra_args if a != "--"]
+    overridden = _extra_keys(cleaned_extra)
+
+    repo_id = _policy_repo_id(hub_user, policy_name, benchmark_name)
+    dataset_id = benchmark.dataset_repo_id.format(hub_user=hub_user)
+
+    defaults: list[tuple[str, str]] = [
+        ("--policy.path", policy_path),
+        ("--dataset.repo_id", dataset_id),
+        ("--policy.repo_id", repo_id),
+        ("--env.type", benchmark.env_type),
+        ("--env.task", benchmark.env_task),
+        ("--steps", str(steps)),
+        ("--batch_size", str(batch_size)),
+        ("--eval_freq", str(eval_freq)),
+        ("--save_freq", str(save_freq)),
+        ("--output_dir", f"outputs/train/{policy_name}_{benchmark_name}"),
+        ("--job_name", f"{policy_name}_{benchmark_name}"),
+        ("--policy.push_to_hub", "true"),
+    ]
+    if wandb:
+        defaults.append(("--wandb.enable", "true"))
+    for k, v in benchmark.train_overrides.items():
+        defaults.append((f"--{k}", v))
+
+    cmd: list[str] = [
+        "accelerate", "launch",
+        "--multi_gpu",
+        f"--num_processes={num_gpus}",
+        lerobot_train,
+    ]
+    for key, val in defaults:
+        if key not in overridden:
+            cmd.append(f"{key}={val}")
+    cmd.extend(cleaned_extra)
+    return cmd
+
+
+def _build_eval_cmd(
+    benchmark: BenchmarkEntry,
+    *,
+    hub_user: str,
+    policy_name: str,
+    benchmark_name: str,
+    eval_task: str | None = None,
+    n_episodes: int,
+    batch_size_eval: int,
+    extra_args: list[str],
+) -> list[str]:
+    """Build the ``lerobot-eval`` command list.
+
+    ``eval_task`` overrides the benchmark's ``env_task`` so the same
+    benchmark can be evaluated on multiple suites (e.g. LIBERO).
+    """
+    lerobot_eval = shutil.which("lerobot-eval")
+    if lerobot_eval is None:
+        raise RuntimeError("lerobot-eval not found on PATH. Is lerobot installed?")
+
+    task = eval_task or benchmark.env_task
+    repo_id = _policy_repo_id(hub_user, policy_name, benchmark_name)
+    out_dir = _eval_output_dir(policy_name, benchmark_name, eval_task=task)
+
+    cleaned_extra = [a for a in extra_args if a != "--"]
+    overridden = _extra_keys(cleaned_extra)
+
+    defaults: list[tuple[str, str]] = [
+        ("--policy.path", repo_id),
+        ("--env.type", benchmark.env_type),
+        ("--env.task", task),
+        ("--eval.n_episodes", str(n_episodes)),
+        ("--eval.batch_size", str(batch_size_eval)),
+        ("--output_dir", out_dir),
+        ("--policy.device", "cuda"),
+    ]
+    for k, v in benchmark.eval_overrides.items():
+        defaults.append((f"--{k}", v))
+
+    cmd: list[str] = [lerobot_eval]
+    for key, val in defaults:
+        if key not in overridden:
+            cmd.append(f"{key}={val}")
+    cmd.extend(cleaned_extra)
+    return cmd
+
+
+def _eval_output_dir(policy_name: str, benchmark_name: str, eval_task: str | None = None) -> Path:
+    if eval_task:
+        return Path(f"outputs/eval/{policy_name}_{benchmark_name}/{eval_task}")
+    return Path(f"outputs/eval/{policy_name}_{benchmark_name}")
+
+
+def _run(cmd: list[str], *, dry_run: bool) -> None:
+    log.info("Command: %s", " \\\n    ".join(cmd))
+    if dry_run:
+        log.info("[dry-run] Skipping execution.")
+        return
+    result = subprocess.run(cmd, check=False)
+    if result.returncode != 0:
+        log.error("Command failed with exit code %d", result.returncode)
+        sys.exit(result.returncode)
+
+
+def _push_eval_to_hub(
+    *,
+    hub_user: str,
+    policy_name: str,
+    benchmark_name: str,
+    eval_task: str | None = None,
+    dry_run: bool,
+) -> None:
+    """Upload eval results (metrics + videos) to the policy repo on the Hub."""
+    from huggingface_hub import HfApi
+
+    repo_id = _policy_repo_id(hub_user, policy_name, benchmark_name)
+    local_dir = _eval_output_dir(policy_name, benchmark_name, eval_task=eval_task)
+    hub_path = f"eval/{eval_task}" if eval_task else f"eval/{benchmark_name}"
+
+    if not local_dir.exists():
+        log.warning("Eval output dir %s does not exist, skipping hub upload.", local_dir)
+        return
+
+    log.info("Uploading eval results from %s to %s (path_in_repo=%s)", local_dir, repo_id, hub_path)
+    if dry_run:
+        log.info("[dry-run] Skipping upload.")
+        return
+
+    api = HfApi()
+    api.upload_folder(
+        folder_path=str(local_dir),
+        repo_id=repo_id,
+        path_in_repo=hub_path,
+        repo_type="model",
+        commit_message=f"Upload eval results for {eval_task or benchmark_name}",
+    )
+
+
+def _resolve_benchmarks(names: str) -> list[tuple[str, BenchmarkEntry]]:
+    out = []
+    for name in names.split(","):
+        name = name.strip()
+        if name not in BENCHMARK_REGISTRY:
+            available = ", ".join(BENCHMARK_REGISTRY)
+            raise ValueError(f"Unknown benchmark '{name}'. Available: {available}")
+        out.append((name, BENCHMARK_REGISTRY[name]))
+    return out
+
+
+def cmd_train(args: argparse.Namespace) -> None:
+    benchmarks = _resolve_benchmarks(args.benchmarks)
+    for bname, bentry in benchmarks:
+        log.info("=== Training on benchmark: %s ===", bname)
+        cmd = _build_train_cmd(
+            bentry,
+            policy_path=args.policy_path,
+            hub_user=args.hub_user,
+            policy_name=args.policy_name,
+            benchmark_name=bname,
+            num_gpus=args.num_gpus,
+            steps=args.steps,
+            batch_size=args.batch_size,
+            eval_freq=args.eval_freq,
+            save_freq=args.save_freq,
+            wandb=args.wandb,
+            extra_args=args.extra,
+        )
+        _run(cmd, dry_run=args.dry_run)
+
+
+def _run_eval_for_benchmark(
+    bname: str,
+    bentry: BenchmarkEntry,
+    args: argparse.Namespace,
+) -> None:
+    """Run evaluation for a single benchmark, iterating over all its eval_tasks."""
+    tasks = bentry.eval_tasks or [bentry.env_task]
+    for task in tasks:
+        log.info("=== Evaluating %s / %s ===", bname, task)
+        cmd = _build_eval_cmd(
+            bentry,
+            hub_user=args.hub_user,
+            policy_name=args.policy_name,
+            benchmark_name=bname,
+            eval_task=task if bentry.eval_tasks else None,
+            n_episodes=args.n_episodes,
+            batch_size_eval=args.batch_size_eval,
+            extra_args=args.extra,
+        )
+        _run(cmd, dry_run=args.dry_run)
+        if args.push_eval_to_hub:
+            _push_eval_to_hub(
+                hub_user=args.hub_user,
+                policy_name=args.policy_name,
+                benchmark_name=bname,
+                eval_task=task if bentry.eval_tasks else None,
+                dry_run=args.dry_run,
+            )
+
+
+def cmd_eval(args: argparse.Namespace) -> None:
+    benchmarks = _resolve_benchmarks(args.benchmarks)
+    for bname, bentry in benchmarks:
+        _run_eval_for_benchmark(bname, bentry, args)
+
+
+def cmd_all(args: argparse.Namespace) -> None:
+    """Train on each benchmark, then evaluate each."""
+    benchmarks = _resolve_benchmarks(args.benchmarks)
+
+    log.info("Phase 1: Training on %d benchmark(s)", len(benchmarks))
+    for bname, bentry in benchmarks:
+        log.info("=== Training on benchmark: %s ===", bname)
+        cmd = _build_train_cmd(
+            bentry,
+            policy_path=args.policy_path,
+            hub_user=args.hub_user,
+            policy_name=args.policy_name,
+            benchmark_name=bname,
+            num_gpus=args.num_gpus,
+            steps=args.steps,
+            batch_size=args.batch_size,
+            eval_freq=args.eval_freq,
+            save_freq=args.save_freq,
+            wandb=args.wandb,
+            extra_args=args.extra,
+        )
+        _run(cmd, dry_run=args.dry_run)
+
+    log.info("Phase 2: Evaluating %d benchmark(s)", len(benchmarks))
+    for bname, bentry in benchmarks:
+        _run_eval_for_benchmark(bname, bentry, args)
+
+
+def _add_common_args(p: argparse.ArgumentParser) -> None:
+    p.add_argument(
+        "--benchmarks", required=True,
+        help="Comma-separated benchmark names (e.g. libero_plus,robocasa,robomme).",
+    )
+    p.add_argument("--hub-user", required=True, help="HuggingFace Hub username.")
+    p.add_argument(
+        "--policy-name", default="smolvla",
+        help="Short policy name used in repo IDs and output dirs (default: smolvla).",
+    )
+    p.add_argument("--dry-run", action="store_true", help="Print commands without executing.")
+
+
+def _add_train_args(p: argparse.ArgumentParser) -> None:
+    p.add_argument("--policy-path", default="lerobot/smolvla_base", help="Pretrained policy path.")
+    p.add_argument("--num-gpus", type=int, default=4, help="Number of GPUs.")
+    p.add_argument("--steps", type=int, default=50_000, help="Total training steps.")
+    p.add_argument("--batch-size", type=int, default=32, help="Per-GPU batch size.")
+    p.add_argument("--eval-freq", type=int, default=10_000, help="Eval every N steps (0 to disable).")
+    p.add_argument("--save-freq", type=int, default=10_000, help="Save checkpoint every N steps.")
+    p.add_argument("--wandb", action="store_true", help="Enable Weights & Biases logging.")
+
+
+def _add_eval_args(p: argparse.ArgumentParser) -> None:
+    p.add_argument("--n-episodes", type=int, default=50, help="Number of eval episodes.")
+    p.add_argument("--batch-size-eval", type=int, default=10, help="Eval batch size (parallel envs).")
+    p.add_argument(
+        "--push-eval-to-hub", action="store_true",
+        help="Upload eval results (metrics + videos) to the policy repo on the Hub.",
+    )
+
+
+def build_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser(
+        prog="lerobot-benchmark",
+        description="Train and evaluate policies across simulation benchmarks.",
+    )
+    sub = parser.add_subparsers(dest="command", required=True)
+
+    # train
+    p_train = sub.add_parser("train", help="Train a policy on each selected benchmark.")
+    _add_common_args(p_train)
+    _add_train_args(p_train)
+    p_train.set_defaults(func=cmd_train)
+
+    # eval
+    p_eval = sub.add_parser("eval", help="Evaluate trained policies on each benchmark.")
+    _add_common_args(p_eval)
+    _add_eval_args(p_eval)
+    p_eval.set_defaults(func=cmd_eval)
+
+    # all (train + eval)
+    p_all = sub.add_parser("all", help="Train then evaluate on each benchmark.")
+    _add_common_args(p_all)
+    _add_train_args(p_all)
+    _add_eval_args(p_all)
+    p_all.set_defaults(func=cmd_all)
+
+    # list
+    p_list = sub.add_parser("list", help="List available benchmarks.")
+    p_list.set_defaults(func=lambda _args: _list_benchmarks())
+
+    return parser
+
+
+def _list_benchmarks() -> None:
+    print("Available benchmarks:\n")
+    for name, entry in BENCHMARK_REGISTRY.items():
+        print(f"  {name}")
+        print(f"    dataset:  {entry.dataset_repo_id}")
+        print(f"    env:      {entry.env_type}")
+        if entry.eval_tasks:
+            print(f"    eval on:  {', '.join(entry.eval_tasks)}")
+        else:
+            print(f"    eval on:  {entry.env_task}")
+        print()
+
+
+def main() -> None:
+    parser = build_parser()
+    args, extra = parser.parse_known_args()
+    args.extra = extra
+    args.func(args)
+
+
+if __name__ == "__main__":
+    main()
--- a/src/lerobot/scripts/lerobot_eval.py
+++ b/src/lerobot/scripts/lerobot_eval.py
@@ -49,6 +49,9 @@ You can learn about the CLI options for this script in the `EvalPipelineConfig`
 import concurrent.futures as cf
 import json
 import logging
+import shutil
+import subprocess
+import sys
 import threading
 import time
 from collections import defaultdict
@@ -71,7 +74,9 @@ from tqdm import trange

 from lerobot.configs import parser
 from lerobot.configs.eval import EvalPipelineConfig
+
 from lerobot.envs.factory import make_env, make_env_pre_post_processors
+from lerobot.envs.lazy_vec_env import LazyVectorEnv
 from lerobot.envs.utils import (
    add_envs_task,
    check_env_attributes_and_types,
@@ -82,6 +87,11 @@ from lerobot.policies.factory import make_policy, make_pre_post_processors
 from lerobot.policies.pretrained import PreTrainedPolicy
 from lerobot.processor import PolicyAction, PolicyProcessorPipeline
 from lerobot.utils.constants import ACTION, DONE, OBS_STR, REWARD
+from lerobot.utils.hf_eval_results import (
+    build_eval_results_rows,
+    default_eval_date,
+    upload_eval_results_yaml,
+)
 from lerobot.utils.import_utils import register_third_party_plugins
 from lerobot.utils.io_utils import write_video
 from lerobot.utils.random_utils import set_seed
@@ -502,9 +512,178 @@ def _compile_episode_data(
    return data_dict


+def _serializable_config(obj: Any) -> Any:
+    """Recursively convert a config dict so it is JSON-serializable."""
+    if isinstance(obj, dict):
+        return {k: _serializable_config(v) for k, v in obj.items()}
+    if isinstance(obj, (list, tuple)):
+        return [_serializable_config(v) for v in obj]
+    if isinstance(obj, Path):
+        return str(obj)
+    if isinstance(obj, (int, float, str, bool, type(None))):
+        return obj
+    return str(obj)
+
+
+def push_eval_to_hub(
+    repo_id: str,
+    output_dir: Path,
+    info: dict,
+    env_type: str,
+    env_task: str | None,
+    benchmark_dataset_id: str,
+    source_url: str | None = None,
+    notes: str | None = None,
+) -> str:
+    """Upload eval artifacts and `.eval_results` rows to the Hub.
+
+    Args:
+        repo_id: HF model repo (e.g. "user/my_policy").
+        output_dir: Local directory containing eval_info.json and videos/.
+        info: The eval results dict (as returned by eval_policy_all).
+        env_type: Environment type string (e.g. "libero_plus", "pusht").
+        env_task: The env task string from eval config.
+        benchmark_dataset_id: HF dataset id of the consolidated benchmark dataset.
+        source_url: Optional source URL for `.eval_results` attribution.
+        notes: Optional setup notes to include in `.eval_results`.
+
+    Returns:
+        URL of the last Hub commit.
+    """
+    from huggingface_hub import HfApi
+
+    api = HfApi()
+    api.create_repo(repo_id=repo_id, exist_ok=True)
+
+    commit_url = ""
+
+    # 1. Upload eval_info.json
+    eval_json_path = output_dir / "eval_info.json"
+    if eval_json_path.exists():
+        commit_url = api.upload_file(
+            path_or_fileobj=str(eval_json_path),
+            path_in_repo=f"eval/{env_type}/eval_info.json",
+            repo_id=repo_id,
+            commit_message=f"Upload eval results for {env_type}",
+        )
+
+    # 2. Upload eval_config.json (policy, env, and eval settings used)
+    eval_config_path = output_dir / "eval_config.json"
+    if eval_config_path.exists():
+        api.upload_file(
+            path_or_fileobj=str(eval_config_path),
+            path_in_repo=f"eval/{env_type}/eval_config.json",
+            repo_id=repo_id,
+            commit_message=f"Upload eval config for {env_type}",
+        )
+
+    # 3. Upload rollout videos
+    videos_dir = output_dir / "videos"
+    if videos_dir.is_dir():
+        api.upload_folder(
+            folder_path=str(videos_dir),
+            path_in_repo=f"eval/{env_type}/videos",
+            repo_id=repo_id,
+            commit_message=f"Upload eval rollout videos for {env_type}",
+        )
+
+    # 4. Upload HF-native `.eval_results` rows (canonical leaderboard surface).
+    rows = build_eval_results_rows(
+        info=info,
+        env_type=env_type,
+        env_task=env_task,
+        benchmark_dataset_id=benchmark_dataset_id,
+        source_url=source_url,
+        notes=notes,
+        eval_date=default_eval_date(),
+    )
+    commit_url = upload_eval_results_yaml(
+        api=api,
+        repo_id=repo_id,
+        rows=rows,
+        env_type=env_type,
+        env_task=env_task,
+        output_dir=output_dir,
+    )
+
+    logging.info(f"Eval results pushed to https://huggingface.co/{repo_id}")
+    return commit_url
+
+
@parser.wrap()
 def eval_main(cfg: EvalPipelineConfig):
    logging.info(pformat(asdict(cfg)))
+    # Multi-instance orchestration only applies to local runtime.
+    # For docker runtime, instance_count controls the number of env containers
+    # spawned directly by run_eval_in_docker — no extra lerobot-eval processes needed.
+    if cfg.eval.runtime == "local" and cfg.eval.instance_count > 1 and cfg.eval.instance_id == 0:
+        _orchestrate_multi_instance_eval(cfg)
+    else:
+        _run_eval_worker(cfg)
+
+
+def _maybe_add_libero_plus_perturbation(info: dict, cfg: EvalPipelineConfig) -> None:
+    if cfg.env.type != "libero_plus":
+        return
+    try:
+        from lerobot.envs.libero import aggregate_by_perturbation, build_perturbation_index
+
+        suite_names = [s.strip() for s in cfg.env.task.split(",") if s.strip()]
+        suite_indices = {s: build_perturbation_index(s) for s in suite_names}
+        perturbation_results = aggregate_by_perturbation(info["per_task"], suite_indices)
+        info["perturbation_results"] = perturbation_results
+        print("\n=== Perturbation Results ===")
+        for dim, stats in perturbation_results.items():
+            print(f"  {dim}: {stats['pc_success']:.1f}% ({stats['n_episodes']} episodes)")
+    except Exception as exc:
+        # Never fail a finished long-running eval on post-processing.
+        print(f"WARNING: Failed to compute LIBERO-Plus perturbation breakdown: {exc}")
+        print("Continuing with per-suite + overall metrics only.")
+
+
+def _save_eval_outputs(cfg: EvalPipelineConfig, info: dict) -> None:
+    output_dir = Path(cfg.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    with open(output_dir / "eval_info.json", "w") as f:
+        json.dump(info, f, indent=2)
+
+    eval_cfg_dict = _serializable_config(asdict(cfg))
+    with open(output_dir / "eval_config.json", "w") as f:
+        json.dump(eval_cfg_dict, f, indent=2)
+
+
+def _maybe_push_eval_outputs(cfg: EvalPipelineConfig, info: dict) -> None:
+    if not cfg.push_to_hub:
+        return
+    repo_id = str(cfg.policy.pretrained_path)
+    try:
+        push_eval_to_hub(
+            repo_id=repo_id,
+            output_dir=Path(cfg.output_dir),
+            info=info,
+            env_type=cfg.env.type,
+            env_task=cfg.env.task,
+            benchmark_dataset_id=cfg.benchmark_dataset_id,
+            source_url=cfg.eval_result_source_url,
+            notes=cfg.eval_result_notes,
+        )
+    except Exception as exc:
+        logging.warning("Failed to push eval artifacts/results to Hub: %s", exc)
+
+
+def _run_eval_worker(cfg: EvalPipelineConfig) -> dict:
+    logging.info(pformat(asdict(cfg)))
+
+    if cfg.eval.runtime in ("docker", "multiprocess"):
+        from lerobot.envs.docker_runtime import run_eval_in_docker, run_eval_multiprocess
+
+        if cfg.eval.runtime == "docker":
+            run_eval_in_docker(cfg)
+        else:
+            run_eval_multiprocess(cfg)
+        output_dir = Path(cfg.output_dir)
+        with open(output_dir / "eval_info.json") as f:
+            return json.load(f)

    # Check device is available
    device = get_safe_torch_device(cfg.policy.device, log=True)
@@ -548,34 +727,123 @@ def eval_main(cfg: EvalPipelineConfig):
    # Create environment-specific preprocessor and postprocessor (e.g., for LIBERO environments)
    env_preprocessor, env_postprocessor = make_env_pre_post_processors(env_cfg=cfg.env, policy_cfg=cfg.policy)

-    with torch.no_grad(), torch.autocast(device_type=device.type) if cfg.policy.use_amp else nullcontext():
-        info = eval_policy_all(
-            envs=envs,
-            policy=policy,
-            env_preprocessor=env_preprocessor,
-            env_postprocessor=env_postprocessor,
-            preprocessor=preprocessor,
-            postprocessor=postprocessor,
-            n_episodes=cfg.eval.n_episodes,
-            max_episodes_rendered=10,
-            videos_dir=Path(cfg.output_dir) / "videos",
-            start_seed=cfg.seed,
-            max_parallel_tasks=cfg.env.max_parallel_tasks,
-        )
-        print("Overall Aggregated Metrics:")
-        print(info["overall"])
+    try:
+        with (
+            torch.no_grad(),
+            torch.autocast(device_type=device.type) if cfg.policy.use_amp else nullcontext(),
+        ):
+            info = eval_policy_all(
+                envs=envs,
+                policy=policy,
+                env_preprocessor=env_preprocessor,
+                env_postprocessor=env_postprocessor,
+                preprocessor=preprocessor,
+                postprocessor=postprocessor,
+                n_episodes=cfg.eval.n_episodes,
+                max_episodes_rendered=10,
+                videos_dir=Path(cfg.output_dir) / "videos",
+                start_seed=cfg.seed,
+                max_parallel_tasks=cfg.env.max_parallel_tasks,
+                instance_count=cfg.eval.instance_count,
+                instance_id=cfg.eval.instance_id,
+            )
+            print("Overall Aggregated Metrics:")
+            print(info["overall"])

-        # Print per-suite stats
-        for task_group, task_group_info in info.items():
-            print(f"\nAggregated Metrics for {task_group}:")
-            print(task_group_info)
-    # Close all vec envs
-    close_envs(envs)
+            for key, val in info.get("per_group", {}).items():
+                print(f"\nAggregated Metrics for {key}:")
+                print(val)

-    # Save info
-    with open(Path(cfg.output_dir) / "eval_info.json", "w") as f:
-        json.dump(info, f, indent=2)
+            _maybe_add_libero_plus_perturbation(info, cfg)
+    finally:
+        close_envs(envs)

+    _save_eval_outputs(cfg, info)
+    _maybe_push_eval_outputs(cfg, info)
+
+    logging.info("End of eval")
+    return info
+
+
+def _orchestrate_multi_instance_eval(cfg: EvalPipelineConfig) -> None:
+    start_t = time.time()
+    root_output_dir = Path(cfg.output_dir)
+    instances_root = root_output_dir / "instances"
+    instances_root.mkdir(parents=True, exist_ok=True)
+
+    n_instances = cfg.eval.instance_count
+    logging.info(f"Launching multi-instance eval with {n_instances} workers.")
+
+    # Spawn workers for shard 1..N-1, run shard 0 in-process.
+    child_procs: list[tuple[int, subprocess.Popen]] = []
+    argv = [
+        arg
+        for arg in sys.argv[1:]
+        if not arg.startswith("--eval.instance_id=")
+        and not arg.startswith("--output_dir=")
+        and not arg.startswith("--push_to_hub=")
+    ]
+    for i in range(1, n_instances):
+        child_output_dir = instances_root / str(i)
+        cmd = [
+            sys.executable,
+            "-m",
+            "lerobot.scripts.lerobot_eval",
+            *argv,
+            f"--eval.instance_id={i}",
+            f"--output_dir={child_output_dir}",
+            "--push_to_hub=false",
+        ]
+        logging.info("Starting eval worker %s/%s", i + 1, n_instances)
+        child_procs.append((i, subprocess.Popen(cmd)))
+
+    cfg0 = deepcopy(cfg)
+    cfg0.eval.instance_id = 0
+    cfg0.push_to_hub = False
+    cfg0.output_dir = instances_root / "0"
+    _run_eval_worker(cfg0)
+
+    failed = []
+    for idx, proc in child_procs:
+        rc = proc.wait()
+        if rc != 0:
+            failed.append((idx, rc))
+    if failed:
+        raise RuntimeError(f"Multi-instance eval failed for workers: {failed}")
+
+    partial_infos: list[dict] = []
+    for i in range(n_instances):
+        info_path = instances_root / str(i) / "eval_info.json"
+        with open(info_path) as f:
+            partial_infos.append(json.load(f))
+
+    merged_per_task = []
+    for info in partial_infos:
+        merged_per_task.extend(info.get("per_task", []))
+    merged_per_task.sort(key=lambda x: (x["task_group"], x["task_id"]))
+
+    # Merge videos from each shard into final output dir.
+    merged_videos_dir = root_output_dir / "videos"
+    for i in range(n_instances):
+        shard_dir = instances_root / str(i)
+        shard_videos = shard_dir / "videos"
+        if shard_videos.is_dir():
+            shutil.copytree(shard_videos, merged_videos_dir, dirs_exist_ok=True)
+            old_prefix = str(shard_videos)
+            new_prefix = str(merged_videos_dir)
+            for entry in merged_per_task:
+                paths = entry.get("metrics", {}).get("video_paths", [])
+                entry["metrics"]["video_paths"] = [
+                    p.replace(old_prefix, new_prefix, 1) if p.startswith(old_prefix) else p for p in paths
+                ]
+
+    merged_info = _aggregate_eval_from_per_task(merged_per_task, total_eval_s=time.time() - start_t)
+    _maybe_add_libero_plus_perturbation(merged_info, cfg)
+    print("Overall Aggregated Metrics:")
+    print(merged_info["overall"])
+
+    _save_eval_outputs(cfg, merged_info)
+    _maybe_push_eval_outputs(cfg, merged_info)
    logging.info("End of eval")


@@ -590,6 +858,179 @@ class TaskMetrics(TypedDict):
 ACC_KEYS = ("sum_rewards", "max_rewards", "successes", "video_paths")


+def _aggregate_eval_from_per_task(per_task_infos: list[dict], total_eval_s: float) -> dict:
+    """Aggregate eval metrics from per-task payloads."""
+    group_acc: dict[str, dict[str, list]] = defaultdict(lambda: {k: [] for k in ACC_KEYS})
+    overall: dict[str, list] = {k: [] for k in ACC_KEYS}
+
+    def _append(group: str, key: str, value: Any):
+        if value is None:
+            return
+        if isinstance(value, list):
+            group_acc[group][key].extend(value)
+            overall[key].extend(value)
+        else:
+            group_acc[group][key].append(value)
+            overall[key].append(value)
+
+    for entry in per_task_infos:
+        group = entry["task_group"]
+        metrics = entry["metrics"]
+        _append(group, "sum_rewards", metrics.get("sum_rewards"))
+        _append(group, "max_rewards", metrics.get("max_rewards"))
+        _append(group, "successes", metrics.get("successes"))
+        paths = metrics.get("video_paths", [])
+        if paths:
+            group_acc[group]["video_paths"].extend(paths)
+            overall["video_paths"].extend(paths)
+
+    def _agg_from_list(xs: list[float]) -> float:
+        if not xs:
+            return float("nan")
+        arr = np.array(xs, dtype=float)
+        return float(np.nanmean(arr))
+
+    groups_aggregated = {}
+    for group, acc in group_acc.items():
+        groups_aggregated[group] = {
+            "avg_sum_reward": _agg_from_list(acc["sum_rewards"]),
+            "avg_max_reward": _agg_from_list(acc["max_rewards"]),
+            "pc_success": _agg_from_list(acc["successes"]) * 100 if acc["successes"] else float("nan"),
+            "n_episodes": len(acc["sum_rewards"]),
+            "video_paths": list(acc["video_paths"]),
+        }
+
+    overall_agg = {
+        "avg_sum_reward": _agg_from_list(overall["sum_rewards"]),
+        "avg_max_reward": _agg_from_list(overall["max_rewards"]),
+        "pc_success": _agg_from_list(overall["successes"]) * 100 if overall["successes"] else float("nan"),
+        "n_episodes": len(overall["sum_rewards"]),
+        "eval_s": total_eval_s,
+        "eval_ep_s": total_eval_s / max(1, len(overall["sum_rewards"])),
+        "video_paths": list(overall["video_paths"]),
+    }
+
+    return {"per_task": per_task_infos, "per_group": groups_aggregated, "overall": overall_agg}
+
+
+def _eval_task_batch(
+    batch: list[tuple[str, int, LazyVectorEnv]],
+    policy,
+    env_preprocessor: PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
+    env_postprocessor: PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
+    preprocessor: PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
+    postprocessor: PolicyProcessorPipeline[PolicyAction, PolicyAction],
+    start_seed: int | None,
+    max_episodes_rendered: int = 0,
+    videos_dir: Path | None = None,
+) -> list[tuple[str, int, TaskMetrics]]:
+    """Evaluate N tasks in a single batched rollout for GPU efficiency.
+
+    Each task contributes one sub-env to a combined SyncVectorEnv so the policy
+    processes all N observations in one forward pass per step.
+    """
+    all_fns: list[Callable] = []
+    task_slices: list[tuple[str, int, int, int]] = []
+    offset = 0
+    for task_group, task_id, lazy_env in batch:
+        fns = lazy_env.factory_fns
+        if not fns:
+            continue
+        start = offset
+        offset += len(fns)
+        all_fns.extend(fns)
+        task_slices.append((task_group, task_id, start, offset))
+
+    if not all_fns:
+        return []
+
+    env_cls = batch[0][2].env_cls
+    combined_env = env_cls(all_fns)
+
+    try:
+        seeds = None
+        if start_seed is not None:
+            seeds = list(range(start_seed, start_seed + combined_env.num_envs))
+
+        ep_frames: list[np.ndarray] = []
+
+        def render_frame(env: gym.vector.VectorEnv):
+            if max_episodes_rendered <= 0:
+                return
+            n = min(max_episodes_rendered, env.num_envs)
+            if isinstance(env, gym.vector.SyncVectorEnv):
+                ep_frames.append(np.stack([env.envs[i].render() for i in range(n)]))
+            elif isinstance(env, gym.vector.AsyncVectorEnv):
+                ep_frames.append(np.stack(env.call("render")[:n]))
+
+        rollout_data = rollout(
+            env=combined_env,
+            policy=policy,
+            env_preprocessor=env_preprocessor,
+            env_postprocessor=env_postprocessor,
+            preprocessor=preprocessor,
+            postprocessor=postprocessor,
+            seeds=seeds,
+            render_callback=render_frame if max_episodes_rendered > 0 else None,
+        )
+
+        n_steps = rollout_data["done"].shape[1]
+        done_indices = torch.argmax(rollout_data["done"].to(int), dim=1)
+        mask = (torch.arange(n_steps) <= einops.repeat(done_indices + 1, "b -> b s", s=n_steps)).int()
+        batch_sum_rewards = einops.reduce((rollout_data["reward"] * mask), "b n -> b", "sum")
+        batch_max_rewards = einops.reduce((rollout_data["reward"] * mask), "b n -> b", "max")
+        batch_successes = einops.reduce((rollout_data["success"] * mask), "b n -> b", "any")
+
+        video_paths_per_task: dict[tuple[str, int], list[str]] = defaultdict(list)
+        if max_episodes_rendered > 0 and ep_frames and videos_dir:
+            stacked = np.stack(ep_frames, axis=1)  # (batch, time, h, w, c)
+            rendered = 0
+            threads = []
+            for tg, tid, start_i, end_i in task_slices:
+                if rendered >= max_episodes_rendered:
+                    break
+                task_dir = videos_dir / f"{tg}_{tid}"
+                task_dir.mkdir(parents=True, exist_ok=True)
+                for env_idx in range(start_i, end_i):
+                    if rendered >= max_episodes_rendered:
+                        break
+                    episode_index = env_idx - start_i
+                    video_path = task_dir / f"eval_episode_{episode_index}.mp4"
+                    video_paths_per_task[(tg, tid)].append(str(video_path))
+                    di = done_indices[env_idx].item()
+                    thread = threading.Thread(
+                        target=write_video,
+                        args=(
+                            str(video_path),
+                            stacked[env_idx, : di + 1],
+                            combined_env.unwrapped.metadata["render_fps"],
+                        ),
+                    )
+                    thread.start()
+                    threads.append(thread)
+                    rendered += 1
+            for t in threads:
+                t.join()
+
+        results: list[tuple[str, int, TaskMetrics]] = []
+        for tg, tid, start_i, end_i in task_slices:
+            results.append(
+                (
+                    tg,
+                    tid,
+                    TaskMetrics(
+                        sum_rewards=batch_sum_rewards[start_i:end_i].tolist(),
+                        max_rewards=batch_max_rewards[start_i:end_i].tolist(),
+                        successes=batch_successes[start_i:end_i].tolist(),
+                        video_paths=video_paths_per_task.get((tg, tid), []),
+                    ),
+                )
+            )
+        return results
+    finally:
+        combined_env.close()
+
+
 def eval_one(
    env: gym.vector.VectorEnv,
    *,
@@ -634,7 +1075,7 @@ def eval_one(
 def run_one(
    task_group: str,
    task_id: int,
-    env,
+    env: Any,
    *,
    policy,
    env_preprocessor,
@@ -678,7 +1119,7 @@ def run_one(


 def eval_policy_all(
-    envs: dict[str, dict[int, gym.vector.VectorEnv]],
+    envs: dict[str, dict[int, Any]],
    policy,
    env_preprocessor: PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
    env_postprocessor: PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
@@ -691,6 +1132,8 @@ def eval_policy_all(
    return_episode_data: bool = False,
    start_seed: int | None = None,
    max_parallel_tasks: int = 1,
+    instance_count: int = 1,
+    instance_id: int = 0,
 ) -> dict:
    """
    Evaluate a nested `envs` dict: {task_group: {task_id: vec_env}}.
@@ -703,6 +1146,12 @@ def eval_policy_all(

    # Flatten envs into list of (task_group, task_id, env)
    tasks = [(tg, tid, vec) for tg, group in envs.items() for tid, vec in group.items()]
+    if instance_count > 1:
+        total_tasks = len(tasks)
+        tasks = [task for idx, task in enumerate(tasks) if idx % instance_count == instance_id]
+        logging.info(
+            f"Instance shard {instance_id + 1}/{instance_count}: {len(tasks)}/{total_tasks} tasks assigned."
+        )

    # accumulators: track metrics at both per-group level and across all groups
    group_acc: dict[str, dict[str, list]] = defaultdict(lambda: {k: [] for k in ACC_KEYS})
@@ -748,59 +1197,77 @@ def eval_policy_all(
        start_seed=start_seed,
    )

-    if max_parallel_tasks <= 1:
-        # sequential path (single accumulator path on the main thread)
-        # NOTE: keeping a single-threaded accumulator avoids concurrent list appends or locks
-        for task_group, task_id, env in tasks:
-            tg, tid, metrics = task_runner(task_group, task_id, env)
-            _accumulate_to(tg, metrics)
-            per_task_infos.append({"task_group": tg, "task_id": tid, "metrics": metrics})
-    else:
-        # threaded path: submit all tasks, consume completions on main thread and accumulate there
-        with cf.ThreadPoolExecutor(max_workers=max_parallel_tasks) as executor:
-            fut2meta = {}
-            for task_group, task_id, env in tasks:
-                fut = executor.submit(task_runner, task_group, task_id, env)
-                fut2meta[fut] = (task_group, task_id)
-            for fut in cf.as_completed(fut2meta):
-                tg, tid, metrics = fut.result()
+    all_lazy = all(isinstance(env, LazyVectorEnv) for _, _, env in tasks)
+    single_factory_per_task = all(
+        not isinstance(env, LazyVectorEnv) or env.num_factory_fns == 1 for _, _, env in tasks
+    )
+    can_batch = max_parallel_tasks > 1 and all_lazy and single_factory_per_task and n_episodes == 1
+
+    if can_batch:
+        # Multi-task batched path: combine N tasks into one SyncVectorEnv per chunk
+        # so the policy processes all N observations in a single forward pass per step.
+        chunk_size = max_parallel_tasks
+        logging.info(f"Task scheduler mode: batched_lazy (chunk_size={chunk_size})")
+        n_chunks = (len(tasks) + chunk_size - 1) // chunk_size
+        rendered_so_far = 0
+        for chunk_idx in range(n_chunks):
+            chunk = tasks[chunk_idx * chunk_size : (chunk_idx + 1) * chunk_size]
+            render_budget = max(0, max_episodes_rendered - rendered_so_far)
+            logging.info(
+                f"Batch {chunk_idx + 1}/{n_chunks}: evaluating {len(chunk)} tasks "
+                f"({chunk_idx * chunk_size + 1}–{chunk_idx * chunk_size + len(chunk)}/{len(tasks)})"
+            )
+            batch_results = _eval_task_batch(
+                chunk,
+                policy=policy,
+                env_preprocessor=env_preprocessor,
+                env_postprocessor=env_postprocessor,
+                preprocessor=preprocessor,
+                postprocessor=postprocessor,
+                start_seed=start_seed,
+                max_episodes_rendered=render_budget,
+                videos_dir=videos_dir,
+            )
+            for tg, tid, metrics in batch_results:
                _accumulate_to(tg, metrics)
                per_task_infos.append({"task_group": tg, "task_id": tid, "metrics": metrics})
+                rendered_so_far += len(metrics.get("video_paths", []))

-    # compute aggregated metrics helper (robust to lists/scalars)
-    def _agg_from_list(xs):
-        if not xs:
-            return float("nan")
-        arr = np.array(xs, dtype=float)
-        return float(np.nanmean(arr))
+            if overall["successes"]:
+                sr = np.nanmean(overall["successes"]) * 100
+                logging.info(f"  running success rate: {sr:.1f}%")
+    elif max_parallel_tasks <= 1:
+        logging.info("Task scheduler mode: sequential")
+        for task_group, task_id, env in tasks:
+            try:
+                tg, tid, metrics = task_runner(task_group, task_id, env)
+                _accumulate_to(tg, metrics)
+                per_task_infos.append({"task_group": tg, "task_id": tid, "metrics": metrics})
+            finally:
+                env.close()
+    else:
+        # Threaded fallback for cases where batched lazy mode cannot be used.
+        if all_lazy and n_episodes != 1:
+            logging.info("Task scheduler mode: threaded (lazy batching disabled because n_episodes != 1)")
+        elif all_lazy and not single_factory_per_task:
+            logging.info("Task scheduler mode: threaded (lazy batching disabled because eval.batch_size > 1)")
+        else:
+            logging.info(f"Task scheduler mode: threaded (max_workers={max_parallel_tasks})")
+        with cf.ThreadPoolExecutor(max_workers=max_parallel_tasks) as executor:
+            fut2meta: dict[cf.Future, tuple[str, int, Any]] = {}
+            for task_group, task_id, env in tasks:
+                fut = executor.submit(task_runner, task_group, task_id, env)
+                fut2meta[fut] = (task_group, task_id, env)
+            for fut in cf.as_completed(fut2meta):
+                tg, tid, env = fut2meta[fut]
+                try:
+                    _, _, metrics = fut.result()
+                    _accumulate_to(tg, metrics)
+                    per_task_infos.append({"task_group": tg, "task_id": tid, "metrics": metrics})
+                finally:
+                    env.close()

-    # compute per-group aggregates
-    groups_aggregated = {}
-    for group, acc in group_acc.items():
-        groups_aggregated[group] = {
-            "avg_sum_reward": _agg_from_list(acc["sum_rewards"]),
-            "avg_max_reward": _agg_from_list(acc["max_rewards"]),
-            "pc_success": _agg_from_list(acc["successes"]) * 100 if acc["successes"] else float("nan"),
-            "n_episodes": len(acc["sum_rewards"]),
-            "video_paths": list(acc["video_paths"]),
-        }
-
-    # overall aggregates
-    overall_agg = {
-        "avg_sum_reward": _agg_from_list(overall["sum_rewards"]),
-        "avg_max_reward": _agg_from_list(overall["max_rewards"]),
-        "pc_success": _agg_from_list(overall["successes"]) * 100 if overall["successes"] else float("nan"),
-        "n_episodes": len(overall["sum_rewards"]),
-        "eval_s": time.time() - start_t,
-        "eval_ep_s": (time.time() - start_t) / max(1, len(overall["sum_rewards"])),
-        "video_paths": list(overall["video_paths"]),
-    }
-
-    return {
-        "per_task": per_task_infos,
-        "per_group": groups_aggregated,
-        "overall": overall_agg,
-    }
+    return _aggregate_eval_from_per_task(per_task_infos, total_eval_s=time.time() - start_t)


 def main():
--- a/src/lerobot/scripts/lerobot_eval_worker.py
+++ b/src/lerobot/scripts/lerobot_eval_worker.py
@@ -0,0 +1,218 @@
+#!/usr/bin/env python
+
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Docker eval worker — runs inside a benchmark container.
+
+Runs gym episodes for a sharded subset of the configured env's tasks, calling
+a remote HTTP policy inference server (running on the host GPU) for action chunks.
+
+Usage (normally invoked by docker_runtime.run_eval_in_docker, not directly):
+    lerobot-eval-worker \\
+        --env.type=libero_plus \\
+        --server_address=host.docker.internal:50051 \\
+        --n_episodes=5 \\
+        --seed=1000 \\
+        --instance_id=0 \\
+        --instance_count=2 \\
+        --output_path=/results/worker_0.json
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+import pickle  # nosec B403 — internal serialisation only
+import time
+import urllib.request
+from dataclasses import dataclass, field
+from pathlib import Path
+
+import draccus
+import numpy as np
+
+from lerobot import envs  # noqa: F401 — registers all env subclasses
+from lerobot.envs.configs import EnvConfig
+from lerobot.envs.factory import make_env
+from lerobot.envs.utils import add_envs_task, preprocess_observation
+from lerobot.utils.utils import init_logging
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class EvalWorkerConfig:
+    env: EnvConfig
+    # Address of the policy inference HTTP server on the host.
+    server_address: str = "host.docker.internal:50051"
+    # Number of episodes to run per task.
+    n_episodes: int = 1
+    # Starting random seed; episode i of a task uses seed + i.
+    seed: int = 0
+    # 0-indexed shard id for this worker.
+    instance_id: int = 0
+    # Total number of shards (workers).
+    instance_count: int = 1
+    # Path (inside the container) to write the JSON per-task results.
+    output_path: Path = field(default_factory=lambda: Path("/results/worker.json"))
+    # Timeout in seconds for each HTTP request to the policy server.
+    server_timeout: float = 120.0
+
+
+def _call_server(server_address: str, obs_t: dict, timeout: float) -> np.ndarray:
+    """POST pickled obs to /predict_chunk, return numpy chunk (T, action_dim)."""
+    body = pickle.dumps({"obs_t": obs_t})  # nosec B301
+    req = urllib.request.Request(
+        f"http://{server_address}/predict_chunk",
+        data=body,
+        method="POST",
+        headers={"Content-Type": "application/octet-stream"},
+    )
+    with urllib.request.urlopen(req, timeout=timeout) as resp:  # nosec B310
+        return pickle.loads(resp.read())  # nosec B301
+
+
+def run_worker(cfg: EvalWorkerConfig) -> dict:
+    """Run cfg.n_episodes episodes per assigned task. Returns per-task results dict."""
+    # Build envs: {task_group: {task_id: vec_env}}
+    envs_dict = make_env(cfg.env, n_envs=1)
+
+    # Flatten to list of (task_group, task_id, env)
+    tasks = [
+        (task_group, task_id, vec)
+        for task_group, group in envs_dict.items()
+        for task_id, vec in group.items()
+    ]
+
+    # Shard: this worker handles tasks where index % instance_count == instance_id
+    if cfg.instance_count > 1:
+        total = len(tasks)
+        assigned = {i for i in range(total) if i % cfg.instance_count == cfg.instance_id}
+        for i, (_, _, env) in enumerate(tasks):
+            if i not in assigned:
+                try:
+                    env.close()
+                except Exception:
+                    pass
+        tasks = [t for i, t in enumerate(tasks) if i in assigned]
+        logger.info(
+            "Shard %d/%d: %d/%d tasks assigned.",
+            cfg.instance_id + 1,
+            cfg.instance_count,
+            len(tasks),
+            total,
+        )
+
+    per_task: list[dict] = []
+
+    for task_group, task_id, env in tasks:
+        sum_rewards: list[float] = []
+        max_rewards: list[float] = []
+        successes: list[bool] = []
+
+        for ep_idx in range(cfg.n_episodes):
+            obs, _info = env.reset(seed=[cfg.seed + ep_idx])
+            obs_t = preprocess_observation(obs)
+            obs_t = add_envs_task(env, obs_t)
+
+            action_buffer: list[np.ndarray] = []  # each element: (1, action_dim)
+            ep_rewards: list[float] = []
+            ep_success = False
+            done = np.zeros(1, dtype=bool)
+            ep_steps = 0
+            ep_infer_time = 0.0
+            ep_env_time = 0.0
+            ep_infer_calls = 0
+
+            while not np.all(done):
+                if not action_buffer:
+                    t0 = time.monotonic()
+                    chunk_np = _call_server(cfg.server_address, obs_t, cfg.server_timeout)
+                    ep_infer_time += time.monotonic() - t0
+                    ep_infer_calls += 1
+                    action_buffer = [chunk_np[i : i + 1] for i in range(chunk_np.shape[0])]
+
+                action_np = action_buffer.pop(0)
+                t0 = time.monotonic()
+                obs, reward, terminated, truncated, info = env.step(action_np)
+                ep_env_time += time.monotonic() - t0
+                ep_steps += 1
+
+                done = terminated | truncated | done
+                ep_rewards.append(float(np.mean(reward)))
+
+                if "final_info" in info:
+                    final_info = info["final_info"]
+                    if isinstance(final_info, dict) and "is_success" in final_info:
+                        ep_success = bool(final_info["is_success"][0])
+
+                if not np.all(done):
+                    obs_t = preprocess_observation(obs)
+                    obs_t = add_envs_task(env, obs_t)
+
+            sum_rewards.append(float(np.sum(ep_rewards)))
+            max_rewards.append(float(np.max(ep_rewards)) if ep_rewards else 0.0)
+            successes.append(ep_success)
+            avg_env_ms = (ep_env_time / ep_steps * 1000) if ep_steps else 0
+            avg_infer_ms = (ep_infer_time / ep_infer_calls * 1000) if ep_infer_calls else 0
+            logger.info(
+                "Task %s[%d] ep %d/%d — success=%s | %d steps, %d infer calls | "
+                "env %.0fms/step, infer %.0fms/call (env %.1fs, infer %.1fs total)",
+                task_group,
+                task_id,
+                ep_idx + 1,
+                cfg.n_episodes,
+                ep_success,
+                ep_steps,
+                ep_infer_calls,
+                avg_env_ms,
+                avg_infer_ms,
+                ep_env_time,
+                ep_infer_time,
+            )
+
+        per_task.append(
+            {
+                "task_group": task_group,
+                "task_id": task_id,
+                "metrics": {
+                    "sum_rewards": sum_rewards,
+                    "max_rewards": max_rewards,
+                    "successes": successes,
+                    "video_paths": [],
+                },
+            }
+        )
+        env.close()
+
+    return {"per_task": per_task}
+
+
+def worker_main(cfg: EvalWorkerConfig) -> None:
+    results = run_worker(cfg)
+    output = Path(cfg.output_path)
+    output.parent.mkdir(parents=True, exist_ok=True)
+    output.write_text(json.dumps(results, indent=2))
+    logger.info("Worker %d wrote results to %s", cfg.instance_id, output)
+
+
+def main() -> None:
+    init_logging()
+    cfg = draccus.parse(config_class=EvalWorkerConfig)
+    worker_main(cfg)
+
+
+if __name__ == "__main__":
+    main()
--- a/src/lerobot/scripts/lerobot_leaderboard.py
+++ b/src/lerobot/scripts/lerobot_leaderboard.py
@@ -0,0 +1,588 @@
+#!/usr/bin/env python
+
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Generate an interactive eval leaderboard from Hub model repos.
+
+Reads eval results (as pushed by ``lerobot-eval --push_to_hub``) from one or
+more Hugging Face model repos and produces a self-contained HTML page with a
+sortable, filterable leaderboard table.
+
+Usage::
+
+    # models.txt contains one HF repo ID per line (lines starting with # are ignored)
+    lerobot-leaderboard models.txt --output leaderboard.html
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import logging
+import sys
+from dataclasses import dataclass, field
+from pathlib import Path
+
+logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class ModelEntry:
+    repo_id: str
+    policy_type: str = "—"
+    dataset: str = "—"
+    training_steps: str = "—"
+    batch_size: str = "—"
+    # env_type -> {group_name -> pc_success}
+    eval_results: dict[str, dict[str, float]] = field(default_factory=dict)
+    # env_type -> overall pc_success
+    eval_overall: dict[str, float] = field(default_factory=dict)
+    # env_type -> n_episodes
+    eval_n_episodes: dict[str, int] = field(default_factory=dict)
+
+
+def _try_download(repo_id: str, filename: str) -> dict | None:
+    """Download a JSON file from a Hub repo, return parsed dict or None."""
+    from huggingface_hub import hf_hub_download
+    from huggingface_hub.utils import EntryNotFoundError, RepositoryNotFoundError
+
+    try:
+        path = hf_hub_download(repo_id, filename, repo_type="model")
+        with open(path) as f:
+            return json.load(f)
+    except (EntryNotFoundError, RepositoryNotFoundError, OSError):
+        return None
+
+
+def _list_eval_dirs(repo_id: str) -> list[str]:
+    """List env_type subdirectories under eval/ in a Hub repo."""
+    from huggingface_hub import HfApi
+    from huggingface_hub.utils import RepositoryNotFoundError
+
+    api = HfApi()
+    try:
+        files = api.list_repo_files(repo_id, repo_type="model")
+    except RepositoryNotFoundError:
+        return []
+
+    env_types = set()
+    for f in files:
+        if f.startswith("eval/") and f.count("/") >= 2:
+            env_types.add(f.split("/")[1])
+    return sorted(env_types)
+
+
+def fetch_model_entry(repo_id: str) -> ModelEntry:
+    """Fetch all available metadata and eval results for a single model."""
+    entry = ModelEntry(repo_id=repo_id)
+
+    # Policy config
+    policy_cfg = _try_download(repo_id, "config.json")
+    if policy_cfg:
+        entry.policy_type = policy_cfg.get("type", "—")
+
+    # Training config
+    train_cfg = _try_download(repo_id, "train_config.json")
+    if train_cfg:
+        ds = train_cfg.get("dataset", {})
+        entry.dataset = ds.get("repo_id", "—") if isinstance(ds, dict) else str(ds)
+        entry.training_steps = str(train_cfg.get("steps", "—"))
+        entry.batch_size = str(train_cfg.get("batch_size", "—"))
+
+    # Eval results per env_type
+    for env_type in _list_eval_dirs(repo_id):
+        eval_info = _try_download(repo_id, f"eval/{env_type}/eval_info.json")
+        if not eval_info:
+            continue
+
+        per_group = eval_info.get("per_group", {})
+        group_results = {}
+        for group_name, stats in per_group.items():
+            group_results[group_name] = stats.get("pc_success", float("nan"))
+
+        entry.eval_results[env_type] = group_results
+
+        overall = eval_info.get("overall", {})
+        entry.eval_overall[env_type] = overall.get("pc_success", float("nan"))
+        entry.eval_n_episodes[env_type] = overall.get("n_episodes", 0)
+
+    return entry
+
+
+def fetch_all(repo_ids: list[str]) -> list[ModelEntry]:
+    entries = []
+    for repo_id in repo_ids:
+        logger.info(f"Fetching {repo_id}...")
+        try:
+            entries.append(fetch_model_entry(repo_id))
+        except Exception as e:
+            logger.warning(f"Failed to fetch {repo_id}: {e}")
+    return entries
+
+
+def collect_all_env_types(entries: list[ModelEntry]) -> list[str]:
+    """Collect all unique env_types across all entries, sorted."""
+    env_types: set[str] = set()
+    for e in entries:
+        env_types.update(e.eval_overall.keys())
+    return sorted(env_types)
+
+
+def collect_all_groups(entries: list[ModelEntry]) -> dict[str, list[str]]:
+    """Collect all unique group names per env_type."""
+    groups: dict[str, set[str]] = {}
+    for e in entries:
+        for env_type, group_results in e.eval_results.items():
+            groups.setdefault(env_type, set()).update(group_results.keys())
+    return {k: sorted(v) for k, v in groups.items()}
+
+
+def build_html(entries: list[ModelEntry], title: str = "LeRobot Eval Leaderboard") -> str:
+    env_types = collect_all_env_types(entries)
+    all_groups = collect_all_groups(entries)
+
+    # Build column structure: fixed cols + per env_type (overall + per-group sub-columns)
+    # We'll build the data as JSON and let JS handle rendering
+    table_data = []
+    for e in entries:
+        row = {
+            "repo_id": e.repo_id,
+            "policy_type": e.policy_type,
+            "dataset": e.dataset,
+            "training_steps": e.training_steps,
+            "batch_size": e.batch_size,
+        }
+        for env_type in env_types:
+            overall = e.eval_overall.get(env_type)
+            row[f"{env_type}__overall"] = round(overall, 1) if overall is not None else None
+            n_ep = e.eval_n_episodes.get(env_type)
+            row[f"{env_type}__n_episodes"] = n_ep if n_ep else None
+            for group in all_groups.get(env_type, []):
+                val = e.eval_results.get(env_type, {}).get(group)
+                row[f"{env_type}__{group}"] = round(val, 1) if val is not None else None
+        table_data.append(row)
+
+    # Build column definitions for the JS table
+    columns_json = json.dumps(_build_column_defs(env_types, all_groups))
+    data_json = json.dumps(table_data)
+
+    return _HTML_TEMPLATE.format(
+        title=title,
+        columns_json=columns_json,
+        data_json=data_json,
+    )
+
+
+def _build_column_defs(env_types: list[str], all_groups: dict[str, list[str]]) -> list[dict]:
+    cols = [
+        {"key": "repo_id", "label": "Model", "group": "Model Info", "sortable": True, "type": "link"},
+        {"key": "policy_type", "label": "Policy", "group": "Model Info", "sortable": True, "type": "text"},
+        {"key": "dataset", "label": "Dataset", "group": "Model Info", "sortable": True, "type": "text"},
+        {
+            "key": "training_steps",
+            "label": "Steps",
+            "group": "Training",
+            "sortable": True,
+            "type": "number",
+        },
+        {
+            "key": "batch_size",
+            "label": "Batch",
+            "group": "Training",
+            "sortable": True,
+            "type": "number",
+        },
+    ]
+    for env_type in env_types:
+        cols.append(
+            {
+                "key": f"{env_type}__overall",
+                "label": "Overall %",
+                "group": env_type,
+                "sortable": True,
+                "type": "pct",
+            }
+        )
+        for group in all_groups.get(env_type, []):
+            cols.append(
+                {
+                    "key": f"{env_type}__{group}",
+                    "label": f"{group} %",
+                    "group": env_type,
+                    "sortable": True,
+                    "type": "pct",
+                }
+            )
+        cols.append(
+            {
+                "key": f"{env_type}__n_episodes",
+                "label": "Episodes",
+                "group": env_type,
+                "sortable": True,
+                "type": "number",
+            }
+        )
+    return cols
+
+
+_HTML_TEMPLATE = """\
+<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="utf-8" />
+<meta name="viewport" content="width=device-width, initial-scale=1" />
+<title>{title}</title>
+<style>
+  :root {{
+    --bg: #0d1117;
+    --surface: #161b22;
+    --border: #30363d;
+    --text: #e6edf3;
+    --text-muted: #8b949e;
+    --accent: #58a6ff;
+    --green: #3fb950;
+    --yellow: #d29922;
+    --red: #f85149;
+    --header-bg: #1c2128;
+  }}
+  * {{ box-sizing: border-box; margin: 0; padding: 0; }}
+  body {{
+    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif;
+    background: var(--bg);
+    color: var(--text);
+    padding: 24px;
+    line-height: 1.5;
+  }}
+  h1 {{
+    font-size: 1.75rem;
+    font-weight: 600;
+    margin-bottom: 8px;
+  }}
+  .subtitle {{
+    color: var(--text-muted);
+    font-size: 0.9rem;
+    margin-bottom: 20px;
+  }}
+  .controls {{
+    display: flex;
+    gap: 12px;
+    margin-bottom: 16px;
+    flex-wrap: wrap;
+    align-items: center;
+  }}
+  .controls input {{
+    background: var(--surface);
+    border: 1px solid var(--border);
+    color: var(--text);
+    padding: 8px 14px;
+    border-radius: 6px;
+    font-size: 0.875rem;
+    width: 280px;
+    outline: none;
+  }}
+  .controls input:focus {{ border-color: var(--accent); }}
+  .controls label {{
+    color: var(--text-muted);
+    font-size: 0.8rem;
+    display: flex;
+    align-items: center;
+    gap: 6px;
+    cursor: pointer;
+  }}
+  .controls input[type="checkbox"] {{
+    width: auto;
+    accent-color: var(--accent);
+  }}
+  .table-wrap {{
+    overflow-x: auto;
+    border: 1px solid var(--border);
+    border-radius: 8px;
+  }}
+  table {{
+    border-collapse: collapse;
+    width: 100%;
+    font-size: 0.85rem;
+    white-space: nowrap;
+  }}
+  thead th {{
+    background: var(--header-bg);
+    color: var(--text-muted);
+    font-weight: 600;
+    text-transform: uppercase;
+    font-size: 0.7rem;
+    letter-spacing: 0.05em;
+    padding: 10px 14px;
+    border-bottom: 1px solid var(--border);
+    cursor: pointer;
+    user-select: none;
+    position: sticky;
+    top: 0;
+    z-index: 2;
+  }}
+  thead th:hover {{ color: var(--text); }}
+  thead th .arrow {{ margin-left: 4px; opacity: 0.4; }}
+  thead th.sorted .arrow {{ opacity: 1; color: var(--accent); }}
+  thead tr.group-header th {{
+    text-align: center;
+    font-size: 0.75rem;
+    letter-spacing: 0.08em;
+    border-right: 1px solid var(--border);
+    cursor: default;
+  }}
+  tbody tr {{ border-bottom: 1px solid var(--border); }}
+  tbody tr:hover {{ background: rgba(88,166,255,0.06); }}
+  td {{
+    padding: 10px 14px;
+    vertical-align: middle;
+  }}
+  td.model-cell a {{
+    color: var(--accent);
+    text-decoration: none;
+    font-weight: 500;
+  }}
+  td.model-cell a:hover {{ text-decoration: underline; }}
+  td.pct {{
+    font-weight: 600;
+    font-variant-numeric: tabular-nums;
+    text-align: right;
+  }}
+  td.number {{
+    text-align: right;
+    font-variant-numeric: tabular-nums;
+  }}
+  .pct-high {{ color: var(--green); }}
+  .pct-mid  {{ color: var(--yellow); }}
+  .pct-low  {{ color: var(--red); }}
+  .pct-na   {{ color: var(--text-muted); font-weight: 400; }}
+  .best-in-col {{ background: rgba(63,185,80,0.12); }}
+  footer {{
+    margin-top: 20px;
+    color: var(--text-muted);
+    font-size: 0.75rem;
+  }}
+  footer a {{ color: var(--accent); text-decoration: none; }}
+</style>
+</head>
+<body>
+
+<h1>&#129302; {title}</h1>
+<p class="subtitle">Click any column header to sort. Filter by typing below.</p>
+
+<div class="controls">
+  <input type="text" id="filter" placeholder="Filter models..." />
+  <label><input type="checkbox" id="toggleGroups" checked /> Show sub-suite columns</label>
+</div>
+
+<div class="table-wrap">
+  <table id="leaderboard">
+    <thead></thead>
+    <tbody></tbody>
+  </table>
+</div>
+
+<footer>
+  Generated by <a href="https://github.com/huggingface/lerobot">LeRobot</a> &middot;
+  Eval results from <a href="https://huggingface.co">Hugging Face Hub</a>
+</footer>
+
+<script>
+const COLUMNS = {columns_json};
+const DATA    = {data_json};
+
+let sortKey = null, sortAsc = true, showGroups = true;
+
+function pctClass(v) {{
+  if (v == null) return 'pct-na';
+  if (v >= 70)   return 'pct-high';
+  if (v >= 40)   return 'pct-mid';
+  return 'pct-low';
+}}
+
+function visibleCols() {{
+  if (showGroups) return COLUMNS;
+  return COLUMNS.filter(c => !c.key.includes('__') || c.key.endsWith('__overall') || c.key.endsWith('__n_episodes'));
+}}
+
+function bestPerCol(rows) {{
+  const best = {{}};
+  for (const c of visibleCols()) {{
+    if (c.type !== 'pct') continue;
+    let max = -Infinity;
+    for (const r of rows) {{
+      const v = r[c.key];
+      if (v != null && v > max) max = v;
+    }}
+    best[c.key] = max === -Infinity ? null : max;
+  }}
+  return best;
+}}
+
+function render() {{
+  const filter = document.getElementById('filter').value.toLowerCase();
+  let rows = DATA.filter(r => r.repo_id.toLowerCase().includes(filter)
+    || (r.policy_type||'').toLowerCase().includes(filter)
+    || (r.dataset||'').toLowerCase().includes(filter));
+
+  if (sortKey) {{
+    rows.sort((a, b) => {{
+      let va = a[sortKey], vb = b[sortKey];
+      if (va == null && vb == null) return 0;
+      if (va == null) return 1;
+      if (vb == null) return -1;
+      if (typeof va === 'string') va = va.toLowerCase();
+      if (typeof vb === 'string') vb = vb.toLowerCase();
+      return sortAsc ? (va < vb ? -1 : va > vb ? 1 : 0) : (va > vb ? -1 : va < vb ? 1 : 0);
+    }});
+  }}
+
+  const cols = visibleCols();
+  const best = bestPerCol(rows);
+
+  // Group header row
+  const groups = [];
+  let lastGroup = null;
+  for (const c of cols) {{
+    if (c.group !== lastGroup) {{
+      groups.push({{ label: c.group, span: 1 }});
+      lastGroup = c.group;
+    }} else {{
+      groups[groups.length - 1].span++;
+    }}
+  }}
+
+  const thead = document.querySelector('#leaderboard thead');
+  thead.innerHTML = '';
+
+  // Group header
+  const gtr = document.createElement('tr');
+  gtr.className = 'group-header';
+  for (const g of groups) {{
+    const th = document.createElement('th');
+    th.colSpan = g.span;
+    th.textContent = g.label;
+    gtr.appendChild(th);
+  }}
+  thead.appendChild(gtr);
+
+  // Column headers
+  const htr = document.createElement('tr');
+  for (const c of cols) {{
+    const th = document.createElement('th');
+    th.innerHTML = c.label + ' <span class="arrow">' + (sortKey === c.key ? (sortAsc ? '&#9650;' : '&#9660;') : '&#8693;') + '</span>';
+    if (sortKey === c.key) th.classList.add('sorted');
+    th.addEventListener('click', () => {{
+      if (sortKey === c.key) {{ sortAsc = !sortAsc; }}
+      else {{ sortKey = c.key; sortAsc = c.type === 'pct' ? false : true; }}
+      render();
+    }});
+    htr.appendChild(th);
+  }}
+  thead.appendChild(htr);
+
+  // Body
+  const tbody = document.querySelector('#leaderboard tbody');
+  tbody.innerHTML = '';
+  for (const r of rows) {{
+    const tr = document.createElement('tr');
+    for (const c of cols) {{
+      const td = document.createElement('td');
+      const v = r[c.key];
+      if (c.type === 'link') {{
+        td.className = 'model-cell';
+        const a = document.createElement('a');
+        a.href = 'https://huggingface.co/' + v;
+        a.target = '_blank';
+        a.textContent = v;
+        td.appendChild(a);
+      }} else if (c.type === 'pct') {{
+        td.className = 'pct ' + pctClass(v);
+        td.textContent = v != null ? v.toFixed(1) : '—';
+        if (v != null && best[c.key] != null && v === best[c.key] && rows.length > 1) {{
+          td.classList.add('best-in-col');
+        }}
+      }} else if (c.type === 'number') {{
+        td.className = 'number';
+        td.textContent = v != null ? v : '—';
+      }} else {{
+        td.textContent = v || '—';
+      }}
+      tr.appendChild(td);
+    }}
+    tbody.appendChild(tr);
+  }}
+}}
+
+document.getElementById('filter').addEventListener('input', render);
+document.getElementById('toggleGroups').addEventListener('change', (e) => {{
+  showGroups = e.target.checked;
+  render();
+}});
+
+render();
+</script>
+</body>
+</html>
+"""
+
+
+def parse_args(argv: list[str] | None = None) -> argparse.Namespace:
+    p = argparse.ArgumentParser(
+        description="Generate an interactive eval leaderboard from Hub model repos.",
+    )
+    p.add_argument(
+        "repo_ids_file",
+        type=str,
+        help="Path to a text file with one repo ID per line.",
+    )
+    p.add_argument(
+        "--output",
+        type=str,
+        default="leaderboard.html",
+        help="Output HTML file path (default: leaderboard.html).",
+    )
+    p.add_argument(
+        "--title",
+        type=str,
+        default="LeRobot Eval Leaderboard",
+        help="Title shown in the leaderboard page.",
+    )
+    return p.parse_args(argv)
+
+
+def main(argv: list[str] | None = None):
+    args = parse_args(argv)
+
+    path = Path(args.repo_ids_file)
+    if not path.exists():
+        logger.error(f"File not found: {path}")
+        sys.exit(1)
+    repo_ids = [line.strip() for line in path.read_text().splitlines() if line.strip() and not line.startswith("#")]
+    if not repo_ids:
+        logger.error(f"No repo IDs found in {path}")
+        sys.exit(1)
+
+    entries = fetch_all(repo_ids)
+    if not entries:
+        logger.error("No valid entries found.")
+        sys.exit(1)
+
+    html = build_html(entries, title=args.title)
+    out = Path(args.output)
+    out.write_text(html)
+    logger.info(f"Leaderboard written to {out.resolve()}")
+
+
+if __name__ == "__main__":
+    main()
--- a/src/lerobot/scripts/lerobot_train.py
+++ b/src/lerobot/scripts/lerobot_train.py
@@ -29,8 +29,7 @@ from tqdm import tqdm
 from lerobot.configs import parser
 from lerobot.configs.train import TrainPipelineConfig
 from lerobot.datasets.factory import make_dataset
-from lerobot.datasets.multi_dataset import NewMultiLeRobotDataset
-from lerobot.datasets.sampler import EpisodeAwareSampler, WeightedEpisodeAwareSampler
+from lerobot.datasets.sampler import EpisodeAwareSampler
 from lerobot.datasets.utils import cycle
 from lerobot.envs.factory import make_env, make_env_pre_post_processors
 from lerobot.envs.utils import close_envs
@@ -344,25 +343,13 @@ def train(cfg: TrainPipelineConfig, accelerator: Accelerator | None = None):
        logging.info(f"{num_total_params=} ({format_big_number(num_total_params)})")

    # create dataloader for offline training
-    drop_n_last = getattr(cfg.policy, "drop_n_last_frames", 0)
-
-    if isinstance(dataset, NewMultiLeRobotDataset):
-        shuffle = False
-        sampler = WeightedEpisodeAwareSampler(
-            dataset.meta.episodes["dataset_from_index"],
-            dataset.meta.episodes["dataset_to_index"],
-            dataset_membership=dataset.meta.episodes["dataset_source"],
-            dataset_weights=dataset.dataset_weights,
-            drop_n_last_frames=drop_n_last,
-            shuffle=True,
-        )
-    elif drop_n_last > 0:
+    if hasattr(cfg.policy, "drop_n_last_frames"):
        shuffle = False
        sampler = EpisodeAwareSampler(
            dataset.meta.episodes["dataset_from_index"],
            dataset.meta.episodes["dataset_to_index"],
            episode_indices_to_use=dataset.episodes,
-            drop_n_last_frames=drop_n_last,
+            drop_n_last_frames=cfg.policy.drop_n_last_frames,
            shuffle=True,
        )
    else:
@@ -373,7 +360,7 @@ def train(cfg: TrainPipelineConfig, accelerator: Accelerator | None = None):
        dataset,
        num_workers=cfg.num_workers,
        batch_size=cfg.batch_size,
-        shuffle=shuffle and not getattr(cfg.dataset, "streaming", False),
+        shuffle=shuffle and not cfg.dataset.streaming,
        sampler=sampler,
        pin_memory=device.type == "cuda",
        drop_last=False,
--- a/src/lerobot/utils/hf_eval_results.py
+++ b/src/lerobot/utils/hf_eval_results.py
@@ -0,0 +1,161 @@
+#!/usr/bin/env python
+
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Helpers for building and uploading HF-native `.eval_results` YAML rows.
+
+The `.eval_results` format is consumed by the HuggingFace leaderboard surface.
+See https://huggingface.co/docs/evaluate/en/evaluation-on-hub for the spec.
+"""
+
+from __future__ import annotations
+
+import math
+from datetime import date
+from pathlib import Path
+from typing import Any
+
+import yaml
+
+
+def default_eval_date() -> str:
+    """Return today's UTC date as an ISO-8601 string (YYYY-MM-DD)."""
+    return date.today().isoformat()
+
+
+def build_eval_results_rows(
+    *,
+    info: dict,
+    env_type: str,
+    env_task: str | None,
+    benchmark_dataset_id: str,
+    eval_date: str,
+    source_url: str | None = None,
+    notes: str | None = None,
+) -> list[dict[str, Any]]:
+    """Build a list of `.eval_results` rows from an eval info dict.
+
+    Each row represents one (task_group, metric) combination.  When no
+    per-group breakdown is available a single overall row is emitted.
+
+    Args:
+        info: The dict returned by ``_aggregate_eval_from_per_task`` / ``eval_policy_all``.
+            Expected keys: ``overall``, optionally ``per_group``.
+        env_type: Environment type string (e.g. ``"libero_plus"``).
+        env_task: The env task string from eval config (may be ``None``).
+        benchmark_dataset_id: HF dataset repo-id of the consolidated benchmark dataset.
+        eval_date: ISO-8601 date string (use ``default_eval_date()``).
+        source_url: Optional URL to the evaluation run / report.
+        notes: Optional free-text notes about the evaluation setup.
+
+    Returns:
+        A list of row dicts ready for serialisation with ``upload_eval_results_yaml``.
+    """
+    rows: list[dict[str, Any]] = []
+    task_name = env_task or env_type
+
+    def _safe(v: float) -> float:
+        return 0.0 if (v is None or (isinstance(v, float) and math.isnan(v))) else float(v)
+
+    def _make_row(config_name: str, pc_success: float, n_episodes: int) -> dict[str, Any]:
+        row: dict[str, Any] = {
+            "task": {
+                "type": "robotics",
+                "name": task_name,
+            },
+            "dataset": {
+                "name": benchmark_dataset_id,
+                "type": benchmark_dataset_id,
+                "config": config_name,
+                "split": "test",
+            },
+            "metrics": [
+                {
+                    "type": "success_rate",
+                    "value": _safe(pc_success),
+                    "name": "Success Rate (%)",
+                },
+                {
+                    "type": "n_episodes",
+                    "value": n_episodes,
+                    "name": "Number of Episodes",
+                },
+            ],
+            "evaluated_at": eval_date,
+        }
+        if source_url:
+            row["source_url"] = source_url
+        if notes:
+            row["notes"] = notes
+        return row
+
+    per_group: dict = info.get("per_group", {})
+    if per_group:
+        for group_name, group_metrics in per_group.items():
+            rows.append(
+                _make_row(
+                    config_name=group_name,
+                    pc_success=group_metrics.get("pc_success", float("nan")),
+                    n_episodes=group_metrics.get("n_episodes", 0),
+                )
+            )
+    else:
+        overall = info.get("overall", {})
+        rows.append(
+            _make_row(
+                config_name=env_type,
+                pc_success=overall.get("pc_success", float("nan")),
+                n_episodes=overall.get("n_episodes", 0),
+            )
+        )
+
+    return rows
+
+
+def upload_eval_results_yaml(
+    *,
+    api: Any,
+    repo_id: str,
+    rows: list[dict[str, Any]],
+    env_type: str,
+    env_task: str | None,
+    output_dir: Path,
+) -> str:
+    """Serialise ``rows`` to YAML and upload to the Hub model repo.
+
+    The file is written locally to ``output_dir/eval_results.yaml`` and
+    then uploaded to ``eval/{env_type}/eval_results.yaml`` in ``repo_id``.
+
+    Args:
+        api: An instantiated ``huggingface_hub.HfApi`` object.
+        repo_id: HF model repo (e.g. ``"user/my_policy"``).
+        rows: Rows produced by ``build_eval_results_rows``.
+        env_type: Environment type string (used for the Hub path prefix).
+        env_task: The env task string (unused, kept for API symmetry).
+        output_dir: Local directory to write the YAML before uploading.
+
+    Returns:
+        URL of the Hub commit containing the uploaded file.
+    """
+    yaml_path = Path(output_dir) / "eval_results.yaml"
+    yaml_path.write_text(yaml.dump({"eval_results": rows}, sort_keys=False, allow_unicode=True))
+
+    commit_url = api.upload_file(
+        path_or_fileobj=str(yaml_path),
+        path_in_repo=f"eval/{env_type}/eval_results.yaml",
+        repo_id=repo_id,
+        commit_message=f"Upload eval results YAML for {env_type}",
+    )
+    return str(commit_url)
--- a/tests/envs/test_lazy_vec_env.py
+++ b/tests/envs/test_lazy_vec_env.py
@@ -0,0 +1,62 @@
+#!/usr/bin/env python
+
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from lerobot.envs.lazy_vec_env import LazyVectorEnv
+
+
+class _DummyVectorEnv:
+    def __init__(self):
+        self.marker = "ok"
+        self.closed = False
+
+    def close(self):
+        self.closed = True
+
+
+def test_lazy_vec_env_materializes_only_on_access():
+    created = []
+
+    def _make_env(fns):
+        created.append(len(fns))
+        return _DummyVectorEnv()
+
+    lazy = LazyVectorEnv(_make_env, [lambda: None, lambda: None])
+    assert created == []
+    assert lazy.num_factory_fns == 2
+
+    assert lazy.marker == "ok"
+    assert created == [2]
+
+    # Second access should re-use the same materialized env.
+    assert lazy.marker == "ok"
+    assert created == [2]
+
+
+def test_lazy_vec_env_can_rematerialize_after_close():
+    created = []
+
+    def _make_env(fns):
+        created.append(len(fns))
+        return _DummyVectorEnv()
+
+    lazy = LazyVectorEnv(_make_env, [lambda: None])
+    lazy.materialize()
+    assert created == [1]
+
+    lazy.close()
+    lazy.materialize()
+    assert created == [1, 1]
+
--- a/tests/scripts/test_eval_scheduler.py
+++ b/tests/scripts/test_eval_scheduler.py
@@ -0,0 +1,244 @@
+#!/usr/bin/env python
+
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from lerobot.envs.lazy_vec_env import LazyVectorEnv
+from lerobot.scripts import lerobot_eval
+
+
+class _DummyTaskEnv:
+    def __init__(self):
+        self.close_calls = 0
+
+    def close(self):
+        self.close_calls += 1
+
+
+class _TrackedLazyEnv(LazyVectorEnv):
+    def __init__(self, n_factory_fns: int = 1):
+        super().__init__(lambda fns: None, [lambda: None for _ in range(n_factory_fns)])
+        self.close_calls = 0
+
+    def close(self):
+        self.close_calls += 1
+        super().close()
+
+
+def _fake_metrics():
+    return {
+        "sum_rewards": [1.0],
+        "max_rewards": [1.0],
+        "successes": [True],
+        "video_paths": [],
+    }
+
+
+def test_eval_policy_all_sequential_closes_envs(monkeypatch):
+    def _fake_run_one(task_group, task_id, env, **kwargs):  # noqa: ARG001
+        return task_group, task_id, _fake_metrics()
+
+    monkeypatch.setattr(lerobot_eval, "run_one", _fake_run_one)
+    env_a = _DummyTaskEnv()
+    env_b = _DummyTaskEnv()
+    envs = {"suite": {0: env_a, 1: env_b}}
+
+    result = lerobot_eval.eval_policy_all(
+        envs=envs,
+        policy=None,
+        env_preprocessor=None,
+        env_postprocessor=None,
+        preprocessor=None,
+        postprocessor=None,
+        n_episodes=1,
+        max_parallel_tasks=1,
+    )
+
+    assert env_a.close_calls == 1
+    assert env_b.close_calls == 1
+    assert result["overall"]["n_episodes"] == 2
+
+
+def test_eval_policy_all_threaded_fallback_closes_envs(monkeypatch):
+    def _fake_run_one(task_group, task_id, env, **kwargs):  # noqa: ARG001
+        return task_group, task_id, _fake_metrics()
+
+    monkeypatch.setattr(lerobot_eval, "run_one", _fake_run_one)
+    env_a = _DummyTaskEnv()
+    env_b = _DummyTaskEnv()
+    env_c = _DummyTaskEnv()
+    envs = {"suite": {0: env_a, 1: env_b, 2: env_c}}
+
+    result = lerobot_eval.eval_policy_all(
+        envs=envs,
+        policy=None,
+        env_preprocessor=None,
+        env_postprocessor=None,
+        preprocessor=None,
+        postprocessor=None,
+        n_episodes=1,
+        max_parallel_tasks=2,
+    )
+
+    assert env_a.close_calls == 1
+    assert env_b.close_calls == 1
+    assert env_c.close_calls == 1
+    assert result["overall"]["n_episodes"] == 3
+
+
+def test_eval_policy_all_uses_batched_lazy_mode(monkeypatch):
+    def _run_one_should_not_be_called(*args, **kwargs):
+        raise AssertionError("run_one should not run in batched lazy mode")
+
+    chunk_sizes = []
+
+    def _fake_eval_task_batch(chunk, **kwargs):  # noqa: ARG001
+        chunk_sizes.append(len(chunk))
+        return [(tg, tid, _fake_metrics()) for tg, tid, _ in chunk]
+
+    monkeypatch.setattr(lerobot_eval, "run_one", _run_one_should_not_be_called)
+    monkeypatch.setattr(lerobot_eval, "_eval_task_batch", _fake_eval_task_batch)
+
+    envs = {
+        "suite": {
+            0: LazyVectorEnv(lambda fns: None, [lambda: None]),
+            1: LazyVectorEnv(lambda fns: None, [lambda: None]),
+            2: LazyVectorEnv(lambda fns: None, [lambda: None]),
+        }
+    }
+
+    result = lerobot_eval.eval_policy_all(
+        envs=envs,
+        policy=None,
+        env_preprocessor=None,
+        env_postprocessor=None,
+        preprocessor=None,
+        postprocessor=None,
+        n_episodes=1,
+        max_parallel_tasks=2,
+    )
+
+    assert chunk_sizes == [2, 1]
+    assert result["overall"]["n_episodes"] == 3
+
+
+def test_eval_policy_all_disables_batched_lazy_when_n_episodes_not_one(monkeypatch):
+    def _fake_run_one(task_group, task_id, env, **kwargs):  # noqa: ARG001
+        return task_group, task_id, _fake_metrics()
+
+    def _batch_should_not_run(*args, **kwargs):
+        raise AssertionError("_eval_task_batch should not run when n_episodes != 1")
+
+    monkeypatch.setattr(lerobot_eval, "run_one", _fake_run_one)
+    monkeypatch.setattr(lerobot_eval, "_eval_task_batch", _batch_should_not_run)
+
+    env_a = _TrackedLazyEnv()
+    env_b = _TrackedLazyEnv()
+    envs = {"suite": {0: env_a, 1: env_b}}
+
+    result = lerobot_eval.eval_policy_all(
+        envs=envs,
+        policy=None,
+        env_preprocessor=None,
+        env_postprocessor=None,
+        preprocessor=None,
+        postprocessor=None,
+        n_episodes=2,
+        max_parallel_tasks=2,
+    )
+
+    assert env_a.close_calls == 1
+    assert env_b.close_calls == 1
+    assert result["overall"]["n_episodes"] == 2
+
+
+def test_eval_policy_all_disables_batched_lazy_when_batch_size_above_one(monkeypatch):
+    def _fake_run_one(task_group, task_id, env, **kwargs):  # noqa: ARG001
+        return task_group, task_id, _fake_metrics()
+
+    def _batch_should_not_run(*args, **kwargs):
+        raise AssertionError("_eval_task_batch should not run when eval.batch_size > 1")
+
+    monkeypatch.setattr(lerobot_eval, "run_one", _fake_run_one)
+    monkeypatch.setattr(lerobot_eval, "_eval_task_batch", _batch_should_not_run)
+
+    env_a = _TrackedLazyEnv(n_factory_fns=2)
+    env_b = _TrackedLazyEnv(n_factory_fns=2)
+    envs = {"suite": {0: env_a, 1: env_b}}
+
+    result = lerobot_eval.eval_policy_all(
+        envs=envs,
+        policy=None,
+        env_preprocessor=None,
+        env_postprocessor=None,
+        preprocessor=None,
+        postprocessor=None,
+        n_episodes=1,
+        max_parallel_tasks=2,
+    )
+
+    assert env_a.close_calls == 1
+    assert env_b.close_calls == 1
+    assert result["overall"]["n_episodes"] == 2
+
+
+def test_eval_policy_all_applies_instance_sharding(monkeypatch):
+    called = []
+
+    def _fake_run_one(task_group, task_id, env, **kwargs):  # noqa: ARG001
+        called.append(task_id)
+        return task_group, task_id, _fake_metrics()
+
+    monkeypatch.setattr(lerobot_eval, "run_one", _fake_run_one)
+    envs = {"suite": {0: _DummyTaskEnv(), 1: _DummyTaskEnv(), 2: _DummyTaskEnv(), 3: _DummyTaskEnv()}}
+
+    result = lerobot_eval.eval_policy_all(
+        envs=envs,
+        policy=None,
+        env_preprocessor=None,
+        env_postprocessor=None,
+        preprocessor=None,
+        postprocessor=None,
+        n_episodes=1,
+        max_parallel_tasks=1,
+        instance_count=2,
+        instance_id=1,
+    )
+
+    assert called == [1, 3]
+    assert result["overall"]["n_episodes"] == 2
+
+
+def test_aggregate_eval_from_per_task_merges_groups_and_overall():
+    per_task = [
+        {
+            "task_group": "a",
+            "task_id": 0,
+            "metrics": {"sum_rewards": [1.0], "max_rewards": [2.0], "successes": [True], "video_paths": ["v0"]},
+        },
+        {
+            "task_group": "b",
+            "task_id": 1,
+            "metrics": {"sum_rewards": [3.0], "max_rewards": [4.0], "successes": [False], "video_paths": []},
+        },
+    ]
+
+    merged = lerobot_eval._aggregate_eval_from_per_task(per_task, total_eval_s=10.0)
+
+    assert merged["overall"]["n_episodes"] == 2
+    assert merged["overall"]["avg_sum_reward"] == 2.0
+    assert merged["overall"]["pc_success"] == 50.0
+    assert merged["overall"]["eval_s"] == 10.0
+    assert set(merged["per_group"]) == {"a", "b"}
+
--- a/tests/scripts/test_eval_worker.py
+++ b/tests/scripts/test_eval_worker.py
@@ -0,0 +1,34 @@
+#!/usr/bin/env python
+
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pathlib import Path
+
+from lerobot.scripts import lerobot_eval_worker
+
+
+def test_worker_main_writes_results(monkeypatch, tmp_path: Path):
+    cfg = lerobot_eval_worker.EvalWorkerConfig(
+        env="pusht",  # type: ignore[arg-type]
+        instance_id=3,
+        output_path=tmp_path / "worker.json",
+    )
+
+    monkeypatch.setattr(lerobot_eval_worker, "run_worker", lambda _cfg: {"per_task": []})
+
+    lerobot_eval_worker.worker_main(cfg)
+
+    assert cfg.output_path.exists()
+    assert cfg.output_path.read_text().strip() == '{\n  "per_task": []\n}'
Author	SHA1	Message	Date
Pepijn Kooijmans	8770c011b0	feat(eval): add per-episode timing logs to eval worker Logs avg env step time, avg inference call time, and totals per episode to identify whether env or policy server is the bottleneck. Made-with: Cursor	2026-03-25 07:30:49 +01:00
Pepijn Kooijmans	ddcda8f1ca	feat(eval): respect n_action_steps in inference server The server now returns only n_action_steps actions from the predicted chunk instead of the full chunk_size, enabling more frequent re-planning when n_action_steps < chunk_size. Made-with: Cursor	2026-03-25 06:21:02 +01:00
Pepijn Kooijmans	4f8ebe41b3	fix(eval): create independent preprocessors per policy server Each _InferenceServer now gets its own preprocessor/postprocessor instances, preventing RuntimeError from HuggingFace tokenizer's non-thread-safe Rust borrow checker when multiple servers run concurrently. Made-with: Cursor	2026-03-24 22:38:03 +01:00
Pepijn Kooijmans	066976e078	feat(eval): add multiprocess runtime -- no Docker needed New eval.runtime=multiprocess spawns local lerobot-eval-worker subprocesses instead of Docker containers. Supports eval.policy_servers for parallel inference. Works on SLURM clusters and anywhere Docker is unavailable. Usage: lerobot-eval --eval.runtime=multiprocess \ --eval.instance_count=8 --eval.policy_servers=4 --eval.port=50051 Made-with: Cursor	2026-03-24 22:14:27 +01:00
Pepijn Kooijmans	b3c2592ace	feat(eval): multi-policy-server support for Docker eval Add eval.policy_servers parameter (default 1) that spawns N independent policy inference servers on consecutive ports. Containers are round-robin assigned across servers, enabling parallel GPU inference for small models like SmolVLA (~1.4GB each). Usage: --eval.policy_servers=4 --eval.instance_count=20 → 4 model copies on GPU, 20 containers distributed across them. Made-with: Cursor	2026-03-24 20:28:58 +01:00
Pepijn Kooijmans	b97ea8999f	fix(docker): create libero config.yaml for non-plus LIBERO builds The generic libero benchmark case was falling through to the wildcard install path, which doesn't pre-create ~/.libero/config.yaml. This caused an interactive input() prompt that crashes in Docker (EOFError). Made-with: Cursor	2026-03-24 07:11:18 +01:00
Pepijn Kooijmans	69aeda68f5	Docker EGL/GLVND support + asset download refactor + imagenet stats fix - Add NVIDIA EGL/Vulkan vendor ICDs and graphics libs to both Dockerfiles - Refactor LIBERO-plus asset download into a separate build step - Fix KeyError in datasets/factory.py when stats dict is None or missing keys Made-with: Cursor	2026-03-23 23:33:22 +01:00
Pepijn Kooijmans	a9e355bd03	Lazy env creation + smart sharding to fix container OOM	2026-03-23 23:15:23 +01:00
Pepijn Kooijmans	aae68e3448	fix(docker): use recursive glob for deeply nested asset zip structure The LIBERO-plus assets.zip has a deeply nested path (inspire/hdd/project/.../assets) that didn't match the shallow glob. Use recursive glob to find assets/scenes regardless of nesting depth. Made-with: Cursor	2026-03-23 18:57:44 +01:00
Pepijn Kooijmans	4b9f6c4aed	fix(docker): download LIBERO-plus assets (~6 GB) at image build time The benchmark containers were missing the scene/texture/object assets required by LIBERO-plus. Download them from HuggingFace Hub during the Docker build so containers are self-contained and ready to run. Made-with: Cursor	2026-03-23 18:48:25 +01:00
Pepijn Kooijmans	6057638fc1	fix(docker): pre-create libero config.yaml to avoid interactive input() prompt The upstream libero __init__.py calls input() when ~/.libero/config.yaml is missing, which crashes in non-interactive Docker containers with EOFError. Pre-create the config with default paths at build time using importlib.util.find_spec to locate the module without triggering the problematic import. Made-with: Cursor	2026-03-23 18:44:14 +01:00
Pepijn Kooijmans	e52e7e644a	fix(docker): add libero_plus install workaround to generic Dockerfile.benchmark The generic Dockerfile.benchmark was using a plain `uv pip install ".[libero_plus]"` which silently fails to make `libero` importable due to an upstream LIBERO-plus packaging bug. Port the dedicated clone + .pth workaround from Dockerfile.eval-libero-plus so `docker build --build-arg BENCHMARK=libero_plus` produces working containers. Also fix eval worker using nonexistent `parser.parse()` — use `draccus.parse()`. Made-with: Cursor	2026-03-23 18:31:57 +01:00
Pepijn	8633608d26	fix(docker): pin numpy==2.2.5 in separate RUN for robocasa robocasa/__init__.py hard-asserts numpy==2.2.5. When bundled with other packages in one uv install command, uv silently skips the numpy pin (same "already resolved" bug hit with libero_plus). Moving the pin to a dedicated final RUN step guarantees it is applied last. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-22 22:04:42 -07:00
Pepijn	900e6b59c8	fix(docker): pin mujoco==3.3.1 for robocasa (hard assert on import) robocasa/__init__.py asserts mujoco.__version__ == "3.3.1" and aborts with an error if any other version is installed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-22 21:53:37 -07:00
Pepijn	f844fe683c	fix(docker): use robosuite master branch for robocasa (per README) robocasa README explicitly says to use the master branch of ARISE-Initiative/robosuite (no robocasa-specific branch exists). Also install robocasa with --no-deps to bypass its lerobot==0.3.3 pin, and declare its actual runtime deps explicitly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-22 21:51:18 -07:00
Pepijn	4403675b31	fix(docker): install robocasa's robosuite fork (adds PandaOmron) Standard robosuite 1.4.x from PyPI doesn't include PandaOmron and other robocasa-specific robots. robocasa requires the fork at ARISE-Initiative/robosuite@robocasa_v1.4.1. Install both from source with --no-deps; shared deps (easydict, scikit-image, scipy) installed explicitly first. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-22 21:40:48 -07:00
Pepijn	d18be0c3f4	feat(docker): add metaworld to default benchmark build list Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-22 21:38:54 -07:00
Pepijn	866f8adf11	fix(docker): install robocasa from GitHub source (not on PyPI) robocasa is not published to PyPI, so uv can't resolve it as a plain package dep. Fix by installing its runtime deps explicitly and cloning robocasa from GitHub with --no-deps (same pattern as libero_plus). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-22 21:38:19 -07:00
Pepijn	3d6310c03d	fix(docker): also override numpy==1.26.4 for robomme image mani-skill==3.0.0b21 requires numpy<2.0.0 in addition to gymnasium==0.29.1, both conflicting with lerobot's base requirements. numpy 1.26.4 is runtime-compatible with lerobot's usage (no numpy 2.x-only APIs are used in the eval worker or env wrappers). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-22 21:34:50 -07:00
Pepijn	c3b26382e7	fix(docker): override gymnasium==0.29.1 for robomme image mani-skill==3.0.0b21 (robomme dep) pins gymnasium==0.29.1, conflicting with lerobot's gymnasium>=1.1.1. Use uv --override to force 0.29.1. Both 0.29.x and 1.x use the same 5-tuple step() API (introduced in gymnasium 0.26), so the eval worker and RoboMMEGymEnv wrapper are fully compatible with the downgraded version. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-22 21:28:34 -07:00
Pepijn	e54e582a6f	fix(docker): bypass uv extras chain bug in libero_plus Dockerfile uv silently skips packages when resolving a nested extras chain (lerobot[libero_plus] -> lerobot[libero] -> hf-libero -> robosuite). POST-INSTALL grep confirmed robosuite absent after install despite uv reporting 'Resolved 113 packages, Installed 1'. Fix: install all libero_plus deps directly by name, bypassing the extras chain entirely. Also add --plain flag to build script for verbose output. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-22 21:12:26 -07:00
Pepijn	418791ebba	debug(docker): add pre/post-install diagnostics to libero_plus Dockerfile Temporary diagnostic to identify why uv sees robosuite as already installed in the base venv despite it not being a base lerobot dep. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-22 21:03:42 -07:00
Pepijn	ee3354a885	fix(docker): fix libero_plus deps by replacing git dep with lerobot[libero] The libero @ git+...@main dep had empty install_requires, causing uv to skip robosuite (and other deps) during resolution — they appeared "already resolved" from a stale git dep cache even though not installed. Fix: use lerobot[libero] as the dep source (hf-libero properly declares all deps including robosuite via robomimic). The LIBERO-plus Python module is installed from the git clone with --no-deps, so hf-libero's declared deps are used but LIBERO-plus's environments override via .pth. Also remove egl_probe (broken original) duplicate alongside hf-egl-probe. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-22 20:57:14 -07:00
Pepijn	2cd06fe95b	fix(docker): exclude .venv from Docker build context Without this, the server's local .venv gets copied into the image by the final COPY . . step in Dockerfile.eval-base, overwriting the freshly-created uv venv. uv then sees those packages as already installed and skips them — but they may be missing or built for the wrong environment, causing ModuleNotFoundError at runtime. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-22 16:13:32 -07:00
Pepijn	7be84cb545	fix(docker): add CMAKE_POLICY_VERSION_MINIMUM=3.5 for cmake 4.x compat cmake 4.x removed backward compat with cmake_minimum_required < 3.5, breaking egl-probe compilation. Setting CMAKE_POLICY_VERSION_MINIMUM=3.5 in the base image ENV re-enables it so robomimic's egl-probe builds. Also adds --no-cache-base flag to build script so the base can be force-rebuilt when Dockerfile.eval-base changes, and pins hf-egl-probe in libero extras as the upstream-fixed fork of egl-probe. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-22 15:45:36 -07:00
Pepijn	c35af1ae6a	fix(docker): use uv pip instead of pip in benchmark Dockerfiles pip's backtracking resolver hits 'resolution-too-deep' on complex dependency graphs (robomme → mani-skill, libero_plus → robosuite/bddl). uv resolves the same graphs in seconds without backtracking issues. Also removes the now-redundant PATH= prefix since uv and python are already on PATH via the base image ENV. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-22 15:17:03 -07:00
Pepijn	6fc024704e	fix(deps): add missing future dep for bddl in libero_plus extras bddl imports future (from-future) at package init but doesn't declare it as a dependency. This caused ModuleNotFoundError inside the benchmark Docker container when verifying the libero_plus install. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-22 11:17:51 -07:00
Pepijn	c3b7a18f01	feat(docker): add build_benchmark_images.sh to build and push all eval images Builds lerobot-eval-base then each benchmark image (libero, libero_plus, robomme, robocasa), runs the smoke tests, and optionally pushes to Docker Hub. Usage: bash docker/build_benchmark_images.sh # local only bash docker/build_benchmark_images.sh --push --hub_org=<org> # push to Hub Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-22 11:03:09 -07:00
Pepijn	7fc0cdf68a	fix(eval): skip multi-instance orchestration when runtime=docker _orchestrate_multi_instance_eval spawns extra lerobot-eval processes that each call run_eval_in_docker again, creating N^2 containers. For docker runtime, instance_count directly controls how many env-worker containers are spawned by run_eval_in_docker — no process-level orchestration is needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-20 22:48:06 -07:00
Pepijn	23bf69ebab	feat(eval): add hf_eval_results utility for HF leaderboard upload Implements the missing lerobot.utils.hf_eval_results module imported by lerobot_eval.py but never created. Provides: - default_eval_date(): today's UTC date as ISO-8601 - build_eval_results_rows(): converts eval info dict → HF .eval_results rows - upload_eval_results_yaml(): serialises rows and uploads to Hub model repo Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-20 22:41:21 -07:00
Pepijn	3d5d8fa88a	feat(eval): implement docker runtime with HTTP policy inference server Add docker_runtime.py (host-side) and lerobot_eval_worker.py (container-side) for --eval.runtime=docker. Policy loads once on the host GPU; Docker containers run env-only workers that call back via HTTP for action chunks, maximising GPU utilisation across parallel benchmark tasks. - _InferenceServer: HTTP server wrapping predict_action_chunk with a single lock - run_eval_in_docker: spawns instance_count containers, collects + merges per-task JSON, writes eval_info.json compatible with _aggregate_eval_from_per_task - lerobot-eval-worker CLI: make_env → shard tasks → run episodes → write JSON - EvalDockerConfig: add port field (default 50051) - pyproject.toml: add lerobot-eval-worker entry point Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-20 22:35:59 -07:00
Pepijn	e80c9e6270	chore(docker): remove docker-compose, use individual build/run commands Replace docker-compose.benchmark.yml with per-image docker build/run instructions. Each benchmark is built and tested independently via --build-arg BENCHMARK=<name> on Dockerfile.benchmark. Co-Authored-By: Claude <noreply@anthropic.com>	2026-03-20 22:21:28 -07:00
Pepijn	39cf11d5dc	feat(docker): add per-benchmark evaluation containers Add Dockerfile.benchmark (parameterized via ARG BENCHMARK), a docker-compose.benchmark.yml with services for libero, libero_plus, robomme, and robocasa, and a smoke_test_benchmark.sh that verifies imports and CLI entry-points in each container. Also add the missing `robocasa` optional dep group to pyproject.toml (the docs already referenced `pip install ".[robocasa]"` but the group was not defined). Build a specific benchmark image: docker build --build-arg BENCHMARK=robomme \ -f docker/Dockerfile.benchmark -t lerobot-benchmark-robomme . Build all via compose: docker compose -f docker/docker-compose.benchmark.yml build Smoke-test inside a container: docker compose -f docker/docker-compose.benchmark.yml run --rm robomme \ bash docker/smoke_test_benchmark.sh Co-Authored-By: Claude <noreply@anthropic.com>	2026-03-20 22:21:28 -07:00
Pepijn Kooijmans	285c500aef	speed up benchmark eval scheduling and docker workflow	2026-03-21 06:09:01 +01:00
Pepijn Kooijmans	f60d163588	refactor(leaderboard): use a single positional file arg for repo IDs Replace --repo-ids and --repo-ids-file with a single positional repo_ids_file argument. Lines starting with # are ignored. Made-with: Cursor	2026-03-16 02:58:28 +01:00
Pepijn Kooijmans	4a04465bb8	feat: add lerobot-leaderboard to generate interactive eval comparison pages New CLI tool that fetches eval results from multiple Hub model repos and produces a self-contained HTML leaderboard with sortable columns, per-suite breakdowns, best-in-column highlighting, and filtering. Made-with: Cursor	2026-03-16 02:57:37 +01:00
Pepijn Kooijmans	464532ec37	feat(eval): include eval config (policy, env, eval settings) in Hub push Saves eval_config.json locally and uploads it alongside results. The model card now includes a collapsible "Eval configuration" section showing the full config JSON used for the evaluation run. Made-with: Cursor	2026-03-16 02:42:38 +01:00
Pepijn Kooijmans	89f9bd78ab	feat(eval): add --push_to_hub to upload eval results, videos, and model card to Hub Adds a push_to_hub flag to lerobot-eval that uploads eval_info.json, rollout videos, and appends an evaluation results table to the model card on Hugging Face. Also declares missing LIBERO-plus runtime deps in pyproject.toml and adds an asset validation check for libero_plus. Made-with: Cursor	2026-03-16 02:39:24 +01:00
pepijn	c9cfc88602	feat: add benchmark orchestration, LIBERO-plus install parity, and eval hardening - Add lerobot-benchmark CLI for multi-benchmark train/eval workflows - Add benchmark_training.mdx documentation - Add libero-plus pip extra alias with EGL probe deps matching standard libero - Harden libero.py: wand mock, init-state fallback, renderer EGL→OSMesa fallback - Add multimodal_analysis.py script and SLURM training template Made-with: Cursor	2026-03-15 05:52:53 +00:00
pepijn	7bef12a461	feat(envs): add RoboMME memory-augmented manipulation benchmark - RoboMMEEnv config with 16 tasks across 4 suites (Counting, Permanence, Reference, Imitation) - Gymnasium wrapper around BenchmarkEnvBuilder (robomme.py) - Environment factory wiring for env_type="robomme" - robomme optional dependency in pyproject.toml Made-with: Cursor	2026-03-13 04:44:32 +00:00