fix(robotwin): pin compatible curobo in benchmark image

Merge remote-tracking branch 'origin/feat/robotwin-benchmark' into feat/robotwin-benchmark
Merge branch 'main' into feat/robotwin-benchmark
2026-06-02 11:51:25 +00:00 · 2026-04-21 18:41:16 +02:00 · 2026-04-20 17:31:28 +02:00 · 2026-04-20 17:17:00 +02:00 · 2026-04-20 15:28:45 +02:00 · 2026-04-20 15:18:41 +02:00
45 changed files with 512 additions and 4985 deletions
--- a/.github/workflows/benchmark_tests.yml
+++ b/.github/workflows/benchmark_tests.yml
@@ -525,421 +525,3 @@ jobs:
          name: robocasa-metrics
          path: /tmp/robocasa-artifacts/metrics.json
          if-no-files-found: warn
-
-  # ── ROBOCEREBRA ───────────────────────────────────────────────────────────
-  # Reuses the LIBERO simulator (libero_10 suite) with RoboCerebra camera
-  # defaults (image/wrist_image). The image is layered on
-  # huggingface/lerobot-gpu, which already ships [libero] as part of [all].
-  robocerebra-integration-test:
-    name: RoboCerebra — build image + 1-episode eval
-    runs-on:
-      group: aws-g6-4xlarge-plus
-    env:
-      HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
-
-    steps:
-      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd  # v6.0.2
-        with:
-          persist-credentials: false
-          lfs: true
-
-      - name: Set up Docker Buildx
-        uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
-        with:
-          cache-binary: false
-
-      - name: Login to Docker Hub
-        if: ${{ env.DOCKERHUB_USERNAME != '' }}
-        uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
-        with:
-          username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
-          password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
-        env:
-          DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
-
-      - name: Build RoboCerebra benchmark image
-        uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
-        with:
-          context: .
-          file: docker/Dockerfile.benchmark.robocerebra
-          push: false
-          load: true
-          tags: lerobot-benchmark-robocerebra:ci
-          cache-from: type=local,src=/tmp/.buildx-cache-robocerebra
-          cache-to: type=local,dest=/tmp/.buildx-cache-robocerebra,mode=max
-
-      - name: Run RoboCerebra smoke eval (1 episode)
-        if: env.HF_USER_TOKEN != ''
-        run: |
-          docker run --name robocerebra-eval --gpus all \
-            --shm-size=4g \
-            -e HF_HOME=/tmp/hf \
-            -e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-            -e HF_HUB_DOWNLOAD_TIMEOUT=300 \
-            -e LIBERO_DATA_FOLDER=/tmp/libero_data \
-            lerobot-benchmark-robocerebra:ci \
-            bash -c "
-              hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
-              lerobot-eval \
-                --policy.path=lerobot/smolvla_robocerebra \
-                --env.type=libero \
-                --env.task=libero_10 \
-                --env.fps=20 \
-                --env.obs_type=pixels_agent_pos \
-                --env.observation_height=256 \
-                --env.observation_width=256 \
-                '--env.camera_name_mapping={\"agentview_image\": \"image\", \"robot0_eye_in_hand_image\": \"wrist_image\"}' \
-                --eval.batch_size=1 \
-                --eval.n_episodes=1 \
-                --eval.use_async_envs=false \
-                --policy.device=cuda \
-                '--rename_map={\"observation.images.image\": \"observation.images.camera1\", \"observation.images.wrist_image\": \"observation.images.camera2\"}' \
-                --policy.empty_cameras=1 \
-                --output_dir=/tmp/eval-artifacts
-              python scripts/ci/extract_task_descriptions.py \
-                --env libero --task libero_10 \
-                --output /tmp/eval-artifacts/task_descriptions.json
-            "
-
-      - name: Copy RoboCerebra artifacts from container
-        if: always()
-        run: |
-          mkdir -p /tmp/robocerebra-artifacts
-          docker cp robocerebra-eval:/tmp/eval-artifacts/. /tmp/robocerebra-artifacts/ 2>/dev/null || true
-          docker rm -f robocerebra-eval || true
-
-      - name: Parse RoboCerebra eval metrics
-        if: always()
-        run: |
-          python3 scripts/ci/parse_eval_metrics.py \
-            --artifacts-dir /tmp/robocerebra-artifacts \
-            --env robocerebra \
-            --task libero_10 \
-            --policy lerobot/smolvla_robocerebra
-
-      - name: Upload RoboCerebra rollout video
-        if: always()
-        uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
-        with:
-          name: robocerebra-rollout-video
-          path: /tmp/robocerebra-artifacts/videos/
-          if-no-files-found: warn
-
-      - name: Upload RoboCerebra eval metrics
-        if: always()
-        uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
-        with:
-          name: robocerebra-metrics
-          path: /tmp/robocerebra-artifacts/metrics.json
-          if-no-files-found: warn
-
-  # ── ROBOMME ───────────────────────────────────────────────────────────────
-  # Isolated image: mani-skill/SAPIEN/Vulkan chain with gymnasium and numpy
-  # overrides (robomme can't be a pyproject extra due to numpy<2 pin).
-  robomme-integration-test:
-    name: RoboMME — build image + 1-episode eval
-    runs-on:
-      group: aws-g6-4xlarge-plus
-    env:
-      HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
-      ROBOMME_POLICY: lerobot/smolvla_robomme
-      ROBOMME_TASKS: PickXtimes,BinFill,StopCube,MoveCube,InsertPeg,SwingXtimes,VideoUnmask,ButtonUnmask,PickHighlight,PatternLock
-
-    steps:
-      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
-        with:
-          persist-credentials: false
-          lfs: true
-
-      - name: Set up Docker Buildx
-        uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
-        with:
-          cache-binary: false
-
-      - name: Login to Docker Hub
-        if: ${{ env.DOCKERHUB_USERNAME != '' }}
-        uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
-        with:
-          username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
-          password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
-        env:
-          DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
-
-      - name: Build RoboMME benchmark image
-        uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
-        with:
-          context: .
-          file: docker/Dockerfile.benchmark.robomme
-          push: false
-          load: true
-          tags: lerobot-benchmark-robomme:ci
-
-      - name: Run RoboMME smoke eval (10 tasks, 1 episode each)
-        if: env.HF_USER_TOKEN != ''
-        run: |
-          docker run --name robomme-eval --gpus all \
-            --shm-size=4g \
-            -e HF_HOME=/tmp/hf \
-            -e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-            -e HF_HUB_DOWNLOAD_TIMEOUT=300 \
-            -e ROBOMME_POLICY="${ROBOMME_POLICY}" \
-            -e ROBOMME_TASKS="${ROBOMME_TASKS}" \
-            lerobot-benchmark-robomme:ci \
-            bash -c "
-              hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
-              lerobot-eval \
-                --policy.path=\"\$ROBOMME_POLICY\" \
-                --env.type=robomme \
-                --env.task=\"\$ROBOMME_TASKS\" \
-                --env.dataset_split=test \
-                --env.task_ids=[0] \
-                --eval.batch_size=1 \
-                --eval.n_episodes=1 \
-                --eval.use_async_envs=false \
-                --policy.device=cuda \
-                '--rename_map={\"observation.images.image\": \"observation.images.camera1\", \"observation.images.wrist_image\": \"observation.images.camera2\"}' \
-                --policy.empty_cameras=3 \
-                --output_dir=/tmp/eval-artifacts
-              python scripts/ci/extract_task_descriptions.py \
-                --env robomme --task \"\$ROBOMME_TASKS\" \
-                --output /tmp/eval-artifacts/task_descriptions.json
-            "
-
-      - name: Copy RoboMME artifacts from container
-        if: always()
-        run: |
-          mkdir -p /tmp/robomme-artifacts
-          docker cp robomme-eval:/tmp/eval-artifacts/. /tmp/robomme-artifacts/ 2>/dev/null || true
-          docker rm -f robomme-eval || true
-
-      - name: Parse RoboMME eval metrics
-        if: always()
-        run: |
-          python3 scripts/ci/parse_eval_metrics.py \
-            --artifacts-dir /tmp/robomme-artifacts \
-            --env robomme \
-            --task "${ROBOMME_TASKS}" \
-            --policy "${ROBOMME_POLICY}"
-
-      - name: Upload RoboMME rollout video
-        if: always()
-        uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
-        with:
-          name: robomme-rollout-video
-          path: /tmp/robomme-artifacts/videos/
-          if-no-files-found: warn
-
-      - name: Upload RoboMME eval metrics
-        if: always()
-        uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
-        with:
-          name: robomme-metrics
-          path: /tmp/robomme-artifacts/metrics.json
-          if-no-files-found: warn
-
-  # ── LIBERO-plus ───────────────────────────────────────────────────────────
-  # Isolated image: LIBERO-plus fork cloned into /home/user_lerobot on top of
-  # huggingface/lerobot-gpu (see docker/Dockerfile.benchmark.libero_plus).
-  libero-plus-integration-test:
-    name: LIBERO-plus — build image + 1-episode eval
-    runs-on:
-      group: aws-g6-4xlarge-plus
-    env:
-      HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
-      LIBERO_PLUS_SUITE: libero_spatial
-      LIBERO_PLUS_POLICY: lerobot/smolvla_libero_plus
-      LIBERO_PLUS_TASK_IDS: "[0,100,260,500,1000,1500,2000,2400]"
-
-    steps:
-      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd  # v6.0.2
-        with:
-          persist-credentials: false
-          lfs: true
-
-      - name: Set up Docker Buildx
-        uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
-        with:
-          cache-binary: false
-
-      - name: Login to Docker Hub
-        if: ${{ env.DOCKERHUB_USERNAME != '' }}
-        uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
-        with:
-          username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
-          password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
-        env:
-          DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
-
-      - name: Build LIBERO-plus benchmark image
-        uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
-        with:
-          context: .
-          file: docker/Dockerfile.benchmark.libero_plus
-          push: false
-          load: true
-          tags: lerobot-benchmark-libero-plus:ci
-          cache-from: type=local,src=/tmp/.buildx-cache-libero-plus
-          cache-to: type=local,dest=/tmp/.buildx-cache-libero-plus,mode=max
-
-      - name: Run LIBERO-plus smoke eval (1 episode)
-        if: env.HF_USER_TOKEN != ''
-        run: |
-          docker run --name libero-plus-eval --gpus all \
-            --shm-size=4g \
-            -e HF_HOME=/tmp/hf \
-            -e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-            -e HF_HUB_DOWNLOAD_TIMEOUT=300 \
-            -e LIBERO_PLUS_SUITE="${LIBERO_PLUS_SUITE}" \
-            -e LIBERO_PLUS_POLICY="${LIBERO_PLUS_POLICY}" \
-            -e LIBERO_PLUS_TASK_IDS="${LIBERO_PLUS_TASK_IDS}" \
-            lerobot-benchmark-libero-plus:ci \
-            bash -c "
-              hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
-              lerobot-eval \
-                --policy.path=\"\$LIBERO_PLUS_POLICY\" \
-                --env.type=libero_plus \
-                --env.task=\"\$LIBERO_PLUS_SUITE\" \
-                --env.task_ids=\"\$LIBERO_PLUS_TASK_IDS\" \
-                --eval.batch_size=1 \
-                --eval.n_episodes=1 \
-                --eval.use_async_envs=false \
-                --policy.device=cuda \
-                '--env.camera_name_mapping={\"agentview_image\": \"camera1\", \"robot0_eye_in_hand_image\": \"camera2\"}' \
-                --policy.empty_cameras=1 \
-                --output_dir=/tmp/eval-artifacts
-              python scripts/ci/extract_task_descriptions.py \
-                --env libero_plus --task \"\$LIBERO_PLUS_SUITE\" \
-                --output /tmp/eval-artifacts/task_descriptions.json
-            "
-
-      - name: Copy LIBERO-plus artifacts from container
-        if: always()
-        run: |
-          mkdir -p /tmp/libero-plus-artifacts
-          docker cp libero-plus-eval:/tmp/eval-artifacts/. /tmp/libero-plus-artifacts/ 2>/dev/null || true
-          docker rm -f libero-plus-eval || true
-
-      - name: Parse LIBERO-plus eval metrics
-        if: always()
-        run: |
-          python3 scripts/ci/parse_eval_metrics.py \
-            --artifacts-dir /tmp/libero-plus-artifacts \
-            --env libero_plus \
-            --task "${LIBERO_PLUS_SUITE}" \
-            --policy "${LIBERO_PLUS_POLICY}"
-
-      - name: Upload LIBERO-plus rollout video
-        if: always()
-        uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
-        with:
-          name: libero-plus-rollout-video
-          path: /tmp/libero-plus-artifacts/videos/
-          if-no-files-found: warn
-
-      - name: Upload LIBERO-plus eval metrics
-        if: always()
-        uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
-        with:
-          name: libero-plus-metrics
-          path: /tmp/libero-plus-artifacts/metrics.json
-          if-no-files-found: warn
-
-  # ── VLABENCH ─────────────────────────────────────────────────────────────
-  # Isolated image: lerobot[vlabench] only (VLABench, mujoco==3.2.2, dm-control chain)
-  vlabench-integration-test:
-    name: VLABench — build image + 1-episode eval
-    runs-on:
-      group: aws-g6-4xlarge-plus
-    env:
-      HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
-
-    steps:
-      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd  # v6.0.2
-        with:
-          persist-credentials: false
-          lfs: true
-
-      - name: Set up Docker Buildx
-        uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
-        with:
-          cache-binary: false
-
-      - name: Login to Docker Hub
-        if: ${{ env.DOCKERHUB_USERNAME != '' }}
-        uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
-        with:
-          username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
-          password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
-        env:
-          DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
-
-      - name: Build VLABench benchmark image
-        uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
-        with:
-          context: .
-          file: docker/Dockerfile.benchmark.vlabench
-          push: false
-          load: true
-          tags: lerobot-benchmark-vlabench:ci
-          build-args: |
-            VLABENCH_ASSETS_REPO=lerobot/vlabench-assets
-
-      - name: Run VLABench smoke eval (10 tasks, 1 episode each)
-        if: env.HF_USER_TOKEN != ''
-        run: |
-          docker run --name vlabench-eval --gpus all \
-            --shm-size=4g \
-            -e HF_HOME=/tmp/hf \
-            -e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-            -e HF_HUB_DOWNLOAD_TIMEOUT=300 \
-            -e MUJOCO_GL=egl \
-            lerobot-benchmark-vlabench:ci \
-            bash -c "
-              hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
-              lerobot-eval \
-                --policy.path=lerobot/smolvla_vlabench \
-                --env.type=vlabench \
-                --env.task=select_fruit,select_toy,select_book,select_painting,select_drink,select_ingredient,select_billiards,select_poker,add_condiment,insert_flower \
-                --eval.batch_size=1 \
-                --eval.n_episodes=1 \
-                --eval.use_async_envs=false \
-                --policy.device=cuda \
-                '--rename_map={\"observation.images.image\": \"observation.images.camera1\", \"observation.images.second_image\": \"observation.images.camera2\", \"observation.images.wrist_image\": \"observation.images.camera3\"}' \
-                --output_dir=/tmp/eval-artifacts
-              python scripts/ci/extract_task_descriptions.py \
-                --env vlabench \
-                --task select_fruit,select_toy,select_book,select_painting,select_drink,select_ingredient,select_billiards,select_poker,add_condiment,insert_flower \
-                --output /tmp/eval-artifacts/task_descriptions.json
-            "
-
-      - name: Copy VLABench artifacts from container
-        if: always()
-        run: |
-          mkdir -p /tmp/vlabench-artifacts
-          docker cp vlabench-eval:/tmp/eval-artifacts/. /tmp/vlabench-artifacts/ 2>/dev/null || true
-          docker rm -f vlabench-eval || true
-
-      - name: Parse VLABench eval metrics
-        if: always()
-        run: |
-          python3 scripts/ci/parse_eval_metrics.py \
-            --artifacts-dir /tmp/vlabench-artifacts \
-            --env vlabench \
-            --task select_fruit,select_toy,select_book,select_painting,select_drink,select_ingredient,select_billiards,select_poker,add_condiment,insert_flower \
-            --policy lerobot/smolvla_vlabench
-
-      - name: Upload VLABench rollout video
-        if: always()
-        uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
-        with:
-          name: vlabench-rollout-video
-          path: /tmp/vlabench-artifacts/videos/
-          if-no-files-found: warn
-
-      - name: Upload VLABench eval metrics
-        if: always()
-        uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
-        with:
-          name: vlabench-metrics
-          path: /tmp/vlabench-artifacts/metrics.json
-          if-no-files-found: warn
--- a/docker/Dockerfile.benchmark.libero_plus
+++ b/docker/Dockerfile.benchmark.libero_plus
@@ -1,84 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# Benchmark image for LIBERO-plus integration tests.
-# Extends the nightly GPU image (which has lerobot[all]) with the LIBERO-plus
-# fork source + its 6.4 GB perturbation assets.
-#
-# Build:  docker build -f docker/Dockerfile.benchmark.libero_plus -t lerobot-benchmark-libero-plus .
-# Run:    docker run --gpus all --rm lerobot-benchmark-libero-plus lerobot-eval ...
-
-FROM huggingface/lerobot-gpu:latest
-ENV MUJOCO_GL=egl
-
-# unzip for the 6.4 GB assets.zip; the rest are LIBERO-plus build-time extras
-# (wand / ImageMagick / fontconfig) not in the nightly base.
-USER root
-RUN apt-get update \
-    && apt-get install -y --no-install-recommends \
-         unzip libexpat1 libfontconfig1-dev libmagickwand-dev \
-    && apt-get clean && rm -rf /var/lib/apt/lists/*
-USER user_lerobot
-
-# robosuite==1.4.1 is mandatory (the fork uses `single_arm_env` removed in
-# v1.5+). The rest are LIBERO-plus runtime deps pulled from its setup.py.
-# We install these explicitly instead of via the [libero_plus] extra because
-# the extra's `libero @ git+...` dep installs as a namespace package and then
-# clone and PYTHONPATH-override it below.
-RUN uv pip install --no-cache \
-        "robosuite==1.4.1" \
-        "bddl==1.0.1" \
-        "easydict==1.13" \
-        "mujoco==3.7.0" \
-        "matplotlib==3.10.8" \
-        "Wand==0.6.13" \
-        "scikit-image==0.25.2" \
-        "gym==0.26.2"
-
-# Clone LIBERO-plus and make it importable as `libero`. The nightly base has
-# hf-libero (10 tasks) preinstalled via lerobot[libero]; uninstall it so
-# Python resolves `import libero` to the 2402-task LIBERO-plus module instead.
-# Pinned to the current upstream main SHA so benchmark builds stay reproducible.
-ARG LIBERO_PLUS_SHA=4976dc3
-ENV LIBERO_PLUS_ROOT=/home/user_lerobot/libero-plus/libero/libero
-RUN git clone https://github.com/sylvestf/LIBERO-plus.git /home/user_lerobot/libero-plus \
-    && git -C /home/user_lerobot/libero-plus checkout ${LIBERO_PLUS_SHA} \
-    && cd /home/user_lerobot/libero-plus && uv pip install --no-cache --no-deps -e "." \
-    && (uv pip uninstall hf-libero 2>/dev/null || true)
-ENV PYTHONPATH="/home/user_lerobot/libero-plus:${PYTHONPATH}"
-
-# Perturbation textures/scenes: bddl_base_domain.py resolves XMLs via
-# DIR_PATH/../assets (package-relative, ignoring ~/.libero/config.yaml). All
-# 2402 tasks reference files that ship only in Sylvest/LIBERO-plus's
-# assets.zip (6.4 GB) under a deep author-internal prefix — extract and
-# flatten it under ${LIBERO_PLUS_ROOT}/assets.
-RUN python -c "\
-from huggingface_hub import hf_hub_download; \
-hf_hub_download(repo_id='Sylvest/LIBERO-plus', repo_type='dataset', \
-                filename='assets.zip', local_dir='/tmp/libero-plus-dl')" \
-    && unzip -q /tmp/libero-plus-dl/assets.zip -d /tmp/libero-plus-dl/extract \
-    && ASSETS_DIR=$(find /tmp/libero-plus-dl/extract -type d -name assets | head -1) \
-    && mv "${ASSETS_DIR}" ${LIBERO_PLUS_ROOT}/assets \
-    && rm -rf /tmp/libero-plus-dl
-
-# Point ~/.libero/config.yaml at the clone so LIBERO-plus's imports are
-# non-interactive (it calls input() when the config is missing).
-RUN mkdir -p /home/user_lerobot/.libero \
-    && printf "assets: ${LIBERO_PLUS_ROOT}/assets\nbddl_files: ${LIBERO_PLUS_ROOT}/bddl_files\ndatasets: ${LIBERO_PLUS_ROOT}/../datasets\ninit_states: ${LIBERO_PLUS_ROOT}/init_files\n" \
-       > /home/user_lerobot/.libero/config.yaml
-
-# Overlay the PR's source code on top of the nightly image.
-COPY --chown=user_lerobot:user_lerobot . .
-
-CMD ["/bin/bash"]
--- a/docker/Dockerfile.benchmark.robocerebra
+++ b/docker/Dockerfile.benchmark.robocerebra
@@ -1,43 +0,0 @@
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# Benchmark image for RoboCerebra integration tests.
-# RoboCerebra reuses LIBERO's simulator (libero_10 suite) with a different
-# rename_map, so this image is identical to the LIBERO benchmark image —
-# extends the nightly GPU base with LIBERO assets + the PR's source code.
-#
-# Build:  docker build -f docker/Dockerfile.benchmark.robocerebra -t lerobot-benchmark-robocerebra .
-# Run:    docker run --gpus all --rm lerobot-benchmark-robocerebra lerobot-eval ...
-
-FROM huggingface/lerobot-gpu:latest
-
-# Pre-download lerobot/libero-assets from HF Hub so nothing is fetched at
-# runtime (which times out on CI). Point the libero config at the cached path.
-# libero/libero/__init__.py calls input() when ~/.libero/config.yaml is missing,
-# so we write the config before any libero import can happen.
-RUN LIBERO_DIR=$(python -c \
-      "import importlib.util, os; s=importlib.util.find_spec('libero'); \
-       print(os.path.join(os.path.dirname(s.origin), 'libero'))") && \
-    mkdir -p /home/user_lerobot/.libero && \
-    python -c "\
-from huggingface_hub import snapshot_download; \
-snapshot_download(repo_id='lerobot/libero-assets', repo_type='dataset', \
-                  local_dir='/home/user_lerobot/.libero/assets')" && \
-    printf "assets: /home/user_lerobot/.libero/assets\nbddl_files: ${LIBERO_DIR}/bddl_files\ndatasets: ${LIBERO_DIR}/../datasets\ninit_states: ${LIBERO_DIR}/init_files\n" \
-    > /home/user_lerobot/.libero/config.yaml
-
-# Overlay the PR's source code on top of the nightly image.
-COPY --chown=user_lerobot:user_lerobot . .
-
-CMD ["/bin/bash"]
--- a/docker/Dockerfile.benchmark.robomme
+++ b/docker/Dockerfile.benchmark.robomme
@@ -1,56 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# Benchmark image for RoboMME integration tests.
-# Extends the nightly GPU image (which has lerobot[all]) with Vulkan system
-# libs for ManiSkill/SAPIEN and the robomme extra. robomme isn't in [all]
-# because mani-skill hard-pins gymnasium==0.29.1 and numpy<2.0.0 which
-# conflict with lerobot's defaults; both are safe at runtime:
-#   - gymnasium 0.29.x has the same 5-tuple step() API as 1.x (since 0.26)
-#   - numpy 1.26.4 is API-compatible with lerobot's actual usage.
-#
-# Build:  docker build -f docker/Dockerfile.benchmark.robomme -t lerobot-benchmark-robomme .
-# Run:    docker run --gpus all --rm lerobot-benchmark-robomme lerobot-eval ...
-
-FROM huggingface/lerobot-gpu:latest
-
-# NVIDIA Container Toolkit: expose Vulkan driver capability for headless rendering.
-ENV NVIDIA_DRIVER_CAPABILITIES=all \
-    VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/nvidia_icd.json
-
-# ManiSkill/SAPIEN's renderer needs Vulkan, which isn't in the base image.
-USER root
-RUN apt-get update \
-    && apt-get install -y --no-install-recommends \
-         libvulkan1 libvulkan-dev mesa-vulkan-drivers \
-    && mkdir -p /usr/share/vulkan/icd.d \
-    && echo '{"file_format_version":"1.0.0","ICD":{"library_path":"libGLX_nvidia.so.0","api_version":"1.3.0"}}' \
-       > /usr/share/vulkan/icd.d/nvidia_icd.json \
-    && apt-get clean && rm -rf /var/lib/apt/lists/*
-USER user_lerobot
-
-# Install smolvla + av-dep via the PR's pyproject, then layer robomme on top
-# with gymnasium/numpy overrides. robomme isn't a pyproject extra because its
-# mani-skill pin conflicts with lerobot's base numpy>=2 (see pyproject.toml).
-COPY --chown=user_lerobot:user_lerobot setup.py pyproject.toml uv.lock README.md MANIFEST.in ./
-RUN printf 'gymnasium==0.29.1\nnumpy==1.26.4\n' > /tmp/robomme_override.txt \
-    && uv pip install --no-cache --override /tmp/robomme_override.txt \
-         -e ".[smolvla,av-dep]" \
-         "robomme @ git+https://github.com/RoboMME/robomme_benchmark.git@main" \
-    && python -c "import robomme; print('robomme import OK')"
-
-# Overlay the PR's source code on top of the nightly image.
-COPY --chown=user_lerobot:user_lerobot . .
-
-CMD ["/bin/bash"]
--- a/docker/Dockerfile.benchmark.robotwin
+++ b/docker/Dockerfile.benchmark.robotwin
@@ -111,22 +111,15 @@ EOF
 WORKDIR ${ROBOTWIN_ROOT}
 RUN python script/update_embodiment_config_path.py

-ENV PYTHONPATH="${ROBOTWIN_ROOT}"
+ENV PYTHONPATH="${ROBOTWIN_ROOT}:${PYTHONPATH}"

-# Fail the image build early if the CuRobo package layout regresses. Importing
-# RoboTwin's planner here is too eager because CuRobo constructs CUDA-backed
-# defaults at import time, while Docker builds don't have access to an NVIDIA
-# driver.
+# Fail the image build early if the CuRobo/RoboTwin import chain regresses.
 RUN python - <<'EOF'
-from pathlib import Path
-
 from curobo.types.math import Pose
-
-planner_src = (Path("/opt/robotwin/envs/robot/planner.py")).read_text()
-assert "from curobo.types.math import Pose as CuroboPose" in planner_src
+from envs.robot.planner import CuroboPlanner

 print("CuRobo import OK:", Pose.__name__)
-print("RoboTwin planner import references curobo.types.math")
+print("RoboTwin planner import OK:", CuroboPlanner.__name__)
 EOF

 # Return to the lerobot source directory (set by base image) before overlaying.
--- a/docker/Dockerfile.benchmark.vlabench
+++ b/docker/Dockerfile.benchmark.vlabench
@@ -1,99 +0,0 @@
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# Benchmark image for VLABench integration tests.
-# Extends the nightly GPU image with the PR's source code and VLABench setup.
-#
-# Build:  docker build -f docker/Dockerfile.benchmark.vlabench -t lerobot-benchmark-vlabench .
-# Run:    docker run --gpus all --rm lerobot-benchmark-vlabench lerobot-eval ...
-
-FROM huggingface/lerobot-gpu:latest
-
-# Install VLABench from GitHub (not on PyPI) and pin MuJoCo/dm-control.
-# Shallow-clone without submodule recursion (nested SSH-only submodules fail in CI).
-# Editable install (-e) because VLABench/utils/ has no __init__.py, so
-# find_packages() omits it from wheels; editable mode uses the source tree directly.
-# rrt-algorithms has the same packaging issue (rrt/ dir missing __init__.py).
-# Patch: constant.py calls os.listdir on ~100 asset/obj/meshes/* dirs at import
-# time. Guard the call so missing dirs return [] instead of crashing (in case
-# the asset download is partial).
-#
-# Pinned upstream SHAs for reproducible benchmark runs. Bump when you need
-# an upstream fix; don't rely on `main`/`develop` drift.
-ARG VLABENCH_SHA=cf588fe60c0c7282174fe979f5913170cfe69017
-ARG RRT_ALGORITHMS_SHA=e51d95ee489a225220d6ae2a764c4111f6ba7d85
-RUN git clone https://github.com/OpenMOSS/VLABench.git ~/VLABench && \
-    git -C ~/VLABench checkout ${VLABENCH_SHA} && \
-    git clone https://github.com/motion-planning/rrt-algorithms.git ~/rrt-algorithms && \
-    git -C ~/rrt-algorithms checkout ${RRT_ALGORITHMS_SHA} && \
-    python3 -c "\
-import pathlib; \
-p = pathlib.Path.home() / 'VLABench/VLABench/configs/constant.py'; \
-t = p.read_text(); \
-p.write_text(t.replace( \
-    'subdirs = os.listdir(xml_dir)', \
-    'if not os.path.isdir(xml_dir): return []\n    subdirs = os.listdir(xml_dir)'))" && \
-    uv pip install --no-cache -e ~/VLABench -e ~/rrt-algorithms \
-      mujoco==3.2.2 dm-control==1.0.22 \
-      open3d colorlog scikit-learn openai gdown
-
-# Download VLABench mesh assets. Task configs reference object meshes
-# (obj/meshes/fruit/, containers/basket/, tablewares/plates/, etc.); without
-# them the task builder picks from an empty mesh list and crashes with
-# IndexError at task-build time (random.choice([]) in config_manager.py).
-#
-# Preferred source: an HF Hub mirror. Set VLABENCH_ASSETS_REPO at build time
-# (e.g. --build-arg VLABENCH_ASSETS_REPO=lerobot/vlabench-assets) and we'll
-# snapshot_download the repo into VLABench's assets dir. This is the reliable
-# path for CI — Google Drive frequently returns HTTP 429 ("Too many users have
-# viewed or downloaded this file recently") on shared academic files.
-#
-# After download we *validate* that at least one XML exists under each
-# task-critical subtree and fail the build loudly if not. Silent-empty asset
-# dirs are the #1 cause of VLABench runtime crashes in CI, so we surface them
-# here rather than after a 10-minute eval build.
-#
-# Fallback: VLABench's own gdown-based script. Best-effort only.
-ARG VLABENCH_ASSETS_REPO=""
-RUN ASSETS_DIR="$HOME/VLABench/VLABench/assets" && \
-    if [ -n "${VLABENCH_ASSETS_REPO}" ]; then \
-        echo "Downloading VLABench assets from HF Hub: ${VLABENCH_ASSETS_REPO}" && \
-        uv pip install --no-cache "huggingface_hub[hf_xet]>=0.26" && \
-        python -c "from huggingface_hub import snapshot_download; \
-p = snapshot_download(repo_id='${VLABENCH_ASSETS_REPO}', repo_type='dataset', \
-    local_dir='${ASSETS_DIR}', allow_patterns=['obj/**', 'scenes/**']); \
-print('snapshot_download returned:', p)"; \
-    else \
-        echo "No VLABENCH_ASSETS_REPO set — falling back to gdown" && \
-        python ~/VLABench/scripts/download_assets.py --choice all; \
-    fi && \
-    python -c "\
-from pathlib import Path; \
-import sys; \
-root = Path('${ASSETS_DIR}'); \
-checks = ['obj/meshes/tablewares/plates', 'obj/meshes/containers/basket', 'obj/meshes/fruit', 'obj/meshes/containers/tray']; \
-failed = []; \
-print(f'Validating VLABench assets under {root}'); \
-[print(f'  {c}: {len(list((root/c).rglob(\"*.xml\")))} XMLs') for c in checks]; \
-[failed.append(c) for c in checks if not any((root/c).rglob('*.xml'))]; \
-sys.exit(f'Empty asset dirs (no *.xml): {failed}') if failed else print('All asset dirs populated.')"
-
-# Overlay the PR's source code on top of the nightly image.
-COPY --chown=user_lerobot:user_lerobot . .
-
-# Re-install lerobot editably so the new source (with VLABenchEnv registration
-# and updated obs handling) replaces the stale package baked into the nightly image.
-RUN uv pip install --no-cache --no-deps -e .
-
-CMD ["/bin/bash"]
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -31,10 +31,8 @@
    title: Porting Large Datasets
  - local: using_dataset_tools
    title: Using the Dataset Tools
-  - local: language_and_recipes
-    title: Language Columns and Recipes
-  - local: tools
-    title: Tools
+  - local: dataset_subtask
+    title: Using Subtasks in the Dataset
  - local: streaming_video_encoding
    title: Streaming Video Encoding
  title: "Datasets"
@@ -79,22 +77,14 @@
    title: Adding a New Benchmark
  - local: libero
    title: LIBERO
-  - local: libero_plus
-    title: LIBERO-plus
  - local: metaworld
    title: Meta-World
  - local: robotwin
    title: RoboTwin 2.0
  - local: robocasa
    title: RoboCasa365
-  - local: robocerebra
-    title: RoboCerebra
-  - local: robomme
-    title: RoboMME
  - local: envhub_isaaclab_arena
    title: NVIDIA IsaacLab Arena Environments
-  - local: vlabench
-    title: VLABench
  title: "Benchmarks"
 - sections:
  - local: introduction_processors
--- a/docs/source/dataset_subtask.mdx
+++ b/docs/source/dataset_subtask.mdx
@@ -0,0 +1,277 @@
+# Using Subtasks in LeRobot Datasets
+
+Subtask support in robotics datasets has proven effective in improving robot reasoning and understanding. Subtasks are particularly useful for:
+
+- **Hierarchical policies**: Building policies that include subtask predictions to visualize robot reasoning in real time
+- **Reward modeling**: Helping reward models understand task progression (e.g., SARM-style stage-aware reward models)
+- **Task decomposition**: Breaking down complex manipulation tasks into atomic, interpretable steps
+
+LeRobotDataset now supports subtasks as part of its dataset structure, alongside tasks.
+
+## What are Subtasks?
+
+While a **task** describes the overall goal (e.g., "Pick up the apple and place it in the basket"), **subtasks** break down the execution into finer-grained steps:
+
+1. "Approach the apple"
+2. "Grasp the apple"
+3. "Lift the apple"
+4. "Move to basket"
+5. "Release the apple"
+
+Each frame in the dataset can be annotated with its corresponding subtask, enabling models to learn and predict these intermediate stages.
+
+<img
+  src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/subtask-asset.png"
+  alt="An overview of subtask annotation showing how frames are labeled with intermediate subtask stages"
+  width="80%"
+/>
+
+<p>
+  <em>Figure: Overview of subtask annotation.</em>
+</p>
+
+**Reference:** _Subtask-learning based for robot self-assembly in flexible collaborative assembly in manufacturing_, Original Article, Published: 19 April 2022.
+
+## Dataset Structure
+
+Subtask information is stored in the dataset metadata:
+
+```
+my-dataset/
+├── data/
+│   └── ...
+├── meta/
+│   ├── info.json
+│   ├── stats.json
+│   ├── tasks.parquet
+│   ├── subtasks.parquet      # Subtask index → subtask string mapping
+│   └── episodes/
+│       └── ...
+└── videos/
+    └── ...
+```
+
+### Subtasks Parquet File
+
+The `meta/subtasks.parquet` file maps subtask indices to their natural language descriptions:
+
+| subtask_index | subtask (index column) |
+| ------------- | ---------------------- |
+| 0             | "Approach the apple"   |
+| 1             | "Grasp the apple"      |
+| 2             | "Lift the apple"       |
+| ...           | ...                    |
+
+### Frame-Level Annotations
+
+Each frame in the dataset can include a `subtask_index` field that references the subtasks parquet file:
+
+```python
+# Example frame data in the parquet file
+{
+    "index": 42,
+    "timestamp": 1.4,
+    "episode_index": 0,
+    "task_index": 0,
+    "subtask_index": 2,  # References "Lift the apple"
+    "observation.state": [...],
+    "action": [...],
+}
+```
+
+## Annotating Datasets with Subtasks
+
+We provide a HuggingFace Space for easily annotating any LeRobotDataset with subtasks:
+
+**[https://huggingface.co/spaces/lerobot/annotate](https://huggingface.co/spaces/lerobot/annotate)**
+
+After completing your annotation:
+
+1. Click "Push to Hub" to upload your annotated dataset
+2. You can also run the annotation space locally by following the instructions at [github.com/huggingface/lerobot-annotate](https://github.com/huggingface/lerobot-annotate)
+
+## Loading Datasets with Subtasks
+
+When you load a dataset with subtask annotations, the subtask information is automatically available:
+
+```python
+from lerobot.datasets import LeRobotDataset
+
+# Load a dataset with subtask annotations
+dataset = LeRobotDataset("jadechoghari/collect-fruit-annotated")
+
+# Access a sample
+sample = dataset[100]
+
+# The sample includes both task and subtask information
+print(sample["task"])        # "Collect the fruit"
+print(sample["subtask"])     # "Grasp the apple"
+print(sample["task_index"])  # tensor(0)
+print(sample["subtask_index"])  # tensor(2)
+```
+
+### Checking for Subtask Support
+
+You can check if a dataset has subtask annotations:
+
+```python
+# Check if subtasks are available
+has_subtasks = (
+    "subtask_index" in dataset.features
+    and dataset.meta.subtasks is not None
+)
+
+if has_subtasks:
+    print(f"Dataset has {len(dataset.meta.subtasks)} unique subtasks")
+    print("Subtasks:", list(dataset.meta.subtasks.index))
+```
+
+## Using Subtasks for Training
+
+### With the Tokenizer Processor
+
+The `TokenizerProcessor` automatically handles subtask tokenization for Vision-Language Action (VLA) models:
+
+```python
+from lerobot.processor import TokenizerProcessorStep
+
+# Create a tokenizer processor step
+tokenizer_processor = TokenizerProcessorStep(
+    tokenizer_name_or_path="google/paligemma-3b-pt-224",
+    padding="max_length",
+    max_length=64,
+)
+
+# The processor will automatically tokenize subtasks if present in the batch
+# and add them to the observation under:
+# - "observation.subtask.tokens"
+# - "observation.subtask.attention_mask"
+```
+
+When subtasks are available in the batch, the tokenizer processor adds:
+
+- `observation.subtask.tokens`: Tokenized subtask text
+- `observation.subtask.attention_mask`: Attention mask for the subtask tokens
+
+### DataLoader with Subtasks
+
+```python
+import torch
+from lerobot.datasets import LeRobotDataset
+
+dataset = LeRobotDataset("jadechoghari/collect-fruit-annotated")
+
+dataloader = torch.utils.data.DataLoader(
+    dataset,
+    batch_size=16,
+    shuffle=True,
+)
+
+for batch in dataloader:
+    # Access subtask information in the batch
+    subtasks = batch["subtask"]  # List of subtask strings
+    subtask_indices = batch["subtask_index"]  # Tensor of subtask indices
+
+    # Use for training hierarchical policies or reward models
+    print(f"Batch subtasks: {set(subtasks)}")
+```
+
+## Example Datasets with Subtask Annotations
+
+Try loading a dataset with subtask annotations:
+
+```python
+from lerobot.datasets import LeRobotDataset
+
+# Example dataset with subtask annotations
+dataset = LeRobotDataset("jadechoghari/collect-fruit-annotated")
+
+# Explore the subtasks
+print("Available subtasks:")
+for subtask_name in dataset.meta.subtasks.index:
+    print(f"  - {subtask_name}")
+
+# Get subtask distribution
+subtask_counts = {}
+for i in range(len(dataset)):
+    sample = dataset[i]
+    subtask = sample["subtask"]
+    subtask_counts[subtask] = subtask_counts.get(subtask, 0) + 1
+
+print("\nSubtask distribution:")
+for subtask, count in sorted(subtask_counts.items(), key=lambda x: -x[1]):
+    print(f"  {subtask}: {count} frames")
+```
+
+## Use Cases
+
+### 1. Hierarchical Policy Training
+
+Train policies that predict both actions and current subtask:
+
+```python
+class HierarchicalPolicy(nn.Module):
+    def __init__(self, num_subtasks):
+        super().__init__()
+        self.action_head = nn.Linear(hidden_dim, action_dim)
+        self.subtask_head = nn.Linear(hidden_dim, num_subtasks)
+
+    def forward(self, observations):
+        features = self.encoder(observations)
+        actions = self.action_head(features)
+        subtask_logits = self.subtask_head(features)
+        return actions, subtask_logits
+```
+
+### 2. Stage-Aware Reward Modeling (SARM)
+
+Build reward models that understand task progression:
+
+```python
+# SARM predicts:
+# - Stage: Which subtask is being executed (discrete)
+# - Progress: How far along the subtask (continuous 0-1)
+
+class SARMRewardModel(nn.Module):
+    def forward(self, observations):
+        features = self.encoder(observations)
+        stage_logits = self.stage_classifier(features)
+        progress = self.progress_regressor(features)
+        return stage_logits, progress
+```
+
+### 3. Progress Visualization
+
+Monitor robot execution by tracking subtask progression:
+
+```python
+def visualize_execution(model, observations):
+    for t, obs in enumerate(observations):
+        action, subtask_logits = model(obs)
+        predicted_subtask = subtask_names[subtask_logits.argmax()]
+        print(f"t={t}: Executing '{predicted_subtask}'")
+```
+
+## API Reference
+
+### LeRobotDataset Properties
+
+| Property                    | Type                   | Description                                |
+| --------------------------- | ---------------------- | ------------------------------------------ |
+| `meta.subtasks`             | `pd.DataFrame \| None` | DataFrame mapping subtask names to indices |
+| `features["subtask_index"]` | `dict`                 | Feature spec for subtask_index if present  |
+
+### Sample Keys
+
+When subtasks are available, each sample includes:
+
+| Key             | Type           | Description                          |
+| --------------- | -------------- | ------------------------------------ |
+| `subtask_index` | `torch.Tensor` | Integer index of the current subtask |
+| `subtask`       | `str`          | Natural language subtask description |
+
+## Related Resources
+
+- [SARM Paper](https://arxiv.org/pdf/2509.25358) - Stage-Aware Reward Modeling for Long Horizon Robot Manipulation
+- [LeRobot Annotate Space](https://huggingface.co/spaces/lerobot/annotate) - Interactive annotation tool
+- [LeRobotDataset v3.0](./lerobot-dataset-v3) - Dataset format documentation
--- a/docs/source/language_and_recipes.mdx
+++ b/docs/source/language_and_recipes.mdx
@@ -1,109 +0,0 @@
-# Language columns and recipes
-
-LeRobot stores reusable language annotations directly next to frame data in `data/chunk-*/file-*.parquet`.
-The two optional columns are:
-
- `language_persistent`: a list of rows broadcast across every frame in an episode for state that remains active, such as `subtask`, `plan`, and `memory`.
- `language_events`: a list of rows only on the exact frame where an event was emitted, such as `interjection`, `vqa`, and speech tool calls.
-
-Both columns share the same row shape (event rows omit `timestamp` because the
-frame the row sits on already provides it):
-
-```text
-role: string
-content: string | null
-style: string | null
-timestamp: float64        # persistent rows only
-camera: string | null     # observation.images.* feature key, view-dependent rows only
-tool_calls: list[Json] | null
-```
-
-The `camera` field tags rows whose `content` is grounded in a specific camera
-view. Rows of view-dependent styles (`vqa`, and the reserved `motion` /
-`trace`) MUST set `camera` to the matching `observation.images.*` feature key.
-Rows of every other style MUST leave `camera` as `null`. Pipeline writers and
-the validator enforce this via `validate_camera_field(style, camera)`.
-
-`meta/tasks.parquet` remains the canonical source for the task. The special `${task}` recipe binding always reads that task string and does not depend on language annotations.
-
-## Architecture
-
-The language stack has three layers:
-
-1. `lerobot.datasets.language` defines the schema, style registry, and `column_for_style`.
-2. `lerobot.datasets.language_render` resolves rows and renders messages.
-3. `RenderMessagesStep` turns dataset samples into `messages`, `message_streams`, and `target_message_indices`.
-
-`LeRobotDataset` stays recipe-agnostic. It passes `language_persistent` and `language_events` through when present, and unannotated datasets keep their existing behavior.
-
-## Temporal semantics
-
-Persistent styles are active after emission until replaced:
-
- `active_at(t, style=subtask)`
- `nth_prev(style=memory, offset=1)`
- `nth_next(style=subtask, offset=1)`
-
-Event styles only exist on their exact timestamp:
-
- `emitted_at(t, style=interjection)`
- `emitted_at(t, style=vqa, role=user, camera=observation.images.top)`
- `emitted_at(t, role=assistant, tool_name=say)`
-
-Exact event matching has no tolerance window, so writers must stamp event rows with frame timestamps from the parquet data.
-
-## View-dependent resolution
-
-For view-dependent styles (`vqa`, `motion`, `trace`), the resolver gains a
-`camera=` filter parallel to `role=` and `tool_name=`. Datasets with multiple
-cameras typically emit one (`vqa`, `user`) + (`vqa`, `assistant`) pair per
-camera at the same timestamp; without `camera=`, those resolvers see two
-matches and raise an ambiguity error. Recipes consume each camera through its
-own binding plus a matching image block, e.g.
-
-```yaml
-ask_vqa_top:
-  bindings:
-    vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.top)"
-    vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.top)"
-  messages:
-    - role: user
-      stream: high_level
-      if_present: vqa_query
-      content:
-        - { type: image, feature: observation.images.top }
-        - { type: text, text: "${vqa_query}" }
-    - { role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa }
-```
-
-Add one such sub-recipe per camera the dataset records.
-
-## Recipe anatomy
-
-Recipes are YAML files backed by `TrainingRecipe` and `MessageTurn`.
-
-```yaml
-messages:
-  - { role: user, content: "${task}", stream: high_level }
-  - { role: assistant, content: "${subtask}", stream: low_level, target: true }
-```
-
-Rendered samples use HF-style chat messages plus LeRobot sidecars:
-
-```python
-sample["messages"]
-sample["message_streams"]
-sample["target_message_indices"]
-```
-
-The renderer does not apply a tokenizer chat template. Policy processors decide how to serialize the messages for their backbone.
-
-## Blends
-
-Blend recipes select one weighted sub-recipe deterministically from the sample index.
-The canonical `recipes/pi05_hirobot.yaml` combines memory updates, interjection responses, high-level subtask prediction, low-level execution, and VQA.
-
-## Graceful absence
-
-If both language columns are missing, `None`, or empty, `RenderMessagesStep` is a no-op.
-If an event-scoped branch is selected on a frame without the required event row, rendering returns `None`, allowing a loader to retry another sample.
--- a/docs/source/libero_plus.mdx
+++ b/docs/source/libero_plus.mdx
@@ -1,188 +0,0 @@
-# LIBERO-plus
-
-LIBERO-plus is a **robustness benchmark** for Vision-Language-Action (VLA) models built on top of [LIBERO](./libero). It systematically stress-tests policies by applying **seven independent perturbation dimensions** to the original LIBERO task set, exposing failure modes that standard benchmarks miss.
-
- Paper: [In-depth Robustness Analysis of Vision-Language-Action Models](https://arxiv.org/abs/2510.13626)
- GitHub: [sylvestf/LIBERO-plus](https://github.com/sylvestf/LIBERO-plus)
- Dataset: [lerobot/libero_plus](https://huggingface.co/datasets/lerobot/libero_plus)
-
-![An overview of the LIBERO-plus benchmark perturbation dimensions](https://github.com/sylvestf/LIBERO-plus/raw/main/static/images/libero-plus.jpg)
-
-## Perturbation dimensions
-
-LIBERO-plus creates ~10 000 task variants by perturbing each original LIBERO task along these axes:
-
-| Dimension             | What changes                                          |
-| --------------------- | ----------------------------------------------------- |
-| Objects layout        | Target position, presence of confounding objects      |
-| Camera viewpoints     | Camera position, orientation, field-of-view           |
-| Robot initial states  | Manipulator start pose                                |
-| Language instructions | LLM-rewritten task description (paraphrase / synonym) |
-| Light conditions      | Intensity, direction, color, shadow                   |
-| Background textures   | Scene surface and object appearance                   |
-| Sensor noise          | Photometric distortions and image degradation         |
-
-## Available task suites
-
-LIBERO-plus covers the same five suites as LIBERO:
-
-| Suite          | CLI name         | Tasks | Max steps | Description                                        |
-| -------------- | ---------------- | ----- | --------- | -------------------------------------------------- |
-| LIBERO-Spatial | `libero_spatial` | 10    | 280       | Tasks requiring reasoning about spatial relations  |
-| LIBERO-Object  | `libero_object`  | 10    | 280       | Tasks centered on manipulating different objects   |
-| LIBERO-Goal    | `libero_goal`    | 10    | 300       | Goal-conditioned tasks with changing targets       |
-| LIBERO-90      | `libero_90`      | 90    | 400       | Short-horizon tasks from the LIBERO-100 collection |
-| LIBERO-Long    | `libero_10`      | 10    | 520       | Long-horizon tasks from the LIBERO-100 collection  |
-
-<Tip warning={true}>
-  Installing LIBERO-plus **replaces** vanilla LIBERO — it uninstalls `hf-libero`
-  so that `import libero` resolves to the LIBERO-plus fork. You cannot have both
-  installed at the same time. To switch back to vanilla LIBERO, uninstall the
-  fork and reinstall with `pip install -e ".[libero]"`.
-</Tip>
-
-## Installation
-
-### System dependencies (Linux only)
-
-```bash
-sudo apt install libexpat1 libfontconfig1-dev libmagickwand-dev
-```
-
-### Python package
-
-```bash
-pip install -e ".[libero]" "robosuite==1.4.1" bddl easydict mujoco wand scikit-image gym
-git clone https://github.com/sylvestf/LIBERO-plus.git
-cd LIBERO-plus && pip install --no-deps -e .
-pip uninstall -y hf-libero  # so `import libero` resolves to the fork
-```
-
-LIBERO-plus is installed from its GitHub fork rather than a pyproject extra — the fork ships as a namespace package that pip can't handle, so it must be cloned and added to `PYTHONPATH`. See `docker/Dockerfile.benchmark.libero_plus` for the canonical install. MuJoCo is required, so only Linux is supported.
-
-<Tip>
-Set the MuJoCo rendering backend before running evaluation:
-
-```bash
-export MUJOCO_GL=egl   # headless / HPC / cloud
-```
-
-</Tip>
-
-### Download LIBERO-plus assets
-
-LIBERO-plus ships its extended asset pack separately. Download `assets.zip` from the [Hugging Face dataset](https://huggingface.co/datasets/Sylvest/LIBERO-plus/tree/main) and extract it into the LIBERO-plus package directory:
-
-```bash
-# After installing the package, find where it was installed:
-python -c "import libero; print(libero.__file__)"
-# Then extract assets.zip into <package_root>/libero/assets/
-```
-
-## Evaluation
-
-### Default evaluation (recommended)
-
-Evaluate across the four standard suites (10 episodes per task):
-
-```bash
-lerobot-eval \
-  --policy.path="your-policy-id" \
-  --env.type=libero_plus \
-  --env.task=libero_spatial,libero_object,libero_goal,libero_10 \
-  --eval.batch_size=1 \
-  --eval.n_episodes=10 \
-  --env.max_parallel_tasks=1
-```
-
-### Single-suite evaluation
-
-Evaluate on one LIBERO-plus suite:
-
-```bash
-lerobot-eval \
-  --policy.path="your-policy-id" \
-  --env.type=libero_plus \
-  --env.task=libero_spatial \
-  --eval.batch_size=1 \
-  --eval.n_episodes=10
-```
-
- `--env.task` picks the suite (`libero_spatial`, `libero_object`, etc.).
- `--env.task_ids` restricts to specific task indices (`[0]`, `[1,2,3]`, etc.). Omit to run all tasks in the suite.
- `--eval.batch_size` controls how many environments run in parallel.
- `--eval.n_episodes` sets how many episodes to run per task.
-
-### Multi-suite evaluation
-
-Benchmark a policy across multiple suites at once by passing a comma-separated list:
-
-```bash
-lerobot-eval \
-  --policy.path="your-policy-id" \
-  --env.type=libero_plus \
-  --env.task=libero_spatial,libero_object \
-  --eval.batch_size=1 \
-  --eval.n_episodes=10
-```
-
-### Control mode
-
-LIBERO-plus supports two control modes — `relative` (default) and `absolute`. Different VLA checkpoints are trained with different action parameterizations, so make sure the mode matches your policy:
-
-```bash
--env.control_mode=relative   # or "absolute"
-```
-
-### Policy inputs and outputs
-
-**Observations:**
-
- `observation.state` — 8-dim proprioceptive features (eef position, axis-angle orientation, gripper qpos)
- `observation.images.image` — main camera view (`agentview_image`), HWC uint8
- `observation.images.image2` — wrist camera view (`robot0_eye_in_hand_image`), HWC uint8
-
-**Actions:**
-
- Continuous control in `Box(-1, 1, shape=(7,))` — 6D end-effector delta + 1D gripper
-
-### Recommended evaluation episodes
-
-For reproducible benchmarking, use **10 episodes per task** across all four standard suites (Spatial, Object, Goal, Long). This gives 400 total episodes and matches the protocol used for published results.
-
-## Training
-
-### Dataset
-
-A LeRobot-format training dataset for LIBERO-plus is available at:
-
- [lerobot/libero_plus](https://huggingface.co/datasets/lerobot/libero_plus)
-
-### Example training command
-
-```bash
-lerobot-train \
-    --policy.type=smolvla \
-    --policy.repo_id=${HF_USER}/smolvla_libero_plus \
-    --policy.load_vlm_weights=true \
-    --dataset.repo_id=lerobot/libero_plus \
-    --env.type=libero_plus \
-    --env.task=libero_spatial \
-    --output_dir=./outputs/ \
-    --steps=100000 \
-    --batch_size=4 \
-    --eval.batch_size=1 \
-    --eval.n_episodes=1 \
-    --eval_freq=1000
-```
-
-## Relationship to LIBERO
-
-LIBERO-plus is a drop-in extension of LIBERO:
-
- Same Python gym interface (`LiberoEnv`, `LiberoProcessorStep`)
- Same camera names and observation/action format
- Same task suite names
- Installs under the same `libero` Python package name (different GitHub repo)
-
-To use the original LIBERO benchmark, see [LIBERO](./libero) and use `--env.type=libero`.
--- a/docs/source/robocerebra.mdx
+++ b/docs/source/robocerebra.mdx
@@ -1,99 +0,0 @@
-# RoboCerebra
-
-[RoboCerebra](https://robocerebra-project.github.io/) is a long-horizon manipulation benchmark that evaluates **high-level reasoning, planning, and memory** in VLAs. Episodes chain multiple sub-goals with language-grounded intermediate instructions, built on top of LIBERO's simulator stack (MuJoCo + robosuite, Franka Panda 7-DOF).
-
- Paper: [RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation](https://arxiv.org/abs/2506.06677)
- Project website: [robocerebra-project.github.io](https://robocerebra-project.github.io/)
- Dataset: [`lerobot/robocerebra_unified`](https://huggingface.co/datasets/lerobot/robocerebra_unified) — LeRobot v3.0, 6,660 episodes / 571,116 frames at 20 fps, 1,728 language-grounded sub-tasks.
- Pretrained policy: [`lerobot/smolvla_robocerebra`](https://huggingface.co/lerobot/smolvla_robocerebra)
-
-## Available tasks
-
-RoboCerebra reuses LIBERO's simulator, so evaluation runs against the LIBERO `libero_10` long-horizon suite:
-
-| Suite     | CLI name    | Tasks | Description                                                   |
-| --------- | ----------- | ----- | ------------------------------------------------------------- |
-| LIBERO-10 | `libero_10` | 10    | Long-horizon kitchen/living room tasks chaining 3–6 sub-goals |
-
-Each RoboCerebra episode in the dataset is segmented into multiple sub-tasks with natural-language instructions, which the unified dataset exposes as independent supervision signals.
-
-## Installation
-
-RoboCerebra piggybacks on LIBERO, so the `libero` extra is all you need:
-
-```bash
-pip install -e ".[libero]"
-```
-
-<Tip>
-RoboCerebra requires Linux (MuJoCo / robosuite). Set the rendering backend before training or evaluation:
-
-```bash
-export MUJOCO_GL=egl  # for headless servers (HPC, cloud)
-```
-
-</Tip>
-
-## Evaluation
-
-RoboCerebra eval runs against LIBERO's `libero_10` suite with RoboCerebra's camera naming (`image` + `wrist_image`) and an extra empty-camera slot so a three-view-trained policy receives the expected input layout:
-
-```bash
-lerobot-eval \
-  --policy.path=lerobot/smolvla_robocerebra \
-  --env.type=libero \
-  --env.task=libero_10 \
-  --env.fps=20 \
-  --env.obs_type=pixels_agent_pos \
-  --env.observation_height=256 \
-  --env.observation_width=256 \
-  '--env.camera_name_mapping={"agentview_image": "image", "robot0_eye_in_hand_image": "wrist_image"}' \
-  --eval.batch_size=1 \
-  --eval.n_episodes=10 \
-  --eval.use_async_envs=false \
-  --policy.device=cuda \
-  '--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.wrist_image": "observation.images.camera2"}' \
-  --policy.empty_cameras=1
-```
-
-### Recommended evaluation episodes
-
-**10 episodes per task** across the `libero_10` suite (100 total) for reproducible benchmarking. Matches the protocol used in the RoboCerebra paper.
-
-## Policy inputs and outputs
-
-**Observations:**
-
- `observation.state` — 8-dim proprioceptive state (7 joint positions + gripper)
- `observation.images.image` — third-person view, 256×256 HWC uint8
- `observation.images.wrist_image` — wrist-mounted camera view, 256×256 HWC uint8
-
-**Actions:**
-
- Continuous control in `Box(-1, 1, shape=(7,))` — end-effector delta (6D) + gripper (1D)
-
-## Training
-
-The unified dataset at [`lerobot/robocerebra_unified`](https://huggingface.co/datasets/lerobot/robocerebra_unified) exposes two RGB streams and language-grounded sub-task annotations:
-
-| Feature                          | Shape         | Description          |
-| -------------------------------- | ------------- | -------------------- |
-| `observation.images.image`       | (256, 256, 3) | Third-person view    |
-| `observation.images.wrist_image` | (256, 256, 3) | Wrist-mounted camera |
-| `observation.state`              | (8,)          | Joint pos + gripper  |
-| `action`                         | (7,)          | EEF delta + gripper  |
-
-Fine-tune a SmolVLA base on it:
-
-```bash
-lerobot-train \
-  --policy.path=lerobot/smolvla_base \
-  --dataset.repo_id=lerobot/robocerebra_unified \
-  --env.type=libero \
-  --env.task=libero_10 \
-  --output_dir=outputs/smolvla_robocerebra
-```
-
-## Reproducing published results
-
-The released checkpoint [`lerobot/smolvla_robocerebra`](https://huggingface.co/lerobot/smolvla_robocerebra) was trained on `lerobot/robocerebra_unified` and evaluated with the command in the [Evaluation](#evaluation) section. CI runs the same command with `--eval.n_episodes=1` as a smoke test on every PR touching the benchmark.
--- a/docs/source/robomme.mdx
+++ b/docs/source/robomme.mdx
@@ -1,130 +0,0 @@
-# RoboMME
-
-[RoboMME](https://robomme.github.io) is a memory-augmented manipulation benchmark built on ManiSkill (SAPIEN). It evaluates a robot's ability to retain and use information across an episode — counting, object permanence, reference, and imitation.
-
- **16 tasks** across 4 memory-skill suites
- **1,600 training demos** (100 per task, 50 val, 50 test)
- **Dataset**: [`lerobot/robomme`](https://huggingface.co/datasets/lerobot/robomme) — LeRobot v3.0, 768K frames at 10 fps
- **Simulator**: ManiSkill / SAPIEN, Panda arm, Linux only
-
-![RoboMME benchmark tasks overview](https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2603.04639/gradient.png)
-
-## Tasks
-
-| Suite                             | Tasks                                                         |
-| --------------------------------- | ------------------------------------------------------------- |
-| **Counting** (temporal memory)    | BinFill, PickXtimes, SwingXtimes, StopCube                    |
-| **Permanence** (spatial memory)   | VideoUnmask, VideoUnmaskSwap, ButtonUnmask, ButtonUnmaskSwap  |
-| **Reference** (object memory)     | PickHighlight, VideoRepick, VideoPlaceButton, VideoPlaceOrder |
-| **Imitation** (procedural memory) | MoveCube, InsertPeg, PatternLock, RouteStick                  |
-
-## Installation
-
-> RoboMME requires **Linux** (ManiSkill/SAPIEN uses Vulkan rendering). Docker is recommended to isolate dependency conflicts.
-
-### Native (Linux)
-
-```bash
-pip install --override <(printf 'gymnasium==0.29.1\nnumpy==1.26.4\n') \
-  -e '.[smolvla,av-dep]' \
-  'robomme @ git+https://github.com/RoboMME/robomme_benchmark.git@main'
-```
-
-> **Dependency note**: `mani-skill` (pulled by `robomme`) pins `gymnasium==0.29.1` and `numpy<2.0.0`, which conflict with lerobot's base `numpy>=2.0.0`. That's why `robomme` is not a pyproject extra — use the override install above, or the Docker approach below to avoid conflicts entirely.
-
-### Docker (recommended)
-
-```bash
-# Build base image first (from repo root)
-docker build -f docker/Dockerfile.eval-base -t lerobot-eval-base .
-
-# Build RoboMME eval image (applies gymnasium + numpy pin overrides)
-docker build -f docker/Dockerfile.benchmark.robomme -t lerobot-robomme .
-```
-
-The `docker/Dockerfile.benchmark.robomme` image overrides `gymnasium==0.29.1` and `numpy==1.26.4` after lerobot's install. Both versions are runtime-safe for lerobot's actual API usage.
-
-## Running Evaluation
-
-### Default (single task, single episode)
-
-```bash
-lerobot-eval \
-    --policy.path=<your_policy_repo> \
-    --env.type=robomme \
-    --env.task=PickXtimes \
-    --env.dataset_split=test \
-    --env.task_ids=[0] \
-    --eval.batch_size=1 \
-    --eval.n_episodes=1
-```
-
-### Multi-task evaluation
-
-Evaluate multiple tasks in one run by comma-separating task names. Use `task_ids` to control which episodes are evaluated per task. Recommended: 50 episodes per task for the test split.
-
-```bash
-lerobot-eval \
-    --policy.path=<your_policy_repo> \
-    --env.type=robomme \
-    --env.task=PickXtimes,BinFill,StopCube,MoveCube,InsertPeg \
-    --env.dataset_split=test \
-    --env.task_ids=[0,1,2,3,4,5,6,7,8,9] \
-    --eval.batch_size=1 \
-    --eval.n_episodes=50
-```
-
-### Key CLI options for `env.type=robomme`
-
-| Option               | Default       | Description                                        |
-| -------------------- | ------------- | -------------------------------------------------- |
-| `env.task`           | `PickXtimes`  | Any of the 16 task names above (comma-separated)   |
-| `env.dataset_split`  | `test`        | `train`, `val`, or `test`                          |
-| `env.action_space`   | `joint_angle` | `joint_angle` (8-D) or `ee_pose` (7-D)             |
-| `env.episode_length` | `300`         | Max steps per episode                              |
-| `env.task_ids`       | `null`        | List of episode indices to evaluate (null = `[0]`) |
-
-## Dataset
-
-The dataset [`lerobot/robomme`](https://huggingface.co/datasets/lerobot/robomme) is in **LeRobot v3.0 format** and can be loaded directly:
-
-```python
-from lerobot.datasets.lerobot_dataset import LeRobotDataset
-
-dataset = LeRobotDataset("lerobot/robomme")
-```
-
-### Dataset features
-
-| Feature            | Shape         | Description                     |
-| ------------------ | ------------- | ------------------------------- |
-| `image`            | (256, 256, 3) | Front camera RGB                |
-| `wrist_image`      | (256, 256, 3) | Wrist camera RGB                |
-| `actions`          | (8,)          | Joint angles + gripper          |
-| `state`            | (8,)          | Joint positions + gripper state |
-| `simple_subgoal`   | str           | High-level language annotation  |
-| `grounded_subgoal` | str           | Grounded language annotation    |
-| `episode_index`    | int           | Episode ID                      |
-| `frame_index`      | int           | Frame within episode            |
-
-### Feature key alignment (training)
-
-The env wrapper exposes `pixels/image` and `pixels/wrist_image` as observation keys. The `features_map` in `RoboMMEEnv` maps these to `observation.images.image` and `observation.images.wrist_image` for the policy. State is exposed as `agent_pos` and maps to `observation.state`.
-
-The dataset's `image` and `wrist_image` columns already align with the policy input keys, so no renaming is needed when fine-tuning.
-
-## Action Spaces
-
-| Type          | Dim | Description                                               |
-| ------------- | --- | --------------------------------------------------------- |
-| `joint_angle` | 8   | 7 joint angles + 1 gripper (−1 closed, +1 open, absolute) |
-| `ee_pose`     | 7   | xyz + roll/pitch/yaw + gripper                            |
-
-Set via `--env.action_space=joint_angle` (default) or `--env.action_space=ee_pose`.
-
-## Platform Notes
-
- **Linux only**: ManiSkill requires SAPIEN/Vulkan. macOS and Windows are not supported.
- **GPU recommended**: Rendering is CPU-capable but slow; CUDA + Vulkan gives full speed.
- **gymnasium / numpy conflict**: See installation note above. Docker image handles this automatically.
- **ManiSkill fork**: `robomme` depends on a specific ManiSkill fork (`YinpeiDai/ManiSkill`), pulled in automatically via the `robomme` package.
--- a/docs/source/tools.mdx
+++ b/docs/source/tools.mdx
@@ -1,198 +0,0 @@
-# Tools
-
-LeRobot v3.1 supports **tool calls** in policies — assistant messages can
-emit structured invocations like `say(text="OK, starting now")` that the
-runtime dispatches to a real implementation (TTS, controller, logger, …).
-
-This page covers:
-
-1. Where the tool catalog lives (PR 1).
-2. How the annotation pipeline produces tool-call atoms (PR 2).
-3. How to add your own tool (PR 3).
-
-## Where tools are declared
-
-Two layers.
-
-**The catalog** — a list of OpenAI-style function schemas — lives at
-`meta/info.json["tools"]` on each dataset. Example:
-
-```json
-{
-  "features": { "...": "..." },
-  "tools": [
-    {
-      "type": "function",
-      "function": {
-        "name": "say",
-        "description": "Speak a short utterance to the user via the TTS executor.",
-        "parameters": {
-          "type": "object",
-          "properties": {
-            "text": { "type": "string", "description": "The verbatim text to speak." }
-          },
-          "required": ["text"]
-        }
-      }
-    }
-  ]
-}
-```
-
-Read it via the dataset metadata accessor:
-
-```python
-from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata
-
-meta = LeRobotDatasetMetadata(repo_id="pepijn/super_poulain_final_annotations")
-tools = meta.tools     # list[dict] — OpenAI tool schemas
-```
-
-If the dataset's `info.json` doesn't declare any tools, `meta.tools`
-returns `DEFAULT_TOOLS` from `lerobot.datasets.language` — currently a
-single-entry list with the canonical `say` schema. So unannotated
-datasets and chat-template consumers keep working without any
-configuration:
-
-```python
-prompt_str = tokenizer.apply_chat_template(
-    sample["messages"],
-    tools=meta.tools,                 # works either way
-    add_generation_prompt=False,
-    tokenize=False,
-)
-```
-
-**The implementations** — runnable Python — live under
-`src/lerobot/tools/`, one file per tool. The `say` implementation
-arrives in PR 3 and wraps Kyutai's pocket-tts model.
-
-## Per-row tool *invocations*
-
-The catalog above describes *what can be called*. The actual *call* — the
-function name plus the argument values — is stored per-row, on the
-assistant atoms in `language_events`:
-
-```python
-{
-  "role": "assistant",
-  "content": null,
-  "style": null,
-  "timestamp": 12.4,
-  "camera": null,
-  "tool_calls": [
-    { "type": "function",
-      "function": { "name": "say", "arguments": { "text": "On it." } } }
-  ]
-}
-```
-
-Recipes splice these into rendered messages via `tool_calls_from`:
-
-```yaml
-user_interjection_response:
-  bindings:
-    speech: "emitted_at(t, role=assistant, tool_name=say)"
-  messages:
-    - { role: user,      content: "${task}",         stream: high_level }
-    - { role: assistant, content: "${current_plan}", stream: high_level,
-        target: true, tool_calls_from: speech }
-```
-
-The model's training target is one assistant turn that carries both the
-plan text *and* the `say` tool call. At inference, the runtime parses
-the generated text back into structured `tool_calls` and dispatches to
-the matching implementation.
-
-## How to add your own tool
-
-Three steps. Concrete example: a `record_observation` tool the policy
-can call to capture an extra observation outside the regular control
-loop.
-
-### Step 1 — declare the schema
-
-Add an entry under `meta/info.json["tools"]`. Either edit the file
-directly on disk *before* running the annotation pipeline (it'll be
-preserved) or hand it to `lerobot-annotate` via a config flag (PR 2 —
-exact CLI lands with the pipeline change).
-
-```json
-{
-  "tools": [
-    { "type": "function", "function": { "name": "say", "...": "..." } },
-    {
-      "type": "function",
-      "function": {
-        "name": "record_observation",
-        "description": "Capture a high-resolution still image for the user.",
-        "parameters": {
-          "type": "object",
-          "properties": {
-            "label": { "type": "string", "description": "Short label for the saved image." }
-          },
-          "required": ["label"]
-        }
-      }
-    }
-  ]
-}
-```
-
-The schema follows OpenAI's function-calling convention exactly, so the
-chat template can render it natively.
-
-### Step 2 — implement the call
-
-Create `src/lerobot/tools/record_observation.py`:
-
-```python
-from .base import Tool
-from typing import Any
-
-RECORD_OBSERVATION_SCHEMA: dict[str, Any] = { "...": "..." }   # mirrors the JSON above
-
-
-class RecordObservationTool:
-    name = "record_observation"
-    schema = RECORD_OBSERVATION_SCHEMA
-
-    def __init__(self, schema: dict | None = None, output_dir: str = "."):
-        self.output_dir = output_dir
-
-    def call(self, arguments: dict) -> str:
-        label = arguments["label"]
-        # ... save the latest camera frame to <output_dir>/<label>.png ...
-        return f"saved {label}.png"
-```
-
-One file per tool keeps dependencies isolated — `record_observation`
-might pull `pillow`, while `say` (PR 3) pulls `pocket-tts`. Users
-installing only the tools they need avoid heavy transitive deps.
-
-### Step 3 — register it
-
-Add to `src/lerobot/tools/registry.py` (PR 3):
-
-```python
-from .record_observation import RecordObservationTool
-
-TOOL_REGISTRY["record_observation"] = RecordObservationTool
-```
-
-That's it. At runtime `get_tools(meta)` looks up each schema in
-`meta.tools`, instantiates the matching registered class, and returns
-a name → instance dict the dispatcher can route into.
-
-## Where this fits in the three-PR stack
-
-| Layer | PR | What lands |
-|---|---|---|
-| Catalog storage in `meta/info.json` + `meta.tools` accessor | PR 1 | This page; `SAY_TOOL_SCHEMA`, `DEFAULT_TOOLS` constants in `lerobot.datasets.language`; `LeRobotDatasetMetadata.tools` property |
-| Annotation pipeline writes `tools` to meta after a run; honors anything users pre-populated | PR 2 | `lerobot-annotate` ensures `meta/info.json["tools"]` includes the canonical `say` and merges any user-declared tools |
-| Runnable implementations under `src/lerobot/tools/`; runtime dispatcher; `say.py` wired to Kyutai's pocket-tts | PR 3 | One file per tool; `Tool` protocol; `TOOL_REGISTRY`; optional `[tools]` extra in `pyproject.toml` |
-
-If you want to use a tool *without* writing an implementation (e.g. for
-training-time chat-template formatting only), step 1 alone is enough —
-the model still learns to *generate* the call. Steps 2 and 3 are only
-needed to actually *execute* it at inference.
--- a/docs/source/vlabench.mdx
+++ b/docs/source/vlabench.mdx
@@ -1,176 +0,0 @@
-# VLABench
-
-[VLABench](https://github.com/OpenMOSS/VLABench) is a large-scale benchmark for **language-conditioned robotic manipulation with long-horizon reasoning**. The upstream suite covers 100 task categories across 2,000+ objects and evaluates six dimensions of robot intelligence: mesh & texture understanding, spatial reasoning, world-knowledge transfer, semantic instruction comprehension, physical-law understanding, and long-horizon planning. Built on MuJoCo / dm_control with a Franka Panda 7-DOF arm. LeRobot exposes **43 of these tasks** through `--env.task` (21 primitives + 22 composites, see [Available tasks](#available-tasks) below).
-
- Paper: [VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning](https://arxiv.org/abs/2412.18194)
- GitHub: [OpenMOSS/VLABench](https://github.com/OpenMOSS/VLABench)
- Project website: [vlabench.github.io](https://vlabench.github.io)
- Pretrained policy: [`lerobot/smolvla_vlabench`](https://huggingface.co/lerobot/smolvla_vlabench)
-
-<img
-  src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/vlabench.png"
-  alt="VLABench benchmark overview"
-  width="85%"
-/>
-
-## Available tasks
-
-VLABench ships two task suites covering **43 task categories** in LeRobot's `--env.task` surface:
-
-| Suite     | CLI name    | Tasks | Description                                                      |
-| --------- | ----------- | ----- | ---------------------------------------------------------------- |
-| Primitive | `primitive` | 21    | Single / few-skill combinations (select, insert, physics QA)     |
-| Composite | `composite` | 22    | Multi-step reasoning and long-horizon planning (cook, rearrange) |
-
-**Primitive tasks:** `select_fruit`, `select_toy`, `select_chemistry_tube`, `add_condiment`, `select_book`, `select_painting`, `select_drink`, `insert_flower`, `select_billiards`, `select_ingredient`, `select_mahjong`, `select_poker`, and physical-reasoning tasks (`density_qa`, `friction_qa`, `magnetism_qa`, `reflection_qa`, `simple_cuestick_usage`, `simple_seesaw_usage`, `sound_speed_qa`, `thermal_expansion_qa`, `weight_qa`).
-
-**Composite tasks:** `cluster_billiards`, `cluster_book`, `cluster_drink`, `cluster_toy`, `cook_dishes`, `cool_drink`, `find_unseen_object`, `get_coffee`, `hammer_nail`, `heat_food`, `make_juice`, `play_mahjong`, `play_math_game`, `play_poker`, `play_snooker`, `rearrange_book`, `rearrange_chemistry_tube`, `set_dining_table`, `set_study_table`, `store_food`, `take_chemistry_experiment`, `use_seesaw_complex`.
-
-`--env.task` accepts three forms:
-
- a single task name (`select_fruit`)
- a comma-separated list (`select_fruit,heat_food`)
- a suite shortcut (`primitive`, `composite`, or `primitive,composite`)
-
-## Installation
-
-VLABench is **not on PyPI** — its only distribution is the [OpenMOSS/VLABench](https://github.com/OpenMOSS/VLABench) GitHub repo — so LeRobot does not expose a `vlabench` extra. Install it manually as an editable clone, alongside the MuJoCo / dm_control pins VLABench needs, then fetch the mesh assets:
-
-```bash
-# After following the standard LeRobot installation instructions.
-
-git clone https://github.com/OpenMOSS/VLABench.git ~/VLABench
-git clone https://github.com/motion-planning/rrt-algorithms.git ~/rrt-algorithms
-pip install -e ~/VLABench -e ~/rrt-algorithms
-pip install "mujoco==3.2.2" "dm-control==1.0.22" \
-            open3d colorlog scikit-learn openai gdown
-
-python ~/VLABench/scripts/download_assets.py
-```
-
-<Tip>
-VLABench requires Linux (`sys_platform == 'linux'`) and Python 3.10+. Set the MuJoCo rendering backend before running:
-
-```bash
-export MUJOCO_GL=egl  # for headless servers (HPC, cloud)
-```
-
-</Tip>
-
-## Evaluation
-
-All eval snippets below mirror the command CI runs (see `.github/workflows/benchmark_tests.yml`). The `--rename_map` argument maps VLABench's `image` / `second_image` / `wrist_image` camera keys onto the three-camera (`camera1` / `camera2` / `camera3`) input layout the released `smolvla_vlabench` policy was trained on.
-
-### Single-task evaluation (recommended for quick iteration)
-
-```bash
-lerobot-eval \
-  --policy.path=lerobot/smolvla_vlabench \
-  --env.type=vlabench \
-  --env.task=select_fruit \
-  --eval.batch_size=1 \
-  --eval.n_episodes=10 \
-  --eval.use_async_envs=false \
-  --policy.device=cuda \
-  '--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
-```
-
-### Multi-task evaluation
-
-Pass a comma-separated list of tasks:
-
-```bash
-lerobot-eval \
-  --policy.path=lerobot/smolvla_vlabench \
-  --env.type=vlabench \
-  --env.task=select_fruit,select_toy,add_condiment,heat_food \
-  --eval.batch_size=1 \
-  --eval.n_episodes=10 \
-  --eval.use_async_envs=false \
-  --policy.device=cuda \
-  '--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
-```
-
-### Suite-wide evaluation
-
-Run an entire suite (all 21 primitives or all 22 composites):
-
-```bash
-lerobot-eval \
-  --policy.path=lerobot/smolvla_vlabench \
-  --env.type=vlabench \
-  --env.task=primitive \
-  --eval.batch_size=1 \
-  --eval.n_episodes=10 \
-  --eval.use_async_envs=false \
-  --policy.device=cuda \
-  --env.max_parallel_tasks=1 \
-  '--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
-```
-
-Or both suites:
-
-```bash
-lerobot-eval \
-  --policy.path=lerobot/smolvla_vlabench \
-  --env.type=vlabench \
-  --env.task=primitive,composite \
-  --eval.batch_size=1 \
-  --eval.n_episodes=10 \
-  --eval.use_async_envs=false \
-  --policy.device=cuda \
-  --env.max_parallel_tasks=1 \
-  '--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
-```
-
-### Recommended evaluation episodes
-
-**10 episodes per task** for reproducible benchmarking (210 total for the full primitive suite, 220 for composite). Matches the protocol in the VLABench paper.
-
-## Policy inputs and outputs
-
-**Observations:**
-
- `observation.state` — 7-dim end-effector state (position xyz + Euler xyz + gripper)
- `observation.images.image` — front camera, 480×480 HWC uint8
- `observation.images.second_image` — second camera, 480×480 HWC uint8
- `observation.images.wrist_image` — wrist camera, 480×480 HWC uint8
-
-**Actions:**
-
- Continuous control in `Box(-1, 1, shape=(7,))` — 3D position + 3D Euler orientation + 1D gripper.
-
-## Training
-
-### Datasets
-
-Pre-collected VLABench datasets in LeRobot format on the Hub:
-
- [`VLABench/vlabench_primitive_ft_lerobot_video`](https://huggingface.co/datasets/VLABench/vlabench_primitive_ft_lerobot_video) — 5,000 episodes, 128 tasks, 480×480 images.
- [`VLABench/vlabench_composite_ft_lerobot_video`](https://huggingface.co/datasets/VLABench/vlabench_composite_ft_lerobot_video) — 5,977 episodes, 167 tasks, 224×224 images.
-
-### Example training command
-
-Fine-tune a SmolVLA base on the primitive suite:
-
-```bash
-lerobot-train \
-  --policy.type=smolvla \
-  --policy.repo_id=${HF_USER}/smolvla_vlabench_primitive \
-  --policy.load_vlm_weights=true \
-  --policy.push_to_hub=true \
-  --dataset.repo_id=VLABench/vlabench_primitive_ft_lerobot_video \
-  --env.type=vlabench \
-  --env.task=select_fruit \
-  --output_dir=./outputs/smolvla_vlabench_primitive \
-  --steps=100000 \
-  --batch_size=4 \
-  --eval_freq=5000 \
-  --eval.batch_size=1 \
-  --eval.n_episodes=1 \
-  --save_freq=10000
-```
-
-## Reproducing published results
-
-The released checkpoint [`lerobot/smolvla_vlabench`](https://huggingface.co/lerobot/smolvla_vlabench) was trained on the primitive-suite dataset above and is evaluated with the [Single-task](#single-task-evaluation-recommended-for-quick-iteration) / [Suite-wide](#suite-wide-evaluation) commands. CI runs a 10-primitive-task smoke eval (one episode each) on every PR touching the benchmark.
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -95,7 +95,7 @@ dependencies = [

 # ── Feature-scoped extras ──────────────────────────────────
 dataset = [
-    "datasets>=4.7.0,<5.0.0",
+    "datasets>=4.0.0,<5.0.0",
    "pandas>=2.0.0,<3.0.0", # NOTE: Transitive dependency of datasets
    "pyarrow>=21.0.0,<30.0.0", # NOTE: Transitive dependency of datasets
    "lerobot[av-dep]",
@@ -212,15 +212,6 @@ aloha = ["lerobot[dataset]", "gym-aloha>=0.1.2,<0.2.0", "lerobot[scipy-dep]"]
 pusht = ["lerobot[dataset]", "gym-pusht>=0.1.5,<0.2.0", "pymunk>=6.6.0,<7.0.0"] # TODO: Fix pymunk version in gym-pusht instead
 libero = ["lerobot[dataset]", "lerobot[transformers-dep]", "hf-libero>=0.1.3,<0.2.0; sys_platform == 'linux'", "lerobot[scipy-dep]"]
 metaworld = ["lerobot[dataset]", "metaworld==3.0.0", "lerobot[scipy-dep]"]
-# NOTE: vlabench is NOT exposed as a `lerobot` extra. Its only distribution
-# is the OpenMOSS/VLABench GitHub repo (package name `VLABench`, no PyPI
-# release), so any `vlabench>=X` pip spec is unresolvable. Install it
-# manually alongside MuJoCo / dm-control — see docs/source/vlabench.mdx
-# for the recipe.
-# NOTE: robomme is NOT a pyproject extra — mani-skill hard-pins numpy<2
-# which conflicts with lerobot's numpy>=2 base pin, so the two trees can't
-# resolve into a single env. Install it only in the RoboMME Docker image
-# via `uv pip install --override` (see docker/Dockerfile.benchmark.robomme).
 # NOTE: robocasa is NOT exposed as a `lerobot` extra. Its setup.py pins
 # `lerobot==0.3.3` in install_requires, which cyclically shadows our own
 # workspace `lerobot` and makes the graph unsolvable under any resolver
--- a/scripts/ci/extract_task_descriptions.py
+++ b/scripts/ci/extract_task_descriptions.py
@@ -31,23 +31,9 @@ from __future__ import annotations

 import argparse
 import json
-import re
 import sys
 from pathlib import Path

-# LIBERO-plus derives task.language by space-joining the perturbation-variant
-# filename (grab_language_from_filename in libero/libero/benchmark/__init__.py),
-# so non-_language_ variants inherit a trailing metadata blob like
-# "view 0 0 100 0 0 initstate 0 noise 45" or "add 16". Strip those tokens so
-# the description matches the base instruction used in the training dataset.
-_LIBERO_PERTURBATION_TAIL_RE = re.compile(
-    r"(?:\s(?:view|initstate|noise|add|tb|table|light|level)(?:\s\d+)+)+$"
-)
-
-
-def _strip_libero_perturbation_tail(instruction: str) -> str:
-    return _LIBERO_PERTURBATION_TAIL_RE.sub("", instruction).strip()
-

 def _libero_descriptions(task_suite: str) -> dict[str, str]:
    from libero.libero import benchmark  # type: ignore[import-untyped]
@@ -61,10 +47,7 @@ def _libero_descriptions(task_suite: str) -> dict[str, str]:
        )
        return {}
    suite = suite_dict[task_suite]()
-    return {
-        f"{task_suite}_{i}": _strip_libero_perturbation_tail(suite.get_task(i).language)
-        for i in range(suite.n_tasks)
-    }
+    return {f"{task_suite}_{i}": suite.get_task(i).language for i in range(suite.n_tasks)}


 def _metaworld_descriptions(task_name: str) -> dict[str, str]:
@@ -109,74 +92,16 @@ def _robocasa_descriptions(task_spec: str) -> dict[str, str]:
    return out


-_ROBOMME_DESCRIPTIONS = {
-    "BinFill": "Fill the target bin with the correct number of cubes",
-    "PickXtimes": "Pick the indicated cube the specified number of times",
-    "SwingXtimes": "Swing the object the specified number of times",
-    "StopCube": "Grasp and stop the moving cube",
-    "VideoUnmask": "Pick the cube shown in the reference video",
-    "VideoUnmaskSwap": "Pick the cube matching the reference video after a swap",
-    "ButtonUnmask": "Press the button indicated by the reference",
-    "ButtonUnmaskSwap": "Press the correct button after objects are swapped",
-    "PickHighlight": "Pick the highlighted cube",
-    "VideoRepick": "Repick the cube shown in the reference video",
-    "VideoPlaceButton": "Place the cube on the button shown in the video",
-    "VideoPlaceOrder": "Place cubes in the order shown in the video",
-    "MoveCube": "Move the cube to the target location",
-    "InsertPeg": "Insert the peg into the target hole",
-    "PatternLock": "Unlock the pattern by pressing buttons in sequence",
-    "RouteStick": "Route the stick through the required waypoints",
-}
-
-
-def _robomme_descriptions(task_names: str, task_ids: list[int] | None = None) -> dict[str, str]:
-    """Return descriptions for each requested RoboMME task. Keys match the
-    video filename pattern `<task>_<task_id>` used by the eval script."""
-    if task_ids is None:
-        task_ids = [0]
-    out: dict[str, str] = {}
-    for name in (t.strip() for t in task_names.split(",") if t.strip()):
-        desc = _ROBOMME_DESCRIPTIONS.get(name, name)
-        for tid in task_ids:
-            out[f"{name}_{tid}"] = desc
-    return out
-
-
-def _vlabench_descriptions(task_spec: str) -> dict[str, str]:
-    """For each task in the comma-separated list, emit a cleaned-name label.
-
-    VLABench tasks carry language instructions on their dm_control task
-    object, but pulling them requires loading the full env per task
-    (~seconds each). The CI smoke-eval already captures the instruction
-    inside its episode info; this mapping is just enough to key
-    `metrics.json` by `<task>_0`.
-    """
-    out: dict[str, str] = {}
-    for task in (t.strip() for t in task_spec.split(",") if t.strip()):
-        out[f"{task}_0"] = task.replace("_", " ").strip()
-    return out
-
-
 def main() -> int:
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument("--env", required=True, help="Environment family (libero, metaworld, ...)")
    parser.add_argument("--task", required=True, help="Task/suite name (e.g. libero_spatial)")
-    parser.add_argument(
-        "--task-ids",
-        type=str,
-        default=None,
-        help="Comma-separated task IDs (e.g. '0,1,2'). Default: [0]",
-    )
    parser.add_argument("--output", required=True, help="Path to write task_descriptions.json")
    args = parser.parse_args()

-    task_ids: list[int] | None = None
-    if args.task_ids:
-        task_ids = [int(x.strip()) for x in args.task_ids.split(",")]
-
    descriptions: dict[str, str] = {}
    try:
-        if args.env == ("libero", "libero_plus"):
+        if args.env == "libero":
            descriptions = _libero_descriptions(args.task)
        elif args.env == "metaworld":
            descriptions = _metaworld_descriptions(args.task)
@@ -184,10 +109,6 @@ def main() -> int:
            descriptions = _robotwin_descriptions(args.task)
        elif args.env == "robocasa":
            descriptions = _robocasa_descriptions(args.task)
-        elif args.env == "robomme":
-            descriptions = _robomme_descriptions(args.task, task_ids=task_ids)
-        elif args.env == "vlabench":
-            descriptions = _vlabench_descriptions(args.task)
        else:
            print(
                f"[extract_task_descriptions] No description extractor for env '{args.env}'.",
--- a/src/lerobot/configs/init.py
+++ b/src/lerobot/configs/init.py
@@ -23,7 +23,6 @@ Import them directly: ``from lerobot.configs.train import TrainPipelineConfig``

 from .default import DatasetConfig, EvalConfig, PeftConfig, WandBConfig
 from .policies import PreTrainedConfig
-from .recipe import MessageTurn, TrainingRecipe, load_recipe
 from .types import (
    FeatureType,
    NormalizationMode,
@@ -42,10 +41,7 @@ __all__ = [
    # Config classes
    "DatasetConfig",
    "EvalConfig",
-    "MessageTurn",
    "PeftConfig",
    "PreTrainedConfig",
-    "TrainingRecipe",
    "WandBConfig",
-    "load_recipe",
 ]
--- a/src/lerobot/configs/recipe.py
+++ b/src/lerobot/configs/recipe.py
@@ -1,193 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import annotations
-
-import re
-from dataclasses import dataclass
-from pathlib import Path
-from typing import Any, Literal, get_args
-
-MessageRole = Literal["user", "assistant", "system", "tool"]
-MessageStream = Literal["high_level", "low_level"]
-
-DEFAULT_BINDINGS = {
-    "subtask": "active_at(t, style=subtask)",
-    "memory": "active_at(t, style=memory)",
-    "plan": "active_at(t, style=plan)",
-    "speech": "emitted_at(t, role=assistant, tool_name=say)",
-    "interjection": "emitted_at(t, style=interjection)",
-    "vqa": "emitted_at(t, style=vqa, role=assistant)",
-    "vqa_query": "emitted_at(t, style=vqa, role=user)",
-}
-
-_PLACEHOLDER_RE = re.compile(r"\$\{([A-Za-z_][A-Za-z0-9_]*)\}")
-_VALID_ROLES = frozenset(get_args(MessageRole))
-_VALID_STREAMS = frozenset(get_args(MessageStream))
-
-
-@dataclass
-class MessageTurn:
-    """A single chat-style turn in a recipe template.
-
-    ``content`` may be a plain string, a list of HF-style multimodal blocks, or
-    ``None`` when ``tool_calls_from`` supplies tool-call payloads instead.
-    ``stream`` tags the turn for downstream filtering, ``target`` flags it as a
-    training target, and ``if_present`` skips the turn when the named binding
-    resolves to ``None``.
-    """
-
-    role: MessageRole
-    content: str | list[dict[str, Any]] | None = None
-    stream: MessageStream | None = None
-    target: bool = False
-    if_present: str | None = None
-    tool_calls_from: str | None = None
-
-    def __post_init__(self) -> None:
-        """Validate role, stream, and content after dataclass construction."""
-        if self.role not in _VALID_ROLES:
-            raise ValueError(f"Unsupported message role: {self.role!r}")
-        if self.stream is not None and self.stream not in _VALID_STREAMS:
-            raise ValueError(f"Unsupported message stream: {self.stream!r}")
-        if self.content is None and self.tool_calls_from is None:
-            raise ValueError("MessageTurn.content is required unless tool_calls_from is set.")
-        if self.content is not None and not isinstance(self.content, (str, list)):
-            raise TypeError("MessageTurn.content must be a string, a list of HF-style blocks, or None.")
-        if isinstance(self.content, list):
-            for block in self.content:
-                if not isinstance(block, dict) or "type" not in block:
-                    raise ValueError(
-                        "Multimodal content blocks must be HF-style dictionaries with a type key."
-                    )
-
-    @classmethod
-    def from_dict(cls, data: dict[str, Any]) -> MessageTurn:
-        """Construct a :class:`MessageTurn` from a plain dictionary."""
-        return cls(**data)
-
-
-@dataclass
-class TrainingRecipe:
-    """A recipe describing how to render training samples from language rows.
-
-    A recipe is either a *message recipe* (``messages`` plus optional
-    ``bindings``) or a *blend recipe* (``blend`` mapping names to weighted
-    sub-recipes). ``weight`` is only meaningful inside a blend.
-    """
-
-    messages: list[MessageTurn] | None = None
-    bindings: dict[str, str] | None = None
-    blend: dict[str, TrainingRecipe] | None = None
-    weight: float | None = None
-
-    def __post_init__(self) -> None:
-        """Validate that exactly one of ``messages`` or ``blend`` is set."""
-        if self.messages is not None and self.blend is not None:
-            raise ValueError("TrainingRecipe must set only one of messages or blend.")
-        if self.messages is None and self.blend is None:
-            raise ValueError("TrainingRecipe must set one of messages or blend.")
-
-        if self.messages is not None:
-            self._validate_message_recipe()
-        if self.blend is not None:
-            self._validate_blend_recipe()
-
-    @classmethod
-    def from_dict(cls, data: dict[str, Any]) -> TrainingRecipe:
-        """Construct a :class:`TrainingRecipe` from a nested dictionary."""
-        data = dict(data)
-        if data.get("messages") is not None:
-            data["messages"] = [
-                turn if isinstance(turn, MessageTurn) else MessageTurn.from_dict(turn)
-                for turn in data["messages"]
-            ]
-        if data.get("blend") is not None:
-            data["blend"] = {
-                name: recipe if isinstance(recipe, TrainingRecipe) else cls.from_dict(recipe)
-                for name, recipe in data["blend"].items()
-            }
-        return cls(**data)
-
-    @classmethod
-    def from_yaml(cls, path: str | Path) -> TrainingRecipe:
-        """Load a :class:`TrainingRecipe` from a YAML file at ``path``."""
-        import yaml  # type: ignore[import-untyped]
-
-        with open(path) as f:
-            data = yaml.safe_load(f)
-        if not isinstance(data, dict):
-            raise ValueError(f"Recipe YAML must contain a mapping at the top level: {path}")
-        return cls.from_dict(data)
-
-    def _validate_message_recipe(self) -> None:
-        """Ensure every templated binding is known and at least one turn is a target."""
-        assert self.messages is not None
-        known_bindings = set(DEFAULT_BINDINGS) | set(self.bindings or {}) | {"task"}
-
-        for turn in self.messages:
-            missing = self._referenced_bindings(turn) - known_bindings
-            if missing:
-                raise ValueError(f"MessageTurn references unknown binding(s): {sorted(missing)}")
-
-        if not any(turn.target for turn in self.messages):
-            raise ValueError("Message recipes must contain at least one target turn.")
-
-    def _validate_blend_recipe(self) -> None:
-        """Ensure each blend component is a non-empty, weighted message recipe."""
-        assert self.blend is not None
-        if not self.blend:
-            raise ValueError("Blend recipes must contain at least one component.")
-
-        for name, recipe in self.blend.items():
-            if recipe.blend is not None:
-                raise ValueError(f"Blend component {name!r} cannot itself define a blend.")
-            if recipe.messages is None:
-                raise ValueError(f"Blend component {name!r} must define messages.")
-            if recipe.weight is None:
-                raise ValueError(f"Blend component {name!r} must define weight.")
-            if recipe.weight <= 0:
-                raise ValueError(f"Blend component {name!r} must have a positive weight.")
-
-    def _referenced_bindings(self, turn: MessageTurn) -> set[str]:
-        """Return the binding names that ``turn`` references via placeholders or attributes."""
-        names: set[str] = set()
-        if turn.if_present is not None:
-            names.add(turn.if_present)
-        if turn.tool_calls_from is not None:
-            names.add(turn.tool_calls_from)
-        names.update(_placeholders_in_content(turn.content))
-        return names
-
-
-def _placeholders_in_content(content: str | list[dict[str, Any]] | None) -> set[str]:
-    """Return the set of ``${name}`` placeholders found anywhere in ``content``."""
-    if content is None:
-        return set()
-    if isinstance(content, str):
-        return set(_PLACEHOLDER_RE.findall(content))
-
-    names: set[str] = set()
-    for block in content:
-        for value in block.values():
-            if isinstance(value, str):
-                names.update(_PLACEHOLDER_RE.findall(value))
-    return names
-
-
-def load_recipe(path: str | Path) -> TrainingRecipe:
-    """Load a :class:`TrainingRecipe` from a YAML file at ``path``."""
-    return TrainingRecipe.from_yaml(path)
--- a/src/lerobot/configs/recipes/pi05_hirobot.yaml
+++ b/src/lerobot/configs/recipes/pi05_hirobot.yaml
@@ -1,74 +0,0 @@
-blend:
-
-  memory_update:
-    weight: 0.10
-    bindings:
-      prior_memory: "nth_prev(style=memory, offset=1)"
-      current_memory: "emitted_at(t, style=memory)"
-      completed_subtask: "nth_prev(style=subtask, offset=1)"
-    messages:
-      - {role: user, content: "${task}", stream: high_level}
-      - {role: assistant, content: "Previous memory: ${prior_memory}", stream: high_level, if_present: prior_memory}
-      - {role: user, content: "Completed subtask: ${completed_subtask}", stream: high_level, if_present: completed_subtask}
-      - {role: assistant, content: "${current_memory}", stream: high_level, target: true, if_present: current_memory}
-
-  user_interjection_response:
-    weight: 0.16
-    bindings:
-      prior_plan: "nth_prev(style=plan, offset=1)"
-      current_plan: "emitted_at(t, style=plan)"
-      interjection: "emitted_at(t, style=interjection)"
-      speech: "emitted_at(t, role=assistant, tool_name=say)"
-    messages:
-      - {role: user, content: "${task}", stream: high_level}
-      - {role: assistant, content: "Previous plan:\n${prior_plan}", stream: high_level, if_present: prior_plan}
-      - {role: user, content: "${interjection}", stream: high_level, if_present: interjection}
-      - {role: assistant, content: "${current_plan}", stream: high_level, target: true, if_present: current_plan, tool_calls_from: speech}
-
-  high_level_subtask:
-    weight: 0.15
-    bindings:
-      next_subtask: "nth_next(style=subtask, offset=1)"
-    messages:
-      - {role: user, content: "${task}\nPlan: ${plan}\nMemory: ${memory}", stream: high_level}
-      - {role: user, content: "Current subtask: ${subtask}", stream: high_level, if_present: subtask}
-      - {role: assistant, content: "${next_subtask}", stream: high_level, target: true}
-
-  low_level_execution:
-    weight: 0.35
-    messages:
-      - {role: user, content: "${task}\nPlan: ${plan}\nMemory: ${memory}", stream: high_level}
-      - {role: assistant, content: "${subtask}", stream: low_level, target: true}
-
-  # VQA is view-dependent: bbox / keypoint / count answers only make sense for
-  # the camera they were grounded against. Each camera gets its own sub-recipe
-  # so the resolver can disambiguate via `camera=...` and the user-turn carries
-  # the matching image block. Adjust the camera keys (and add more sub-recipes)
-  # to match the cameras present on your dataset.
-  ask_vqa_top:
-    weight: 0.10
-    bindings:
-      vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.top)"
-      vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.top)"
-    messages:
-      - role: user
-        stream: high_level
-        if_present: vqa_query
-        content:
-          - {type: image, feature: observation.images.top}
-          - {type: text, text: "${vqa_query}"}
-      - {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
-
-  ask_vqa_wrist:
-    weight: 0.10
-    bindings:
-      vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.wrist)"
-      vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.wrist)"
-    messages:
-      - role: user
-        stream: high_level
-        if_present: vqa_query
-        content:
-          - {type: image, feature: observation.images.wrist}
-          - {type: text, text: "${vqa_query}"}
-      - {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
--- a/src/lerobot/datasets/init.py
+++ b/src/lerobot/datasets/init.py
@@ -37,14 +37,6 @@ from .dataset_tools import (
 from .factory import make_dataset, resolve_delta_timestamps
 from .image_writer import safe_stop_image_writer
 from .io_utils import load_episodes, write_stats
-from .language import (
-    EVENT_ONLY_STYLES,
-    LANGUAGE_EVENTS,
-    LANGUAGE_PERSISTENT,
-    PERSISTENT_STYLES,
-    STYLE_REGISTRY,
-    column_for_style,
-)
 from .lerobot_dataset import LeRobotDataset
 from .multi_dataset import MultiLeRobotDataset
 from .pipeline_features import aggregate_pipeline_dataset_features, create_initial_features
@@ -61,15 +53,10 @@ __all__ = [
    "CODEBASE_VERSION",
    "DEFAULT_EPISODES_PATH",
    "DEFAULT_QUANTILES",
-    "EVENT_ONLY_STYLES",
    "EpisodeAwareSampler",
-    "LANGUAGE_EVENTS",
-    "LANGUAGE_PERSISTENT",
    "LeRobotDataset",
    "LeRobotDatasetMetadata",
    "MultiLeRobotDataset",
-    "PERSISTENT_STYLES",
-    "STYLE_REGISTRY",
    "StreamingLeRobotDataset",
    "VideoEncodingManager",
    "add_features",
@@ -79,7 +66,6 @@ __all__ = [
    "convert_image_to_video_dataset",
    "create_initial_features",
    "create_lerobot_dataset_card",
-    "column_for_style",
    "delete_episodes",
    "get_feature_stats",
    "load_episodes",
--- a/src/lerobot/datasets/compute_stats.py
+++ b/src/lerobot/datasets/compute_stats.py
@@ -512,7 +512,7 @@ def compute_episode_stats(

    ep_stats = {}
    for key, data in episode_data.items():
-        if features[key]["dtype"] in {"string", "language"}:
+        if features[key]["dtype"] == "string":
            continue

        if features[key]["dtype"] in ["image", "video"]:
--- a/src/lerobot/datasets/dataset_metadata.py
+++ b/src/lerobot/datasets/dataset_metadata.py
@@ -34,6 +34,7 @@ from .io_utils import (
    load_episodes,
    load_info,
    load_stats,
+    load_subtasks,
    load_tasks,
    write_info,
    write_json,
@@ -176,6 +177,7 @@ class LeRobotDatasetMetadata:
        self.info = load_info(self.root)
        check_version_compatibility(self.repo_id, self._version, CODEBASE_VERSION)
        self.tasks = load_tasks(self.root)
+        self.subtasks = load_subtasks(self.root)
        self.episodes = load_episodes(self.root)
        self.stats = load_stats(self.root)

@@ -318,28 +320,6 @@ class LeRobotDatasetMetadata:
        """Keys to access visual modalities (regardless of their storage method)."""
        return [key for key, ft in self.features.items() if ft["dtype"] in ["video", "image"]]

-    @property
-    def tools(self) -> list[dict]:
-        """OpenAI-style tool schemas declared by this dataset.
-
-        Read from ``meta/info.json["tools"]``. Returns a copy, so callers
-        can mutate the result safely. Falls back to
-        :data:`lerobot.datasets.language.DEFAULT_TOOLS` (the canonical
-        ``say`` schema) when the dataset doesn't declare any — that way
-        unannotated datasets and chat-template consumers
-        (``apply_chat_template(messages, tools=meta.tools)``) keep
-        working out of the box.
-
-        Implementations live under :mod:`lerobot.tools` (one file per
-        tool); see ``docs/source/tools.mdx`` for the authoring guide.
-        """
-        from .language import DEFAULT_TOOLS  # noqa: PLC0415  (avoid circular import)
-
-        declared = self.info.get("tools")
-        if isinstance(declared, list) and declared:
-            return [dict(t) for t in declared]
-        return [dict(t) for t in DEFAULT_TOOLS]
-
    @property
    def names(self) -> dict[str, list | dict]:
        """Names of the various dimensions of vector modalities."""
@@ -655,6 +635,7 @@ class LeRobotDatasetMetadata:
        _validate_feature_names(features)

        obj.tasks = None
+        obj.subtasks = None
        obj.episodes = None
        obj.stats = None
        obj.info = create_empty_dataset_info(
--- a/src/lerobot/datasets/dataset_reader.py
+++ b/src/lerobot/datasets/dataset_reader.py
@@ -295,4 +295,9 @@ class DatasetReader:
        task_idx = item["task_index"].item()
        item["task"] = self._meta.tasks.iloc[task_idx].name

+        # add subtask information if available
+        if "subtask_index" in self._meta.features and self._meta.subtasks is not None:
+            subtask_idx = item["subtask_index"].item()
+            item["subtask"] = self._meta.subtasks.iloc[subtask_idx].name
+
        return item
--- a/src/lerobot/datasets/feature_utils.py
+++ b/src/lerobot/datasets/feature_utils.py
@@ -22,12 +22,6 @@ from PIL import Image as PILImage
 from lerobot.utils.constants import DEFAULT_FEATURES
 from lerobot.utils.utils import is_valid_numpy_dtype_string

-from .language import (
-    LANGUAGE_PERSISTENT,
-    is_language_column,
-    language_events_column_feature,
-    language_persistent_column_feature,
-)
 from .utils import (
    DEFAULT_CHUNK_SIZE,
    DEFAULT_DATA_FILE_SIZE_IN_MB,
@@ -51,13 +45,7 @@ def get_hf_features_from_features(features: dict) -> datasets.Features:
    """
    hf_features = {}
    for key, ft in features.items():
-        if is_language_column(key):
-            hf_features[key] = (
-                language_persistent_column_feature()
-                if key == LANGUAGE_PERSISTENT
-                else language_events_column_feature()
-            )
-        elif ft["dtype"] == "video":
+        if ft["dtype"] == "video":
            continue
        elif ft["dtype"] == "image":
            hf_features[key] = datasets.Image()
@@ -254,8 +242,6 @@ def validate_feature_dtype_and_shape(
        return validate_feature_image_or_video(name, expected_shape, value)
    elif expected_dtype == "string":
        return validate_feature_string(name, value)
-    elif expected_dtype == "language":
-        return ""
    else:
        raise NotImplementedError(f"The feature dtype '{expected_dtype}' is not implemented yet.")

--- a/src/lerobot/datasets/io_utils.py
+++ b/src/lerobot/datasets/io_utils.py
@@ -34,6 +34,7 @@ from lerobot.utils.utils import SuppressProgressBars, flatten_dict, unflatten_di
 from .utils import (
    DEFAULT_DATA_FILE_SIZE_IN_MB,
    DEFAULT_EPISODES_PATH,
+    DEFAULT_SUBTASKS_PATH,
    DEFAULT_TASKS_PATH,
    EPISODES_DIR,
    INFO_PATH,
@@ -188,6 +189,14 @@ def load_tasks(local_dir: Path) -> pandas.DataFrame:
    return tasks


+def load_subtasks(local_dir: Path) -> pandas.DataFrame | None:
+    """Load subtasks from subtasks.parquet if it exists."""
+    subtasks_path = local_dir / DEFAULT_SUBTASKS_PATH
+    if subtasks_path.exists():
+        return pd.read_parquet(subtasks_path)
+    return None
+
+
 def write_episodes(episodes: Dataset, local_dir: Path) -> None:
    """Write episode metadata to a parquet file in the LeRobot v3.0 format.
    This function writes episode-level metadata to a single parquet file.
@@ -259,13 +268,11 @@ def hf_transform_to_torch(items_dict: dict[str, list[Any]]) -> dict[str, list[to
        dict: The batch with items converted to torch tensors.
    """
    for key in items_dict:
-        if key in {"language_persistent", "language_events"}:
-            continue
        first_item = items_dict[key][0]
        if isinstance(first_item, PILImage.Image):
            to_tensor = transforms.ToTensor()
            items_dict[key] = [to_tensor(img) for img in items_dict[key]]
-        elif first_item is None or isinstance(first_item, dict):
+        elif first_item is None:
            pass
        else:
            items_dict[key] = [x if isinstance(x, str) else torch.tensor(x) for x in items_dict[key]]
@@ -301,11 +308,7 @@ def item_to_torch(item: dict) -> dict:
        dict: Dictionary with all tensor-like items converted to torch.Tensor.
    """
    for key, val in item.items():
-        if isinstance(val, (np.ndarray | list)) and key not in [
-            "task",
-            "language_persistent",
-            "language_events",
-        ]:
+        if isinstance(val, (np.ndarray | list)) and key not in ["task"]:
            # Convert numpy arrays and lists to torch tensors
            item[key] = torch.tensor(val)
    return item
--- a/src/lerobot/datasets/language.py
+++ b/src/lerobot/datasets/language.py
@@ -1,236 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import annotations
-
-from typing import Literal
-
-import datasets
-import pyarrow as pa
-
-LANGUAGE_PERSISTENT = "language_persistent"
-LANGUAGE_EVENTS = "language_events"
-LANGUAGE_COLUMNS = (LANGUAGE_PERSISTENT, LANGUAGE_EVENTS)
-PERSISTENT_ROW_FIELDS = ("role", "content", "style", "timestamp", "camera", "tool_calls")
-EVENT_ROW_FIELDS = ("role", "content", "style", "camera", "tool_calls")
-
-CORE_STYLES = {
-    "subtask",
-    "plan",
-    "memory",
-    "motion",
-    "interjection",
-    "vqa",
-    "trace",
-    "task_aug",
-}
-EXTENDED_STYLES = set()
-STYLE_REGISTRY = CORE_STYLES | EXTENDED_STYLES
-
-PERSISTENT_STYLES = {"subtask", "plan", "memory", "motion", "task_aug"}
-EVENT_ONLY_STYLES = {"interjection", "vqa", "trace"}
-
-# Styles whose ``content`` is grounded in a specific camera view. Rows of these
-# styles MUST carry a non-null ``camera`` referencing an ``observation.images.*``
-# feature key. Rows of every other style MUST have ``camera=None``. ``motion``
-# is intentionally NOT in this set: motion primitives are described in
-# robot-frame (joint / Cartesian) terms, not pixel space, so they are
-# camera-agnostic. ``trace`` is the pixel-trajectory event style and IS
-# view-dependent. The ``camera`` field nevertheless lives on
-# ``PERSISTENT_ROW_FIELDS`` too so the schema, validator, and resolver
-# behave symmetrically across the two columns; persistent rows simply
-# always have ``camera=None`` in practice today.
-VIEW_DEPENDENT_STYLES = {"vqa", "trace"}
-
-LanguageColumn = Literal["language_persistent", "language_events"]
-
-
-def _json_arrow_type() -> pa.DataType:
-    """Return the Arrow JSON type, falling back to ``string`` on older pyarrow."""
-    return pa.json_() if hasattr(pa, "json_") else pa.string()
-
-
-def _json_feature() -> object:
-    """Return the HF ``datasets`` JSON feature, falling back to a string value."""
-    return datasets.Json() if hasattr(datasets, "Json") else datasets.Value("string")
-
-
-def language_persistent_row_arrow_type() -> pa.StructType:
-    """Return the Arrow struct type for a single persistent language row.
-
-    Persistent rows carry their own ``timestamp`` because they represent a state
-    that became active at a specific moment and remains active until superseded.
-    """
-    return pa.struct(
-        [
-            pa.field("role", pa.string(), nullable=False),
-            pa.field("content", pa.string(), nullable=True),
-            pa.field("style", pa.string(), nullable=True),
-            pa.field("timestamp", pa.float64(), nullable=False),
-            pa.field("camera", pa.string(), nullable=True),
-            pa.field("tool_calls", pa.list_(_json_arrow_type()), nullable=True),
-        ]
-    )
-
-
-def language_event_row_arrow_type() -> pa.StructType:
-    """Return the Arrow struct type for a single event language row.
-
-    Event rows have no ``timestamp`` field: each event is stored on the dataset
-    row whose frame timestamp is the event's firing time.
-    """
-    return pa.struct(
-        [
-            pa.field("role", pa.string(), nullable=False),
-            pa.field("content", pa.string(), nullable=True),
-            pa.field("style", pa.string(), nullable=True),
-            pa.field("camera", pa.string(), nullable=True),
-            pa.field("tool_calls", pa.list_(_json_arrow_type()), nullable=True),
-        ]
-    )
-
-
-def language_persistent_arrow_type() -> pa.ListType:
-    """Return the Arrow list type for the ``language_persistent`` column."""
-    return pa.list_(language_persistent_row_arrow_type())
-
-
-def language_events_arrow_type() -> pa.ListType:
-    """Return the Arrow list type for the ``language_events`` column."""
-    return pa.list_(language_event_row_arrow_type())
-
-
-def language_persistent_row_feature() -> dict[str, object]:
-    """Return the HF ``datasets`` feature mapping for a persistent language row."""
-    return {
-        "role": datasets.Value("string"),
-        "content": datasets.Value("string"),
-        "style": datasets.Value("string"),
-        "timestamp": datasets.Value("float64"),
-        "camera": datasets.Value("string"),
-        "tool_calls": datasets.List(_json_feature()),
-    }
-
-
-def language_event_row_feature() -> dict[str, object]:
-    """Return the HF ``datasets`` feature mapping for an event language row."""
-    return {
-        "role": datasets.Value("string"),
-        "content": datasets.Value("string"),
-        "style": datasets.Value("string"),
-        "camera": datasets.Value("string"),
-        "tool_calls": datasets.List(_json_feature()),
-    }
-
-
-def language_persistent_column_feature() -> datasets.List:
-    """Return the HF ``datasets`` feature for the ``language_persistent`` column."""
-    return datasets.List(language_persistent_row_feature())
-
-
-def language_events_column_feature() -> datasets.List:
-    """Return the HF ``datasets`` feature for the ``language_events`` column."""
-    return datasets.List(language_event_row_feature())
-
-
-def language_feature_info() -> dict[str, dict]:
-    """Return the ``info["features"]`` entries for both language columns."""
-    return {
-        LANGUAGE_PERSISTENT: {"dtype": "language", "shape": (1,), "names": None},
-        LANGUAGE_EVENTS: {"dtype": "language", "shape": (1,), "names": None},
-    }
-
-
-def is_language_column(key: str) -> bool:
-    """Return ``True`` if ``key`` is one of the dataset's language column names."""
-    return key in LANGUAGE_COLUMNS
-
-
-def is_view_dependent_style(style: str | None) -> bool:
-    """Return ``True`` if rows of ``style`` must be tagged with a ``camera`` key."""
-    return style in VIEW_DEPENDENT_STYLES
-
-
-def validate_camera_field(style: str | None, camera: str | None) -> None:
-    """Enforce the ``camera`` invariant: required iff ``style`` is view-dependent.
-
-    Raises ``ValueError`` if a view-dependent style is missing ``camera`` or if
-    a non-view-dependent style carries one. Pipeline writers and the validator
-    should call this on every emitted row.
-    """
-    if is_view_dependent_style(style):
-        if not camera:
-            raise ValueError(
-                f"Rows of view-dependent style {style!r} require a non-empty 'camera' "
-                f"field referencing an 'observation.images.*' feature key."
-            )
-    elif camera is not None:
-        raise ValueError(
-            f"Rows of style {style!r} must have camera=None; got camera={camera!r}."
-        )
-
-
-# --- Tool registry --------------------------------------------------------
-# Tools declared on a dataset live in ``meta/info.json["tools"]`` as a list
-# of OpenAI-style function schemas. The runtime / training stack reads them
-# through :class:`LeRobotDatasetMetadata.tools` (with these constants as
-# fallback when the dataset doesn't declare any). Implementations live
-# under :mod:`lerobot.tools` (one file per tool); see
-# ``docs/source/tools.mdx`` for the authoring guide.
-
-SAY_TOOL_SCHEMA: dict = {
-    "type": "function",
-    "function": {
-        "name": "say",
-        "description": "Speak a short utterance to the user via the TTS executor.",
-        "parameters": {
-            "type": "object",
-            "properties": {
-                "text": {
-                    "type": "string",
-                    "description": "The verbatim text to speak.",
-                }
-            },
-            "required": ["text"],
-        },
-    },
-}
-"""Canonical schema for the ``say`` tool emitted by the steerable
-annotation pipeline (PR 2 Module 2). Single source of truth — PR 2's
-writer, PR 3's runtime tool registry, and the dataset visualizer all
-import this constant rather than duplicating the dict."""
-
-DEFAULT_TOOLS: list[dict] = [SAY_TOOL_SCHEMA]
-"""Fallback tools list. Returned by ``LeRobotDatasetMetadata.tools``
-when ``meta/info.json["tools"]`` is unset, so unannotated datasets and
-chat-template consumers (``apply_chat_template(messages, tools=...)``)
-keep working out of the box."""
-
-
-def column_for_style(style: str | None) -> LanguageColumn:
-    """Map a language style to the column where rows of that style are stored.
-
-    Styles in :data:`PERSISTENT_STYLES` route to :data:`LANGUAGE_PERSISTENT`.
-    Styles in :data:`EVENT_ONLY_STYLES` and the implicit ``None`` style route
-    to :data:`LANGUAGE_EVENTS`.
-    """
-    if style is None:
-        return LANGUAGE_EVENTS
-    if style in PERSISTENT_STYLES:
-        return LANGUAGE_PERSISTENT
-    if style in EVENT_ONLY_STYLES:
-        return LANGUAGE_EVENTS
-    raise ValueError(f"Unknown language style: {style!r}")
--- a/src/lerobot/datasets/language_render.py
+++ b/src/lerobot/datasets/language_render.py
@@ -1,593 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import annotations
-
-import copy
-import hashlib
-import re
-from collections.abc import Sequence
-from typing import Any
-
-from lerobot.configs.recipe import DEFAULT_BINDINGS, TrainingRecipe
-
-from .language import (
-    EVENT_ONLY_STYLES,
-    LANGUAGE_PERSISTENT,
-    PERSISTENT_STYLES,
-    column_for_style,
-)
-
-LanguageRow = dict[str, Any]
-RenderedMessages = dict[str, list[Any]]
-
-_RESOLVER_RE = re.compile(r"^(?P<name>[A-Za-z_][A-Za-z0-9_]*)\((?P<args>.*)\)$")
-_PLACEHOLDER_RE = re.compile(r"\$\{([A-Za-z_][A-Za-z0-9_]*)\}")
-
-
-def active_at(
-    t: float,
-    *,
-    persistent: Sequence[LanguageRow],
-    events: Sequence[LanguageRow] | None = None,
-    style: str | None = None,
-    role: str | None = None,
-    tool_name: str | None = None,
-    camera: str | None = None,
-) -> LanguageRow | None:
-    """Return the persistent row of ``style`` that is active at time ``t``.
-
-    A persistent row is "active" at ``t`` when its own ``timestamp`` is the
-    most recent one ``<= t`` for the given ``style``/``role``/``tool_name``/
-    ``camera`` selector. ``events`` is accepted for resolver-signature
-    uniformity but is not consulted: only persistent styles are valid here.
-    """
-    _validate_persistent_resolver("active_at", style)
-    matches = _matching_rows(
-        persistent, style=style, role=role, tool_name=tool_name, camera=camera
-    )
-    matches = [row for row in matches if _timestamp(row) <= t]
-    return _select_latest(
-        matches, style=style, role=role, tool_name=tool_name, camera=camera
-    )
-
-
-def emitted_at(
-    t: float,
-    *,
-    persistent: Sequence[LanguageRow],
-    events: Sequence[LanguageRow],
-    style: str | None = None,
-    role: str | None = None,
-    tool_name: str | None = None,
-    camera: str | None = None,
-) -> LanguageRow | None:
-    """Return the row of ``style`` emitted at exactly time ``t``.
-
-    For persistent styles, this matches persistent rows whose own ``timestamp``
-    equals ``t``. For event styles, the ``events`` list is assumed to come from
-    the dataset row at frame ``t`` (event rows carry no timestamp of their own),
-    so all matching event rows are considered emitted at ``t``. ``camera``
-    filters by the row's ``camera`` field — required to disambiguate when
-    multiple view-dependent rows share ``(t, role)`` across cameras.
-    """
-    column = column_for_style(style)
-    if column == LANGUAGE_PERSISTENT:
-        matches = [
-            row
-            for row in _matching_rows(
-                persistent, style=style, role=role, tool_name=tool_name, camera=camera
-            )
-            if _timestamp(row) == t
-        ]
-        return _select_one(
-            matches,
-            style=style,
-            role=role,
-            tool_name=tool_name,
-            camera=camera,
-            sort_key=_persistent_sort_key,
-        )
-    matches = _matching_rows(
-        events, style=style, role=role, tool_name=tool_name, camera=camera
-    )
-    return _select_one(
-        matches,
-        style=style,
-        role=role,
-        tool_name=tool_name,
-        camera=camera,
-        sort_key=_event_sort_key,
-    )
-
-
-def nth_prev(
-    t: float,
-    *,
-    persistent: Sequence[LanguageRow],
-    events: Sequence[LanguageRow] | None = None,
-    style: str | None = None,
-    offset: int = 1,
-    role: str | None = None,
-    tool_name: str | None = None,
-    camera: str | None = None,
-) -> LanguageRow | None:
-    """Return the persistent row that was active ``offset`` steps before ``t``.
-
-    Walks back through chronologically sorted persistent rows of ``style``
-    (filtered by optional ``role``/``tool_name``/``camera``) and returns the
-    one ``offset`` positions before the row active at ``t``. Only valid for
-    persistent styles.
-    """
-    return _nth_relative(
-        t,
-        persistent=persistent,
-        style=style,
-        offset=-offset,
-        role=role,
-        tool_name=tool_name,
-        camera=camera,
-        resolver_name="nth_prev",
-    )
-
-
-def nth_next(
-    t: float,
-    *,
-    persistent: Sequence[LanguageRow],
-    events: Sequence[LanguageRow] | None = None,
-    style: str | None = None,
-    offset: int = 1,
-    role: str | None = None,
-    tool_name: str | None = None,
-    camera: str | None = None,
-) -> LanguageRow | None:
-    """Return the persistent row that becomes active ``offset`` steps after ``t``.
-
-    Walks forward through chronologically sorted persistent rows of ``style``
-    (filtered by optional ``role``/``tool_name``/``camera``) and returns the
-    one ``offset`` positions after the row active at ``t``. Only valid for
-    persistent styles.
-    """
-    return _nth_relative(
-        t,
-        persistent=persistent,
-        style=style,
-        offset=offset,
-        role=role,
-        tool_name=tool_name,
-        camera=camera,
-        resolver_name="nth_next",
-    )
-
-
-def render_sample(
-    *,
-    recipe: TrainingRecipe,
-    persistent: Sequence[LanguageRow] | None,
-    events: Sequence[LanguageRow] | None,
-    t: float,
-    sample_idx: int,
-    task: str | None = None,
-    dataset_ctx: Any | None = None,
-) -> RenderedMessages | None:
-    """Render the chat-style messages for a single dataset sample.
-
-    Resolves the recipe's bindings against ``persistent`` and ``events`` rows
-    at frame timestamp ``t``, then expands the recipe's message templates.
-    Returns ``None`` if the resolved sample contains no target message.
-    """
-    persistent_rows = _normalize_rows(persistent or [])
-    event_rows = _normalize_rows(events or [])
-    selected_recipe = _select_recipe(recipe, sample_idx)
-    bindings = _resolve_bindings(
-        selected_recipe,
-        persistent=persistent_rows,
-        events=event_rows,
-        t=t,
-        sample_idx=sample_idx,
-        task=task,
-        dataset_ctx=dataset_ctx,
-    )
-    return _render_message_recipe(selected_recipe, bindings)
-
-
-def _select_recipe(recipe: TrainingRecipe, sample_idx: int) -> TrainingRecipe:
-    """Pick a deterministic blend component for ``sample_idx`` (or return ``recipe``)."""
-    if recipe.blend is None:
-        return recipe
-
-    total_weight = sum(component.weight or 0.0 for component in recipe.blend.values())
-    if total_weight <= 0:
-        raise ValueError("Blend weights must sum to a positive value.")
-
-    digest = hashlib.blake2b(str(sample_idx).encode(), digest_size=8).digest()
-    draw = int.from_bytes(digest, "big") / 2**64 * total_weight
-    cumulative = 0.0
-    last_component: TrainingRecipe | None = None
-    for component in recipe.blend.values():
-        last_component = component
-        cumulative += component.weight or 0.0
-        if draw < cumulative:
-            return component
-    assert last_component is not None
-    return last_component
-
-
-def _resolve_bindings(
-    recipe: TrainingRecipe,
-    *,
-    persistent: Sequence[LanguageRow],
-    events: Sequence[LanguageRow],
-    t: float,
-    sample_idx: int,
-    task: str | None,
-    dataset_ctx: Any | None,
-) -> dict[str, LanguageRow | str | None]:
-    """Resolve every binding in ``recipe`` (plus ``task``) at time ``t``."""
-    bindings: dict[str, LanguageRow | str | None] = {
-        "task": _resolve_task(
-            task, dataset_ctx, persistent=persistent, sample_idx=sample_idx
-        ),
-    }
-    specs = {**DEFAULT_BINDINGS, **(recipe.bindings or {})}
-    for name, spec in specs.items():
-        bindings[name] = _resolve_spec(spec, persistent=persistent, events=events, t=t)
-    return bindings
-
-
-def _resolve_task(
-    task: str | None,
-    dataset_ctx: Any | None,
-    *,
-    persistent: Sequence[LanguageRow] = (),
-    sample_idx: int = 0,
-) -> str | None:
-    """Return the task string for ``sample_idx``.
-
-    Resolution order:
-
-    1. Explicit ``task`` override (caller-supplied) wins.
-    2. If ``persistent`` contains rows of style ``task_aug`` (role=user),
-       deterministically pick one by ``sample_idx`` so each frame of an
-       episode rotates through the available rephrasings across an epoch.
-       This realizes Xiao 2022 / CAST-style task-prompt diversity without
-       changing ``meta/tasks.parquet`` and without forcing recipes to opt
-       in: ``${task}`` automatically picks a rephrasing when one exists,
-       and falls back to the canonical task otherwise. Recipes that want
-       the literal canonical task can override the binding.
-    3. Otherwise read the canonical task from ``dataset_ctx`` (which is
-       backed by ``meta/tasks.parquet``).
-    """
-    if task is not None:
-        return task
-
-    aug_rows = [
-        r
-        for r in persistent
-        if r.get("style") == "task_aug" and r.get("role") == "user"
-    ]
-    if aug_rows:
-        # Deterministic, blake2b-based pick keyed on sample_idx so the
-        # rotation is reproducible across runs (Python's built-in ``hash``
-        # is process-randomized).
-        digest = hashlib.blake2b(
-            f"task_aug:{sample_idx}".encode(), digest_size=8
-        ).digest()
-        idx = int.from_bytes(digest, "big") % len(aug_rows)
-        chosen = aug_rows[idx].get("content")
-        if chosen:
-            return str(chosen)
-
-    if dataset_ctx is None:
-        return None
-    if isinstance(dataset_ctx, dict):
-        return dataset_ctx.get("task")
-    return getattr(dataset_ctx, "task", None)
-
-
-def _resolve_spec(
-    spec: str,
-    *,
-    persistent: Sequence[LanguageRow],
-    events: Sequence[LanguageRow],
-    t: float,
-) -> LanguageRow | None:
-    """Parse a single binding's resolver expression and dispatch to its function."""
-    match = _RESOLVER_RE.match(spec.strip())
-    if match is None:
-        raise ValueError(f"Invalid resolver expression: {spec!r}")
-    name = match.group("name")
-    kwargs = _parse_resolver_args(match.group("args"))
-    kwargs.pop("t_arg", None)
-
-    resolvers = {
-        "active_at": active_at,
-        "emitted_at": emitted_at,
-        "nth_prev": nth_prev,
-        "nth_next": nth_next,
-    }
-    if name not in resolvers:
-        raise ValueError(f"Unknown language resolver: {name!r}")
-    return resolvers[name](t, persistent=persistent, events=events, **kwargs)
-
-
-def _parse_resolver_args(args: str) -> dict[str, Any]:
-    """Parse a comma-separated resolver argument list into a kwargs dict."""
-    kwargs: dict[str, Any] = {}
-    if not args.strip():
-        return kwargs
-
-    parts = [part.strip() for part in args.split(",") if part.strip()]
-    for part in parts:
-        if part == "t":
-            kwargs["t_arg"] = True
-            continue
-        if "=" not in part:
-            raise ValueError(f"Invalid resolver argument: {part!r}")
-        key, value = (item.strip() for item in part.split("=", 1))
-        if key == "offset":
-            kwargs[key] = int(value)
-        else:
-            kwargs[key] = value.strip("\"'")
-    return kwargs
-
-
-def _render_message_recipe(
-    recipe: TrainingRecipe,
-    bindings: dict[str, LanguageRow | str | None],
-) -> RenderedMessages | None:
-    """Expand ``recipe.messages`` into rendered chat messages using ``bindings``."""
-    assert recipe.messages is not None
-    messages: list[dict[str, Any]] = []
-    streams: list[str | None] = []
-    target_indices: list[int] = []
-
-    for turn in recipe.messages:
-        if turn.if_present is not None and bindings.get(turn.if_present) is None:
-            continue
-
-        message = {"role": turn.role}
-        if turn.content is not None:
-            message["content"] = _render_content(turn.content, bindings)
-
-        if turn.tool_calls_from is not None:
-            row = bindings.get(turn.tool_calls_from)
-            tool_calls = row.get("tool_calls") if isinstance(row, dict) else None
-            if tool_calls:
-                message["tool_calls"] = copy.deepcopy(tool_calls)
-
-        message_idx = len(messages)
-        messages.append(message)
-        streams.append(turn.stream)
-        if turn.target:
-            target_indices.append(message_idx)
-
-    if not target_indices:
-        return None
-
-    rendered = {
-        "messages": messages,
-        "message_streams": streams,
-        "target_message_indices": target_indices,
-    }
-    _validate_rendered(rendered)
-    return rendered
-
-
-def _render_content(
-    content: str | list[dict[str, Any]],
-    bindings: dict[str, LanguageRow | str | None],
-) -> str | list[dict[str, Any]]:
-    """Substitute bindings into a string or each string field of multimodal blocks."""
-    if isinstance(content, str):
-        return _substitute(content, bindings)
-
-    rendered_blocks = []
-    for block in content:
-        rendered_block = copy.deepcopy(block)
-        for key, value in rendered_block.items():
-            if isinstance(value, str):
-                rendered_block[key] = _substitute(value, bindings)
-        rendered_blocks.append(rendered_block)
-    return rendered_blocks
-
-
-def _substitute(template: str, bindings: dict[str, LanguageRow | str | None]) -> str:
-    """Replace ``${name}`` placeholders in ``template`` with their bound values."""
-
-    def replace(match: re.Match[str]) -> str:
-        """Resolve a single ``${name}`` match to its bound string value."""
-        name = match.group(1)
-        if name not in bindings:
-            raise ValueError(f"Unknown template binding: {name!r}")
-        value = bindings[name]
-        if value is None:
-            return ""
-        if isinstance(value, dict):
-            content = value.get("content")
-            return "" if content is None else str(content)
-        return str(value)
-
-    return _PLACEHOLDER_RE.sub(replace, template)
-
-
-def _validate_rendered(rendered: RenderedMessages) -> None:
-    """Sanity-check the rendered output for stream/target alignment."""
-    messages = rendered["messages"]
-    streams = rendered["message_streams"]
-    target_indices = rendered["target_message_indices"]
-
-    if len(streams) != len(messages):
-        raise ValueError("message_streams must be aligned with messages.")
-    if not target_indices:
-        raise ValueError("Rendered samples must contain at least one target message.")
-    for idx in target_indices:
-        if idx < 0 or idx >= len(messages):
-            raise ValueError(f"Target message index {idx} is out of bounds.")
-    for idx, stream in enumerate(streams):
-        if stream is None:
-            raise ValueError(f"Rendered message {idx} has no stream.")
-
-
-def _nth_relative(
-    t: float,
-    *,
-    persistent: Sequence[LanguageRow],
-    style: str | None,
-    offset: int,
-    role: str | None,
-    tool_name: str | None,
-    camera: str | None,
-    resolver_name: str,
-) -> LanguageRow | None:
-    """Shared body for ``nth_prev`` / ``nth_next`` with signed ``offset``."""
-    _validate_persistent_resolver(resolver_name, style)
-    if abs(offset) < 1:
-        raise ValueError(f"{resolver_name} offset must be non-zero.")
-
-    rows = sorted(
-        _matching_rows(persistent, style=style, role=role, tool_name=tool_name, camera=camera),
-        key=_persistent_sort_key,
-    )
-    if not rows:
-        return None
-
-    anchor_idx = None
-    for idx, row in enumerate(rows):
-        if _timestamp(row) <= t:
-            anchor_idx = idx
-        else:
-            break
-
-    target_idx = (offset - 1 if offset > 0 else None) if anchor_idx is None else anchor_idx + offset
-
-    if target_idx is None or target_idx < 0 or target_idx >= len(rows):
-        return None
-    return rows[target_idx]
-
-
-def _validate_persistent_resolver(resolver_name: str, style: str | None) -> None:
-    """Reject calls with missing or event-only ``style`` for persistent resolvers."""
-    if style is None:
-        raise ValueError(f"{resolver_name} requires a persistent style.")
-    if style in EVENT_ONLY_STYLES:
-        raise ValueError(f"{resolver_name} cannot be used with event-only style {style!r}.")
-    if style not in PERSISTENT_STYLES:
-        column_for_style(style)
-
-
-def _matching_rows(
-    rows: Sequence[LanguageRow],
-    *,
-    style: str | None,
-    role: str | None,
-    tool_name: str | None,
-    camera: str | None,
-) -> list[LanguageRow]:
-    """Return ``rows`` filtered by optional ``style``/``role``/``tool_name``/``camera`` selectors."""
-    return [
-        row
-        for row in rows
-        if (style is None or row.get("style") == style)
-        and (role is None or row.get("role") == role)
-        and (tool_name is None or _row_has_tool_name(row, tool_name))
-        and (camera is None or row.get("camera") == camera)
-    ]
-
-
-def _select_latest(
-    rows: Sequence[LanguageRow],
-    *,
-    style: str | None,
-    role: str | None,
-    tool_name: str | None,
-    camera: str | None,
-) -> LanguageRow | None:
-    """Return the row tied for the latest ``timestamp`` (disambiguated by selectors)."""
-    if not rows:
-        return None
-    rows = sorted(rows, key=_persistent_sort_key)
-    latest_ts = _timestamp(rows[-1])
-    return _select_one(
-        [row for row in rows if _timestamp(row) == latest_ts],
-        style=style,
-        role=role,
-        tool_name=tool_name,
-        camera=camera,
-        sort_key=_persistent_sort_key,
-    )
-
-
-def _select_one(
-    rows: Sequence[LanguageRow],
-    *,
-    style: str | None,
-    role: str | None,
-    tool_name: str | None,
-    camera: str | None,
-    sort_key: Any,
-) -> LanguageRow | None:
-    """Return the single matching row, or raise if the selectors are ambiguous."""
-    if not rows:
-        return None
-    if len(rows) > 1 and role is None and tool_name is None and camera is None:
-        raise ValueError(
-            f"Ambiguous resolver for style={style!r}; add role=..., tool_name=..., "
-            f"or camera=... to disambiguate."
-        )
-    return sorted(rows, key=sort_key)[0]
-
-
-def _persistent_sort_key(row: LanguageRow) -> tuple[float, str, str]:
-    """Sort key for persistent rows: ``(timestamp, style, role)``."""
-    return (_timestamp(row), row.get("style") or "", row.get("role") or "")
-
-
-def _event_sort_key(row: LanguageRow) -> tuple[str, str]:
-    """Sort key for event rows: ``(style, role)`` (timestamp is implicit in the frame)."""
-    return (row.get("style") or "", row.get("role") or "")
-
-
-def _timestamp(row: LanguageRow) -> float:
-    """Extract a row's ``timestamp`` as a Python float (unwrapping numpy scalars)."""
-    value = row["timestamp"]
-    return float(value.item() if hasattr(value, "item") else value)
-
-
-def _row_has_tool_name(row: LanguageRow, tool_name: str) -> bool:
-    """Return ``True`` if any of the row's tool calls invokes ``tool_name``."""
-    for tool_call in row.get("tool_calls") or []:
-        if isinstance(tool_call, str):
-            continue
-        function = tool_call.get("function") if isinstance(tool_call, dict) else None
-        if isinstance(function, dict) and function.get("name") == tool_name:
-            return True
-    return False
-
-
-def _normalize_rows(rows: Sequence[Any]) -> list[LanguageRow]:
-    """Convert pyarrow scalars / mappings into a fresh list of plain dict rows."""
-    normalized = []
-    for row in rows:
-        if row is None:
-            continue
-        if hasattr(row, "as_py"):
-            row = row.as_py()
-        if not isinstance(row, dict):
-            raise TypeError(f"Language rows must be dictionaries, got {type(row).__name__}.")
-        normalized.append(dict(row))
-    return normalized
--- a/src/lerobot/datasets/utils.py
+++ b/src/lerobot/datasets/utils.py
@@ -83,6 +83,7 @@ VIDEO_DIR = "videos"

 CHUNK_FILE_PATTERN = "chunk-{chunk_index:03d}/file-{file_index:03d}"
 DEFAULT_TASKS_PATH = "meta/tasks.parquet"
+DEFAULT_SUBTASKS_PATH = "meta/subtasks.parquet"
 DEFAULT_EPISODES_PATH = EPISODES_DIR + "/" + CHUNK_FILE_PATTERN + ".parquet"
 DEFAULT_DATA_PATH = DATA_DIR + "/" + CHUNK_FILE_PATTERN + ".parquet"
 DEFAULT_VIDEO_PATH = VIDEO_DIR + "/{video_key}/" + CHUNK_FILE_PATTERN + ".mp4"
--- a/src/lerobot/envs/configs.py
+++ b/src/lerobot/envs/configs.py
@@ -331,7 +331,6 @@ class LiberoEnv(EnvConfig):
    camera_name_mapping: dict[str, str] | None = None
    observation_height: int = 360
    observation_width: int = 360
-    is_libero_plus: bool = False
    features: dict[str, PolicyFeature] = field(
        default_factory=lambda: {
            ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(7,)),
@@ -433,7 +432,6 @@ class LiberoEnv(EnvConfig):
            control_mode=self.control_mode,
            episode_length=self.episode_length,
            camera_name_mapping=self.camera_name_mapping,
-            is_libero_plus=self.is_libero_plus,
        )

    def get_env_processors(self):
@@ -573,71 +571,6 @@ class RoboCasaEnv(EnvConfig):
        )


-@EnvConfig.register_subclass("vlabench")
-@dataclass
-class VLABenchEnv(EnvConfig):
-    task: str = "select_fruit"
-    fps: int = 10
-    episode_length: int = 500
-    obs_type: str = "pixels_agent_pos"
-    render_mode: str = "rgb_array"
-    render_resolution: tuple[int, int] = (480, 480)
-    robot: str = "franka"
-    action_mode: str = "eef"
-    features: dict[str, PolicyFeature] = field(
-        default_factory=lambda: {
-            ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(7,)),
-        }
-    )
-    features_map: dict[str, str] = field(
-        default_factory=lambda: {
-            ACTION: ACTION,
-            "agent_pos": OBS_STATE,
-            "pixels/image": f"{OBS_IMAGES}.image",
-            "pixels/second_image": f"{OBS_IMAGES}.second_image",
-            "pixels/wrist_image": f"{OBS_IMAGES}.wrist_image",
-        }
-    )
-
-    def __post_init__(self):
-        h, w = self.render_resolution
-        if self.obs_type == "pixels":
-            self.features["pixels/image"] = PolicyFeature(type=FeatureType.VISUAL, shape=(h, w, 3))
-            self.features["pixels/second_image"] = PolicyFeature(type=FeatureType.VISUAL, shape=(h, w, 3))
-            self.features["pixels/wrist_image"] = PolicyFeature(type=FeatureType.VISUAL, shape=(h, w, 3))
-        elif self.obs_type == "pixels_agent_pos":
-            self.features["pixels/image"] = PolicyFeature(type=FeatureType.VISUAL, shape=(h, w, 3))
-            self.features["pixels/second_image"] = PolicyFeature(type=FeatureType.VISUAL, shape=(h, w, 3))
-            self.features["pixels/wrist_image"] = PolicyFeature(type=FeatureType.VISUAL, shape=(h, w, 3))
-            self.features["agent_pos"] = PolicyFeature(type=FeatureType.STATE, shape=(7,))
-        else:
-            raise ValueError(f"Unsupported obs_type: {self.obs_type}")
-
-    @property
-    def gym_kwargs(self) -> dict:
-        return {
-            "obs_type": self.obs_type,
-            "render_mode": self.render_mode,
-            "render_resolution": self.render_resolution,
-            "robot": self.robot,
-            "max_episode_steps": self.episode_length,
-            "action_mode": self.action_mode,
-        }
-
-    def create_envs(self, n_envs: int, use_async_envs: bool = False):
-        from .vlabench import create_vlabench_envs
-
-        if self.task is None:
-            raise ValueError("VLABenchEnv requires a task to be specified")
-        env_cls = _make_vec_env_cls(use_async_envs, n_envs)
-        return create_vlabench_envs(
-            task=self.task,
-            n_envs=n_envs,
-            gym_kwargs=self.gym_kwargs,
-            env_cls=env_cls,
-        )
-
-
@EnvConfig.register_subclass("isaaclab_arena")
@dataclass
 class IsaaclabArenaEnv(HubEnvConfig):
@@ -718,30 +651,6 @@ class IsaaclabArenaEnv(HubEnvConfig):
        )


-@EnvConfig.register_subclass("libero_plus")
-@dataclass
-class LiberoPlusEnv(LiberoEnv):
-    """Config for LIBERO-plus robustness benchmark evaluation.
-
-    LIBERO-plus extends LIBERO with 7 perturbation dimensions (camera viewpoints,
-    object layouts, robot initial states, language instructions, lighting, background
-    textures, sensor noise) producing ~10k task variants.
-
-    The gym interface is identical to LIBERO so this class reuses ``LiberoEnv``
-    entirely — only the registered name and default task suite differ.
-
-    Install: see docker/Dockerfile.benchmark.libero_plus — LIBERO-plus ships
-    as a namespace package from a git fork and must be cloned + PYTHONPATH'd
-    rather than installed as a pyproject extra.
-
-    See Also:
-        https://github.com/sylvestf/LIBERO-plus
-    """
-
-    task: str = "libero_spatial"
-    is_libero_plus: bool = True
-
-
@EnvConfig.register_subclass("robotwin")
@dataclass
 class RoboTwinEnvConfig(EnvConfig):
@@ -827,60 +736,3 @@ class RoboTwinEnvConfig(EnvConfig):
            observation_width=self.observation_width,
            episode_length=self.episode_length,
        )
-
-
-@EnvConfig.register_subclass("robomme")
-@dataclass
-class RoboMMEEnv(EnvConfig):
-    """RoboMME memory-augmented manipulation benchmark (ManiSkill/SAPIEN).
-
-    16 tasks across 4 suites: Counting, Permanence, Reference, Imitation.
-    Dataset: lerobot/robomme (LeRobot v3.0, 1,600 episodes).
-    Benchmark: https://github.com/RoboMME/robomme_benchmark
-
-    Requires the `robomme` git package installed separately (Linux only);
-    see docker/Dockerfile.benchmark.robomme for the canonical install.
-    """
-
-    task: str = "PickXtimes"
-    fps: int = 10
-    episode_length: int = 300
-    action_space: str = "joint_angle"  # or "ee_pose" (7-D)
-    dataset_split: str = "test"  # "train" | "val" | "test"
-    task_ids: list[int] | None = None
-    features: dict[str, PolicyFeature] = field(default_factory=dict)
-    features_map: dict[str, str] = field(
-        default_factory=lambda: {
-            ACTION: ACTION,
-            "pixels/image": f"{OBS_IMAGES}.image",
-            "pixels/wrist_image": f"{OBS_IMAGES}.wrist_image",
-            "agent_pos": OBS_STATE,
-        }
-    )
-
-    def __post_init__(self):
-        action_dim = 8 if self.action_space == "joint_angle" else 7
-        self.features = {
-            ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(action_dim,)),
-            "pixels/image": PolicyFeature(type=FeatureType.VISUAL, shape=(256, 256, 3)),
-            "pixels/wrist_image": PolicyFeature(type=FeatureType.VISUAL, shape=(256, 256, 3)),
-            "agent_pos": PolicyFeature(type=FeatureType.STATE, shape=(8,)),
-        }
-
-    @property
-    def gym_kwargs(self) -> dict:
-        return {}
-
-    def create_envs(self, n_envs: int, use_async_envs: bool = True):
-        from lerobot.envs.robomme import create_robomme_envs
-
-        env_cls = _make_vec_env_cls(use_async_envs, n_envs)
-        return create_robomme_envs(
-            task=self.task,
-            n_envs=n_envs,
-            action_space_type=self.action_space,
-            dataset=self.dataset_split,
-            episode_length=self.episode_length,
-            task_ids=self.task_ids,
-            env_cls=env_cls,
-        )
--- a/src/lerobot/envs/libero.py
+++ b/src/lerobot/envs/libero.py
@@ -16,7 +16,6 @@
 from __future__ import annotations

 import os
-import re
 from collections import defaultdict
 from collections.abc import Callable, Iterable, Mapping, Sequence
 from functools import partial
@@ -57,34 +56,14 @@ def _select_task_ids(total_tasks: int, task_ids: Iterable[int] | None) -> list[i
    return ids


-# LIBERO-plus perturbation variants encode the perturbation in the filename
-# but on disk only the base `.pruned_init` exists — strip the suffix to match
-# LIBERO-plus's own suite.get_task_init_states() (we reimplement it here so we
-# can pass weights_only=False for PyTorch 2.6+ numpy pickles).
-_LIBERO_PERTURBATION_SUFFIX_RE = re.compile(r"_(?:language|view|light)_[^.]*|_(?:table|tb)_\d+")
-
-
-def get_task_init_states(task_suite: Any, i: int, is_libero_plus: bool = False) -> np.ndarray:
-    task = task_suite.tasks[i]
-    filename = Path(task.init_states_file)
-    root = Path(get_libero_path("init_states"))
-
-    if not is_libero_plus:
-        init_states_path = root / task.problem_folder / filename.name
-        return torch.load(init_states_path, weights_only=False)  # nosec B614
-
-    # LIBERO-plus: `_add_` / `_level` variants store extra-object layouts under
-    # libero_newobj/ as a flat array that must be reshaped to (1, -1).
-    if "_add_" in filename.name or "_level" in filename.name:
-        init_states_path = root / "libero_newobj" / task.problem_folder / filename.name
-        init_states = torch.load(init_states_path, weights_only=False)  # nosec B614
-        return init_states.reshape(1, -1)
-
-    # LIBERO-plus perturbation variants encode the perturbation in the filename
-    # but on disk only the base `.pruned_init` exists — strip the suffix to match.
-    stripped = _LIBERO_PERTURBATION_SUFFIX_RE.sub("", filename.stem) + filename.suffix
-    init_states_path = root / task.problem_folder / stripped
-    return torch.load(init_states_path, weights_only=False)  # nosec B614
+def get_task_init_states(task_suite: Any, i: int) -> np.ndarray:
+    init_states_path = (
+        Path(get_libero_path("init_states"))
+        / task_suite.tasks[i].problem_folder
+        / task_suite.tasks[i].init_states_file
+    )
+    init_states = torch.load(init_states_path, weights_only=False)  # nosec B614
+    return init_states


 def get_libero_dummy_action():
@@ -126,11 +105,9 @@ class LiberoEnv(gym.Env):
        camera_name_mapping: dict[str, str] | None = None,
        num_steps_wait: int = 10,
        control_mode: str = "relative",
-        is_libero_plus: bool = False,
    ):
        super().__init__()
        self.task_id = task_id
-        self.is_libero_plus = is_libero_plus
        self.obs_type = obs_type
        self.render_mode = render_mode
        self.observation_width = observation_width
@@ -157,11 +134,7 @@ class LiberoEnv(gym.Env):
        self.episode_index = episode_index
        self.episode_length = episode_length
        # Load once and keep
-        self._init_states = (
-            get_task_init_states(task_suite, self.task_id, is_libero_plus=self.is_libero_plus)
-            if self.init_states
-            else None
-        )
+        self._init_states = get_task_init_states(task_suite, self.task_id) if self.init_states else None
        self._reset_stride = n_envs  # when performing a reset, append `_reset_stride` to `init_state_id`.

        self.init_state_id = self.episode_index  # tie each sub-env to a fixed init state
@@ -394,7 +367,6 @@ def _make_env_fns(
    gym_kwargs: Mapping[str, Any],
    control_mode: str,
    camera_name_mapping: dict[str, str] | None = None,
-    is_libero_plus: bool = False,
 ) -> list[Callable[[], LiberoEnv]]:
    """Build n_envs factory callables for a single (suite, task_id)."""

@@ -411,7 +383,6 @@ def _make_env_fns(
            n_envs=n_envs,
            control_mode=control_mode,
            camera_name_mapping=camera_name_mapping,
-            is_libero_plus=is_libero_plus,
            **local_kwargs,
        )

@@ -434,7 +405,6 @@ def create_libero_envs(
    control_mode: str = "relative",
    episode_length: int | None = None,
    camera_name_mapping: dict[str, str] | None = None,
-    is_libero_plus: bool = False,
 ) -> dict[str, dict[int, Any]]:
    """
    Create vectorized LIBERO environments with a consistent return shape.
@@ -493,7 +463,6 @@ def create_libero_envs(
                gym_kwargs=gym_kwargs,
                control_mode=control_mode,
                camera_name_mapping=camera_name_mapping,
-                is_libero_plus=is_libero_plus,
            )
            if is_async:
                lazy = _LazyAsyncVectorEnv(fns, cached_obs_space, cached_act_space, cached_metadata)
--- a/src/lerobot/envs/robomme.py
+++ b/src/lerobot/envs/robomme.py
@@ -1,245 +0,0 @@
-"""RoboMME environment wrapper for LeRobot evaluation.
-
-Wraps the RoboMME ``BenchmarkEnvBuilder`` into a Gymnasium-compatible
-``VectorEnv`` suitable for ``lerobot_eval``.
-
-RoboMME tasks:
-  Counting:    BinFill, PickXtimes, SwingXtimes, StopCube
-  Permanence:  VideoUnmask, VideoUnmaskSwap, ButtonUnmask, ButtonUnmaskSwap
-  Reference:   PickHighlight, VideoRepick, VideoPlaceButton, VideoPlaceOrder
-  Imitation:   MoveCube, InsertPeg, PatternLock, RouteStick
-
-Dataset: lerobot/robomme (LeRobot v3.0, 1,600 episodes)
-Install: see docker/Dockerfile.benchmark.robomme  (Linux only — mani-skill vs numpy pin conflict)
-Benchmark: https://github.com/RoboMME/robomme_benchmark
-"""
-
-from __future__ import annotations
-
-from collections.abc import Callable, Sequence
-from functools import partial
-from typing import Any
-
-import gymnasium as gym
-import numpy as np
-from gymnasium import spaces
-
-from .utils import _LazyAsyncVectorEnv
-
-ROBOMME_TASKS = [
-    "BinFill",
-    "PickXtimes",
-    "SwingXtimes",
-    "StopCube",
-    "VideoUnmask",
-    "VideoUnmaskSwap",
-    "ButtonUnmask",
-    "ButtonUnmaskSwap",
-    "PickHighlight",
-    "VideoRepick",
-    "VideoPlaceButton",
-    "VideoPlaceOrder",
-    "MoveCube",
-    "InsertPeg",
-    "PatternLock",
-    "RouteStick",
-]
-
-
-class RoboMMEGymEnv(gym.Env):
-    """Thin Gymnasium wrapper around a single RoboMME episode env."""
-
-    metadata = {"render_modes": ["rgb_array"], "render_fps": 10}
-
-    def __init__(
-        self,
-        task: str = "PickXtimes",
-        action_space_type: str = "joint_angle",
-        dataset: str = "test",
-        episode_idx: int = 0,
-        max_steps: int = 300,
-    ):
-        super().__init__()
-        from robomme.env_record_wrapper import BenchmarkEnvBuilder
-
-        self._task = task
-        self._action_space_type = action_space_type
-        self._dataset = dataset
-        self._episode_idx = episode_idx
-        self._max_steps = max_steps
-        self._max_episode_steps = max_steps
-
-        self._builder = BenchmarkEnvBuilder(
-            env_id=task,
-            dataset=dataset,
-            action_space=action_space_type,
-            gui_render=False,
-            max_steps=max_steps,
-        )
-        self._env = None
-        self._last_raw_obs: dict | None = None
-
-        action_dim = 8 if action_space_type == "joint_angle" else 7
-        self.action_space = spaces.Box(low=-1.0, high=1.0, shape=(action_dim,), dtype=np.float32)
-        # `pixels` must be a nested Dict so `preprocess_observation()` in
-        # envs/utils.py picks it up and maps each camera to
-        # `observation.images.<cam>`. A flat layout (`pixels/image`,
-        # `pixels/wrist_image`) silently drops every image from the batch.
-        self.observation_space = spaces.Dict(
-            {
-                "pixels": spaces.Dict(
-                    {
-                        "image": spaces.Box(0, 255, shape=(256, 256, 3), dtype=np.uint8),
-                        "wrist_image": spaces.Box(0, 255, shape=(256, 256, 3), dtype=np.uint8),
-                    }
-                ),
-                "agent_pos": spaces.Box(-np.inf, np.inf, shape=(8,), dtype=np.float32),
-            }
-        )
-
-    def reset(self, *, seed=None, options=None):
-        super().reset(seed=seed)
-        self._env = self._builder.make_env_for_episode(
-            episode_idx=self._episode_idx,
-            max_steps=self._max_steps,
-        )
-        obs, info = self._env.reset()
-        self._last_raw_obs = obs
-        return self._convert_obs(obs), self._convert_info(info)
-
-    def step(self, action):
-        obs, reward, terminated, truncated, info = self._env.step(action)
-        self._last_raw_obs = obs
-
-        terminated_bool = bool(terminated.item()) if hasattr(terminated, "item") else bool(terminated)
-        truncated_bool = bool(truncated.item()) if hasattr(truncated, "item") else bool(truncated)
-
-        status = info.get("status", "ongoing")
-        is_success = status == "success"
-        conv_info = self._convert_info(info)
-        conv_info["is_success"] = is_success
-
-        return self._convert_obs(obs), float(reward), terminated_bool, truncated_bool, conv_info
-
-    def render(self) -> np.ndarray | None:
-        """Return the front camera image from the last observation for video recording."""
-        if self._last_raw_obs is None:
-            return np.zeros((256, 256, 3), dtype=np.uint8)
-        front = self._last_raw_obs.get("front_rgb_list")
-        if front is None:
-            return np.zeros((256, 256, 3), dtype=np.uint8)
-        frame = front[-1] if isinstance(front, list) else front
-        return np.asarray(frame, dtype=np.uint8)
-
-    def _convert_obs(self, obs: dict) -> dict:
-        front_rgb = (
-            obs["front_rgb_list"][-1] if isinstance(obs["front_rgb_list"], list) else obs["front_rgb_list"]
-        )
-        wrist_rgb = (
-            obs["wrist_rgb_list"][-1] if isinstance(obs["wrist_rgb_list"], list) else obs["wrist_rgb_list"]
-        )
-        joint_state = (
-            obs["joint_state_list"][-1]
-            if isinstance(obs["joint_state_list"], list)
-            else obs["joint_state_list"]
-        )
-        gripper_state = (
-            obs["gripper_state_list"][-1]
-            if isinstance(obs["gripper_state_list"], list)
-            else obs["gripper_state_list"]
-        )
-
-        front_rgb = np.asarray(front_rgb, dtype=np.uint8)
-        wrist_rgb = np.asarray(wrist_rgb, dtype=np.uint8)
-        joint = np.asarray(joint_state, dtype=np.float32).flatten()[:7]
-        gripper = np.asarray(gripper_state, dtype=np.float32).flatten()[:1]
-        state = np.concatenate([joint, gripper])
-
-        return {
-            "pixels": {"image": front_rgb, "wrist_image": wrist_rgb},
-            "agent_pos": state,
-        }
-
-    def _convert_info(self, info: dict) -> dict:
-        return {
-            "status": info.get("status", "ongoing"),
-            "task_goal": info.get("task_goal", ""),
-        }
-
-
-def _make_env_fns(
-    *,
-    task: str,
-    n_envs: int,
-    action_space_type: str,
-    dataset: str,
-    episode_length: int,
-    task_id: int,
-) -> list[Callable[[], RoboMMEGymEnv]]:
-    """Build n_envs factory callables for one RoboMME task id."""
-
-    def _make_one(episode_index: int) -> RoboMMEGymEnv:
-        return RoboMMEGymEnv(
-            task=task,
-            action_space_type=action_space_type,
-            dataset=dataset,
-            episode_idx=episode_index,
-            max_steps=episode_length,
-        )
-
-    return [partial(_make_one, task_id + i) for i in range(n_envs)]
-
-
-def create_robomme_envs(
-    task: str,
-    n_envs: int = 1,
-    action_space_type: str = "joint_angle",
-    dataset: str = "test",
-    episode_length: int = 300,
-    task_ids: list[int] | None = None,
-    env_cls: Callable[[Sequence[Callable[[], Any]]], Any] | None = None,
-) -> dict[str, dict[int, gym.vector.VectorEnv]]:
-    """Create vectorized RoboMME environments for evaluation.
-
-    `task` may be a single RoboMME task name (e.g. "PickXtimes") or a
-    comma-separated list (e.g. "PickXtimes,BinFill,StopCube"). Each task
-    becomes its own suite in the returned mapping.
-
-    Returns {suite_name: {task_id: VectorEnv}} matching lerobot's expected format.
-    """
-    if env_cls is None or not callable(env_cls):
-        raise ValueError("env_cls must be a callable that wraps a list of env factory callables.")
-    if not isinstance(n_envs, int) or n_envs <= 0:
-        raise ValueError(f"n_envs must be a positive int; got {n_envs}.")
-
-    if task_ids is None:
-        task_ids = [0]
-
-    task_names = [t.strip() for t in task.split(",") if t.strip()]
-    is_async = env_cls is gym.vector.AsyncVectorEnv
-    cached_obs_space: spaces.Space | None = None
-    cached_act_space: spaces.Space | None = None
-    cached_metadata: dict[str, Any] | None = None
-    out: dict[str, dict[int, gym.vector.VectorEnv]] = {}
-    for task_name in task_names:
-        envs_by_task: dict[int, gym.vector.VectorEnv] = {}
-        for task_id in task_ids:
-            fns = _make_env_fns(
-                task=task_name,
-                n_envs=n_envs,
-                action_space_type=action_space_type,
-                dataset=dataset,
-                episode_length=episode_length,
-                task_id=task_id,
-            )
-            if is_async:
-                lazy = _LazyAsyncVectorEnv(fns, cached_obs_space, cached_act_space, cached_metadata)
-                if cached_obs_space is None:
-                    cached_obs_space = lazy.observation_space
-                    cached_act_space = lazy.action_space
-                    cached_metadata = lazy.metadata
-                envs_by_task[task_id] = lazy
-            else:
-                envs_by_task[task_id] = env_cls(fns)
-        out[task_name] = envs_by_task
-    return out
--- a/src/lerobot/envs/vlabench.py
+++ b/src/lerobot/envs/vlabench.py
@@ -1,589 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""VLABench environment wrapper for LeRobot.
-
-VLABench is a large-scale benchmark for language-conditioned robotic manipulation
-with long-horizon reasoning, built on MuJoCo/dm_control.
-
- Paper: https://arxiv.org/abs/2412.18194
- GitHub: https://github.com/OpenMOSS/VLABench
- Website: https://vlabench.github.io
-"""
-
-from __future__ import annotations
-
-import contextlib
-import logging
-from collections import defaultdict
-from collections.abc import Callable, Sequence
-from typing import Any
-
-import cv2
-import gymnasium as gym
-import numpy as np
-from gymnasium import spaces
-from scipy.spatial.transform import Rotation
-
-from lerobot.types import RobotObservation
-
-from .utils import _LazyAsyncVectorEnv
-
-logger = logging.getLogger(__name__)
-
-ACTION_DIM = 7  # pos(3) + euler(3) + gripper(1)
-ACTION_LOW = np.array([-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 0.0], dtype=np.float32)
-ACTION_HIGH = np.array([1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], dtype=np.float32)
-
-# Default max episode steps per task type
-DEFAULT_MAX_EPISODE_STEPS = 500
-
-# VLABench task suites
-PRIMITIVE_TASKS = [
-    "select_fruit",
-    "select_toy",
-    "select_chemistry_tube",
-    "add_condiment",
-    "select_book",
-    "select_painting",
-    "select_drink",
-    "insert_flower",
-    "select_billiards",
-    "select_ingredient",
-    "select_mahjong",
-    "select_poker",
-    # Physical series
-    "density_qa",
-    "friction_qa",
-    "magnetism_qa",
-    "reflection_qa",
-    "simple_cuestick_usage",
-    "simple_seesaw_usage",
-    "sound_speed_qa",
-    "thermal_expansion_qa",
-    "weight_qa",
-]
-
-COMPOSITE_TASKS = [
-    "cluster_billiards",
-    "cluster_book",
-    "cluster_drink",
-    "cluster_toy",
-    "cook_dishes",
-    "cool_drink",
-    "find_unseen_object",
-    "get_coffee",
-    "hammer_nail",
-    "heat_food",
-    "make_juice",
-    "play_mahjong",
-    "play_math_game",
-    "play_poker",
-    "play_snooker",
-    "rearrange_book",
-    "rearrange_chemistry_tube",
-    "set_dining_table",
-    "set_study_table",
-    "store_food",
-    "take_chemistry_experiment",
-    "use_seesaw_complex",
-]
-
-SUITE_TASKS: dict[str, list[str]] = {
-    "primitive": PRIMITIVE_TASKS,
-    "composite": COMPOSITE_TASKS,
-}
-
-
-class VLABenchEnv(gym.Env):
-    """Gymnasium wrapper for VLABench environments.
-
-    Wraps the dm_control-based VLABench simulator behind a standard gym.Env interface.
-    Supports multiple cameras (front, second, wrist) and end-effector control.
-    """
-
-    metadata = {"render_modes": ["rgb_array"], "render_fps": 10}
-
-    def __init__(
-        self,
-        task: str = "select_fruit",
-        obs_type: str = "pixels_agent_pos",
-        render_mode: str = "rgb_array",
-        render_resolution: tuple[int, int] = (480, 480),
-        robot: str = "franka",
-        max_episode_steps: int = DEFAULT_MAX_EPISODE_STEPS,
-        action_mode: str = "eef",
-    ):
-        super().__init__()
-        self.task = task
-        self.obs_type = obs_type
-        self.render_mode = render_mode
-        self.render_resolution = render_resolution
-        self.robot = robot
-        self._max_episode_steps = max_episode_steps
-        self.action_mode = action_mode
-
-        # Deferred — created on first reset() inside worker subprocess to avoid
-        # inheriting stale GPU/EGL contexts when AsyncVectorEnv spawns workers.
-        # We never cache `env.physics`: dm_control exposes it as a weakref
-        # proxy that goes stale across resets (rebuilds the sim), so we always
-        # refetch it via `self._env.physics` at the call site.
-        self._env = None
-        self.task_description = ""  # populated on first reset
-        # Cached world-frame XYZ of the robot base link. The VLABench datasets
-        # log both `observation.state` positions and `actions` positions in
-        # robot-base frame (see VLABench/scripts/convert_to_lerobot.py which
-        # subtracts `robot_frame_pos` from ee_pos). The robot is attached at a
-        # fixed offset per task so this is safe to cache once per env build.
-        self._robot_base_xyz: np.ndarray | None = None
-
-        h, w = self.render_resolution
-
-        if self.obs_type == "state":
-            raise NotImplementedError(
-                "The 'state' observation type is not supported in VLABenchEnv. "
-                "Please use 'pixels' or 'pixels_agent_pos'."
-            )
-        elif self.obs_type == "pixels":
-            self.observation_space = spaces.Dict(
-                {
-                    "pixels": spaces.Dict(
-                        {
-                            "image": spaces.Box(low=0, high=255, shape=(h, w, 3), dtype=np.uint8),
-                            "second_image": spaces.Box(low=0, high=255, shape=(h, w, 3), dtype=np.uint8),
-                            "wrist_image": spaces.Box(low=0, high=255, shape=(h, w, 3), dtype=np.uint8),
-                        }
-                    ),
-                }
-            )
-        elif self.obs_type == "pixels_agent_pos":
-            self.observation_space = spaces.Dict(
-                {
-                    "pixels": spaces.Dict(
-                        {
-                            "image": spaces.Box(low=0, high=255, shape=(h, w, 3), dtype=np.uint8),
-                            "second_image": spaces.Box(low=0, high=255, shape=(h, w, 3), dtype=np.uint8),
-                            "wrist_image": spaces.Box(low=0, high=255, shape=(h, w, 3), dtype=np.uint8),
-                        }
-                    ),
-                    "agent_pos": spaces.Box(low=-np.inf, high=np.inf, shape=(7,), dtype=np.float64),
-                }
-            )
-        else:
-            raise ValueError(f"Unsupported obs_type: {self.obs_type}")
-
-        self.action_space = spaces.Box(low=ACTION_LOW, high=ACTION_HIGH, dtype=np.float32)
-
-    # Max attempts to rebuild the underlying env when MuJoCo throws
-    # `PhysicsError` (e.g. mjWARN_BADQACC) during VLABench's 20-step
-    # reset warm-up. Some random task/layout samples land in unstable
-    # initial configurations; re-sampling the layout almost always
-    # gives a stable one. A handful of upstream tasks (notably
-    # `select_mahjong`) have layout samplers that diverge often enough
-    # to need >>5 retries, so we pick a generous ceiling.
-    _ENSURE_ENV_MAX_ATTEMPTS = 20
-
-    def _ensure_env(self) -> None:
-        """Create the underlying VLABench env on first use.
-
-        Called inside the worker subprocess after fork(), so each worker gets
-        its own clean rendering context rather than inheriting a stale one from
-        the parent process (which causes crashes with AsyncVectorEnv).
-
-        Retries on `PhysicsError`: VLABench's `LM4ManipDMEnv.reset()` runs 20
-        warm-up `step()` calls while toggling gravity/fluids to let the scene
-        settle; for some random layouts MuJoCo's integrator diverges and
-        raises `mjWARN_BADQACC`. Re-sampling the layout almost always yields
-        a stable one, so we retry a number of times before giving up. Between
-        attempts we reseed NumPy's global RNG from OS entropy so the upstream
-        task sampler explores fresh initial states — without this, retries
-        can replay the same diverging configuration when the sampler is
-        deterministic given the current RNG state.
-        """
-        if self._env is not None:
-            return
-
-        import VLABench.robots  # noqa: F401  # type: ignore[import-untyped]
-        import VLABench.tasks  # noqa: F401  # type: ignore[import-untyped]
-        from dm_control.rl.control import PhysicsError  # type: ignore[import-untyped]
-        from VLABench.envs import load_env  # type: ignore[import-untyped]
-
-        h, w = self.render_resolution
-        last_exc: PhysicsError | None = None
-        for attempt in range(1, self._ENSURE_ENV_MAX_ATTEMPTS + 1):
-            try:
-                env = load_env(task=self.task, robot=self.robot, render_resolution=(h, w))
-                self._env = env
-                break
-            except PhysicsError as exc:
-                last_exc = exc
-                logger.warning(
-                    "PhysicsError on attempt %d/%d while building task '%s': %s. Retrying with fresh layout…",
-                    attempt,
-                    self._ENSURE_ENV_MAX_ATTEMPTS,
-                    self.task,
-                    exc,
-                )
-                np.random.seed(None)
-        if self._env is None:
-            assert last_exc is not None
-            raise RuntimeError(
-                f"VLABench task '{self.task}' failed to produce a stable "
-                f"initial layout after {self._ENSURE_ENV_MAX_ATTEMPTS} "
-                f"attempts. This task's upstream sampler diverges too "
-                f"often for the configured robot; consider removing it "
-                f"from the eval set. Last physics error: {last_exc}"
-            ) from last_exc
-
-        # Extract task description from the dm_control task
-        task_obj = self._env.task
-        if hasattr(task_obj, "task_description"):
-            self.task_description = task_obj.task_description
-        elif hasattr(task_obj, "language_instruction"):
-            self.task_description = task_obj.language_instruction
-        else:
-            self.task_description = self.task
-
-        # Cache robot base world position so `_build_ctrl_from_action` and
-        # `_get_obs` can translate between robot-frame (dataset) and
-        # world-frame (dm_control) without hitting physics every call.
-        try:
-            self._robot_base_xyz = np.asarray(self._env.get_robot_frame_position(), dtype=np.float64).reshape(
-                3
-            )
-        except Exception:
-            # Fallback to VLABench's default Franka base position.
-            self._robot_base_xyz = np.array([0.0, -0.4, 0.78], dtype=np.float64)
-
-    def _get_obs(self) -> dict:
-        """Get current observation from the environment."""
-        assert self._env is not None
-
-        obs = self._env.get_observation()
-        h, w = self.render_resolution
-
-        def _to_hwc3(arr: np.ndarray) -> np.ndarray:
-            """Coerce any camera array to the declared (h, w, 3) uint8 shape."""
-            a = np.asarray(arr)
-            # Drop a leading singleton batch dim if present.
-            while a.ndim > 3 and a.shape[0] == 1:
-                a = a[0]
-            if a.ndim == 3 and a.shape[0] in (1, 3, 4) and a.shape[-1] not in (1, 3, 4):
-                # CHW → HWC
-                a = np.transpose(a, (1, 2, 0))
-            if a.ndim == 2:
-                a = np.stack([a] * 3, axis=-1)
-            if a.ndim != 3:
-                return np.zeros((h, w, 3), dtype=np.uint8)
-            # Force 3 channels.
-            if a.shape[-1] == 1:
-                a = np.repeat(a, 3, axis=-1)
-            elif a.shape[-1] == 4:
-                a = a[..., :3]
-            elif a.shape[-1] != 3:
-                return np.zeros((h, w, 3), dtype=np.uint8)
-            if a.shape[:2] != (h, w):
-                a = cv2.resize(a, (w, h), interpolation=cv2.INTER_AREA)
-            return a.astype(np.uint8)
-
-        # Extract camera images — VLABench returns (n_cameras, C, H, W) or individual arrays
-        raw_frames: list[np.ndarray] = []
-        if "rgb" in obs:
-            rgb = obs["rgb"]
-            if isinstance(rgb, np.ndarray):
-                if rgb.ndim == 4:
-                    raw_frames = [rgb[i] for i in range(rgb.shape[0])]
-                elif rgb.ndim == 3:
-                    raw_frames = [rgb]
-
-        image_keys = ["image", "second_image", "wrist_image"]
-        images: dict[str, np.ndarray] = {}
-        for i, key in enumerate(image_keys):
-            if i < len(raw_frames):
-                images[key] = _to_hwc3(raw_frames[i])
-            else:
-                images[key] = np.zeros((h, w, 3), dtype=np.uint8)
-
-        # Convert VLABench's raw ee_state `[pos_world(3), quat_wxyz(4), open(1)]`
-        # to the dataset's observation.state layout `[pos_robot(3), euler_xyz(3),
-        # gripper(1)]`. See VLABench/scripts/convert_to_lerobot.py — positions
-        # are stored in robot-base frame and orientations as scipy extrinsic
-        # 'xyz' euler angles.
-        raw = np.asarray(obs.get("ee_state", np.zeros(8)), dtype=np.float64).ravel()
-        pos_world = raw[:3] if raw.size >= 3 else np.zeros(3, dtype=np.float64)
-        quat_wxyz = raw[3:7] if raw.size >= 7 else np.array([1.0, 0.0, 0.0, 0.0], dtype=np.float64)
-        gripper = float(raw[7]) if raw.size >= 8 else 0.0
-
-        base = self._robot_base_xyz if self._robot_base_xyz is not None else np.zeros(3, dtype=np.float64)
-        pos_robot = pos_world - base
-        euler_xyz = Rotation.from_quat([quat_wxyz[1], quat_wxyz[2], quat_wxyz[3], quat_wxyz[0]]).as_euler(
-            "xyz", degrees=False
-        )
-
-        ee_state = np.concatenate([pos_robot, euler_xyz, [gripper]]).astype(np.float64)
-
-        if self.obs_type == "pixels":
-            return {"pixels": images}
-        elif self.obs_type == "pixels_agent_pos":
-            return {
-                "pixels": images,
-                "agent_pos": ee_state.astype(np.float64),
-            }
-        else:
-            raise ValueError(f"Unknown obs_type: {self.obs_type}")
-
-    # ---- Action adaptation (EEF → joint ctrl) --------------------------------
-    #
-    # The HF vlabench datasets log 7D actions
-    # `[x, y, z (robot frame), rx, ry, rz (scipy extrinsic xyz), gripper]`,
-    # exactly matching VLABench's own eval pipeline (evaluator.base):
-    #   pos, euler, g = policy(...)
-    #   quat = euler_to_quaternion(*euler)      # extrinsic xyz -> wxyz
-    #   _, qpos = robot.get_qpos_from_ee_pos(physics, pos=pos + base, quat=quat)
-    #   env.step(np.concatenate([qpos, [g, g]]))
-    #
-    # VLABench's dm_control task writes `data.ctrl[:] = action` directly — for
-    # Franka that's 9 entries (7 arm joints + 2 gripper fingers). We mirror the
-    # above conversion so the policy's EEF commands actually drive the robot.
-
-    _FRANKA_FINGER_OPEN = 0.04  # qpos when gripper fully open
-
-    def _build_ctrl_from_action(self, action: np.ndarray, ctrl_dim: int) -> np.ndarray:
-        """Convert a 7D EEF action into the `ctrl_dim`-sized joint command vector.
-
-        For the Franka default (ctrl_dim=9): 7 arm joint qposes (via IK) +
-        2 gripper finger qposes (open/closed based on the gripper scalar).
-        If the action is already joint-space (shape matches ctrl_dim), pass
-        through.
-        """
-        if action.shape[0] == ctrl_dim:
-            return action.astype(np.float64, copy=False)
-
-        if action.shape[0] != 7:
-            # Unknown layout — fall back to zero-pad so the sim doesn't crash.
-            padded = np.zeros(ctrl_dim, dtype=np.float64)
-            padded[: min(action.shape[0], ctrl_dim)] = action[:ctrl_dim]
-            return padded
-
-        from dm_control.utils.inverse_kinematics import qpos_from_site_pose
-
-        # Action position is in robot-base frame (see convert_to_lerobot.py);
-        # dm_control's IK expects a world-frame target.
-        base = self._robot_base_xyz if self._robot_base_xyz is not None else np.zeros(3, dtype=np.float64)
-        pos_world = np.asarray(action[:3], dtype=np.float64) + base
-        rx, ry, rz = float(action[3]), float(action[4]), float(action[5])
-        gripper = float(np.clip(action[6], 0.0, 1.0))
-
-        # Dataset euler is scipy extrinsic 'xyz' (same as VLABench's
-        # `euler_to_quaternion`). scipy emits `[x, y, z, w]`; dm_control's IK
-        # and MuJoCo use `[w, x, y, z]`, so reorder.
-        qxyzw = Rotation.from_euler("xyz", [rx, ry, rz], degrees=False).as_quat()
-        quat = np.array([qxyzw[3], qxyzw[0], qxyzw[1], qxyzw[2]], dtype=np.float64)
-
-        assert self._env is not None
-        robot = self._env.task.robot
-        site_name = robot.end_effector_site.full_identifier
-
-        # inplace=False so IK doesn't mutate physics state mid-step — we only
-        # want the solved qpos. Fetch a fresh physics handle — caching it can
-        # yield a stale weakref after a reset.
-        ik_result = qpos_from_site_pose(
-            self._env.physics,
-            site_name=site_name,
-            target_pos=pos_world,
-            target_quat=quat,
-            inplace=False,
-            max_steps=100,
-        )
-        n_dof = robot.n_dof  # 7 for Franka
-        arm_qpos = ik_result.qpos[:n_dof]
-
-        # Dataset gripper convention: 1 = open (finger qpos = 0.04),
-        # 0 = closed (finger qpos = 0.0). See VLABench/scripts/convert_to_lerobot.py
-        # where `trajectory[i][-1] > 0.03` is encoded as `1`.
-        finger_qpos = gripper * self._FRANKA_FINGER_OPEN
-
-        ctrl = np.zeros(ctrl_dim, dtype=np.float64)
-        ctrl[:n_dof] = arm_qpos
-        # Remaining entries are gripper fingers (usually 2 for Franka).
-        ctrl[n_dof:] = finger_qpos
-        return ctrl
-
-    def reset(self, seed=None, **kwargs) -> tuple[RobotObservation, dict[str, Any]]:
-        self._ensure_env()
-        assert self._env is not None
-        super().reset(seed=seed)
-
-        if seed is not None:
-            self._seed_inner_env(int(self.np_random.integers(0, 2**31 - 1)))
-
-        self._env.reset()
-
-        observation = self._get_obs()
-        info = {"is_success": False}
-        return observation, info
-
-    def _seed_inner_env(self, seed: int) -> None:
-        """Propagate `seed` to the inner dm_control env. `Environment.reset()`
-        doesn't accept a seed, so we re-seed the task and environment
-        `RandomState`s directly. Best-effort: silently skipped when the
-        expected attributes are absent on a given VLABench version.
-        """
-        for owner_attr, rng_attr in (("task", "random"), (None, "_random_state")):
-            owner = getattr(self._env, owner_attr) if owner_attr else self._env
-            rng = getattr(owner, rng_attr, None)
-            rng_seed = getattr(rng, "seed", None)
-            if callable(rng_seed):
-                rng_seed(seed)
-
-    def step(self, action: np.ndarray) -> tuple[RobotObservation, float, bool, bool, dict[str, Any]]:
-        from dm_control.rl.control import PhysicsError  # type: ignore[import-untyped]
-
-        self._ensure_env()
-        assert self._env is not None
-
-        if action.ndim != 1:
-            raise ValueError(
-                f"Expected action to be 1-D (shape (action_dim,)), "
-                f"but got shape {action.shape} with ndim={action.ndim}"
-            )
-
-        if self.action_mode not in ("eef", "joint", "delta_eef"):
-            raise ValueError(f"Unknown action_mode: {self.action_mode}")
-
-        # Always refetch physics — dm_control returns a weakref proxy that can
-        # go stale across resets.
-        physics = self._env.physics
-        ctrl_dim = int(physics.data.ctrl.shape[0])
-        ctrl = self._build_ctrl_from_action(action, ctrl_dim)
-        try:
-            timestep = self._env.step(ctrl)
-        except PhysicsError as exc:
-            # Physics integrator diverged (e.g. mjWARN_BADQACC). Treat it as
-            # a graceful failed termination rather than a hard crash — the
-            # rest of the multi-task eval should still run.
-            logger.warning(
-                "PhysicsError during step on task '%s': %s. Terminating episode.",
-                self.task,
-                exc,
-            )
-            observation = self._get_obs()
-            info = {"task": self.task, "is_success": False, "physics_error": True}
-            # Drop the stale env so the next reset() rebuilds it cleanly.
-            with contextlib.suppress(Exception):
-                self._env.close()
-            self._env = None
-            return observation, 0.0, True, False, info
-
-        # Extract reward from dm_control timestep
-        reward = float(timestep.reward) if timestep.reward is not None else 0.0
-
-        # Check success via the task's termination condition
-        is_success = False
-        if hasattr(self._env, "task") and hasattr(self._env.task, "should_terminate_episode"):
-            is_success = bool(self._env.task.should_terminate_episode(self._env.physics))
-
-        terminated = is_success
-        truncated = False
-        info = {
-            "task": self.task,
-            "is_success": is_success,
-        }
-
-        observation = self._get_obs()
-
-        if terminated:
-            self.reset()
-
-        return observation, reward, terminated, truncated, info
-
-    def render(self) -> np.ndarray:
-        self._ensure_env()
-        obs = self._get_obs()
-        return obs["pixels"]["image"]
-
-    def close(self):
-        if self._env is not None:
-            self._env.close()
-            self._env = None
-
-
-# ---- Main API ----------------------------------------------------------------
-
-
-def create_vlabench_envs(
-    task: str,
-    n_envs: int,
-    gym_kwargs: dict[str, Any] | None = None,
-    env_cls: Callable[[Sequence[Callable[[], Any]]], Any] | None = None,
-) -> dict[str, dict[int, Any]]:
-    """
-    Create vectorized VLABench environments with a consistent return shape.
-
-    Returns:
-        dict[suite_name][task_id] -> vec_env (env_cls([...]) with exactly n_envs factories)
-
-    Notes:
-        - n_envs is the number of rollouts *per task*.
-        - `task` can be a suite name ("primitive", "composite"), a comma-separated list of
-          suite names, or individual task names (e.g. "select_fruit,heat_food").
-    """
-    if env_cls is None or not callable(env_cls):
-        raise ValueError("env_cls must be a callable that wraps a list of environment factory callables.")
-    if not isinstance(n_envs, int) or n_envs <= 0:
-        raise ValueError(f"n_envs must be a positive int; got {n_envs}.")
-
-    gym_kwargs = dict(gym_kwargs or {})
-    task_groups = [t.strip() for t in task.split(",") if t.strip()]
-    if not task_groups:
-        raise ValueError("`task` must contain at least one VLABench task or suite name.")
-
-    logger.info(
-        "Creating VLABench envs | task_groups=%s | n_envs(per task)=%d",
-        task_groups,
-        n_envs,
-    )
-
-    is_async = env_cls is gym.vector.AsyncVectorEnv
-    cached_obs_space = None
-    cached_act_space = None
-    cached_metadata = None
-    out: dict[str, dict[int, Any]] = defaultdict(dict)
-
-    for group in task_groups:
-        # Check if it's a suite name, otherwise treat as individual task
-        tasks = SUITE_TASKS.get(group, [group])
-
-        for tid, task_name in enumerate(tasks):
-            logger.info(
-                "Building vec env | group=%s | task_id=%d | task=%s",
-                group,
-                tid,
-                task_name,
-            )
-
-            fns = [(lambda tn=task_name: VLABenchEnv(task=tn, **gym_kwargs)) for _ in range(n_envs)]
-
-            if is_async:
-                lazy = _LazyAsyncVectorEnv(fns, cached_obs_space, cached_act_space, cached_metadata)
-                if cached_obs_space is None:
-                    cached_obs_space = lazy.observation_space
-                    cached_act_space = lazy.action_space
-                    cached_metadata = lazy.metadata
-                out[group][tid] = lazy
-            else:
-                out[group][tid] = env_cls(fns)
-
-    return {group: dict(task_map) for group, task_map in out.items()}
--- a/src/lerobot/processor/init.py
+++ b/src/lerobot/processor/init.py
@@ -93,7 +93,6 @@ from .relative_action_processor import (
    to_relative_actions,
 )
 from .rename_processor import RenameObservationsProcessorStep, rename_stats
-from .render_messages_processor import RenderMessagesStep
 from .tokenizer_processor import ActionTokenizerProcessorStep, TokenizerProcessorStep

 __all__ = [
@@ -129,7 +128,6 @@ __all__ = [
    "make_default_robot_observation_processor",
    "AbsoluteActionsProcessorStep",
    "RelativeActionsProcessorStep",
-    "RenderMessagesStep",
    "MapDeltaActionToRobotActionStep",
    "MapTensorToDeltaActionDictStep",
    "NewLineTaskProcessorStep",
--- a/src/lerobot/processor/batch_processor.py
+++ b/src/lerobot/processor/batch_processor.py
@@ -174,24 +174,6 @@ class AddBatchDimensionComplementaryDataStep(ComplementaryDataProcessorStep):
            task_index_value = complementary_data["task_index"]
            if isinstance(task_index_value, Tensor) and task_index_value.dim() == 0:
                complementary_data["task_index"] = task_index_value.unsqueeze(0)
-
-        complementary_data.pop("language_persistent", None)
-        complementary_data.pop("language_events", None)
-
-        if "messages" in complementary_data:
-            messages = complementary_data["messages"]
-            if isinstance(messages, list) and (not messages or isinstance(messages[0], dict)):
-                complementary_data["messages"] = [messages]
-
-        if "message_streams" in complementary_data:
-            streams = complementary_data["message_streams"]
-            if isinstance(streams, list) and (not streams or isinstance(streams[0], str)):
-                complementary_data["message_streams"] = [streams]
-
-        if "target_message_indices" in complementary_data:
-            indices = complementary_data["target_message_indices"]
-            if isinstance(indices, list) and (not indices or isinstance(indices[0], int)):
-                complementary_data["target_message_indices"] = [indices]
        return complementary_data

    def transform_features(
--- a/src/lerobot/processor/converters.py
+++ b/src/lerobot/processor/converters.py
@@ -167,35 +167,12 @@ def _extract_complementary_data(batch: dict[str, Any]) -> dict[str, Any]:
    """
    pad_keys = {k: v for k, v in batch.items() if "_is_pad" in k}
    task_key = {"task": batch["task"]} if "task" in batch else {}
+    subtask_key = {"subtask": batch["subtask"]} if "subtask" in batch else {}
    index_key = {"index": batch["index"]} if "index" in batch else {}
    task_index_key = {"task_index": batch["task_index"]} if "task_index" in batch else {}
    episode_index_key = {"episode_index": batch["episode_index"]} if "episode_index" in batch else {}
-    timestamp_key = {"timestamp": batch["timestamp"]} if "timestamp" in batch else {}
-    language_persistent_key = (
-        {"language_persistent": batch["language_persistent"]} if "language_persistent" in batch else {}
-    )
-    language_events_key = {"language_events": batch["language_events"]} if "language_events" in batch else {}
-    messages_key = {"messages": batch["messages"]} if "messages" in batch else {}
-    message_streams_key = {"message_streams": batch["message_streams"]} if "message_streams" in batch else {}
-    target_message_indices_key = (
-        {"target_message_indices": batch["target_message_indices"]}
-        if "target_message_indices" in batch
-        else {}
-    )

-    return {
-        **pad_keys,
-        **task_key,
-        **index_key,
-        **task_index_key,
-        **episode_index_key,
-        **timestamp_key,
-        **language_persistent_key,
-        **language_events_key,
-        **messages_key,
-        **message_streams_key,
-        **target_message_indices_key,
-    }
+    return {**pad_keys, **task_key, **subtask_key, **index_key, **task_index_key, **episode_index_key}


 def create_transition(
--- a/src/lerobot/processor/render_messages_processor.py
+++ b/src/lerobot/processor/render_messages_processor.py
@@ -1,92 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import annotations
-
-from dataclasses import dataclass
-from typing import Any
-
-from lerobot.configs import PipelineFeatureType, PolicyFeature
-from lerobot.configs.recipe import TrainingRecipe
-from lerobot.datasets.language import LANGUAGE_EVENTS, LANGUAGE_PERSISTENT
-from lerobot.datasets.language_render import render_sample
-from lerobot.types import EnvTransition, TransitionKey
-
-from .pipeline import ProcessorStep, ProcessorStepRegistry
-
-
-@dataclass
-@ProcessorStepRegistry.register(name="render_messages_processor")
-class RenderMessagesStep(ProcessorStep):
-    """Processor step that turns raw language columns into rendered chat messages.
-
-    Reads ``language_persistent`` and ``language_events`` from the transition's
-    complementary data, renders them through ``recipe`` at the sample timestamp,
-    and replaces the raw columns with the resulting ``messages`` /
-    ``message_streams`` / ``target_message_indices`` keys.
-    """
-
-    recipe: TrainingRecipe
-    dataset_ctx: Any | None = None
-
-    def __call__(self, transition: EnvTransition) -> EnvTransition | None:
-        """Render messages for a single transition; return ``None`` to drop it."""
-        complementary_data = transition.get(TransitionKey.COMPLEMENTARY_DATA) or {}
-        persistent = complementary_data.get(LANGUAGE_PERSISTENT) or []
-        events = complementary_data.get(LANGUAGE_EVENTS) or []
-
-        if not persistent and not events:
-            return transition
-
-        timestamp = complementary_data.get("timestamp")
-        if timestamp is None:
-            raise KeyError("RenderMessagesStep requires sample timestamp in complementary data.")
-
-        sample_idx = complementary_data.get("index", 0)
-        rendered = render_sample(
-            recipe=self.recipe,
-            persistent=persistent,
-            events=events,
-            t=_scalar(timestamp),
-            sample_idx=int(_scalar(sample_idx)),
-            task=complementary_data.get("task"),
-            dataset_ctx=self.dataset_ctx,
-        )
-        if rendered is None:
-            return None
-
-        new_transition = transition.copy()
-        new_complementary_data = dict(complementary_data)
-        new_complementary_data.pop(LANGUAGE_PERSISTENT, None)
-        new_complementary_data.pop(LANGUAGE_EVENTS, None)
-        new_complementary_data.update(rendered)
-        new_transition[TransitionKey.COMPLEMENTARY_DATA] = new_complementary_data
-        return new_transition
-
-    def transform_features(
-        self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
-    ) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
-        """Pass features through unchanged; rendering only touches complementary data."""
-        return features
-
-
-def _scalar(value: Any) -> float | int:
-    """Unwrap a tensor/array/single-element list into a Python scalar."""
-    if hasattr(value, "item"):
-        return value.item()
-    if isinstance(value, list) and len(value) == 1:
-        return _scalar(value[0])
-    return value
--- a/src/lerobot/scripts/lerobot_train.py
+++ b/src/lerobot/scripts/lerobot_train.py
@@ -47,7 +47,6 @@ from lerobot.datasets import EpisodeAwareSampler, make_dataset
 from lerobot.envs import close_envs, make_env, make_env_pre_post_processors
 from lerobot.optim.factory import make_optimizer_and_scheduler
 from lerobot.policies import PreTrainedPolicy, make_policy, make_pre_post_processors
-from lerobot.utils.collate import lerobot_collate_fn
 from lerobot.utils.import_utils import register_third_party_plugins
 from lerobot.utils.logging_utils import AverageMeter, MetricsTracker
 from lerobot.utils.random_utils import set_seed
@@ -387,7 +386,6 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
        sampler=sampler,
        pin_memory=device.type == "cuda",
        drop_last=False,
-        collate_fn=lerobot_collate_fn,
        prefetch_factor=cfg.prefetch_factor if cfg.num_workers > 0 else None,
        persistent_workers=cfg.persistent_workers and cfg.num_workers > 0,
    )
--- a/src/lerobot/utils/collate.py
+++ b/src/lerobot/utils/collate.py
@@ -1,54 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import annotations
-
-from typing import Any
-
-from torch.utils.data._utils.collate import default_collate
-
-from lerobot.datasets.language import LANGUAGE_COLUMNS
-
-_PYTHON_LIST_KEYS = {"messages", "message_streams", "target_message_indices"}
-
-
-def lerobot_collate_fn(batch: list[dict[str, Any] | None]) -> dict[str, Any] | None:
-    """Collate function that preserves Python-list and language fields as lists.
-
-    Drops ``None`` samples (e.g. recipes that yielded no target message), keeps
-    rendered-message and language fields as plain Python lists, and delegates
-    every other key to PyTorch's ``default_collate``.
-    """
-    batch = [sample for sample in batch if sample is not None]
-    if not batch:
-        return None
-
-    preserved = {
-        key: [sample[key] for sample in batch if key in sample]
-        for key in _PYTHON_LIST_KEYS
-        if any(key in sample for sample in batch)
-    }
-    tensorizable = [
-        {
-            key: value
-            for key, value in sample.items()
-            if key not in _PYTHON_LIST_KEYS and key not in LANGUAGE_COLUMNS
-        }
-        for sample in batch
-    ]
-    collated = default_collate(tensorizable)
-    collated.update(preserved)
-    return collated
--- a/tests/configs/test_recipe.py
+++ b/tests/configs/test_recipe.py
@@ -1,32 +0,0 @@
-#!/usr/bin/env python
-
-from pathlib import Path
-
-import pytest
-
-from lerobot.configs.recipe import MessageTurn, TrainingRecipe
-
-
-def test_message_recipe_validates_unknown_binding():
-    with pytest.raises(ValueError, match="unknown binding"):
-        TrainingRecipe(
-            messages=[
-                MessageTurn(role="user", content="${missing}", stream="high_level"),
-                MessageTurn(role="assistant", content="ok", stream="high_level", target=True),
-            ]
-        )
-
-
-def test_canonical_recipe_loads():
-    recipe = TrainingRecipe.from_yaml(Path("src/lerobot/configs/recipes/pi05_hirobot.yaml"))
-
-    assert recipe.blend is not None
-    assert set(recipe.blend) == {
-        "memory_update",
-        "user_interjection_response",
-        "high_level_subtask",
-        "low_level_execution",
-        "ask_vqa_top",
-        "ask_vqa_wrist",
-    }
-    assert sum(component.weight for component in recipe.blend.values()) == pytest.approx(0.96)
--- a/tests/datasets/test_language.py
+++ b/tests/datasets/test_language.py
@@ -1,152 +0,0 @@
-#!/usr/bin/env python
-
-import numpy as np
-import pandas as pd
-import pyarrow as pa
-import pytest
-
-from lerobot.datasets import LeRobotDataset
-from lerobot.datasets.io_utils import write_info
-from lerobot.datasets.language import (
-    EVENT_ONLY_STYLES,
-    LANGUAGE_EVENTS,
-    LANGUAGE_PERSISTENT,
-    PERSISTENT_STYLES,
-    STYLE_REGISTRY,
-    VIEW_DEPENDENT_STYLES,
-    column_for_style,
-    is_view_dependent_style,
-    language_events_arrow_type,
-    language_feature_info,
-    language_persistent_arrow_type,
-    validate_camera_field,
-)
-from lerobot.datasets.utils import DEFAULT_DATA_PATH
-
-
-def test_language_arrow_schema_has_expected_fields():
-    persistent_row_type = language_persistent_arrow_type().value_type
-    event_row_type = language_events_arrow_type().value_type
-
-    assert isinstance(persistent_row_type, pa.StructType)
-    assert persistent_row_type.names == [
-        "role",
-        "content",
-        "style",
-        "timestamp",
-        "camera",
-        "tool_calls",
-    ]
-
-    assert isinstance(event_row_type, pa.StructType)
-    assert event_row_type.names == ["role", "content", "style", "camera", "tool_calls"]
-
-
-def test_style_registry_routes_columns():
-    assert {"subtask", "plan", "memory", "motion", "task_aug"} == PERSISTENT_STYLES
-    assert {"interjection", "vqa", "trace"} == EVENT_ONLY_STYLES
-    assert PERSISTENT_STYLES | EVENT_ONLY_STYLES <= STYLE_REGISTRY
-
-    assert column_for_style("subtask") == LANGUAGE_PERSISTENT
-    assert column_for_style("plan") == LANGUAGE_PERSISTENT
-    assert column_for_style("memory") == LANGUAGE_PERSISTENT
-    assert column_for_style("motion") == LANGUAGE_PERSISTENT
-    assert column_for_style("task_aug") == LANGUAGE_PERSISTENT
-    assert column_for_style("interjection") == LANGUAGE_EVENTS
-    assert column_for_style("vqa") == LANGUAGE_EVENTS
-    assert column_for_style("trace") == LANGUAGE_EVENTS
-    assert column_for_style(None) == LANGUAGE_EVENTS
-
-
-def test_view_dependent_styles():
-    # motion lives in PERSISTENT_STYLES and is described in robot-frame
-    # (joint / Cartesian) terms, so it is NOT view-dependent. Only vqa
-    # (event) and trace (event, pixel-trajectory) carry a camera tag.
-    assert {"vqa", "trace"} == VIEW_DEPENDENT_STYLES
-    assert is_view_dependent_style("vqa")
-    assert is_view_dependent_style("trace")
-    assert not is_view_dependent_style("motion")
-    assert not is_view_dependent_style("subtask")
-    assert not is_view_dependent_style("plan")
-    assert not is_view_dependent_style("interjection")
-    assert not is_view_dependent_style(None)
-
-
-def test_validate_camera_field_requires_camera_for_view_dependent_styles():
-    validate_camera_field("vqa", "observation.images.top")
-    validate_camera_field("trace", "observation.images.front")
-    with pytest.raises(ValueError, match="view-dependent"):
-        validate_camera_field("vqa", None)
-    with pytest.raises(ValueError, match="view-dependent"):
-        validate_camera_field("trace", "")
-
-
-def test_validate_camera_field_rejects_camera_on_non_view_dependent_styles():
-    validate_camera_field("subtask", None)
-    validate_camera_field("plan", None)
-    validate_camera_field("memory", None)
-    validate_camera_field("motion", None)
-    validate_camera_field("interjection", None)
-    validate_camera_field(None, None)
-    with pytest.raises(ValueError, match="must have camera=None"):
-        validate_camera_field("subtask", "observation.images.top")
-    with pytest.raises(ValueError, match="must have camera=None"):
-        validate_camera_field("motion", "observation.images.top")
-    with pytest.raises(ValueError, match="must have camera=None"):
-        validate_camera_field("interjection", "observation.images.top")
-    with pytest.raises(ValueError, match="must have camera=None"):
-        validate_camera_field(None, "observation.images.top")
-
-
-def test_unknown_style_rejected():
-    with pytest.raises(ValueError, match="Unknown language style"):
-        column_for_style("surprise")
-
-
-def test_lerobot_dataset_passes_language_columns_through(tmp_path, empty_lerobot_dataset_factory):
-    root = tmp_path / "language_dataset"
-    dataset = empty_lerobot_dataset_factory(
-        root=root,
-        features={"state": {"dtype": "float32", "shape": (2,), "names": None}},
-        use_videos=False,
-    )
-    dataset.add_frame({"state": np.array([0.0, 1.0], dtype=np.float32), "task": "tidy"})
-    dataset.add_frame({"state": np.array([1.0, 2.0], dtype=np.float32), "task": "tidy"})
-    dataset.save_episode()
-    dataset.finalize()
-
-    persistent = [
-        {
-            "role": "assistant",
-            "content": "reach for the cup",
-            "style": "subtask",
-            "timestamp": 0.0,
-            "camera": None,
-            "tool_calls": None,
-        }
-    ]
-    event = {
-        "role": "user",
-        "content": "what is visible?",
-        "style": "vqa",
-        "camera": "observation.images.top",
-        "tool_calls": None,
-    }
-    data_path = root / DEFAULT_DATA_PATH.format(chunk_index=0, file_index=0)
-    df = pd.read_parquet(data_path)
-    df[LANGUAGE_PERSISTENT] = [persistent, persistent]
-    df[LANGUAGE_EVENTS] = [[event], []]
-    df.to_parquet(data_path)
-
-    info = dataset.meta.info
-    info["features"].update(language_feature_info())
-    write_info(info, root)
-
-    reloaded = LeRobotDataset(repo_id=dataset.repo_id, root=root)
-
-    first = reloaded[0]
-    second = reloaded[1]
-    assert first[LANGUAGE_PERSISTENT] == persistent
-    assert first[LANGUAGE_EVENTS] == [event]
-    assert second[LANGUAGE_PERSISTENT] == persistent
-    assert second[LANGUAGE_EVENTS] == []
--- a/tests/datasets/test_language_render.py
+++ b/tests/datasets/test_language_render.py
@@ -1,388 +0,0 @@
-#!/usr/bin/env python
-
-from pathlib import Path
-
-import pytest
-
-from lerobot.configs.recipe import MessageTurn, TrainingRecipe
-from lerobot.datasets.language_render import active_at, emitted_at, nth_next, nth_prev, render_sample
-
-
-def persistent_row(role, content, style, timestamp, tool_calls=None, camera=None):
-    return {
-        "role": role,
-        "content": content,
-        "style": style,
-        "timestamp": timestamp,
-        "camera": camera,
-        "tool_calls": tool_calls,
-    }
-
-
-def event_row(role, content, style, tool_calls=None, camera=None):
-    return {
-        "role": role,
-        "content": content,
-        "style": style,
-        "camera": camera,
-        "tool_calls": tool_calls,
-    }
-
-
-PERSISTENT = [
-    persistent_row("assistant", "plan 0", "plan", 0.0),
-    persistent_row("assistant", "memory 0", "memory", 0.0),
-    persistent_row("assistant", "subtask 0", "subtask", 0.0),
-    persistent_row("assistant", "memory 1", "memory", 1.0),
-    persistent_row("assistant", "subtask 1", "subtask", 1.0),
-]
-EVENTS_AT_1 = [
-    event_row("user", "what is visible?", "vqa", camera="observation.images.top"),
-    event_row("assistant", '{"count": 2}', "vqa", camera="observation.images.top"),
-]
-EVENTS_AT_2 = [
-    event_row("user", "skip wiping", "interjection"),
-    event_row(
-        "assistant",
-        None,
-        None,
-        [{"type": "function", "function": {"name": "say", "arguments": {"text": "Skipping wiping."}}}],
-    ),
-]
-# Same emission tick, two cameras: triggers per-camera disambiguation in
-# resolvers, mirroring how Module 3 of the annotation pipeline writes one
-# (vqa, user) + (vqa, assistant) pair per camera.
-EVENTS_AT_3_TWO_CAMERAS = [
-    event_row("user", "how many cups (top)?", "vqa", camera="observation.images.top"),
-    event_row("assistant", '{"count": 3}', "vqa", camera="observation.images.top"),
-    event_row("user", "how many cups (wrist)?", "vqa", camera="observation.images.wrist"),
-    event_row("assistant", '{"count": 1}', "vqa", camera="observation.images.wrist"),
-]
-
-
-def test_resolver_temporal_semantics():
-    assert active_at(0.5, persistent=PERSISTENT, style="subtask")["content"] == "subtask 0"
-    assert active_at(1.0, persistent=PERSISTENT, style="subtask")["content"] == "subtask 1"
-    assert emitted_at(0.5, persistent=PERSISTENT, events=[], style="vqa", role="assistant") is None
-    assert (
-        emitted_at(1.0, persistent=PERSISTENT, events=EVENTS_AT_1, style="vqa", role="assistant")["content"]
-        == '{"count": 2}'
-    )
-
-
-def test_persistent_relative_resolvers_reject_event_styles():
-    with pytest.raises(ValueError, match="event-only"):
-        active_at(1.0, persistent=PERSISTENT, style="vqa")
-    with pytest.raises(ValueError, match="event-only"):
-        nth_prev(1.0, persistent=PERSISTENT, style="interjection")
-
-
-def test_nth_prev_and_next():
-    assert nth_prev(1.0, persistent=PERSISTENT, style="subtask", offset=1)["content"] == "subtask 0"
-    assert nth_next(0.0, persistent=PERSISTENT, style="subtask", offset=1)["content"] == "subtask 1"
-
-
-def test_substitution_if_present_multimodal_and_tool_calls():
-    recipe = TrainingRecipe(
-        messages=[
-            MessageTurn(
-                role="user",
-                content=[
-                    {"type": "image", "feature": "observation.images.top"},
-                    {"type": "text", "text": "${task}: ${interjection}"},
-                ],
-                stream="high_level",
-                if_present="interjection",
-            ),
-            MessageTurn(
-                role="assistant",
-                content="${plan}",
-                stream="high_level",
-                target=True,
-                tool_calls_from="speech",
-            ),
-        ],
-        bindings={"plan": "active_at(t, style=plan)"},
-    )
-
-    rendered = render_sample(
-        recipe=recipe,
-        persistent=PERSISTENT,
-        events=EVENTS_AT_2,
-        t=2.0,
-        sample_idx=0,
-        task="clean kitchen",
-    )
-
-    assert rendered["messages"][0]["content"][1]["text"] == "clean kitchen: skip wiping"
-    assert rendered["messages"][1]["content"] == "plan 0"
-    assert rendered["messages"][1]["tool_calls"][0]["function"]["name"] == "say"
-    assert rendered["message_streams"] == ["high_level", "high_level"]
-    assert rendered["target_message_indices"] == [1]
-
-
-def test_exact_event_miss_returns_none_when_target_skips():
-    recipe = TrainingRecipe(
-        messages=[
-            MessageTurn(role="user", content="${vqa_query}", stream="high_level", if_present="vqa_query"),
-            MessageTurn(
-                role="assistant",
-                content="${vqa}",
-                stream="high_level",
-                target=True,
-                if_present="vqa",
-            ),
-        ]
-    )
-
-    assert (
-        render_sample(recipe=recipe, persistent=PERSISTENT, events=EVENTS_AT_2, t=0.0, sample_idx=0) is None
-    )
-
-
-def test_deterministic_blend_sampling():
-    recipe = TrainingRecipe(
-        blend={
-            "a": TrainingRecipe(
-                weight=1.0,
-                messages=[
-                    MessageTurn(role="user", content="${task}", stream="high_level"),
-                    MessageTurn(role="assistant", content="a", stream="high_level", target=True),
-                ],
-            ),
-            "b": TrainingRecipe(
-                weight=1.0,
-                messages=[
-                    MessageTurn(role="user", content="${task}", stream="high_level"),
-                    MessageTurn(role="assistant", content="b", stream="high_level", target=True),
-                ],
-            ),
-        }
-    )
-
-    first = render_sample(
-        recipe=recipe, persistent=PERSISTENT, events=EVENTS_AT_2, t=0.0, sample_idx=123, task="x"
-    )
-    second = render_sample(
-        recipe=recipe, persistent=PERSISTENT, events=EVENTS_AT_2, t=0.0, sample_idx=123, task="x"
-    )
-    assert first == second
-
-
-def test_emitted_at_filters_vqa_by_camera():
-    top = emitted_at(
-        3.0,
-        persistent=PERSISTENT,
-        events=EVENTS_AT_3_TWO_CAMERAS,
-        style="vqa",
-        role="assistant",
-        camera="observation.images.top",
-    )
-    wrist = emitted_at(
-        3.0,
-        persistent=PERSISTENT,
-        events=EVENTS_AT_3_TWO_CAMERAS,
-        style="vqa",
-        role="assistant",
-        camera="observation.images.wrist",
-    )
-    assert top["content"] == '{"count": 3}'
-    assert wrist["content"] == '{"count": 1}'
-
-
-def test_emitted_at_raises_on_ambiguous_per_camera_vqa():
-    with pytest.raises(ValueError, match="Ambiguous resolver"):
-        emitted_at(
-            3.0,
-            persistent=PERSISTENT,
-            events=EVENTS_AT_3_TWO_CAMERAS,
-            style="vqa",
-            role="assistant",
-        )
-
-
-def test_per_camera_blend_renders_both_views():
-    recipe = TrainingRecipe(
-        blend={
-            "top": TrainingRecipe(
-                weight=1.0,
-                bindings={
-                    "vqa_query": (
-                        "emitted_at(t, style=vqa, role=user, camera=observation.images.top)"
-                    ),
-                    "vqa": (
-                        "emitted_at(t, style=vqa, role=assistant, camera=observation.images.top)"
-                    ),
-                },
-                messages=[
-                    MessageTurn(
-                        role="user",
-                        content=[
-                            {"type": "image", "feature": "observation.images.top"},
-                            {"type": "text", "text": "${vqa_query}"},
-                        ],
-                        stream="high_level",
-                        if_present="vqa_query",
-                    ),
-                    MessageTurn(
-                        role="assistant",
-                        content="${vqa}",
-                        stream="high_level",
-                        target=True,
-                        if_present="vqa",
-                    ),
-                ],
-            ),
-            "wrist": TrainingRecipe(
-                weight=1.0,
-                bindings={
-                    "vqa_query": (
-                        "emitted_at(t, style=vqa, role=user, camera=observation.images.wrist)"
-                    ),
-                    "vqa": (
-                        "emitted_at(t, style=vqa, role=assistant, camera=observation.images.wrist)"
-                    ),
-                },
-                messages=[
-                    MessageTurn(
-                        role="user",
-                        content=[
-                            {"type": "image", "feature": "observation.images.wrist"},
-                            {"type": "text", "text": "${vqa_query}"},
-                        ],
-                        stream="high_level",
-                        if_present="vqa_query",
-                    ),
-                    MessageTurn(
-                        role="assistant",
-                        content="${vqa}",
-                        stream="high_level",
-                        target=True,
-                        if_present="vqa",
-                    ),
-                ],
-            ),
-        }
-    )
-
-    rendered_top = render_sample(
-        recipe=recipe.blend["top"],
-        persistent=PERSISTENT,
-        events=EVENTS_AT_3_TWO_CAMERAS,
-        t=3.0,
-        sample_idx=0,
-    )
-    rendered_wrist = render_sample(
-        recipe=recipe.blend["wrist"],
-        persistent=PERSISTENT,
-        events=EVENTS_AT_3_TWO_CAMERAS,
-        t=3.0,
-        sample_idx=0,
-    )
-
-    assert rendered_top["messages"][0]["content"][0]["feature"] == "observation.images.top"
-    assert rendered_top["messages"][0]["content"][1]["text"] == "how many cups (top)?"
-    assert rendered_top["messages"][1]["content"] == '{"count": 3}'
-
-    assert rendered_wrist["messages"][0]["content"][0]["feature"] == "observation.images.wrist"
-    assert rendered_wrist["messages"][0]["content"][1]["text"] == "how many cups (wrist)?"
-    assert rendered_wrist["messages"][1]["content"] == '{"count": 1}'
-
-
-def test_resolve_task_picks_rephrasing_deterministically_per_sample():
-    rephrasings = [
-        persistent_row("user", "tidy the kitchen", "task_aug", 0.0),
-        persistent_row("user", "please clean up the kitchen", "task_aug", 0.0),
-        persistent_row("user", "kitchen needs tidying", "task_aug", 0.0),
-        persistent_row("user", "make the kitchen clean", "task_aug", 0.0),
-    ]
-    recipe = TrainingRecipe(
-        messages=[
-            MessageTurn(role="user", content="${task}", stream="high_level"),
-            MessageTurn(role="assistant", content="ok", stream="high_level", target=True),
-        ]
-    )
-
-    # No explicit task override → resolver consults persistent rows.
-    seen: set[str] = set()
-    for sample_idx in range(64):
-        rendered = render_sample(
-            recipe=recipe,
-            persistent=rephrasings,
-            events=[],
-            t=0.0,
-            sample_idx=sample_idx,
-            dataset_ctx={"task": "canonical kitchen task"},
-        )
-        seen.add(rendered["messages"][0]["content"])
-    # Every rephrasing should be reachable across enough samples.
-    assert seen == {r["content"] for r in rephrasings}
-    # Same sample_idx → same pick (determinism).
-    a = render_sample(
-        recipe=recipe, persistent=rephrasings, events=[], t=0.0, sample_idx=42,
-        dataset_ctx={"task": "canonical"},
-    )
-    b = render_sample(
-        recipe=recipe, persistent=rephrasings, events=[], t=0.0, sample_idx=42,
-        dataset_ctx={"task": "canonical"},
-    )
-    assert a["messages"][0]["content"] == b["messages"][0]["content"]
-
-
-def test_resolve_task_falls_back_to_canonical_without_rephrasings():
-    recipe = TrainingRecipe(
-        messages=[
-            MessageTurn(role="user", content="${task}", stream="high_level"),
-            MessageTurn(role="assistant", content="ok", stream="high_level", target=True),
-        ]
-    )
-    rendered = render_sample(
-        recipe=recipe,
-        persistent=PERSISTENT,  # no task_aug rows
-        events=[],
-        t=0.0,
-        sample_idx=0,
-        dataset_ctx={"task": "clean the kitchen"},
-    )
-    assert rendered["messages"][0]["content"] == "clean the kitchen"
-
-
-def test_resolve_task_explicit_override_beats_rephrasings():
-    rephrasings = [
-        persistent_row("user", "rephrased one", "task_aug", 0.0),
-        persistent_row("user", "rephrased two", "task_aug", 0.0),
-    ]
-    recipe = TrainingRecipe(
-        messages=[
-            MessageTurn(role="user", content="${task}", stream="high_level"),
-            MessageTurn(role="assistant", content="ok", stream="high_level", target=True),
-        ]
-    )
-    rendered = render_sample(
-        recipe=recipe,
-        persistent=rephrasings,
-        events=[],
-        t=0.0,
-        sample_idx=0,
-        task="explicit override wins",
-        dataset_ctx={"task": "canonical"},
-    )
-    assert rendered["messages"][0]["content"] == "explicit override wins"
-
-
-def test_canonical_recipe_can_render_low_level_branch():
-    recipe = TrainingRecipe.from_yaml(Path("src/lerobot/configs/recipes/pi05_hirobot.yaml"))
-    low_level = TrainingRecipe(blend={"low": recipe.blend["low_level_execution"]})
-
-    rendered = render_sample(
-        recipe=low_level,
-        persistent=PERSISTENT,
-        events=[],
-        t=0.5,
-        sample_idx=0,
-        task="clean kitchen",
-    )
-
-    assert rendered["messages"][-1] == {"role": "assistant", "content": "subtask 0"}
-    assert rendered["message_streams"][-1] == "low_level"
-    assert rendered["target_message_indices"] == [1]
--- a/tests/datasets/test_subtask_dataset.py
+++ b/tests/datasets/test_subtask_dataset.py
@@ -0,0 +1,193 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Tests for subtask functionality in LeRobotDataset.
+
+These tests verify that:
+- Subtask information is correctly loaded from datasets that have subtask data
+- The __getitem__ method correctly adds subtask strings to returned items
+- Subtask handling gracefully handles missing data
+"""
+
+import pytest
+
+pytest.importorskip("pandas", reason="pandas is required (install lerobot[dataset])")
+
+import pandas as pd  # noqa: E402
+import torch
+
+from lerobot.datasets.lerobot_dataset import LeRobotDataset
+
+
+class TestSubtaskDataset:
+    """Tests for subtask handling in LeRobotDataset."""
+
+    @pytest.fixture
+    def subtask_dataset(self):
+        """Load the test subtask dataset from the hub."""
+        # Use lerobot/pusht-subtask dataset with episode 1
+        return LeRobotDataset(
+            repo_id="lerobot/pusht-subtask",
+            episodes=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
+        )
+
+    def test_subtask_dataset_loads(self, subtask_dataset):
+        """Test that the subtask dataset loads successfully."""
+        assert subtask_dataset is not None
+        assert len(subtask_dataset) > 0
+
+    def test_subtask_metadata_loaded(self, subtask_dataset):
+        """Test that subtask metadata is loaded when present in dataset."""
+        # The dataset should have subtasks metadata loaded
+        assert subtask_dataset.meta.subtasks is not None
+        assert isinstance(subtask_dataset.meta.subtasks, pd.DataFrame)
+
+    def test_subtask_index_in_features(self, subtask_dataset):
+        """Test that subtask_index is a feature when dataset has subtasks."""
+        assert "subtask_index" in subtask_dataset.features
+
+    def test_getitem_returns_subtask_string(self, subtask_dataset):
+        """Test that __getitem__ correctly adds subtask string to returned item."""
+        item = subtask_dataset[0]
+
+        # Subtask should be present in the returned item
+        assert "subtask" in item
+        assert isinstance(item["subtask"], str)
+        assert len(item["subtask"]) > 0  # Should not be empty
+
+    def test_getitem_has_subtask_index(self, subtask_dataset):
+        """Test that __getitem__ includes subtask_index."""
+        item = subtask_dataset[0]
+
+        assert "subtask_index" in item
+        assert isinstance(item["subtask_index"], torch.Tensor)
+
+    def test_subtask_index_maps_to_valid_subtask(self, subtask_dataset):
+        """Test that subtask_index correctly maps to a subtask in metadata."""
+        item = subtask_dataset[0]
+
+        subtask_idx = item["subtask_index"].item()
+        subtask_from_metadata = subtask_dataset.meta.subtasks.iloc[subtask_idx].name
+
+        assert item["subtask"] == subtask_from_metadata
+
+    def test_all_items_have_subtask(self, subtask_dataset):
+        """Test that all items in the dataset have subtask information."""
+        for i in range(min(len(subtask_dataset), 5)):  # Check first 5 items
+            item = subtask_dataset[i]
+            assert "subtask" in item
+            assert isinstance(item["subtask"], str)
+
+    def test_task_and_subtask_coexist(self, subtask_dataset):
+        """Test that both task and subtask are present in returned items."""
+        item = subtask_dataset[0]
+
+        # Both task and subtask should be present
+        assert "task" in item
+        assert "subtask" in item
+        assert isinstance(item["task"], str)
+        assert isinstance(item["subtask"], str)
+
+
+class TestSubtaskDatasetMissing:
+    """Tests for graceful handling when subtask data is missing."""
+
+    @pytest.fixture
+    def dataset_without_subtasks(self, tmp_path, empty_lerobot_dataset_factory):
+        """Create a dataset without subtask information."""
+        features = {"state": {"dtype": "float32", "shape": (2,), "names": None}}
+        dataset = empty_lerobot_dataset_factory(root=tmp_path / "no_subtask", features=features)
+
+        # Add some frames and save
+        for _ in range(5):
+            dataset.add_frame({"state": torch.randn(2), "task": "Test task"})
+        dataset.save_episode()
+        dataset.finalize()
+
+        # Reload the dataset
+        return LeRobotDataset(dataset.repo_id, root=dataset.root)
+
+    def test_no_subtask_in_features(self, dataset_without_subtasks):
+        """Test that subtask_index is not in features when not provided."""
+        assert "subtask_index" not in dataset_without_subtasks.features
+
+    def test_getitem_without_subtask(self, dataset_without_subtasks):
+        """Test that __getitem__ works when subtask is not present."""
+        item = dataset_without_subtasks[0]
+
+        # Item should still be retrievable
+        assert item is not None
+        assert "state" in item
+        assert "task" in item
+
+        # Subtask should NOT be present
+        assert "subtask" not in item
+
+    def test_subtasks_metadata_is_none(self, dataset_without_subtasks):
+        """Test that subtasks metadata is None when not present."""
+        assert dataset_without_subtasks.meta.subtasks is None
+
+
+class TestSubtaskEdgeCases:
+    """Edge case tests for subtask handling."""
+
+    def test_subtask_with_multiple_episodes(self):
+        """Test subtask handling with multiple episodes if available."""
+        try:
+            dataset = LeRobotDataset(
+                repo_id="lerobot/pusht-subtask",
+                episodes=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
+            )
+        except Exception:
+            pytest.skip("Could not load test-subtask dataset")
+
+        # Check first and last items have valid subtasks
+        first_item = dataset[0]
+        last_item = dataset[len(dataset) - 1]
+
+        assert "subtask" in first_item
+        assert "subtask" in last_item
+        assert isinstance(first_item["subtask"], str)
+        assert isinstance(last_item["subtask"], str)
+
+    def test_subtask_index_consistency(self):
+        """Test that same subtask_index returns same subtask string."""
+        try:
+            dataset = LeRobotDataset(
+                repo_id="lerobot/pusht-subtask",
+                episodes=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
+            )
+        except Exception:
+            pytest.skip("Could not load test-subtask dataset")
+
+        if len(dataset) < 2:
+            pytest.skip("Dataset too small for this test")
+
+        # Collect subtask_index to subtask mappings
+        subtask_map = {}
+        for i in range(min(len(dataset), 10)):
+            item = dataset[i]
+            idx = item["subtask_index"].item()
+            subtask = item["subtask"]
+
+            if idx in subtask_map:
+                # Same index should always return same subtask
+                assert subtask_map[idx] == subtask, (
+                    f"Inconsistent subtask for index {idx}: '{subtask_map[idx]}' vs '{subtask}'"
+                )
+            else:
+                subtask_map[idx] = subtask
--- a/tests/processor/test_render_messages_processor.py
+++ b/tests/processor/test_render_messages_processor.py
@@ -1,56 +0,0 @@
-#!/usr/bin/env python
-
-import torch
-
-from lerobot.configs.recipe import MessageTurn, TrainingRecipe
-from lerobot.processor.converters import create_transition
-from lerobot.processor.render_messages_processor import RenderMessagesStep
-from lerobot.types import TransitionKey
-
-
-def test_render_messages_step_noops_without_language_columns():
-    recipe = TrainingRecipe(
-        messages=[
-            MessageTurn(role="user", content="${task}", stream="high_level"),
-            MessageTurn(role="assistant", content="${subtask}", stream="low_level", target=True),
-        ]
-    )
-    transition = create_transition(complementary_data={"task": "do it"})
-
-    assert RenderMessagesStep(recipe)(transition) == transition
-
-
-def test_render_messages_step_renders_and_drops_raw_language():
-    recipe = TrainingRecipe(
-        messages=[
-            MessageTurn(role="user", content="${task}", stream="high_level"),
-            MessageTurn(role="assistant", content="${subtask}", stream="low_level", target=True),
-        ]
-    )
-    transition = create_transition(
-        complementary_data={
-            "task": "do it",
-            "timestamp": torch.tensor(0.0),
-            "index": torch.tensor(7),
-            "language_persistent": [
-                {
-                    "role": "assistant",
-                    "content": "reach carefully",
-                    "style": "subtask",
-                    "timestamp": 0.0,
-                    "camera": None,
-                    "tool_calls": None,
-                }
-            ],
-            "language_events": [],
-        }
-    )
-
-    out = RenderMessagesStep(recipe)(transition)
-    data = out[TransitionKey.COMPLEMENTARY_DATA]
-
-    assert "language_persistent" not in data
-    assert "language_events" not in data
-    assert data["messages"][-1]["content"] == "reach carefully"
-    assert data["message_streams"] == ["high_level", "low_level"]
-    assert data["target_message_indices"] == [1]
--- a/tests/test_robomme_env.py
+++ b/tests/test_robomme_env.py
@@ -1,232 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Unit tests for the RoboMME env wrapper and config.
-
-RoboMME requires Linux + ManiSkill (Vulkan/SAPIEN), so tests that touch the
-env wrapper mock the ``robomme`` package. Tests that only exercise the
-dataclass config run without any mocking.
-"""
-
-from __future__ import annotations
-
-import sys
-from types import ModuleType
-from unittest.mock import MagicMock
-
-import numpy as np
-
-
-def _install_robomme_stub():
-    """Register a minimal stub for the ``robomme`` package on sys.modules."""
-    stub = ModuleType("robomme")
-    wrapper_stub = ModuleType("robomme.env_record_wrapper")
-
-    class FakeBuilder:
-        def __init__(self, **kwargs):
-            pass
-
-        def make_env_for_episode(self, episode_idx: int, max_steps: int):
-            env = MagicMock()
-            obs = {
-                "front_rgb_list": [np.zeros((256, 256, 3), dtype=np.uint8)],
-                "wrist_rgb_list": [np.zeros((256, 256, 3), dtype=np.uint8)],
-                "joint_state_list": [np.zeros(7, dtype=np.float32)],
-                "gripper_state_list": [np.zeros(2, dtype=np.float32)],
-            }
-            env.reset.return_value = (obs, {"status": "ongoing", "task_goal": "pick the cube"})
-            env.step.return_value = (obs, 0.0, False, False, {"status": "ongoing", "task_goal": ""})
-            return env
-
-    wrapper_stub.BenchmarkEnvBuilder = FakeBuilder
-    stub.env_record_wrapper = wrapper_stub
-    sys.modules["robomme"] = stub
-    sys.modules["robomme.env_record_wrapper"] = wrapper_stub
-
-
-def _uninstall_robomme_stub():
-    sys.modules.pop("robomme", None)
-    sys.modules.pop("robomme.env_record_wrapper", None)
-
-
-# ---------------------------------------------------------------------------
-# Config tests (no sim required)
-# ---------------------------------------------------------------------------
-
-
-def test_robomme_env_config_defaults():
-    from lerobot.envs.configs import RoboMMEEnv
-
-    cfg = RoboMMEEnv()
-    assert cfg.task == "PickXtimes"
-    assert cfg.fps == 10
-    assert cfg.episode_length == 300
-    assert cfg.action_space == "joint_angle"
-    assert cfg.dataset_split == "test"
-    assert cfg.task_ids is None
-
-
-def test_robomme_env_config_type():
-    from lerobot.envs.configs import RoboMMEEnv
-
-    cfg = RoboMMEEnv()
-    assert cfg.type == "robomme"
-
-
-def test_robomme_features_map():
-    from lerobot.envs.configs import RoboMMEEnv
-    from lerobot.utils.constants import ACTION, OBS_IMAGES, OBS_STATE
-
-    cfg = RoboMMEEnv()
-    assert cfg.features_map[ACTION] == ACTION
-    assert cfg.features_map["pixels/image"] == f"{OBS_IMAGES}.image"
-    assert cfg.features_map["pixels/wrist_image"] == f"{OBS_IMAGES}.wrist_image"
-    assert cfg.features_map["agent_pos"] == OBS_STATE
-
-
-def test_robomme_features_action_dim_joint_angle():
-    from lerobot.envs.configs import RoboMMEEnv
-    from lerobot.utils.constants import ACTION
-
-    cfg = RoboMMEEnv(action_space="joint_angle")
-    assert cfg.features[ACTION].shape == (8,)
-
-
-def test_robomme_features_action_dim_ee_pose():
-    """`ee_pose` uses a 7-D action; __post_init__ sets the correct shape."""
-    from lerobot.envs.configs import RoboMMEEnv
-    from lerobot.utils.constants import ACTION
-
-    cfg = RoboMMEEnv(action_space="ee_pose")
-    assert cfg.features[ACTION].shape == (7,)
-
-
-# ---------------------------------------------------------------------------
-# Obs conversion (pure Python, no sim)
-# ---------------------------------------------------------------------------
-
-
-def test_convert_obs_list_format():
-    """_convert_obs takes the last element from list-format obs fields and
-    emits a nested ``pixels`` dict (image, wrist_image) plus ``agent_pos``.
-
-    The nested layout is required so ``preprocess_observation()`` in
-    ``envs/utils.py`` maps each camera to ``observation.images.<cam>``.
-    """
-    _install_robomme_stub()
-    try:
-        from lerobot.envs.robomme import RoboMMEGymEnv
-
-        env = RoboMMEGymEnv.__new__(RoboMMEGymEnv)
-
-        front = np.full((256, 256, 3), 42, dtype=np.uint8)
-        wrist = np.full((256, 256, 3), 7, dtype=np.uint8)
-        joints = np.arange(7, dtype=np.float32)
-        gripper = np.array([0.5, 0.5], dtype=np.float32)
-
-        obs_raw = {
-            "front_rgb_list": [np.zeros_like(front), front],
-            "wrist_rgb_list": [np.zeros_like(wrist), wrist],
-            "joint_state_list": [np.zeros(7, dtype=np.float32), joints],
-            "gripper_state_list": [np.zeros(2, dtype=np.float32), gripper],
-        }
-
-        result = env._convert_obs(obs_raw)
-        np.testing.assert_array_equal(result["pixels"]["image"], front)
-        np.testing.assert_array_equal(result["pixels"]["wrist_image"], wrist)
-        assert result["agent_pos"].shape == (8,)
-        np.testing.assert_array_almost_equal(result["agent_pos"][:7], joints)
-        assert result["agent_pos"][7] == gripper[0]
-    finally:
-        _uninstall_robomme_stub()
-
-
-def test_convert_obs_array_format():
-    """_convert_obs also handles non-list (direct array) obs."""
-    _install_robomme_stub()
-    try:
-        from lerobot.envs.robomme import RoboMMEGymEnv
-
-        env = RoboMMEGymEnv.__new__(RoboMMEGymEnv)
-
-        front = np.zeros((256, 256, 3), dtype=np.uint8)
-        obs_raw = {
-            "front_rgb_list": front,
-            "wrist_rgb_list": front,
-            "joint_state_list": np.zeros(7, dtype=np.float32),
-            "gripper_state_list": np.zeros(2, dtype=np.float32),
-        }
-        result = env._convert_obs(obs_raw)
-        assert result["pixels"]["image"].shape == (256, 256, 3)
-        assert result["pixels"]["wrist_image"].shape == (256, 256, 3)
-        assert result["agent_pos"].shape == (8,)
-    finally:
-        _uninstall_robomme_stub()
-
-
-# ---------------------------------------------------------------------------
-# create_robomme_envs (mocked sim)
-# ---------------------------------------------------------------------------
-
-
-def test_create_robomme_envs_returns_correct_structure():
-    """Single task -> {task_name: {task_id: VectorEnv}} with one entry per task_id."""
-    _install_robomme_stub()
-    try:
-        from lerobot.envs.robomme import create_robomme_envs
-
-        env_cls = MagicMock(return_value=MagicMock())
-        result = create_robomme_envs(
-            task="PickXtimes",
-            n_envs=1,
-            task_ids=[0, 1],
-            env_cls=env_cls,
-        )
-
-        assert "PickXtimes" in result
-        assert 0 in result["PickXtimes"]
-        assert 1 in result["PickXtimes"]
-        assert env_cls.call_count == 2
-    finally:
-        _uninstall_robomme_stub()
-
-
-def test_create_robomme_envs_multi_task():
-    """Comma-separated task list produces one suite per task."""
-    _install_robomme_stub()
-    try:
-        from lerobot.envs.robomme import create_robomme_envs
-
-        env_cls = MagicMock(return_value=MagicMock())
-        result = create_robomme_envs(
-            task="PickXtimes,BinFill,StopCube",
-            n_envs=1,
-            env_cls=env_cls,
-        )
-
-        assert set(result.keys()) == {"PickXtimes", "BinFill", "StopCube"}
-    finally:
-        _uninstall_robomme_stub()
-
-
-def test_create_robomme_envs_raises_on_invalid_env_cls():
-    _install_robomme_stub()
-    try:
-        import pytest
-
-        from lerobot.envs.robomme import create_robomme_envs
-
-        with pytest.raises(ValueError, match="env_cls must be a callable"):
-            create_robomme_envs(task="PickXtimes", n_envs=1, env_cls=None)
-    finally:
-        _uninstall_robomme_stub()
--- a/tests/utils/test_collate.py
+++ b/tests/utils/test_collate.py
@@ -1,36 +0,0 @@
-#!/usr/bin/env python
-
-import torch
-
-from lerobot.utils.collate import lerobot_collate_fn
-
-
-def test_lerobot_collate_preserves_messages_and_drops_raw_language():
-    batch = [
-        {
-            "index": torch.tensor(0),
-            "messages": [{"role": "assistant", "content": "a"}],
-            "message_streams": ["low_level"],
-            "target_message_indices": [0],
-            "language_persistent": [{"content": "raw"}],
-            "language_events": [],
-        },
-        {
-            "index": torch.tensor(1),
-            "messages": [{"role": "assistant", "content": "b"}],
-            "message_streams": ["low_level"],
-            "target_message_indices": [0],
-            "language_persistent": [{"content": "raw"}],
-            "language_events": [],
-        },
-    ]
-
-    out = lerobot_collate_fn(batch)
-
-    assert out["index"].tolist() == [0, 1]
-    assert out["messages"][0][0]["content"] == "a"
-    assert out["messages"][1][0]["content"] == "b"
-    assert out["message_streams"] == [["low_level"], ["low_level"]]
-    assert out["target_message_indices"] == [[0], [0]]
-    assert "language_persistent" not in out
-    assert "language_events" not in out
Author	SHA1	Message	Date
Pepijn	51025c6cad	fix(robotwin): pin compatible curobo in benchmark image	2026-04-21 18:41:16 +02:00
Pepijn	05783401d3	Merge remote-tracking branch 'origin/feat/robotwin-benchmark' into feat/robotwin-benchmark	2026-04-20 17:31:28 +02:00
Pepijn	cfaeea6b1a	Merge branch 'main' into feat/robotwin-benchmark Resolves conflicts introduced by the RoboCasa365 benchmark merge on main (PR #3375) overlapping with this PR's benchmark CI scaffolding. - .github/workflows/benchmark_tests.yml: keep both the RoboTwin 2.0 job (from this branch) and the new RoboCasa365 job (from main). - docs/source/_toctree.yml: list both `robotwin` and `robocasa` pages under the "Benchmarks" section. - scripts/ci/extract_task_descriptions.py: keep both `_robotwin_descriptions` and `_robocasa_descriptions` helpers and wire them into `main()` alongside each other. Made-with: Cursor	2026-04-20 17:17:00 +02:00
Pepijn	d3909da83a	Merge branch 'main' into feat/robotwin-benchmark	2026-04-20 15:28:45 +02:00
Pepijn	1157fb11e6	fix: integrate PR #3315 review feedback - envs(robotwin): default `observation_height/width` in `create_robotwin_envs` to `DEFAULT_CAMERA_H/W` (240/320) so they match the D435 dims baked into `task_config/demo_clean.yml`. - envs(robotwin): resolve `task_config/demo_clean.yml` via `CONFIGS_PATH` instead of a cwd-relative path; works regardless of where `lerobot-eval` is invoked. - envs(robotwin): replace `print()` calls in `create_robotwin_envs` with `logger.info(...)` (module-level `logger = logging.getLogger`). - envs(robotwin): use `_LazyAsyncVectorEnv` for the async path so async workers start lazily (matches LIBERO / RoboCasa / VLABench). - envs(robotwin): cast `agent_pos` space + joint-state output to float32 end-to-end (was mixed float64/float32). - envs(configs): use the existing `_make_vec_env_cls(use_async, n_envs)` helper in `RoboTwinEnvConfig.create_envs`; drop the `get_env_processors` override so RoboTwin uses the identity processor inherited from `EnvConfig`. - processor: delete `RoboTwinProcessorStep` — the float32 cast now happens in the wrapper itself, so the processor is redundant. - tests: drop the `TestRoboTwinProcessorStep` suite; update the mock obs fixture to use float32 `joint_action.vector`. - ci: hoist `ROBOTWIN_POLICY` and `ROBOTWIN_TASKS` to job-level env vars so the task list and policy aren't duplicated across eval / extract / parse steps. - docker: pin RoboTwin + CuRobo upstream clones to commit SHAs (`RoboTwin@0aeea2d6`, `curobo@ca941586`) for reproducibility. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 15:18:41 +02:00
Pepijn	0fed8b45c2	ci: gate Docker Hub login on secret availability Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 14:27:06 +02:00
Pepijn	c9c6a6ae3d	fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs Port of #3416 onto this branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 14:05:39 +02:00
Pepijn	2cc147d946	test(robotwin): lower task-count floor from 60 to 50 ROBOTWIN_TASKS was trimmed to 50 tasks (see comment in `src/lerobot/envs/robotwin.py:48`), but the assertion still required ≥60, causing CI failures. Align the test with the current upstream task count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 10:12:58 +02:00
Pepijn	f1ba581ec4	Merge branch 'main' into feat/robotwin-benchmark	2026-04-20 08:45:34 +01:00
Pepijn	46803e88bd	fix(robotwin): sync ROBOTWIN_TASKS + doc with upstream (50 tasks) The local ROBOTWIN_TASKS tuple drifted from upstream RoboTwin-Platform/RoboTwin. Users passing names like `close_laptop`, `close_microwave`, `dump_bin`, `place_block`, `pour_water`, `fold_cloth`, etc. got past our validator (the names were in the tuple) but then crashed inside robosuite with a confusing error, because those tasks don't exist in upstream `envs/`. - Replace ROBOTWIN_TASKS with a verbatim mirror of upstream's `envs/` directory: 50 tasks as of main (was 60 with many stale entries). Added a `gh api`-based one-liner comment so future bumps are mechanical. - Update the `60 tasks` claims in robotwin.mdx and RoboTwinEnvConfig's docstring to `50`. - Replace the stale example-task table in robotwin.mdx with ten upstream-confirmed examples, and flag `open_laptop` as temporarily broken (its `check_success()` uses `self.arm_tag` which is only set inside `play_once()`; eval-mode callers hit AttributeError). - Rebuild the "Full benchmark" command with the actual 50-task list, omitting `open_laptop`. Made-with: Cursor	2026-04-17 15:29:57 +01:00
Pepijn	9ead70f016	fix(ci): swap 4 broken RoboTwin tasks in smoke eval The smoke eval hit two upstream issues: - `open_laptop`: bug in OpenMOSS/RoboTwin main — `check_success()` uses `self.arm_tag`, but that attribute is only set inside `play_once()` (the scripted-expert path). During eval `take_action()` calls `check_success()` directly, hitting `AttributeError: 'open_laptop' object has no attribute 'arm_tag'`. - `close_laptop`, `close_microwave`, `place_block`: not present in upstream RoboTwin `envs/` at all — our ROBOTWIN_TASKS tuple drifted from upstream and these names leaked into CI. Replace the four broken tasks with upstream-confirmed equivalents that exist both in ROBOTWIN_TASKS and in RoboTwin's `envs/`: `adjust_bottle`, `lift_pot`, `stamp_seal`, `turn_switch`. New 10-task smoke set: beat_block_hammer, click_bell, handover_block, stack_blocks_two, click_alarmclock, open_microwave, adjust_bottle, lift_pot, stamp_seal, turn_switch. Made-with: Cursor	2026-04-17 15:18:20 +01:00
Pepijn	84bb033631	ci(robotwin): smoke-eval 10 tasks instead of 5 Broader coverage on the RoboTwin 2.0 benchmark CI job: bump the smoke eval from 5 tasks to 10 (one episode each). Added tasks are all drawn from ROBOTWIN_TASKS and mirror the shape/complexity of the existing set (simple single-object or single-fixture manipulations). Tasks now run: beat_block_hammer, click_bell, handover_block, open_laptop, stack_blocks_two, click_alarmclock, close_laptop, close_microwave, open_microwave, place_block. `parse_eval_metrics.py` reads `overall` for multi-task runs so no parser change is needed. Bumped the step name and the metrics label to reflect the 10-task layout. Made-with: Cursor	2026-04-17 13:40:12 +01:00
Pepijn	78201f3226	Merge branch 'main' into feat/robotwin-benchmark	2026-04-16 18:57:39 +02:00
Pepijn	4ccc4e9a66	fix(docs): use plain markdown image to fix MDX build Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 18:57:27 +02:00
Pepijn	fdbbc35cca	fix(docs): use correct RoboTwin 2.0 teaser image URL Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 18:54:12 +02:00
Pepijn	741a6d5246	fix(docs): correct RoboTwin 2.0 paper arxiv link Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 18:53:55 +02:00
Pepijn	a4102ee86d	fix: integrate PR #3315 review feedback - ci: add Docker Hub login step, add HF_USER_TOKEN guard on eval step - docker: tie patches to pinned versions with removal guidance, remove unnecessary HF_TOKEN for public dataset, fix hadolint warnings - docs: fix paper link to arxiv, add teaser image, fix camera names (4→3 cameras), fix observation dims (480x640→240x320) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 18:33:53 +02:00
Pepijn	ad9662b4a8	fix(robotwin): defer YAML lookup and realign tests with current API __init__ was eagerly calling _load_robotwin_setup_kwargs just to read head_camera_h/w from the YAML. That import (`from envs import CONFIGS_PATH`) required a real RoboTwin install, so constructing the env — and thus every test in tests/envs/test_robotwin.py — blew up with ModuleNotFoundError on fast-tests where RoboTwin isn't installed. Replace the eager lookup with DEFAULT_CAMERA_H/W constants (240×320, the D435 dims baked into task_config/demo_clean.yml). reset() still resolves the full setup_kwargs lazily — that's fine because reset() is only called inside the benchmark Docker image where RoboTwin is present. Also resync the test file with the current env API: - mock get_obs() as the real nested {"observation": {cam: {"rgb": …}}, "joint_action": {"vector": …}} shape - patch both _load_robotwin_task and _load_robotwin_setup_kwargs (_patch_load → _patch_runtime) - drop `front_camera` / `left_wrist` from assertions — aloha-agilex exposes head_camera + left_camera + right_camera, not those - black-frame test now uses left_camera as the missing camera - setup_demo call check loosened to the caller-provided seed/is_test bits (full kwargs include the YAML-derived blob) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 11:14:02 +02:00
Pepijn	f291d3bfa9	docs(robotwin): add robotwin to _toctree.yml under Benchmarks doc-builder's TOC integrity check was rejecting the branch because docs/source/robotwin.mdx existed but wasn't listed in _toctree.yml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 11:06:34 +02:00
Pepijn	49186359b0	refactor(robotwin): rebase docker image on huggingface/lerobot-gpu Mirror the libero/metaworld/libero_plus/robomme pattern: start from the nightly GPU image (apt deps, python, uv, venv, lerobot[all] already there) and layer on only what RoboTwin 2.0 uniquely needs — cuda-nvcc + cuda-cudart-dev (CuRobo builds from source), Vulkan libs + NVIDIA ICD (SAPIEN renderer), sapien/mplib/open3d/pytorch3d/curobo installs, the mplib + sapien upstream patches, and the TianxingChen asset download. Drops ~90 lines of duplicated base setup (CUDA FROM, apt python, uv install, user creation, venv init, base lerobot install). 199 → 110. Also repoint the docs + env docstring dataset link from hxma/RoboTwin-LeRobot-v3.0 to the canonical lerobot/robotwin_unified. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 11:05:07 +02:00
Pepijn	99792bb17b	ci: point benchmark eval checkpoints at the lerobot/ org mirrors pepijn223/smolvla_* → lerobot/smolvla_* across every benchmark job in this branch (libero, metaworld, and the per-branch benchmark). The checkpoints were mirrored into the lerobot/ org and that's the canonical location going forward. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 11:02:06 +02:00
Pepijn	e67ceb213d	feat(robotwin): eval 5 diverse tasks per CI run with NL descriptions Widen the smoke eval from a single task (beat_block_hammer) to five: click_bell, handover_block, open_laptop, stack_blocks_two on top of the original. Each gets its own rollout video in videos/<task>_0/ so the dashboard can surface visually distinct behaviours. extract_task_descriptions.py now has a RoboTwin branch that reads `description/task_instruction/<task>.json` (already shipped in the clone at /opt/robotwin) and pulls the `full_description` field. CI cds into the clone before invoking the script so the relative path resolves. parse_eval_metrics.py is invoked with the same 5-task list so the metrics.json embeds one entry per task. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 21:03:15 +02:00
Pepijn	793f52e360	fix(robotwin): install av-dep so lerobot_eval can write rollout MP4s write_video (utils/io_utils.py:53) lazily imports PyAV via require_package and raises silently inside the video-writing thread when the extra is not installed — so the eval itself succeeds with pc_success=100 but no MP4 ever lands in videos/, and the artifact upload reports "No files were found". Add av-dep to the install line (same pattern as the RoboMME image). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 20:24:03 +02:00
Pepijn	ae113e0d99	fix(robotwin): expose _max_episode_steps for lerobot_eval.rollout rollout() does `env.call("_max_episode_steps")` (lerobot_eval.py:157) to know when to stop stepping. LiberoEnv and MetaworldEnv set this attribute; RoboTwinEnv was tracking the limit under `episode_length` only, so the call raised AttributeError once CuRobo finished warming up. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 20:00:36 +02:00
Pepijn	61a0269560	fix(robotwin): align observation_space dims with D435 camera output lerobot_eval crashed in gym.vector's SyncVectorEnv.reset with: ValueError: Output array is the wrong shape because RoboTwinEnvConfig declared observation_space = (480, 640, 3) but task_config/demo_clean.yml specifies head_camera_type=D435, which renders (240, 320, 3). gym.vector.concatenate pre-allocates a buffer from the declared space, so the first np.stack raises on shape mismatch. Changes: - Config defaults now 240×320 (the D435 dims in _camera_config.yml), with a comment pointing at the source of truth. - RoboTwinEnv.__init__ accepts observation_height/width as Optional and falls back to setup_kwargs["head_camera_h/w"] so the env is self-consistent even if the config is not in sync. - Config camera_names / features_map use the actual aloha-agilex camera names (head_camera, left_camera, right_camera). Drops the stale "front_camera" and "left_wrist"/"right_wrist" entries that never matched anything RoboTwin exposes. - CI workflow's rename_map updated to match the new camera names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:35:01 +02:00
Pepijn	c2160ca86e	refactor(robotwin): drop defensive dict guards, cache black fallback frame _get_obs was guarding every dict access with isinstance(..., dict) in case RoboTwin's get_obs returned something else — but the API contract (envs/_base_task.py:437) always returns a dict, so the guards were silently masking real failures behind plausible-looking zero observations. Drop them. Also: - Cache a single black fallback frame in __init__ instead of allocating a fresh np.zeros((H, W, 3), uint8) for every missing camera on every step — the "camera not exposed" set is static per env. - Only allocate the zero joint_state on the fallback path (not unconditionally before the real value overwrites it). - Replace .flatten() with .ravel() (no copy when already 1-D). - Fold the nested-dict schema comment and two identical torch.enable_grad() rationales into a single Autograd section in the class docstring. - Fix stale `left_wrist` camera name in the observation docstring. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 18:39:40 +02:00
Pepijn	f40b30202b	fix(robotwin): read nested get_obs() output and use aloha-agilex camera names RoboTwin's base_task.get_obs() returns a nested dict: {"observation": {cam: {"rgb": ..., "intrinsic_matrix": ...}}, "joint_action": {"left_arm": ..., "left_gripper": ..., "right_arm": ..., "right_gripper": ..., "vector": np.ndarray}, "endpose": {...}} Our _get_obs was reading raw["{cam}_rgb"] / raw["{cam}"] and raw["joint_action"] as if they were flat, so np.asarray(raw["joint_action"], dtype=float64) tripped on a dict and raised TypeError: float() argument must be a string or a real number, not 'dict' Fix: - Pull images from raw["observation"][cam]["rgb"] - Pull joint state from raw["joint_action"]["vector"] (the flat array) - Update the default camera tuple to (head_camera, left_camera, right_camera) to match RoboTwin's actual wrist-camera names (envs/camera/camera.py:135-151) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 18:25:08 +02:00
Pepijn	b06f134fe4	fix(robotwin): re-enable autograd for CuRobo planner warmup and take_action lerobot_eval wraps the full rollout in torch.no_grad() (lerobot_eval.py:566), but RoboTwin's setup_demo → load_robot → CuroboPlanner(...) runs motion_gen.warmup(), which invokes Newton's-method trajectory optimization. That optimizer calls cost.backward() internally, which raises RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn when autograd is disabled. take_action() hits the same planner path at every step. Wrap both setup_demo and take_action in torch.enable_grad() so CuRobo's optimizer can build its computation graph. Policy inference is unaffected — rollout()'s inner torch.inference_mode() block around select_action() is untouched, so we still don't allocate grad buffers during policy forward. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 17:39:21 +02:00
Pepijn	5558ea2207	feat(envs): add RoboTwin 2.0 benchmark integration - RoboTwinEnvConfig with 4-camera setup (head/front/left_wrist/right_wrist) - Docker image with SAPIEN, mplib, CuRobo, pytorch3d (Python 3.12) - CI workflow: 1-episode smoke eval with pepijn223/smolvla_robotwin - RoboTwinProcessorStep for state float32 casting - Camera rename_map: head_camera/front_camera/left_wrist -> camera1/2/3 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 16:47:44 +02:00