Compare commits

..

20 Commits

Author SHA1 Message Date
Steven Palma
62135d846f license + peft-dep + init groot + flat import layering utils dataset 2026-04-12 16:43:24 +02:00
Steven Palma
718d2fc59d fix fast tests 2026-04-12 14:46:23 +02:00
Steven Palma
8e75f61b31 update fast ci tests 2026-04-12 14:11:50 +02:00
Steven Palma
2bf33ccb98 fix leaking imports in minimal testing 2026-04-12 13:52:45 +02:00
Steven Palma
27292a3432 complete migration 2026-04-12 12:19:26 +02:00
Steven Palma
87528186c0 address minor review comments 2026-04-12 11:27:59 +02:00
Steven Palma
8ef4d78178 upgrade uv lock 2026-04-12 10:40:11 +02:00
Steven Palma
5ccf99b930 add explicit transitative deps 2026-04-12 10:20:44 +02:00
Steven Palma
1624fc1797 is_available checks centralized 2026-04-12 09:56:03 +02:00
Steven Palma
b132e2b5d6 docs and examples imports update 2026-04-12 09:43:13 +02:00
Steven Palma
89b4652de0 fix diffusion tests ci 2026-04-11 21:23:12 +02:00
Steven Palma
5940126fb5 fix test imports 2026-04-11 21:07:53 +02:00
Steven Palma
c9636bb53f fix policy imports 2026-04-11 20:39:03 +02:00
Steven Palma
af0d72bd42 refactor import fixes 2026-04-11 18:02:59 +02:00
Steven Palma
d626964119 big imports refactor 2026-04-11 15:03:24 +02:00
Steven Palma
964acd0151 refactor: more changes 2026-04-11 11:13:15 +02:00
Steven Palma
4767f51971 Merge branch 'main' into feat/minimal_default_install 2026-04-10 20:57:38 +02:00
Steven Palma
4c39981908 refactor: minor improvements 2026-04-10 18:31:07 +02:00
Steven Palma
882a6b0965 refactor: several fixes 2026-04-10 15:35:31 +02:00
Steven Palma
e2381633cd feat(dependecies): minimal default tag install 2026-04-10 14:22:13 +02:00
286 changed files with 4986 additions and 28928 deletions

View File

@@ -2,6 +2,11 @@
Short, imperative summary (e.g., "fix(robots): handle None in sensor parser"). See [CONTRIBUTING.md](../CONTRIBUTING.md) for PR conventions.
## Type / Scope
- **Type**: (Bug | Feature | Docs | Performance | Test | CI | Chore)
- **Scope**: (optional — name of module or package affected)
## Summary / Motivation
- One-paragraph description of what changes and why.
@@ -14,14 +19,28 @@ Short, imperative summary (e.g., "fix(robots): handle None in sensor parser"). S
## What changed
- Short, concrete bullets explaining the functional changes (how the behavior or output differs now).
- Short, concrete bullets of the modifications (files/behaviour).
- Short note if this introduces breaking changes and migration steps.
## How was this tested (or how to run locally)
- Tests added: list new tests or test files. `pytest -q tests/ -k <keyword>`
- Tests added: list new tests or test files.
- Manual checks / dataset runs performed.
- Instructions for the reviewer for reproducing with a quick example or CLI (if applicable)
- Instructions for the reviewer
Example:
- Ran the relevant tests:
```bash
pytest -q tests/ -k <keyword>
```
- Reproduce with a quick example or CLI (if applicable):
```bash
lerobot-train --some.option=true
```
## Checklist (required before merge)
@@ -29,7 +48,6 @@ Short, imperative summary (e.g., "fix(robots): handle None in sensor parser"). S
- [ ] All tests pass locally (`pytest`)
- [ ] Documentation updated
- [ ] CI is green
- [ ] Community Review: I have reviewed another contributor's open PR and linked it here: # (insert PR number/link)
## Reviewer notes

View File

@@ -1,951 +0,0 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Integration tests: build an isolated Docker image per benchmark and run a
# 1-episode smoke eval. Each benchmark gets its own image so incompatible
# dependency trees (e.g. hf-libero vs metaworld==3.0.0) can never collide.
#
# To add a new benchmark:
# 1. Add docker/Dockerfile.benchmark.<name> (install only lerobot[<name>])
# 2. Copy one of the jobs below and adjust the image name and eval command.
name: Benchmark Integration Tests
on:
# Run manually from the Actions tab
workflow_dispatch:
# Run every Monday at 02:00 UTC.
schedule:
- cron: "0 2 * * 1"
push:
branches:
- main
paths:
- "src/lerobot/envs/**"
- "src/lerobot/scripts/lerobot_eval.py"
- "docker/Dockerfile.benchmark.*"
- ".github/workflows/benchmark_tests.yml"
- "pyproject.toml"
pull_request:
branches:
- main
paths:
- "src/lerobot/envs/**"
- "src/lerobot/scripts/lerobot_eval.py"
- "docker/Dockerfile.benchmark.*"
- ".github/workflows/benchmark_tests.yml"
- "pyproject.toml"
permissions:
contents: read
env:
UV_VERSION: "0.8.0"
PYTHON_VERSION: "3.12"
# Cancel in-flight runs for the same branch/PR.
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true
jobs:
# ── LIBERO ────────────────────────────────────────────────────────────────
# Isolated image: lerobot[libero] only (hf-libero, dm-control, mujoco chain)
libero-integration-test:
name: Libero — build image + 1-episode eval
runs-on:
group: aws-g6-4xlarge-plus
env:
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
lfs: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
with:
cache-binary: false
- name: Login to Docker Hub
if: ${{ env.DOCKERHUB_USERNAME != '' }}
uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
with:
username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
env:
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
# Build the benchmark-specific image. The Dockerfile separates dep-install
# from source-copy, so code-only changes skip the slow uv-sync layer
# when the runner has a warm Docker daemon cache.
- name: Build Libero benchmark image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
with:
context: .
file: docker/Dockerfile.benchmark.libero
push: false
load: true
tags: lerobot-benchmark-libero:ci
- name: Run Libero smoke eval (1 episode)
if: env.HF_USER_TOKEN != ''
run: |
# Named container (no --rm) so we can docker cp artifacts out.
# Output to /tmp inside the container — /artifacts doesn't exist
# and user_lerobot cannot create root-level dirs.
docker run --name libero-eval --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
lerobot-benchmark-libero:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
lerobot-eval \
--policy.path=lerobot/smolvla_libero \
--env.type=libero \
--env.task=libero_spatial \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--env.camera_name_mapping={\"agentview_image\": \"camera1\", \"robot0_eye_in_hand_image\": \"camera2\"}' \
--policy.empty_cameras=1 \
--output_dir=/tmp/eval-artifacts
python scripts/ci/extract_task_descriptions.py \
--env libero --task libero_spatial \
--output /tmp/eval-artifacts/task_descriptions.json
"
- name: Copy Libero artifacts from container
if: always()
run: |
mkdir -p /tmp/libero-artifacts
docker cp libero-eval:/tmp/eval-artifacts/. /tmp/libero-artifacts/ 2>/dev/null || true
docker rm -f libero-eval || true
- name: Parse Libero eval metrics
if: always()
run: |
python3 scripts/ci/parse_eval_metrics.py \
--artifacts-dir /tmp/libero-artifacts \
--env libero \
--task libero_spatial \
--policy lerobot/smolvla_libero
- name: Upload Libero rollout video
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: libero-rollout-video
path: /tmp/libero-artifacts/videos/
if-no-files-found: warn
- name: Upload Libero eval metrics
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: libero-metrics
path: /tmp/libero-artifacts/metrics.json
if-no-files-found: warn
# ── LIBERO TRAIN+EVAL SMOKE ──────────────────────────────────────────────
# Train SmolVLA for 1 step (batch_size=1, dataset episode 0 only) then
# immediately runs eval inside the training loop (eval_freq=1, 1 episode).
# Tests the full train→eval-within-training pipeline end-to-end.
- name: Run Libero train+eval smoke (1 step, eval_freq=1)
if: env.HF_USER_TOKEN != ''
run: |
docker run --name libero-train-smoke --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
lerobot-benchmark-libero:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
accelerate launch --num_processes=1 \$(which lerobot-train) \
--policy.path=lerobot/smolvla_base \
--policy.load_vlm_weights=true \
--policy.scheduler_decay_steps=25000 \
--policy.freeze_vision_encoder=false \
--policy.train_expert_only=false \
--dataset.repo_id=lerobot/libero \
--dataset.episodes=[0] \
--dataset.use_imagenet_stats=false \
--env.type=libero \
--env.task=libero_spatial \
'--env.camera_name_mapping={\"agentview_image\": \"camera1\", \"robot0_eye_in_hand_image\": \"camera2\"}' \
--policy.empty_cameras=1 \
--output_dir=/tmp/train-smoke \
--steps=1 \
--batch_size=1 \
--eval_freq=1 \
--eval.n_episodes=1 \
--eval.batch_size=1 \
--eval.use_async_envs=false \
--save_freq=1 \
--policy.push_to_hub=false \
'--rename_map={\"observation.images.image\": \"observation.images.camera1\", \"observation.images.image2\": \"observation.images.camera2\"}'
"
- name: Copy Libero train-smoke artifacts from container
if: always()
run: |
mkdir -p /tmp/libero-train-smoke-artifacts
docker cp libero-train-smoke:/tmp/train-smoke/. /tmp/libero-train-smoke-artifacts/ 2>/dev/null || true
docker rm -f libero-train-smoke || true
- name: Upload Libero train-smoke eval video
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: libero-train-smoke-video
path: /tmp/libero-train-smoke-artifacts/eval/
if-no-files-found: warn
# ── METAWORLD ─────────────────────────────────────────────────────────────
# Isolated image: lerobot[metaworld] only (metaworld==3.0.0, mujoco>=3 chain)
metaworld-integration-test:
name: MetaWorld — build image + 1-episode eval
runs-on:
group: aws-g6-4xlarge-plus
env:
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
lfs: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
with:
cache-binary: false
- name: Login to Docker Hub
if: ${{ env.DOCKERHUB_USERNAME != '' }}
uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
with:
username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
env:
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
- name: Build MetaWorld benchmark image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
with:
context: .
file: docker/Dockerfile.benchmark.metaworld
push: false
load: true
tags: lerobot-benchmark-metaworld:ci
- name: Run MetaWorld smoke eval (1 episode)
if: env.HF_USER_TOKEN != ''
run: |
docker run --name metaworld-eval --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
lerobot-benchmark-metaworld:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
lerobot-eval \
--policy.path=lerobot/smolvla_metaworld \
--env.type=metaworld \
--env.task=metaworld-push-v3 \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={\"observation.image\": \"observation.images.camera1\"}' \
--policy.empty_cameras=2 \
--output_dir=/tmp/eval-artifacts
python scripts/ci/extract_task_descriptions.py \
--env metaworld --task metaworld-push-v3 \
--output /tmp/eval-artifacts/task_descriptions.json
"
- name: Copy MetaWorld artifacts from container
if: always()
run: |
mkdir -p /tmp/metaworld-artifacts
docker cp metaworld-eval:/tmp/eval-artifacts/. /tmp/metaworld-artifacts/ 2>/dev/null || true
docker rm -f metaworld-eval || true
- name: Parse MetaWorld eval metrics
if: always()
run: |
python3 scripts/ci/parse_eval_metrics.py \
--artifacts-dir /tmp/metaworld-artifacts \
--env metaworld \
--task metaworld-push-v3 \
--policy lerobot/smolvla_metaworld
- name: Upload MetaWorld rollout video
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: metaworld-rollout-video
path: /tmp/metaworld-artifacts/videos/
if-no-files-found: warn
- name: Upload MetaWorld eval metrics
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: metaworld-metrics
path: /tmp/metaworld-artifacts/metrics.json
if-no-files-found: warn
# ── ROBOTWIN 2.0 ──────────────────────────────────────────────────────────
# Isolated image: full RoboTwin 2.0 stack — SAPIEN, mplib, CuRobo,
# pytorch3d, + simulation assets (~4 GB).
# Build takes ~20 min on first run; subsequent runs hit the layer cache.
# Requires an NVIDIA GPU runner with CUDA 12.1 drivers.
robotwin-integration-test:
name: RoboTwin 2.0 — build image + 1-episode eval
runs-on:
group: aws-g6-4xlarge-plus
env:
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
ROBOTWIN_POLICY: lerobot/smolvla_robotwin
ROBOTWIN_TASKS: beat_block_hammer,click_bell,handover_block,stack_blocks_two,click_alarmclock,open_microwave,adjust_bottle,lift_pot,stamp_seal,turn_switch
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
lfs: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
with:
cache-binary: false
- name: Login to Docker Hub
if: ${{ env.DOCKERHUB_USERNAME != '' }}
uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
with:
username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
env:
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
# Build the full-install image: SAPIEN, mplib, CuRobo, pytorch3d +
# simulation assets (~4 GB). Layer cache lives in the runner's local
# Docker daemon — reused across re-runs on the same machine.
- name: Build RoboTwin 2.0 benchmark image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
with:
context: .
file: docker/Dockerfile.benchmark.robotwin
push: false
load: true
tags: lerobot-benchmark-robotwin:ci
cache-from: type=local,src=/tmp/.buildx-cache-robotwin
cache-to: type=local,dest=/tmp/.buildx-cache-robotwin,mode=max
- name: Run RoboTwin 2.0 smoke eval (10 tasks, 1 episode each)
if: env.HF_USER_TOKEN != ''
run: |
# Named container (no --rm) so we can docker cp artifacts out.
docker run --name robotwin-eval --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e ROBOTWIN_POLICY="${ROBOTWIN_POLICY}" \
-e ROBOTWIN_TASKS="${ROBOTWIN_TASKS}" \
lerobot-benchmark-robotwin:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
cd /opt/robotwin && lerobot-eval \
--policy.path=\"\$ROBOTWIN_POLICY\" \
--env.type=robotwin \
--env.task=\"\$ROBOTWIN_TASKS\" \
--env.max_parallel_tasks=5 \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={\"observation.images.head_camera\": \"observation.images.camera1\", \"observation.images.left_camera\": \"observation.images.camera2\", \"observation.images.right_camera\": \"observation.images.camera3\"}' \
--output_dir=/tmp/eval-artifacts
python /lerobot/scripts/ci/extract_task_descriptions.py \
--env robotwin \
--task \"\$ROBOTWIN_TASKS\" \
--output /tmp/eval-artifacts/task_descriptions.json
"
- name: Copy RoboTwin artifacts from container
if: always()
run: |
mkdir -p /tmp/robotwin-artifacts
docker cp robotwin-eval:/tmp/eval-artifacts/. /tmp/robotwin-artifacts/ 2>/dev/null || true
docker rm -f robotwin-eval || true
- name: Parse RoboTwin eval metrics
if: always()
run: |
python3 scripts/ci/parse_eval_metrics.py \
--artifacts-dir /tmp/robotwin-artifacts \
--env robotwin \
--task "${ROBOTWIN_TASKS}" \
--policy "${ROBOTWIN_POLICY}"
- name: Upload RoboTwin rollout video
if: always()
uses: actions/upload-artifact@v4
with:
name: robotwin-rollout-video
path: /tmp/robotwin-artifacts/videos/
if-no-files-found: warn
- name: Upload RoboTwin eval metrics
if: always()
uses: actions/upload-artifact@v4
with:
name: robotwin-metrics
path: /tmp/robotwin-artifacts/metrics.json
if-no-files-found: warn
# ── ROBOCASA365 ──────────────────────────────────────────────────────────
# Isolated image: robocasa + robosuite installed manually as editable
# clones (no `lerobot[robocasa]` extra — robocasa's setup.py pins
# `lerobot==0.3.3`, which would shadow this repo's lerobot).
robocasa-integration-test:
name: RoboCasa365 — build image + 1-episode eval
runs-on:
group: aws-g6-4xlarge-plus
env:
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
lfs: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
with:
cache-binary: false
- name: Login to Docker Hub
if: ${{ env.DOCKERHUB_USERNAME != '' }}
uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
with:
username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
env:
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
- name: Build RoboCasa365 benchmark image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
with:
context: .
file: docker/Dockerfile.benchmark.robocasa
push: false
load: true
tags: lerobot-benchmark-robocasa:ci
- name: Run RoboCasa365 smoke eval (10 atomic tasks, 1 episode each)
if: env.HF_USER_TOKEN != ''
run: |
docker run --name robocasa-eval --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
-e MUJOCO_GL=egl \
lerobot-benchmark-robocasa:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
lerobot-eval \
--policy.path=lerobot/smolvla_robocasa \
--env.type=robocasa \
--env.task=CloseFridge,OpenCabinet,OpenDrawer,TurnOnMicrowave,TurnOffStove,CloseToasterOvenDoor,SlideDishwasherRack,TurnOnSinkFaucet,NavigateKitchen,TurnOnElectricKettle \
--env.max_parallel_tasks=5 \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={\"observation.images.robot0_agentview_left\": \"observation.images.camera1\", \"observation.images.robot0_eye_in_hand\": \"observation.images.camera2\", \"observation.images.robot0_agentview_right\": \"observation.images.camera3\"}' \
--output_dir=/tmp/eval-artifacts
python scripts/ci/extract_task_descriptions.py \
--env robocasa \
--task CloseFridge,OpenCabinet,OpenDrawer,TurnOnMicrowave,TurnOffStove,CloseToasterOvenDoor,SlideDishwasherRack,TurnOnSinkFaucet,NavigateKitchen,TurnOnElectricKettle \
--output /tmp/eval-artifacts/task_descriptions.json
"
- name: Copy RoboCasa365 artifacts from container
if: always()
run: |
mkdir -p /tmp/robocasa-artifacts
docker cp robocasa-eval:/tmp/eval-artifacts/. /tmp/robocasa-artifacts/ 2>/dev/null || true
docker rm -f robocasa-eval || true
- name: Parse RoboCasa365 eval metrics
if: always()
run: |
python3 scripts/ci/parse_eval_metrics.py \
--artifacts-dir /tmp/robocasa-artifacts \
--env robocasa \
--task atomic_smoke_10 \
--policy lerobot/smolvla_robocasa
- name: Upload RoboCasa365 rollout video
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: robocasa-rollout-video
path: /tmp/robocasa-artifacts/videos/
if-no-files-found: warn
- name: Upload RoboCasa365 eval metrics
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: robocasa-metrics
path: /tmp/robocasa-artifacts/metrics.json
if-no-files-found: warn
# ── ROBOCEREBRA ───────────────────────────────────────────────────────────
# Reuses the LIBERO simulator (libero_10 suite) with RoboCerebra camera
# defaults (image/wrist_image). The image is layered on
# huggingface/lerobot-gpu, which already ships [libero] as part of [all].
robocerebra-integration-test:
name: RoboCerebra — build image + 1-episode eval
runs-on:
group: aws-g6-4xlarge-plus
env:
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
lfs: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
with:
cache-binary: false
- name: Login to Docker Hub
if: ${{ env.DOCKERHUB_USERNAME != '' }}
uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
with:
username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
env:
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
- name: Build RoboCerebra benchmark image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
with:
context: .
file: docker/Dockerfile.benchmark.robocerebra
push: false
load: true
tags: lerobot-benchmark-robocerebra:ci
cache-from: type=local,src=/tmp/.buildx-cache-robocerebra
cache-to: type=local,dest=/tmp/.buildx-cache-robocerebra,mode=max
- name: Run RoboCerebra smoke eval (1 episode)
if: env.HF_USER_TOKEN != ''
run: |
docker run --name robocerebra-eval --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
-e LIBERO_DATA_FOLDER=/tmp/libero_data \
lerobot-benchmark-robocerebra:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
lerobot-eval \
--policy.path=lerobot/smolvla_robocerebra \
--env.type=libero \
--env.task=libero_10 \
--env.fps=20 \
--env.obs_type=pixels_agent_pos \
--env.observation_height=256 \
--env.observation_width=256 \
'--env.camera_name_mapping={\"agentview_image\": \"image\", \"robot0_eye_in_hand_image\": \"wrist_image\"}' \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={\"observation.images.image\": \"observation.images.camera1\", \"observation.images.wrist_image\": \"observation.images.camera2\"}' \
--policy.empty_cameras=1 \
--output_dir=/tmp/eval-artifacts
python scripts/ci/extract_task_descriptions.py \
--env libero --task libero_10 \
--output /tmp/eval-artifacts/task_descriptions.json
"
- name: Copy RoboCerebra artifacts from container
if: always()
run: |
mkdir -p /tmp/robocerebra-artifacts
docker cp robocerebra-eval:/tmp/eval-artifacts/. /tmp/robocerebra-artifacts/ 2>/dev/null || true
docker rm -f robocerebra-eval || true
- name: Parse RoboCerebra eval metrics
if: always()
run: |
python3 scripts/ci/parse_eval_metrics.py \
--artifacts-dir /tmp/robocerebra-artifacts \
--env robocerebra \
--task libero_10 \
--policy lerobot/smolvla_robocerebra
- name: Upload RoboCerebra rollout video
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: robocerebra-rollout-video
path: /tmp/robocerebra-artifacts/videos/
if-no-files-found: warn
- name: Upload RoboCerebra eval metrics
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: robocerebra-metrics
path: /tmp/robocerebra-artifacts/metrics.json
if-no-files-found: warn
# ── ROBOMME ───────────────────────────────────────────────────────────────
# Isolated image: mani-skill/SAPIEN/Vulkan chain with gymnasium and numpy
# overrides (robomme can't be a pyproject extra due to numpy<2 pin).
robomme-integration-test:
name: RoboMME — build image + 1-episode eval
runs-on:
group: aws-g6-4xlarge-plus
env:
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
ROBOMME_POLICY: lerobot/smolvla_robomme
ROBOMME_TASKS: PickXtimes,BinFill,StopCube,MoveCube,InsertPeg,SwingXtimes,VideoUnmask,ButtonUnmask,PickHighlight,PatternLock
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
lfs: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
with:
cache-binary: false
- name: Login to Docker Hub
if: ${{ env.DOCKERHUB_USERNAME != '' }}
uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
with:
username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
env:
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
- name: Build RoboMME benchmark image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
with:
context: .
file: docker/Dockerfile.benchmark.robomme
push: false
load: true
tags: lerobot-benchmark-robomme:ci
- name: Run RoboMME smoke eval (10 tasks, 1 episode each)
if: env.HF_USER_TOKEN != ''
run: |
docker run --name robomme-eval --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
-e ROBOMME_POLICY="${ROBOMME_POLICY}" \
-e ROBOMME_TASKS="${ROBOMME_TASKS}" \
lerobot-benchmark-robomme:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
lerobot-eval \
--policy.path=\"\$ROBOMME_POLICY\" \
--env.type=robomme \
--env.task=\"\$ROBOMME_TASKS\" \
--env.dataset_split=test \
--env.task_ids=[0] \
--env.max_parallel_tasks=5 \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={\"observation.images.image\": \"observation.images.camera1\", \"observation.images.wrist_image\": \"observation.images.camera2\"}' \
--policy.empty_cameras=3 \
--output_dir=/tmp/eval-artifacts
python scripts/ci/extract_task_descriptions.py \
--env robomme --task \"\$ROBOMME_TASKS\" \
--output /tmp/eval-artifacts/task_descriptions.json
"
- name: Copy RoboMME artifacts from container
if: always()
run: |
mkdir -p /tmp/robomme-artifacts
docker cp robomme-eval:/tmp/eval-artifacts/. /tmp/robomme-artifacts/ 2>/dev/null || true
docker rm -f robomme-eval || true
- name: Parse RoboMME eval metrics
if: always()
run: |
python3 scripts/ci/parse_eval_metrics.py \
--artifacts-dir /tmp/robomme-artifacts \
--env robomme \
--task "${ROBOMME_TASKS}" \
--policy "${ROBOMME_POLICY}"
- name: Upload RoboMME rollout video
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: robomme-rollout-video
path: /tmp/robomme-artifacts/videos/
if-no-files-found: warn
- name: Upload RoboMME eval metrics
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: robomme-metrics
path: /tmp/robomme-artifacts/metrics.json
if-no-files-found: warn
# ── LIBERO-plus ───────────────────────────────────────────────────────────
# Isolated image: LIBERO-plus fork cloned into /home/user_lerobot on top of
# huggingface/lerobot-gpu (see docker/Dockerfile.benchmark.libero_plus).
libero-plus-integration-test:
name: LIBERO-plus — build image + 1-episode eval
runs-on:
group: aws-g6-4xlarge-plus
env:
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
LIBERO_PLUS_SUITE: libero_spatial
LIBERO_PLUS_POLICY: lerobot/smolvla_libero_plus
LIBERO_PLUS_TASK_IDS: "[0,100,260,500,1000,1500,2000,2400]"
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
lfs: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
with:
cache-binary: false
- name: Login to Docker Hub
if: ${{ env.DOCKERHUB_USERNAME != '' }}
uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
with:
username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
env:
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
- name: Build LIBERO-plus benchmark image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
with:
context: .
file: docker/Dockerfile.benchmark.libero_plus
push: false
load: true
tags: lerobot-benchmark-libero-plus:ci
cache-from: type=local,src=/tmp/.buildx-cache-libero-plus
cache-to: type=local,dest=/tmp/.buildx-cache-libero-plus,mode=max
- name: Run LIBERO-plus smoke eval (1 episode)
if: env.HF_USER_TOKEN != ''
run: |
docker run --name libero-plus-eval --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
-e LIBERO_PLUS_SUITE="${LIBERO_PLUS_SUITE}" \
-e LIBERO_PLUS_POLICY="${LIBERO_PLUS_POLICY}" \
-e LIBERO_PLUS_TASK_IDS="${LIBERO_PLUS_TASK_IDS}" \
lerobot-benchmark-libero-plus:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
lerobot-eval \
--policy.path=\"\$LIBERO_PLUS_POLICY\" \
--env.type=libero_plus \
--env.task=\"\$LIBERO_PLUS_SUITE\" \
--env.task_ids=\"\$LIBERO_PLUS_TASK_IDS\" \
--env.max_parallel_tasks=5 \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--env.camera_name_mapping={\"agentview_image\": \"camera1\", \"robot0_eye_in_hand_image\": \"camera2\"}' \
--policy.empty_cameras=1 \
--output_dir=/tmp/eval-artifacts
python scripts/ci/extract_task_descriptions.py \
--env libero_plus --task \"\$LIBERO_PLUS_SUITE\" \
--output /tmp/eval-artifacts/task_descriptions.json
"
- name: Copy LIBERO-plus artifacts from container
if: always()
run: |
mkdir -p /tmp/libero-plus-artifacts
docker cp libero-plus-eval:/tmp/eval-artifacts/. /tmp/libero-plus-artifacts/ 2>/dev/null || true
docker rm -f libero-plus-eval || true
- name: Parse LIBERO-plus eval metrics
if: always()
run: |
python3 scripts/ci/parse_eval_metrics.py \
--artifacts-dir /tmp/libero-plus-artifacts \
--env libero_plus \
--task "${LIBERO_PLUS_SUITE}" \
--policy "${LIBERO_PLUS_POLICY}"
- name: Upload LIBERO-plus rollout video
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: libero-plus-rollout-video
path: /tmp/libero-plus-artifacts/videos/
if-no-files-found: warn
- name: Upload LIBERO-plus eval metrics
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: libero-plus-metrics
path: /tmp/libero-plus-artifacts/metrics.json
if-no-files-found: warn
# ── VLABENCH ─────────────────────────────────────────────────────────────
# Isolated image: lerobot[vlabench] only (VLABench, mujoco==3.2.2, dm-control chain)
vlabench-integration-test:
name: VLABench — build image + 1-episode eval
runs-on:
group: aws-g6-4xlarge-plus
env:
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
lfs: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
with:
cache-binary: false
- name: Login to Docker Hub
if: ${{ env.DOCKERHUB_USERNAME != '' }}
uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
with:
username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
env:
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
- name: Build VLABench benchmark image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
with:
context: .
file: docker/Dockerfile.benchmark.vlabench
push: false
load: true
tags: lerobot-benchmark-vlabench:ci
build-args: |
VLABENCH_ASSETS_REPO=lerobot/vlabench-assets
- name: Run VLABench smoke eval (10 tasks, 1 episode each)
if: env.HF_USER_TOKEN != ''
run: |
docker run --name vlabench-eval --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
-e MUJOCO_GL=egl \
lerobot-benchmark-vlabench:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
lerobot-eval \
--policy.path=lerobot/smolvla_vlabench \
--env.type=vlabench \
--env.task=select_fruit,select_toy,select_book,select_painting,select_drink,select_ingredient,select_billiards,select_poker,add_condiment,insert_flower \
--env.episode_length=50 \
--env.max_parallel_tasks=5 \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={\"observation.images.image\": \"observation.images.camera1\", \"observation.images.second_image\": \"observation.images.camera2\", \"observation.images.wrist_image\": \"observation.images.camera3\"}' \
--output_dir=/tmp/eval-artifacts
python scripts/ci/extract_task_descriptions.py \
--env vlabench \
--task select_fruit,select_toy,select_book,select_painting,select_drink,select_ingredient,select_billiards,select_poker,add_condiment,insert_flower \
--output /tmp/eval-artifacts/task_descriptions.json
"
- name: Copy VLABench artifacts from container
if: always()
run: |
mkdir -p /tmp/vlabench-artifacts
docker cp vlabench-eval:/tmp/eval-artifacts/. /tmp/vlabench-artifacts/ 2>/dev/null || true
docker rm -f vlabench-eval || true
- name: Parse VLABench eval metrics
if: always()
run: |
python3 scripts/ci/parse_eval_metrics.py \
--artifacts-dir /tmp/vlabench-artifacts \
--env vlabench \
--task select_fruit,select_toy,select_book,select_painting,select_drink,select_ingredient,select_billiards,select_poker,add_condiment,insert_flower \
--policy lerobot/smolvla_vlabench
- name: Upload VLABench rollout video
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: vlabench-rollout-video
path: /tmp/vlabench-artifacts/videos/
if-no-files-found: warn
- name: Upload VLABench eval metrics
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: vlabench-metrics
path: /tmp/vlabench-artifacts/metrics.json
if-no-files-found: warn

View File

@@ -33,7 +33,7 @@ jobs:
github.event.workflow_run.event == 'pull_request' &&
github.event.workflow_run.conclusion == 'success' &&
github.repository == 'huggingface/lerobot'
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@2430c1ec91d04667414e2fa31ecfc36c153ea391 # main
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@90b4ee2c10b81b5c1a6367c4e6fc9e2fb510a7e3 # main
with:
package_name: lerobot
secrets:

View File

@@ -55,7 +55,7 @@ jobs:
github.repository == 'huggingface/lerobot'
permissions:
contents: read
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@2430c1ec91d04667414e2fa31ecfc36c153ea391 # main
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@90b4ee2c10b81b5c1a6367c4e6fc9e2fb510a7e3 # main
with:
commit_sha: ${{ github.sha }}
package: lerobot
@@ -78,7 +78,7 @@ jobs:
permissions:
contents: read
pull-requests: write
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@2430c1ec91d04667414e2fa31ecfc36c153ea391 # main
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@90b4ee2c10b81b5c1a6367c4e6fc9e2fb510a7e3 # main
with:
commit_sha: ${{ github.event.pull_request.head.sha }}
pr_number: ${{ github.event.number }}

View File

@@ -217,24 +217,6 @@ jobs:
- name: Run end-to-end tests
run: make test-end-to-end
slack-notification:
name: Slack Notification
needs: [cpu-tests, gpu-tests, upgrade-lock]
if: always() && needs.upgrade-lock.outputs.changed == 'true'
runs-on: ubuntu-latest
permissions:
contents: read
env:
CI_SLACK_CHANNEL: ${{ secrets.CI_SLACK_CHANNEL }}
steps:
- name: Post to a Slack channel
uses: huggingface/hf-workflows/.github/actions/post-slack@a88e7fa2eaee28de5a4d6142381b1fb792349b67 # main
with:
slack_channel: ${{ env.CI_SLACK_CHANNEL }}
title: "Results of the latest dependency tests (CPU + GPU)"
status: ${{ (needs.cpu-tests.result == 'success' && needs.gpu-tests.result == 'success') && 'success' || 'failure' }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
# This job creates or updates a PR with the upgraded lockfile
open-pr:
name: Open PR

View File

@@ -24,14 +24,14 @@ on:
env:
CLOSE_ISSUE_MESSAGE: >
This issue was closed because it has been stalled for 30 days with no activity.
This issue was closed because it has been stalled for 14 days with no activity.
Feel free to reopen if is still relevant, or to ping a collaborator if you have any questions.
CLOSE_PR_MESSAGE: >
This PR was closed because it has been stalled for 30 days with no activity.
This PR was closed because it has been stalled for 21 days with no activity.
Feel free to reopen if is still relevant, or to ping a collaborator if you have any questions.
WARN_ISSUE_MESSAGE: >
This issue has been automatically marked as stale because it has not had
recent activity (1 year). It will be closed if no further activity occurs.
recent activity (6 months). It will be closed if no further activity occurs.
Any change, comment or update to this issue will reset this count.
Thank you for your contributions.
WARN_PR_MESSAGE: >
@@ -59,10 +59,10 @@ jobs:
stale-pr-label: stale
exempt-issue-labels: never-stale
exempt-pr-labels: never-stale
days-before-issue-stale: 365
days-before-issue-close: 30
days-before-issue-stale: 180
days-before-issue-close: 14
days-before-pr-stale: 365
days-before-pr-close: 30
days-before-pr-close: 21
delete-branch: true
close-issue-message: ${{ env.CLOSE_ISSUE_MESSAGE }}
close-pr-message: ${{ env.CLOSE_PR_MESSAGE }}

View File

@@ -1,7 +1,5 @@
This file provides guidance to AI agents when working with code in this repository.
> **User-facing help → [`AGENT_GUIDE.md`](./AGENT_GUIDE.md)** (SO-101 setup, recording, picking a policy, training duration, eval — with copy-pasteable commands).
## Project Overview
LeRobot is a PyTorch-based library for real-world robotics, providing datasets, pretrained policies, and tools for training, evaluation, data collection, and robot control. It integrates with Hugging Face Hub for model/dataset sharing.

View File

@@ -1,410 +0,0 @@
# AGENT_GUIDE.md — LeRobot Helper for AI Agents & Users
This file is a practical, copy-paste-friendly companion for any AI agent (Cursor, Claude, ChatGPT, Codex, etc.) helping a user work with LeRobot. It complements [`AGENTS.md`](./AGENTS.md) (dev/contributor context) with **user-facing guidance**: how to start, what to train, how long, how to record, and how to calibrate an SO-101.
---
## 1. Start here — ask the user first (MANDATORY)
Before suggesting any command, an agent MUST ask the user at least these questions and wait for answers:
1. **What's your goal?** (e.g. "teach my SO-101 to fold a cloth", "train a policy on an existing HF dataset", "contribute a PR", "understand the codebase")
2. **What hardware do you have?**
- Robot: none / SO-100 / SO-101 / Koch / LeKiwi / Reachy / other
- Teleop: leader arm / phone / keyboard / gamepad / none
- Cameras: how many, resolution, fixed or moving?
3. **What machine will you train on?**
- GPU model + VRAM (e.g. "laptop 3060 6 GB", "RTX 4090 24 GB", "A100 80 GB", "CPU only")
- OS: macOS / Linux / Windows
4. **Skill level & time budget?** First time, some ML, experienced? Hours, days, a weekend?
5. **Do you already have a dataset?** Yes (HF repo id?) / no / want to record one
6. **How can I help right now?** (pick one concrete next step)
Only after you have answers, propose a concrete path. If something is ambiguous, ask again rather than guessing. Bias toward **the simplest thing that works** for the user's hardware and goal.
---
## 2. LeRobot in 60 seconds
LeRobot = **datasets + policies + envs + robot control**, unified by a small set of strong abstractions.
- **`LeRobotDataset`** — episode-aware dataset (video or images + actions + state), loadable from the Hub or disk.
- **Policies** (`ACT`, `Diffusion`, `SmolVLA`, `π0`, `π0.5`, `Wall-X`, `X-VLA`, `VQ-BeT`, `TD-MPC`, …) — all inherit `PreTrainedPolicy` and can be pushed/pulled from the Hub.
- **Processors** — small composable transforms between dataset → policy → robot.
- **Envs** (sim) and **Robots** (real) — same action/observation contract so code swaps cleanly.
- **CLI** — `lerobot-record`, `lerobot-train`, `lerobot-eval`, `lerobot-teleoperate`, `lerobot-calibrate`, `lerobot-find-port`, `lerobot-setup-motors`, `lerobot-replay`.
See [`AGENTS.md`](./AGENTS.md) for repo architecture.
---
## 3. Quickstart paths (pick one)
### Path A — "I have an SO-101 and want my first trained policy"
Go to §4 (SO-101 end-to-end), then §5 (data tips), then §6 (pick a policy — likely **ACT**), then §7 (how long), then §8 (eval).
### Path B — "No hardware, I want to train on an existing dataset"
Skip §4. Pick a policy in §6, pick a duration in §7, then run `lerobot-train` per §4.9 with a Hub `--dataset.repo_id` and an `--env.type` for eval. Finish with §8.
### Path C — "I just want to understand the codebase"
Read §2 above, then `AGENTS.md` "Architecture", then open `src/lerobot/policies/act/` and `src/lerobot/datasets/lerobot_dataset.py` as canonical examples.
---
## 4. SO-101 end-to-end cheat-sheet
Full details in [`docs/source/so101.mdx`](./docs/source/so101.mdx) and [`docs/source/il_robots.mdx`](./docs/source/il_robots.mdx). Minimum commands in order. Confirm arms are assembled + powered before issuing.
**4.1 Install**
```bash
pip install 'lerobot[feetech]' # SO-100/SO-101 motor stack
# pip install 'lerobot[all]' # everything
# pip install 'lerobot[aloha,pusht]' # specific features
# pip install 'lerobot[smolvla]' # add SmolVLA deps
git lfs install && git lfs pull
hf auth login # required to push datasets/policies
```
Contributors can alternatively use `uv sync --locked --extra feetech` (see `AGENTS.md`).
**4.2 Find USB ports** — run once per arm, unplug when prompted.
```bash
lerobot-find-port
```
macOS: `/dev/tty.usbmodem...`; Linux: `/dev/ttyACM0` (may need `sudo chmod 666 /dev/ttyACM0`).
**4.3 Setup motor IDs & baudrate** (one-time, per arm)
```bash
lerobot-setup-motors --robot.type=so101_follower --robot.port=<FOLLOWER_PORT>
lerobot-setup-motors --teleop.type=so101_leader --teleop.port=<LEADER_PORT>
```
**4.4 Calibrate** — center all joints, press Enter, sweep each joint through its full range. The `id` is the calibration key — reuse it everywhere.
```bash
lerobot-calibrate --robot.type=so101_follower --robot.port=<FOLLOWER_PORT> --robot.id=my_follower
lerobot-calibrate --teleop.type=so101_leader --teleop.port=<LEADER_PORT> --teleop.id=my_leader
```
**4.5 Teleoperate** (sanity check, no recording)
```bash
lerobot-teleoperate \
--robot.type=so101_follower --robot.port=<FOLLOWER_PORT> --robot.id=my_follower \
--teleop.type=so101_leader --teleop.port=<LEADER_PORT> --teleop.id=my_leader \
--robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
--display_data=true
```
> **Feetech timeout / comms error on SO-100 / SO-101?** Before touching software, check the **red motor LEDs** on the daisy chain.
>
> - **All steady red, gripper → base chain** → wiring OK.
> - **One or more motors dark / chain stops mid-way** → wiring issue: reseat the 3-pin cables, check the controller-board power supply, and make sure each motor is fully clicked in.
> - **LEDs blinking** → the motor is in an **error state**: usually overload (forcing a joint past its limit) **or wrong power supply voltage**. SO-100 / SO-101 ship in two variants — a **5 V / 7.4 V** build and a **12 V** build — they are NOT interchangeable. Using a 12 V PSU on a 5 V / 7.4 V arm (or vice-versa) will trip this error; confirm your motor variant before powering up.
>
> Most "timeout" errors are physical, not code.
**4.6 Record a dataset** — keys: **→** next, **←** redo, **ESC** finish & upload.
```bash
HF_USER=$(NO_COLOR=1 hf auth whoami | awk -F': *' 'NR==1 {print $2}')
lerobot-record \
--robot.type=so101_follower --robot.port=<FOLLOWER_PORT> --robot.id=my_follower \
--teleop.type=so101_leader --teleop.port=<LEADER_PORT> --teleop.id=my_leader \
--robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
--dataset.repo_id=${HF_USER}/my_task \
--dataset.single_task="<describe the task in one sentence>" \
--dataset.num_episodes=50 \
--dataset.episode_time_s=30 \
--dataset.reset_time_s=10 \
--display_data=true
```
**4.7 Visualize****always** do this before training. Look for missing frames, camera blur, unreachable targets, inconsistent object positions.
After upload: https://huggingface.co/spaces/lerobot/visualize_dataset → paste `${HF_USER}/my_task`. Works for **any LeRobot-formatted Hub dataset** — use it to scout other datasets, inspect episode quality, or debug your own data before retraining.
**4.8 Replay an episode** (sanity check)
```bash
lerobot-replay --robot.type=so101_follower --robot.port=<FOLLOWER_PORT> --robot.id=my_follower \
--dataset.repo_id=${HF_USER}/my_task --dataset.episode=0
```
**4.9 Train** (default: ACT — fastest, lowest memory). Apple silicon: `--policy.device=mps`. See §6/§7 for policy and duration.
```bash
lerobot-train \
--dataset.repo_id=${HF_USER}/my_task \
--policy.type=act \
--policy.device=cuda \
--output_dir=outputs/train/act_my_task \
--job_name=act_my_task \
--batch_size=8 \
--wandb.enable=true \
--policy.repo_id=${HF_USER}/act_my_task
```
**4.10 Evaluate on the real robot** — compare success rate to a teleoperated baseline.
```bash
lerobot-record \
--robot.type=so101_follower --robot.port=<FOLLOWER_PORT> --robot.id=my_follower \
--robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
--dataset.repo_id=${HF_USER}/eval_my_task \
--dataset.single_task="<same task description as training>" \
--dataset.num_episodes=10 \
--policy.path=${HF_USER}/act_my_task
```
---
## 5. Data collection tips (beginner → reliable policy)
Good data beats clever models. Adopt these defaults and deviate only with evidence.
### 5.1 Setup & ergonomics
- **Fix the rig and cameras** before touching the software. If the rig vibrates or the operator gets frustrated, fix that first — more bad data won't help.
- **Lighting matters more than resolution.** Diffuse, consistent light. Avoid moving shadows.
- **"Can you do the task from the camera view alone?"** If no, your cameras are wrong. Fix before recording.
- Enable **action interpolation** for rollouts when available for smoother trajectories.
### 5.2 Practice before you record
- Do 510 demos without recording. Build a deliberate, repeatable strategy.
- Hesitant or inconsistent demos teach the model hesitation.
### 5.3 Quality over speed
Deliberate, high-quality execution beats fast sloppy runs. Optimize for speed only **after** strategy is dialed in — never trade quality for it.
### 5.4 Consistency within and across episodes
Same grasp, approach vector, and timing. Coherent strategies are much easier to learn than wildly varying movements.
### 5.5 Start small, then extend (the golden rule)
- **First 50 episodes = constrained version** of the task: one object, fixed position, fixed camera setup, one operator.
- Train a quick ACT model. See what fails.
- **Then add diversity** along one axis at a time: more positions → more lighting → more objects → more operators.
- Don't try to collect the "perfect dataset" on day one. Iterate.
### 5.6 Policy choice for beginners
- **Laptop / first time / want results fast → ACT.** Works surprisingly well, trains fast even on a laptop GPU.
- **Bigger GPU / language-conditioned / multi-task → SmolVLA.** Unfreezing the vision encoder (see §7) is a big win here.
- Defer π0 / π0.5 / Wall-X / X-VLA until you have a proven ACT baseline and a 20+ GB GPU.
### 5.7 Recommended defaults for your first task
| Setting | Value |
| ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| Episodes | **50** to start, scale to 100300 after first training |
| Episode length | 2045 s (shorter is fine for grasp/place) |
| Reset time | 10 s |
| FPS | 30 |
| Cameras | **2 cameras recommended**: 1 fixed front + 1 wrist. Multi-view often outperforms single-view. A single fixed camera also works to keep things simple. |
| Task description | Short, specific, action-phrased sentence |
### 5.8 Troubleshooting signal
- Policy fails at one specific stage → record 1020 more episodes **targeting that stage**.
- Policy flaps / oscillates → likely inconsistent demos, or need more training; re-record worst episodes (use **←** to redo).
- Policy ignores the object → camera framing or lighting issue, not a model issue.
See also: [What makes a good dataset](https://huggingface.co/blog/lerobot-datasets#what-makes-a-good-dataset).
---
## 6. Which policy should I train?
Match the policy to the user's **GPU memory** and **time budget**. Numbers below come from an internal profiling run (one training update per policy). They are **indicative only** — see caveats.
### 6.1 Profiling snapshot (indicative)
All policies typically train for **510 epochs** (see §7).
| Policy | Batch | Update (ms) | Peak GPU mem (GB) | Best for |
| ----------- | ----: | ----------: | ----------------: | ------------------------------------------------------------------------------------------------ |
| `act` | 4 | **83.9** | **0.94** | First-time users, laptops, single-task. Fast and reliable. |
| `diffusion` | 4 | 168.6 | 4.94 | Multi-modal action distributions; needs mid-range GPU. |
| `smolvla` | 1 | 357.8 | 3.93 | Language-conditioned, multi-task, small VLA. **Unfreeze vision encoder for big gains** (see §7). |
| `xvla` | 1 | 731.6 | 15.52 | Large VLA, multi-task. |
| `wall_x` | 1 | 716.5 | 15.95 | Large VLA with world-model objective. |
| `pi0` | 1 | 940.3 | 15.50 | Strong large VLA baseline (Physical Intelligence). |
| `pi05` | 1 | 1055.8 | 16.35 | Newer π policy; similar footprint to `pi0`. |
**Critical caveats:**
- **Optimizer:** measured with **SGD**. LeRobot's default is **AdamW**, which keeps extra optimizer state → **peak memory will be noticeably higher** with the default, especially for `pi0`, `pi05`, `wall_x`, `xvla`.
- **Batch size:** the large policies were profiled at batch 1. In practice use a **larger batch** for stable training (see §7.4). Memory scales roughly linearly with batch.
### 6.2 Decision rules
- **< 8 GB VRAM (laptop, 3060, M-series Mac):** → `act`. Maybe `diffusion` if you have ~68 GB free.
- **1216 GB VRAM (4070/4080, A4000):** → `smolvla` with defaults, or `act`/`diffusion` with larger batch. `pi0`/`pi05`/`wall_x`/`xvla` feasible only with small batch + gradient accumulation.
- **24+ GB VRAM (3090/4090/A5000):** → any policy. Prefer `smolvla` (unfrozen) for multi-task; `act` for single-task grasp-and-place (still often the best ROI). Could experiment with `pi0` or `pi05` or `xvla`
- **80 GB (A100/H100):** → any, with healthy batch. `pi05`, `xvla`, `wall_x` become comfortable.
- **CPU only:** → don't train here. Use Google Colab (see [`docs/source/notebooks.mdx`](./docs/source/notebooks.mdx)) or a rented GPU.
---
## 7. How long should I train?
Robotics imitation learning usually converges in a **few epochs over the dataset**, not hundreds of thousands of raw steps. Think **epochs first**, then translate to steps.
### 7.1 Rule of thumb
- **Typical total: 510 epochs.** Start at 5, eval, then decide if more helps.
- Very small datasets (< 30 episodes) may want slightly more epochs — but first, **collect more data**.
- VLAs with a pretrained vision backbone typically need **fewer** epochs than training from scratch.
### 7.2 Steps ↔ epochs conversion
```
total_frames = sum of frames over all episodes # e.g. 50 eps × 30 fps × 30 s ≈ 45,000
steps_per_epoch = ceil(total_frames / batch_size)
total_steps = epochs × steps_per_epoch
```
Examples for `--batch_size=8`:
| Dataset size | Frames | Steps / epoch | 5 epochs | 10 epochs |
| ----------------------- | ------: | ------------: | -------: | --------: |
| 50 eps × 30 s @ 30 fps | 45,000 | ~5,625 | 28k | 56k |
| 100 eps × 30 s @ 30 fps | 90,000 | ~11,250 | 56k | 113k |
| 300 eps × 30 s @ 30 fps | 270,000 | ~33,750 | 169k | 338k |
Pass the resulting total with `--steps=<N>`; eval at intermediate checkpoints (`outputs/train/.../checkpoints/`).
### 7.3 Per-policy starting points (single-task, ~50 episodes)
| Policy | Batch | Steps (first run) | Notes |
| -------------- | ----: | ----------------: | ----------------------------------------------------------------- |
| `act` | 816 | 30k80k | Usually converges under 50k for single-task. |
| `diffusion` | 816 | 80k150k | Benefits from longer training than ACT. |
| `smolvla` | 48 | 30k80k | Pretrained VLM → converges fast. |
| `pi0` / `pi05` | 14 | 30k80k | Memory-bound; use gradient accumulation for effective batch ≥ 16! |
### 7.4 Batch size guidance
- **Bigger batch is preferable** for stable gradients on teleop data.
- If GPU memory is the bottleneck, use **gradient accumulation** to raise _effective_ batch without raising peak memory.
- Scale **learning rate** gently with batch; most LeRobot defaults work fine for a 24× batch change.
### 7.5 Scale LR schedule & checkpoints with `--steps`
LeRobot's default schedulers (e.g. SmolVLA's cosine decay) use `scheduler_decay_steps=30_000`, which is sized for long training runs. When you shorten training (e.g. 5k10k steps on a small dataset), **scale the scheduler down to match** — otherwise the LR stays near the peak and never decays. Same for checkpoint frequency.
```bash
lerobot-train ... \
--steps=5000 \
--policy.scheduler_decay_steps=5000 \
--save_freq=5000
```
Rule of thumb: set `scheduler_decay_steps ≈ steps`, and `save_freq` to whatever granularity you want for eval (e.g. every 1k5k steps). Match `scheduler_warmup_steps` proportionally if your run is very short.
### 7.6 SmolVLA: unfreeze the vision encoder for real gains
SmolVLA ships with `freeze_vision_encoder=True`. Unfreezing usually **improves performance substantially** on specialized tasks, at the cost of more VRAM and slower steps. Enable with:
```bash
lerobot-train ... --policy.type=smolvla \
--policy.freeze_vision_encoder=false \
--policy.train_expert_only=false
```
### 7.7 Signals to stop / keep going
- Train loss plateaus → stop, save a Hub checkpoint.
- Train loss still dropping and you're under 10 epochs → keep going.
---
## 8. Evaluation & benchmarks
Two flavors of evaluation:
### 8.1 Real-robot eval (SO-101, etc.)
Reuse `lerobot-record` with `--policy.path` to run the trained policy on-robot and save the run as an eval dataset. Convention: prefix the dataset with `eval_`.
```bash
lerobot-record \
--robot.type=so101_follower --robot.port=<FOLLOWER_PORT> --robot.id=my_follower \
--robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
--dataset.repo_id=${HF_USER}/eval_my_task \
--dataset.single_task="<same task description used during training>" \
--dataset.num_episodes=10 \
--policy.path=${HF_USER}/act_my_task
```
Report success rate across episodes. Compare to a teleoperated baseline and to an earlier checkpoint to catch regressions.
### 8.2 Sim-benchmark eval
For policies trained on sim datasets (PushT, Aloha, LIBERO, MetaWorld, RoboCasa, …) use `lerobot-eval` against the matching `env.type`:
```bash
lerobot-eval \
--policy.path=${HF_USER}/diffusion_pusht \
--env.type=pusht \
--eval.n_episodes=50 \
--eval.batch_size=10 \
--policy.device=cuda
```
- Use `--policy.path=outputs/train/.../checkpoints/<step>/pretrained_model` for local checkpoints.
- `--eval.n_episodes` should be ≥ 50 for a stable success-rate estimate.
- Available envs live in `src/lerobot/envs/`. See [`docs/source/libero.mdx`](./docs/source/libero.mdx), [`metaworld.mdx`](./docs/source/metaworld.mdx), [`robocasa.mdx`](./docs/source/robocasa.mdx), [`vlabench.mdx`](./docs/source/vlabench.mdx) for specific benchmarks.
- To add a new benchmark, see [`docs/source/adding_benchmarks.mdx`](./docs/source/adding_benchmarks.mdx) and [`envhub.mdx`](./docs/source/envhub.mdx).
### 8.2b Dockerfiles for benchmark eval
Benchmark envs have native dependencies that are painful to install locally. The repo ships **pre-baked Dockerfiles** for each supported benchmark — use these to run `lerobot-eval` in a reproducible environment:
| Benchmark | Dockerfile |
| ----------- | -------------------------------------------------------------------------------------- |
| LIBERO | [`docker/Dockerfile.benchmark.libero`](./docker/Dockerfile.benchmark.libero) |
| LIBERO+ | [`docker/Dockerfile.benchmark.libero_plus`](./docker/Dockerfile.benchmark.libero_plus) |
| MetaWorld | [`docker/Dockerfile.benchmark.metaworld`](./docker/Dockerfile.benchmark.metaworld) |
| RoboCasa | [`docker/Dockerfile.benchmark.robocasa`](./docker/Dockerfile.benchmark.robocasa) |
| RoboCerebra | [`docker/Dockerfile.benchmark.robocerebra`](./docker/Dockerfile.benchmark.robocerebra) |
| RoboMME | [`docker/Dockerfile.benchmark.robomme`](./docker/Dockerfile.benchmark.robomme) |
| RoboTwin | [`docker/Dockerfile.benchmark.robotwin`](./docker/Dockerfile.benchmark.robotwin) |
| VLABench | [`docker/Dockerfile.benchmark.vlabench`](./docker/Dockerfile.benchmark.vlabench) |
Build and run (adapt to your benchmark):
```bash
docker build -f docker/Dockerfile.benchmark.robomme -t lerobot-bench-robomme .
docker run --gpus all --rm -it \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
lerobot-bench-robomme \
lerobot-eval --policy.path=<your_policy> --env.type=<env> --eval.n_episodes=50
```
See [`docker/README.md`](./docker/README.md) for base-image details.
### 8.3 Target success rates
Single-task grasp-and-place with 50 clean episodes: ACT should reach **> 70% success** on the training configuration. Less → data problem (see §5), not model problem. Expect a drop when generalizing to new positions — scale episodes or diversity to recover.
---
## 9. Further reading & resources
- **Getting started:** [`installation.mdx`](./docs/source/installation.mdx) · [`il_robots.mdx`](./docs/source/il_robots.mdx) · [What makes a good dataset](https://huggingface.co/blog/lerobot-datasets)
- **Per-policy docs:** browse [`docs/source/*.mdx`](./docs/source/) (policies, hardware, benchmarks, advanced training).
- **Community:** [Discord](https://discord.com/invite/s3KuuzsPFb) · [Hub `LeRobot` tag](https://huggingface.co/datasets?other=LeRobot) · [Dataset visualizer](https://huggingface.co/spaces/lerobot/visualize_dataset)
> Keep this file current. If you learn a rule that would prevent a class of user mistakes, add it here and in [`AGENTS.md`](./AGENTS.md).

View File

@@ -78,9 +78,6 @@ Use the templates for required fields and examples.
- **Issues:** Follow the [ticket template](https://github.com/huggingface/lerobot/blob/main/.github/ISSUE_TEMPLATE/bug-report.yml).
- **Pull requests:** Rebase on `upstream/main`, use a descriptive branch (don't work on `main`), run `pre-commit` and tests locally, and follow the [PR template](https://github.com/huggingface/lerobot/blob/main/.github/PULL_REQUEST_TEMPLATE.md).
> [!IMPORTANT]
> Community Review Policy: To help scale our efforts and foster a collaborative environment, we ask contributors to review at least one other person's open PR before their own receives attention. This shared responsibility multiplies our review capacity and helps everyone's code get merged faster!
Once you have submitted your PR and completed a peer review, a member of the LeRobot team will review your contribution.
One member of the LeRobot team will then review your contribution.
Thank you for contributing to LeRobot!

View File

@@ -1,4 +1,3 @@
include src/lerobot/templates/lerobot_modelcard_template.md
include src/lerobot/templates/lerobot_rewardmodel_modelcard_template.md
include src/lerobot/datasets/card_template.md
include src/lerobot/envs/metaworld_config.json

View File

@@ -178,9 +178,3 @@ test-smolvla-ete-eval:
--env.episode_length=5 \
--eval.n_episodes=1 \
--eval.batch_size=1
# E2E annotation pipeline smoke test against a tiny in-memory fixture
# dataset. Opt-in (not part of `make test-end-to-end`) and uses a stub VLM
# backend, so it does not require a real model checkpoint or GPU.
annotation-e2e:
uv run python -m tests.annotations.run_e2e_smoke

View File

@@ -1,42 +0,0 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Benchmark image for LIBERO integration tests.
# Extends the nightly GPU image (which already has all extras installed)
# with the PR's source code and LIBERO-specific asset setup.
#
# Build: docker build -f docker/Dockerfile.benchmark.libero -t lerobot-benchmark-libero .
# Run: docker run --gpus all --rm lerobot-benchmark-libero lerobot-eval ...
FROM huggingface/lerobot-gpu:latest
# Pre-download lerobot/libero-assets from HF Hub so nothing is fetched at
# runtime (which times out on CI). Point the libero config at the cached path.
# libero/libero/__init__.py calls input() when ~/.libero/config.yaml is missing,
# so we write the config before any libero import can happen.
RUN LIBERO_DIR=$(python -c \
"import importlib.util, os; s=importlib.util.find_spec('libero'); \
print(os.path.join(os.path.dirname(s.origin), 'libero'))") && \
mkdir -p /home/user_lerobot/.libero && \
python -c "\
from huggingface_hub import snapshot_download; \
snapshot_download(repo_id='lerobot/libero-assets', repo_type='dataset', \
local_dir='/home/user_lerobot/.libero/assets')" && \
printf "assets: /home/user_lerobot/.libero/assets\nbddl_files: ${LIBERO_DIR}/bddl_files\ndatasets: ${LIBERO_DIR}/../datasets\ninit_states: ${LIBERO_DIR}/init_files\n" \
> /home/user_lerobot/.libero/config.yaml
# Overlay the PR's source code on top of the nightly image.
COPY --chown=user_lerobot:user_lerobot . .
CMD ["/bin/bash"]

View File

@@ -1,84 +0,0 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Benchmark image for LIBERO-plus integration tests.
# Extends the nightly GPU image (which has lerobot[all]) with the LIBERO-plus
# fork source + its 6.4 GB perturbation assets.
#
# Build: docker build -f docker/Dockerfile.benchmark.libero_plus -t lerobot-benchmark-libero-plus .
# Run: docker run --gpus all --rm lerobot-benchmark-libero-plus lerobot-eval ...
FROM huggingface/lerobot-gpu:latest
ENV MUJOCO_GL=egl
# unzip for the 6.4 GB assets.zip; the rest are LIBERO-plus build-time extras
# (wand / ImageMagick / fontconfig) not in the nightly base.
USER root
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
unzip libexpat1 libfontconfig1-dev libmagickwand-dev \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
USER user_lerobot
# robosuite==1.4.1 is mandatory (the fork uses `single_arm_env` removed in
# v1.5+). The rest are LIBERO-plus runtime deps pulled from its setup.py.
# We install these explicitly instead of via the [libero_plus] extra because
# the extra's `libero @ git+...` dep installs as a namespace package and then
# clone and PYTHONPATH-override it below.
RUN uv pip install --no-cache \
"robosuite==1.4.1" \
"bddl==1.0.1" \
"easydict==1.13" \
"mujoco==3.7.0" \
"matplotlib==3.10.8" \
"Wand==0.6.13" \
"scikit-image==0.25.2" \
"gym==0.26.2"
# Clone LIBERO-plus and make it importable as `libero`. The nightly base has
# hf-libero (10 tasks) preinstalled via lerobot[libero]; uninstall it so
# Python resolves `import libero` to the 2402-task LIBERO-plus module instead.
# Pinned to the current upstream main SHA so benchmark builds stay reproducible.
ARG LIBERO_PLUS_SHA=4976dc3
ENV LIBERO_PLUS_ROOT=/home/user_lerobot/libero-plus/libero/libero
RUN git clone https://github.com/sylvestf/LIBERO-plus.git /home/user_lerobot/libero-plus \
&& git -C /home/user_lerobot/libero-plus checkout ${LIBERO_PLUS_SHA} \
&& cd /home/user_lerobot/libero-plus && uv pip install --no-cache --no-deps -e "." \
&& (uv pip uninstall hf-libero 2>/dev/null || true)
ENV PYTHONPATH="/home/user_lerobot/libero-plus:${PYTHONPATH}"
# Perturbation textures/scenes: bddl_base_domain.py resolves XMLs via
# DIR_PATH/../assets (package-relative, ignoring ~/.libero/config.yaml). All
# 2402 tasks reference files that ship only in Sylvest/LIBERO-plus's
# assets.zip (6.4 GB) under a deep author-internal prefix — extract and
# flatten it under ${LIBERO_PLUS_ROOT}/assets.
RUN python -c "\
from huggingface_hub import hf_hub_download; \
hf_hub_download(repo_id='Sylvest/LIBERO-plus', repo_type='dataset', \
filename='assets.zip', local_dir='/tmp/libero-plus-dl')" \
&& unzip -q /tmp/libero-plus-dl/assets.zip -d /tmp/libero-plus-dl/extract \
&& ASSETS_DIR=$(find /tmp/libero-plus-dl/extract -type d -name assets | head -1) \
&& mv "${ASSETS_DIR}" ${LIBERO_PLUS_ROOT}/assets \
&& rm -rf /tmp/libero-plus-dl
# Point ~/.libero/config.yaml at the clone so LIBERO-plus's imports are
# non-interactive (it calls input() when the config is missing).
RUN mkdir -p /home/user_lerobot/.libero \
&& printf "assets: ${LIBERO_PLUS_ROOT}/assets\nbddl_files: ${LIBERO_PLUS_ROOT}/bddl_files\ndatasets: ${LIBERO_PLUS_ROOT}/../datasets\ninit_states: ${LIBERO_PLUS_ROOT}/init_files\n" \
> /home/user_lerobot/.libero/config.yaml
# Overlay the PR's source code on top of the nightly image.
COPY --chown=user_lerobot:user_lerobot . .
CMD ["/bin/bash"]

View File

@@ -1,27 +0,0 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Benchmark image for MetaWorld integration tests.
# Extends the nightly GPU image (which already has all extras installed)
# with the PR's source code.
#
# Build: docker build -f docker/Dockerfile.benchmark.metaworld -t lerobot-benchmark-metaworld .
# Run: docker run --gpus all --rm lerobot-benchmark-metaworld lerobot-eval ...
FROM huggingface/lerobot-gpu:latest
# Overlay the PR's source code on top of the nightly image.
COPY --chown=user_lerobot:user_lerobot . .
CMD ["/bin/bash"]

View File

@@ -1,71 +0,0 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Benchmark image for RoboCasa365 integration tests.
# Extends the nightly GPU image (which already has all extras installed)
# with the PR's source code and RoboCasa-specific asset setup.
#
# Build: docker build -f docker/Dockerfile.benchmark.robocasa -t lerobot-benchmark-robocasa .
# Run: docker run --gpus all --rm lerobot-benchmark-robocasa lerobot-eval ...
FROM huggingface/lerobot-gpu:latest
# Install robocasa + robosuite as editable clones. pip-installing from git
# omits data files like robocasa/models/assets/box_links/box_links_assets.json
# (not declared in package_data), which download_kitchen_assets needs at import.
#
# `--no-deps` on robocasa is deliberate: its setup.py pins `lerobot==0.3.3`
# in install_requires, which would shadow the editable lerobot baked into
# this image. We install robocasa's actual runtime deps explicitly instead.
# Pinned SHAs for reproducible benchmark runs. Bump when you need an
# upstream fix; don't rely on `main`/`master` drift.
ARG ROBOCASA_SHA=56e355ccc64389dfc1b8a61a33b9127b975ba681
ARG ROBOSUITE_SHA=aaa8b9b214ce8e77e82926d677b4d61d55e577ab
RUN git clone https://github.com/robocasa/robocasa.git ~/robocasa && \
git -C ~/robocasa checkout ${ROBOCASA_SHA} && \
git clone https://github.com/ARISE-Initiative/robosuite.git ~/robosuite && \
git -C ~/robosuite checkout ${ROBOSUITE_SHA} && \
uv pip install --no-cache -e ~/robocasa --no-deps && \
uv pip install --no-cache -e ~/robosuite && \
uv pip install --no-cache \
"numpy==2.2.5" "numba==0.61.2" "scipy==1.15.3" "mujoco==3.3.1" \
"pygame==2.6.1" "Pillow==12.2.0" "opencv-python==4.13.0.92" \
"pyyaml==6.0.3" "pynput==1.8.1" "tqdm==4.67.3" "termcolor==3.3.0" \
"imageio==2.37.3" "h5py==3.16.0" "lxml==6.0.4" "hidapi==0.14.0.post4" \
"tianshou==0.4.10" "gymnasium==1.2.3"
# Set up robocasa macros and download kitchen assets. We need:
# - tex : base environment textures
# - tex_generative : AI-generated textures; kitchen fixture XMLs embed
# refs to generative_textures/wall/tex*.png
# unconditionally, so MjModel.from_xml_string fails
# at reset time without them (even if the env is
# constructed with generative_textures=None).
# - fixtures_lw : lightwheel kitchen fixtures (fridge, counters...)
# - objs_lw : lightwheel object meshes (stools, misc props)
# We skip the objaverse/aigen object packs (~30GB combined) by pairing
# this with --env.obj_registries=["lightwheel"] on the lerobot side.
# The download script prompts interactively, so pipe 'y' to auto-accept.
RUN python -m robocasa.scripts.setup_macros && \
yes y | python -m robocasa.scripts.download_kitchen_assets \
--type tex tex_generative fixtures_lw objs_lw
# Overlay the PR's source code on top of the nightly image.
COPY --chown=user_lerobot:user_lerobot . .
# Re-install lerobot editably so the new source (with RoboCasaEnv registration)
# replaces the stale package baked into the nightly image.
RUN uv pip install --no-cache --no-deps -e .
CMD ["/bin/bash"]

View File

@@ -1,43 +0,0 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Benchmark image for RoboCerebra integration tests.
# RoboCerebra reuses LIBERO's simulator (libero_10 suite) with a different
# rename_map, so this image is identical to the LIBERO benchmark image —
# extends the nightly GPU base with LIBERO assets + the PR's source code.
#
# Build: docker build -f docker/Dockerfile.benchmark.robocerebra -t lerobot-benchmark-robocerebra .
# Run: docker run --gpus all --rm lerobot-benchmark-robocerebra lerobot-eval ...
FROM huggingface/lerobot-gpu:latest
# Pre-download lerobot/libero-assets from HF Hub so nothing is fetched at
# runtime (which times out on CI). Point the libero config at the cached path.
# libero/libero/__init__.py calls input() when ~/.libero/config.yaml is missing,
# so we write the config before any libero import can happen.
RUN LIBERO_DIR=$(python -c \
"import importlib.util, os; s=importlib.util.find_spec('libero'); \
print(os.path.join(os.path.dirname(s.origin), 'libero'))") && \
mkdir -p /home/user_lerobot/.libero && \
python -c "\
from huggingface_hub import snapshot_download; \
snapshot_download(repo_id='lerobot/libero-assets', repo_type='dataset', \
local_dir='/home/user_lerobot/.libero/assets')" && \
printf "assets: /home/user_lerobot/.libero/assets\nbddl_files: ${LIBERO_DIR}/bddl_files\ndatasets: ${LIBERO_DIR}/../datasets\ninit_states: ${LIBERO_DIR}/init_files\n" \
> /home/user_lerobot/.libero/config.yaml
# Overlay the PR's source code on top of the nightly image.
COPY --chown=user_lerobot:user_lerobot . .
CMD ["/bin/bash"]

View File

@@ -1,56 +0,0 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Benchmark image for RoboMME integration tests.
# Extends the nightly GPU image (which has lerobot[all]) with Vulkan system
# libs for ManiSkill/SAPIEN and the robomme extra. robomme isn't in [all]
# because mani-skill hard-pins gymnasium==0.29.1 and numpy<2.0.0 which
# conflict with lerobot's defaults; both are safe at runtime:
# - gymnasium 0.29.x has the same 5-tuple step() API as 1.x (since 0.26)
# - numpy 1.26.4 is API-compatible with lerobot's actual usage.
#
# Build: docker build -f docker/Dockerfile.benchmark.robomme -t lerobot-benchmark-robomme .
# Run: docker run --gpus all --rm lerobot-benchmark-robomme lerobot-eval ...
FROM huggingface/lerobot-gpu:latest
# NVIDIA Container Toolkit: expose Vulkan driver capability for headless rendering.
ENV NVIDIA_DRIVER_CAPABILITIES=all \
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/nvidia_icd.json
# ManiSkill/SAPIEN's renderer needs Vulkan, which isn't in the base image.
USER root
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
libvulkan1 libvulkan-dev mesa-vulkan-drivers \
&& mkdir -p /usr/share/vulkan/icd.d \
&& echo '{"file_format_version":"1.0.0","ICD":{"library_path":"libGLX_nvidia.so.0","api_version":"1.3.0"}}' \
> /usr/share/vulkan/icd.d/nvidia_icd.json \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
USER user_lerobot
# Install smolvla + av-dep via the PR's pyproject, then layer robomme on top
# with gymnasium/numpy overrides. robomme isn't a pyproject extra because its
# mani-skill pin conflicts with lerobot's base numpy>=2 (see pyproject.toml).
COPY --chown=user_lerobot:user_lerobot setup.py pyproject.toml uv.lock README.md MANIFEST.in ./
RUN printf 'gymnasium==0.29.1\nnumpy==1.26.4\n' > /tmp/robomme_override.txt \
&& uv pip install --no-cache --override /tmp/robomme_override.txt \
-e ".[smolvla,av-dep]" \
"robomme @ git+https://github.com/RoboMME/robomme_benchmark.git@main" \
&& python -c "import robomme; print('robomme import OK')"
# Overlay the PR's source code on top of the nightly image.
COPY --chown=user_lerobot:user_lerobot . .
CMD ["/bin/bash"]

View File

@@ -1,138 +0,0 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Benchmark image for RoboTwin 2.0 integration tests.
# Extends the nightly GPU image with the RoboTwin simulator stack:
# sapien/mplib/pytorch3d + NVlabs CuRobo + embodiments.zip + objects.zip
# (~3.96 GB of assets; background_texture.zip ~11 GB skipped for smoke eval).
#
# Build: docker build -f docker/Dockerfile.benchmark.robotwin -t lerobot-benchmark-robotwin .
# Run: docker run --gpus all --rm lerobot-benchmark-robotwin \
# lerobot-eval --env.type=robotwin --env.task=beat_block_hammer ...
FROM huggingface/lerobot-gpu:latest
ENV NVIDIA_DRIVER_CAPABILITIES=all \
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/nvidia_icd.json \
ROBOTWIN_ROOT=/opt/robotwin
# The nightly base is CUDA -base (no compiler, no Vulkan loader). CuRobo's
# `pip install -e .` runs nvcc, and SAPIEN renders via Vulkan — add both.
USER root
# Pinned upstream SHA for reproducible benchmark runs. Bump when we need
# an upstream fix; don't rely on `main` drift.
ARG ROBOTWIN_SHA=0aeea2d669c0f8516f4d5785f0aa33ba812c14b4
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
cuda-nvcc-12-6 cuda-cudart-dev-12-6 \
libvulkan1 vulkan-tools \
&& mkdir -p /usr/share/vulkan/icd.d \
&& echo '{"file_format_version":"1.0.0","ICD":{"library_path":"libGLX_nvidia.so.0","api_version":"1.3.0"}}' \
> /usr/share/vulkan/icd.d/nvidia_icd.json \
&& git clone https://github.com/RoboTwin-Platform/RoboTwin.git ${ROBOTWIN_ROOT} \
&& git -C ${ROBOTWIN_ROOT} checkout ${ROBOTWIN_SHA} \
&& chown -R user_lerobot:user_lerobot ${ROBOTWIN_ROOT} \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
USER user_lerobot
# RoboTwin runtime deps (av is already in the base via [av-dep]).
RUN uv pip install --no-cache \
"sapien==3.0.0b1" "mplib==0.2.1" "transforms3d==0.4.2" "trimesh==4.4.3" \
"open3d==0.19.0" "imageio==2.34.2" termcolor zarr pydantic h5py
# pytorch3d has no universal wheel; must be built from source (~10 min, cached).
RUN uv pip install --no-cache --no-build-isolation \
"git+https://github.com/facebookresearch/pytorch3d.git@stable"
# CuRobo — NVlabs motion generator; TORCH_CUDA_ARCH_LIST must be set or the
# build aborts on an empty arch list. RoboTwin's own installer pins v0.7.8,
# which still exposes the v1 API (`curobo.types.math`) that RoboTwin imports.
ARG CUROBO_REF=v0.7.8
RUN cd ${ROBOTWIN_ROOT}/envs \
&& git clone --branch ${CUROBO_REF} --depth 1 https://github.com/NVlabs/curobo.git \
&& cd curobo \
&& TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0;8.6;8.9;9.0" \
uv pip install -e . --no-build-isolation --no-cache
# Upstream patches (mirror RoboTwin's script/_install.sh).
# These patches target the exact versions pinned above; re-check when upgrading.
# mplib==0.2.1: drop a broken `or collide` clause in planner.py.
# Safe to remove once mplib > 0.2.1 ships with the fix upstream.
# sapien==3.0.0b1: fix URDF loader encoding + .srdf extension check.
# Safe to remove once sapien > 3.0.0b1 ships with the fix upstream.
RUN python - <<'EOF'
import pathlib, re, site
for d in site.getsitepackages():
p = pathlib.Path(d) / "mplib" / "planner.py"
if p.exists():
p.write_text(re.sub(r"\bor collide\b", "", p.read_text(), count=1))
print(f"mplib patch applied: {p}")
p = pathlib.Path(d) / "sapien" / "wrapper" / "urdf_loader.py"
if p.exists():
src = p.read_text().replace(
"with open(srdf_path) as f:", 'with open(srdf_path, encoding="utf-8") as f:'
).replace('"srdf"', '".srdf"')
p.write_text(src)
print(f"sapien patch applied: {p}")
EOF
# Simulation assets from TianxingChen/RoboTwin2.0: embodiments (~220 MB) +
# objects (~3.74 GB). background_texture (~11 GB) is intentionally skipped.
# The dataset is public — no auth token needed.
RUN python - <<'EOF'
import os, pathlib, zipfile
from huggingface_hub import hf_hub_download
assets_dir = pathlib.Path(os.environ["ROBOTWIN_ROOT"]) / "assets"
assets_dir.mkdir(parents=True, exist_ok=True)
for fname in ("embodiments.zip", "objects.zip"):
local = hf_hub_download(
repo_id="TianxingChen/RoboTwin2.0",
repo_type="dataset",
filename=fname,
local_dir=str(assets_dir),
)
with zipfile.ZipFile(local, "r") as z:
z.extractall(str(assets_dir))
pathlib.Path(local).unlink()
EOF
WORKDIR ${ROBOTWIN_ROOT}
RUN python script/update_embodiment_config_path.py
ENV PYTHONPATH="${ROBOTWIN_ROOT}"
# Fail the image build early if the CuRobo package layout regresses. Importing
# RoboTwin's planner here is too eager because CuRobo constructs CUDA-backed
# defaults at import time, while Docker builds don't have access to an NVIDIA
# driver.
RUN python - <<'EOF'
from pathlib import Path
from curobo.types.math import Pose
planner_src = (Path("/opt/robotwin/envs/robot/planner.py")).read_text()
assert "from curobo.types.math import Pose as CuroboPose" in planner_src
print("CuRobo import OK:", Pose.__name__)
print("RoboTwin planner import references curobo.types.math")
EOF
# Return to the lerobot source directory (set by base image) before overlaying.
WORKDIR /lerobot
# Overlay the PR's source code on top of the nightly image.
COPY --chown=user_lerobot:user_lerobot . .
CMD ["/bin/bash"]

View File

@@ -1,99 +0,0 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Benchmark image for VLABench integration tests.
# Extends the nightly GPU image with the PR's source code and VLABench setup.
#
# Build: docker build -f docker/Dockerfile.benchmark.vlabench -t lerobot-benchmark-vlabench .
# Run: docker run --gpus all --rm lerobot-benchmark-vlabench lerobot-eval ...
FROM huggingface/lerobot-gpu:latest
# Install VLABench from GitHub (not on PyPI) and pin MuJoCo/dm-control.
# Shallow-clone without submodule recursion (nested SSH-only submodules fail in CI).
# Editable install (-e) because VLABench/utils/ has no __init__.py, so
# find_packages() omits it from wheels; editable mode uses the source tree directly.
# rrt-algorithms has the same packaging issue (rrt/ dir missing __init__.py).
# Patch: constant.py calls os.listdir on ~100 asset/obj/meshes/* dirs at import
# time. Guard the call so missing dirs return [] instead of crashing (in case
# the asset download is partial).
#
# Pinned upstream SHAs for reproducible benchmark runs. Bump when you need
# an upstream fix; don't rely on `main`/`develop` drift.
ARG VLABENCH_SHA=cf588fe60c0c7282174fe979f5913170cfe69017
ARG RRT_ALGORITHMS_SHA=e51d95ee489a225220d6ae2a764c4111f6ba7d85
RUN git clone https://github.com/OpenMOSS/VLABench.git ~/VLABench && \
git -C ~/VLABench checkout ${VLABENCH_SHA} && \
git clone https://github.com/motion-planning/rrt-algorithms.git ~/rrt-algorithms && \
git -C ~/rrt-algorithms checkout ${RRT_ALGORITHMS_SHA} && \
python3 -c "\
import pathlib; \
p = pathlib.Path.home() / 'VLABench/VLABench/configs/constant.py'; \
t = p.read_text(); \
p.write_text(t.replace( \
'subdirs = os.listdir(xml_dir)', \
'if not os.path.isdir(xml_dir): return []\n subdirs = os.listdir(xml_dir)'))" && \
uv pip install --no-cache -e ~/VLABench -e ~/rrt-algorithms \
mujoco==3.2.2 dm-control==1.0.22 \
open3d colorlog scikit-learn openai gdown
# Download VLABench mesh assets. Task configs reference object meshes
# (obj/meshes/fruit/, containers/basket/, tablewares/plates/, etc.); without
# them the task builder picks from an empty mesh list and crashes with
# IndexError at task-build time (random.choice([]) in config_manager.py).
#
# Preferred source: an HF Hub mirror. Set VLABENCH_ASSETS_REPO at build time
# (e.g. --build-arg VLABENCH_ASSETS_REPO=lerobot/vlabench-assets) and we'll
# snapshot_download the repo into VLABench's assets dir. This is the reliable
# path for CI — Google Drive frequently returns HTTP 429 ("Too many users have
# viewed or downloaded this file recently") on shared academic files.
#
# After download we *validate* that at least one XML exists under each
# task-critical subtree and fail the build loudly if not. Silent-empty asset
# dirs are the #1 cause of VLABench runtime crashes in CI, so we surface them
# here rather than after a 10-minute eval build.
#
# Fallback: VLABench's own gdown-based script. Best-effort only.
ARG VLABENCH_ASSETS_REPO=""
RUN ASSETS_DIR="$HOME/VLABench/VLABench/assets" && \
if [ -n "${VLABENCH_ASSETS_REPO}" ]; then \
echo "Downloading VLABench assets from HF Hub: ${VLABENCH_ASSETS_REPO}" && \
uv pip install --no-cache "huggingface_hub[hf_xet]>=0.26" && \
python -c "from huggingface_hub import snapshot_download; \
p = snapshot_download(repo_id='${VLABENCH_ASSETS_REPO}', repo_type='dataset', \
local_dir='${ASSETS_DIR}', allow_patterns=['obj/**', 'scenes/**']); \
print('snapshot_download returned:', p)"; \
else \
echo "No VLABENCH_ASSETS_REPO set — falling back to gdown" && \
python ~/VLABench/scripts/download_assets.py --choice all; \
fi && \
python -c "\
from pathlib import Path; \
import sys; \
root = Path('${ASSETS_DIR}'); \
checks = ['obj/meshes/tablewares/plates', 'obj/meshes/containers/basket', 'obj/meshes/fruit', 'obj/meshes/containers/tray']; \
failed = []; \
print(f'Validating VLABench assets under {root}'); \
[print(f' {c}: {len(list((root/c).rglob(\"*.xml\")))} XMLs') for c in checks]; \
[failed.append(c) for c in checks if not any((root/c).rglob('*.xml'))]; \
sys.exit(f'Empty asset dirs (no *.xml): {failed}') if failed else print('All asset dirs populated.')"
# Overlay the PR's source code on top of the nightly image.
COPY --chown=user_lerobot:user_lerobot . .
# Re-install lerobot editably so the new source (with VLABenchEnv registration
# and updated obs handling) replaces the stale package baked into the nightly image.
RUN uv pip install --no-cache --no-deps -e .
CMD ["/bin/bash"]

View File

@@ -18,8 +18,9 @@
# docker build -f docker/Dockerfile.internal -t lerobot-internal .
# Configure the base image for CI with GPU access
ARG CUDA_VERSION=12.6.3
ARG OS_VERSION=24.04
# TODO(Steven): Bump these versions
ARG CUDA_VERSION=12.4.1
ARG OS_VERSION=22.04
FROM nvidia/cuda:${CUDA_VERSION}-base-ubuntu${OS_VERSION}
# Define Python version argument
@@ -35,13 +36,16 @@ ENV DEBIAN_FRONTEND=noninteractive \
# Install Python, system dependencies, and uv (as root)
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential git curl \
libglib2.0-0 libgl1 libegl1 ffmpeg \
software-properties-common build-essential git curl \
libglib2.0-0 libgl1-mesa-glx libegl1-mesa ffmpeg \
libusb-1.0-0-dev speech-dispatcher libgeos-dev portaudio19-dev \
cmake pkg-config ninja-build \
python${PYTHON_VERSION} \
python${PYTHON_VERSION}-venv \
python${PYTHON_VERSION}-dev \
&& add-apt-repository -y ppa:deadsnakes/ppa \
&& apt-get update \
&& apt-get install -y --no-install-recommends \
python${PYTHON_VERSION} \
python${PYTHON_VERSION}-venv \
python${PYTHON_VERSION}-dev \
&& curl -LsSf https://astral.sh/uv/install.sh | sh \
&& mv /root/.local/bin/uv /usr/local/bin/uv \
&& useradd --create-home --shell /bin/bash user_lerobot \

View File

@@ -31,12 +31,8 @@
title: Porting Large Datasets
- local: using_dataset_tools
title: Using the Dataset Tools
- local: language_and_recipes
title: Language Columns and Recipes
- local: tools
title: Tools
- local: annotation_pipeline
title: Annotation Pipeline
- local: dataset_subtask
title: Using Subtasks in the Dataset
- local: streaming_video_encoding
title: Streaming Video Encoding
title: "Datasets"
@@ -51,8 +47,6 @@
title: π₀-FAST (Pi0Fast)
- local: pi05
title: π₀.₅ (Pi05)
- local: eo1
title: EO-1
- local: groot
title: NVIDIA GR00T N1.5
- local: xvla
@@ -67,8 +61,6 @@
title: SARM
title: "Reward Models"
- sections:
- local: inference
title: Policy Deployment (lerobot-rollout)
- local: async
title: Use Async Inference
- local: rtc
@@ -85,22 +77,10 @@
title: Adding a New Benchmark
- local: libero
title: LIBERO
- local: libero_plus
title: LIBERO-plus
- local: metaworld
title: Meta-World
- local: robotwin
title: RoboTwin 2.0
- local: robocasa
title: RoboCasa365
- local: robocerebra
title: RoboCerebra
- local: robomme
title: RoboMME
- local: envhub_isaaclab_arena
title: NVIDIA IsaacLab Arena Environments
- local: vlabench
title: VLABench
title: "Benchmarks"
- sections:
- local: introduction_processors

View File

@@ -1,198 +0,0 @@
# Annotation Pipeline
`lerobot-annotate` populates the two language columns introduced by the
[Language Columns and Recipes](./language_and_recipes) page —
`language_persistent` and `language_events` — directly into
`data/chunk-*/file-*.parquet`.
## What the pipeline produces
A vocabulary-discovery phase derives a small canonical wording, then three
modules write into a per-episode staging tree, then a single writer
rewrites the data shards in place:
| Style / atom | Column | Module |
| ------------------------------------------- | --------------------- | -------------- |
| `subtask` (Pi0.7-style "how, not what") | `language_persistent` | `plan` |
| `plan` (initial + refresh on interjection) | `language_persistent` | `plan` |
| `memory` (MEM-style compression) | `language_persistent` | `plan` |
| `task_aug` (rephrasings of canonical task) | `language_persistent` | `plan` |
| `interjection` | `language_events` | `interjections`|
| speech tool-call atom (`style=null`, `say`) | `language_events` | `interjections`|
| `vqa` (user / assistant pair) | `language_events` | `vqa` |
The `plan` module is constrained to a **canonical vocabulary** discovered
once per dataset by the `vocabulary` module (phase 0). It watches a few
sample episode videos (`--vocabulary.sample_episodes`, default `3`) and
asks the VLM to derive a small set of imperative subtask labels and
first-person memory milestones that recur across the demos. The VLM
picks the right number of entries itself based on what it sees in the
clips — short pick-and-place demos get ~6 subtask labels, longer
multi-step recipes get more. The result lands at
`meta/canonical_vocabulary.json` (human-readable / hand-editable) and
is reused on every subsequent run. The `plan` module then constrains
both subtask + memory generation to those exact strings — the
downstream low-level policy sees a small, repeatable target
distribution instead of thousands of LLM paraphrases. Disable with
`--vocabulary.enabled=False` to fall back to free-form generation.
The writer does **not** add a `tools` column to the parquet — the tool
catalog lives at `meta/info.json["tools"]` instead (see
[Tools](./tools)). After every annotation run the pipeline ensures the
canonical `say` schema is present in that list, preserving any tools the
user pre-declared.
If you want to declare additional tools for a dataset before annotation
runs, edit `meta/info.json["tools"]` directly — the pipeline preserves
anything already there. Implementations of those tools live under
`src/lerobot/tools/`; one file per tool, registered via
`TOOL_REGISTRY`. See the [Tools](./tools) doc for the authoring guide.
## Running locally
Install the extra and invoke the console script. Episode-level
concurrency comes from `--executor.episode_parallelism` (default 16);
that is the only knob the in-process executor exposes.
```bash
uv sync --extra annotations
uv run lerobot-annotate \
--root=/path/to/dataset \
--vlm.model_id=Qwen/Qwen2.5-VL-7B-Instruct
```
The pipeline attaches actual camera footage to every `plan` /
`interjections` / `vqa` prompt by default, decoded from the dataset's
first `observation.images.*` stream. Override with
`--vlm.camera_key=observation.images.<name>` to pin a specific
viewpoint. Datasets with no video tracks fall back to text-only prompts
automatically.
**The `plan` module sees the whole episode as one video block.** Subtask
decomposition gets a `{"type":"video", "video":[<frames>]}` block
covering the entire demonstration; Qwen-VL pools temporally on its own
and decides where to cut. There is no keyframe stride or count knob —
`--plan.max_video_frames` (default 128) only caps the frames packed
into the video block as a model-capacity bound. The `interjections`
module attaches a short window of frames straddling the interjection
timestamp. The `vqa` module grounds each VQA pair on a single frame —
its `--vqa.K` knob sets how many consecutive frames each emission tick
anchors, and every anchored frame gets its own VQA pair on that one
frame (there is no per-pair frame window).
## Running on Hugging Face Jobs
Distributed annotation is delegated to
[Hugging Face Jobs](https://huggingface.co/docs/hub/en/jobs). The repo
ships a launcher script you copy and edit for your dataset:
```bash
HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py
```
[`examples/annotations/run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
spawns one `h200x2` job that:
1. installs the branch under test plus the annotation extras,
2. boots two vllm servers (one per GPU) for the chosen model,
3. runs the `plan` / `interjections` / `vqa` modules across the dataset
via `lerobot-annotate`,
4. uploads the annotated dataset to `--push_to_hub`.
To target a different dataset, model, or hub repo, edit the `CMD` block
inside the script — every flag in there maps directly onto a CLI flag of
`lerobot-annotate` (see `lerobot-annotate --help` for the full list).
## Style-to-recipe consumer mapping
The pipeline's outputs are designed to be consumed by recipes (see
[Language Columns and Recipes](./language_and_recipes)) — typically:
- low-level / high-level / memory-update branches consume
`subtask`/`plan`/`memory` from `language_persistent`.
- An interjection-response branch consumes `interjection` events plus
the paired speech atom (merged into one assistant target turn via
`tool_calls_from`) and the same-timestamp `plan` refresh.
- A VQA branch consumes the `(vqa, user)` and `(vqa, assistant)` pairs
from `language_events`.
## Why the design splits state from events
Two things drive the scope:
1. **Persistent state vs exact-event split.** Persistent rows
(`subtask`, `plan`, `memory`) broadcast per episode and answer "what
state is in force at this frame?". Event rows (`interjection`, `vqa`,
speech) only appear on the exact frame whose timestamp matches the
emission. The pipeline writes timestamps taken straight from the
source parquet — no floating-point recomputation.
2. **One Qwen-VL pass.** All three modules share a single VLM client
(vLLM if available, transformers fallback) so the cost is one model
load per dataset, not three.
## Module independence and staged reruns
Each module writes its raw output to
`<root>/.annotate_staging/episode_{N:06d}/<module>.jsonl`. That makes
prompt iteration cheap — re-running one module overwrites only its own
JSONL file before the writer composes the final parquet. Modules can be
disabled via `--plan.enabled=false` (and likewise `--interjections.enabled`
/ `--vqa.enabled`) to
test them in isolation.
## Validation/report checks before final write
Before the writer runs, `StagingValidator` checks:
- exact frame-timestamp alignment for every event row;
- no orphan speech / interjection pairs;
- `plan` is refreshed at every interjection timestamp;
- `memory` rows fall on subtask boundaries (warning, not error);
- VQA assistant `content` parses as JSON in one of the
bbox / keypoint / count / attribute / spatial shapes;
- every row routes to the column dictated by `column_for_style(style)`.
Errors abort the writer (`--skip_validation=true` overrides for debugging).
## Paper inspirations per module
- **`plan` module — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417))
atom granularity ("pick up one piece of lettuce", "place bowl to box");
Pi0.7 ([Physical Intelligence 2025](https://pi.website/pi07)) "how, not
what" detail.
- **`plan` module — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596))
compression directive: keep only minimal relevant information; functional
outcomes preserved, specific attributes dropped.
- **`interjections` module.** Hi Robot scenario taxonomy: negative task,
situated correction, specific constraint, preference. Speech is a
tool-call-only atom (`tool_calls=[{type:function, function:{name:"say",
arguments:{text:...}}}]`).
- **`vqa` module.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693))
grounded features (bounding boxes in pixel `[x_min, y_min, x_max, y_max]`,
keypoints) and Steerable VLA Policies ([Zhao 2025](https://arxiv.org/abs/2509.07626))
multi-abstraction grounding. Pi0.7 also grounds answers across
multiple abstraction levels.
Future maintainers should adjust the prompt templates in
`src/lerobot/annotations/steerable_pipeline/prompts/` against these
references rather than rewriting from scratch.
## Compute and list-size estimates
Per episode, the pipeline issues O(`max_steps`) `plan`-module calls,
O(`max_interjections_per_episode`) `interjections`-module calls, and
O(`vqa_emission_hz × episode_seconds`) `vqa`-module calls. With defaults
(8 subtasks, 1 interjection, 1 Hz × 3 pairs) and 30-second episodes, that
is ~50 VLM calls per episode. `language_persistent` per episode is ~10s of
KB at most (parquet dictionary-encodes one entry per episode);
`language_events` is empty on most frames and is bounded by the number of
emissions, not `num_frames × num_emissions`.
## Reproducibility via seed and prompt hashes
`--seed` (default 1729) feeds the per-episode RNGs that select interjection
timestamps and VQA question types. Combined with the deterministic prompt
templates checked into `prompts/`, two runs at the same seed against the
same dataset and the same model checkpoint produce byte-identical staging
artifacts. Prompt edits are recorded by file hash; future tooling can pin
expected `(seed, prompt_hash)` pairs into the dataset card.

View File

@@ -0,0 +1,277 @@
# Using Subtasks in LeRobot Datasets
Subtask support in robotics datasets has proven effective in improving robot reasoning and understanding. Subtasks are particularly useful for:
- **Hierarchical policies**: Building policies that include subtask predictions to visualize robot reasoning in real time
- **Reward modeling**: Helping reward models understand task progression (e.g., SARM-style stage-aware reward models)
- **Task decomposition**: Breaking down complex manipulation tasks into atomic, interpretable steps
LeRobotDataset now supports subtasks as part of its dataset structure, alongside tasks.
## What are Subtasks?
While a **task** describes the overall goal (e.g., "Pick up the apple and place it in the basket"), **subtasks** break down the execution into finer-grained steps:
1. "Approach the apple"
2. "Grasp the apple"
3. "Lift the apple"
4. "Move to basket"
5. "Release the apple"
Each frame in the dataset can be annotated with its corresponding subtask, enabling models to learn and predict these intermediate stages.
<img
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/subtask-asset.png"
alt="An overview of subtask annotation showing how frames are labeled with intermediate subtask stages"
width="80%"
/>
<p>
<em>Figure: Overview of subtask annotation.</em>
</p>
**Reference:** _Subtask-learning based for robot self-assembly in flexible collaborative assembly in manufacturing_, Original Article, Published: 19 April 2022.
## Dataset Structure
Subtask information is stored in the dataset metadata:
```
my-dataset/
├── data/
│ └── ...
├── meta/
│ ├── info.json
│ ├── stats.json
│ ├── tasks.parquet
│ ├── subtasks.parquet # Subtask index → subtask string mapping
│ └── episodes/
│ └── ...
└── videos/
└── ...
```
### Subtasks Parquet File
The `meta/subtasks.parquet` file maps subtask indices to their natural language descriptions:
| subtask_index | subtask (index column) |
| ------------- | ---------------------- |
| 0 | "Approach the apple" |
| 1 | "Grasp the apple" |
| 2 | "Lift the apple" |
| ... | ... |
### Frame-Level Annotations
Each frame in the dataset can include a `subtask_index` field that references the subtasks parquet file:
```python
# Example frame data in the parquet file
{
"index": 42,
"timestamp": 1.4,
"episode_index": 0,
"task_index": 0,
"subtask_index": 2, # References "Lift the apple"
"observation.state": [...],
"action": [...],
}
```
## Annotating Datasets with Subtasks
We provide a HuggingFace Space for easily annotating any LeRobotDataset with subtasks:
**[https://huggingface.co/spaces/lerobot/annotate](https://huggingface.co/spaces/lerobot/annotate)**
After completing your annotation:
1. Click "Push to Hub" to upload your annotated dataset
2. You can also run the annotation space locally by following the instructions at [github.com/huggingface/lerobot-annotate](https://github.com/huggingface/lerobot-annotate)
## Loading Datasets with Subtasks
When you load a dataset with subtask annotations, the subtask information is automatically available:
```python
from lerobot.datasets import LeRobotDataset
# Load a dataset with subtask annotations
dataset = LeRobotDataset("jadechoghari/collect-fruit-annotated")
# Access a sample
sample = dataset[100]
# The sample includes both task and subtask information
print(sample["task"]) # "Collect the fruit"
print(sample["subtask"]) # "Grasp the apple"
print(sample["task_index"]) # tensor(0)
print(sample["subtask_index"]) # tensor(2)
```
### Checking for Subtask Support
You can check if a dataset has subtask annotations:
```python
# Check if subtasks are available
has_subtasks = (
"subtask_index" in dataset.features
and dataset.meta.subtasks is not None
)
if has_subtasks:
print(f"Dataset has {len(dataset.meta.subtasks)} unique subtasks")
print("Subtasks:", list(dataset.meta.subtasks.index))
```
## Using Subtasks for Training
### With the Tokenizer Processor
The `TokenizerProcessor` automatically handles subtask tokenization for Vision-Language Action (VLA) models:
```python
from lerobot.processor import TokenizerProcessorStep
# Create a tokenizer processor step
tokenizer_processor = TokenizerProcessorStep(
tokenizer_name_or_path="google/paligemma-3b-pt-224",
padding="max_length",
max_length=64,
)
# The processor will automatically tokenize subtasks if present in the batch
# and add them to the observation under:
# - "observation.subtask.tokens"
# - "observation.subtask.attention_mask"
```
When subtasks are available in the batch, the tokenizer processor adds:
- `observation.subtask.tokens`: Tokenized subtask text
- `observation.subtask.attention_mask`: Attention mask for the subtask tokens
### DataLoader with Subtasks
```python
import torch
from lerobot.datasets import LeRobotDataset
dataset = LeRobotDataset("jadechoghari/collect-fruit-annotated")
dataloader = torch.utils.data.DataLoader(
dataset,
batch_size=16,
shuffle=True,
)
for batch in dataloader:
# Access subtask information in the batch
subtasks = batch["subtask"] # List of subtask strings
subtask_indices = batch["subtask_index"] # Tensor of subtask indices
# Use for training hierarchical policies or reward models
print(f"Batch subtasks: {set(subtasks)}")
```
## Example Datasets with Subtask Annotations
Try loading a dataset with subtask annotations:
```python
from lerobot.datasets import LeRobotDataset
# Example dataset with subtask annotations
dataset = LeRobotDataset("jadechoghari/collect-fruit-annotated")
# Explore the subtasks
print("Available subtasks:")
for subtask_name in dataset.meta.subtasks.index:
print(f" - {subtask_name}")
# Get subtask distribution
subtask_counts = {}
for i in range(len(dataset)):
sample = dataset[i]
subtask = sample["subtask"]
subtask_counts[subtask] = subtask_counts.get(subtask, 0) + 1
print("\nSubtask distribution:")
for subtask, count in sorted(subtask_counts.items(), key=lambda x: -x[1]):
print(f" {subtask}: {count} frames")
```
## Use Cases
### 1. Hierarchical Policy Training
Train policies that predict both actions and current subtask:
```python
class HierarchicalPolicy(nn.Module):
def __init__(self, num_subtasks):
super().__init__()
self.action_head = nn.Linear(hidden_dim, action_dim)
self.subtask_head = nn.Linear(hidden_dim, num_subtasks)
def forward(self, observations):
features = self.encoder(observations)
actions = self.action_head(features)
subtask_logits = self.subtask_head(features)
return actions, subtask_logits
```
### 2. Stage-Aware Reward Modeling (SARM)
Build reward models that understand task progression:
```python
# SARM predicts:
# - Stage: Which subtask is being executed (discrete)
# - Progress: How far along the subtask (continuous 0-1)
class SARMRewardModel(nn.Module):
def forward(self, observations):
features = self.encoder(observations)
stage_logits = self.stage_classifier(features)
progress = self.progress_regressor(features)
return stage_logits, progress
```
### 3. Progress Visualization
Monitor robot execution by tracking subtask progression:
```python
def visualize_execution(model, observations):
for t, obs in enumerate(observations):
action, subtask_logits = model(obs)
predicted_subtask = subtask_names[subtask_logits.argmax()]
print(f"t={t}: Executing '{predicted_subtask}'")
```
## API Reference
### LeRobotDataset Properties
| Property | Type | Description |
| --------------------------- | ---------------------- | ------------------------------------------ |
| `meta.subtasks` | `pd.DataFrame \| None` | DataFrame mapping subtask names to indices |
| `features["subtask_index"]` | `dict` | Feature spec for subtask_index if present |
### Sample Keys
When subtasks are available, each sample includes:
| Key | Type | Description |
| --------------- | -------------- | ------------------------------------ |
| `subtask_index` | `torch.Tensor` | Integer index of the current subtask |
| `subtask` | `str` | Natural language subtask description |
## Related Resources
- [SARM Paper](https://arxiv.org/pdf/2509.25358) - Stage-Aware Reward Modeling for Long Horizon Robot Manipulation
- [LeRobot Annotate Space](https://huggingface.co/spaces/lerobot/annotate) - Interactive annotation tool
- [LeRobotDataset v3.0](./lerobot-dataset-v3) - Dataset format documentation

View File

@@ -1,168 +0,0 @@
# EO-1
EO-1 is a **Vision-Language-Action policy for robot control**. The LeRobot implementation integrates EO-1 with the standard LeRobot training, evaluation, processor interface.
## Model Overview
EO-1 uses a Qwen2.5-VL backbone for vision-language understanding and adds a continuous flow-matching action head for robot control. The policy formats each robot-control sample as a multimodal conversation: camera images are passed to Qwen2.5-VL, the robot state is represented with EO-1 state tokens, and the future action chunk is represented with EO-1 action tokens.
<img
src="https://huggingface.co/datasets/HaomingSong/lerobot-documentation-images/resolve/main/lerobot/eo_pipeline.png"
alt="An overview of EO-1"
width="85%"
/>
During training, EO-1 learns to denoise continuous action chunks at the action-token positions. During inference, it samples an action chunk, returns continuous actions, and executes `n_action_steps` from the chunk before sampling again.
### What the LeRobot Integration Covers
- Standard `policy.type=eo1` configuration through LeRobot
- Qwen2.5-VL image and text preprocessing through policy processors
- Continuous flow-matching action prediction
- Checkpoint save/load through LeRobot policy APIs
- Training with `lerobot-train` and evaluation with `lerobot-eval`
The broader EO-1 project also includes interleaved vision-text-action pretraining and multimodal reasoning workflows. This page focuses on the LeRobot robot-control policy path.
## Installation Requirements
1. Install LeRobot by following the [Installation Guide](./installation).
2. Install EO-1 dependencies by running:
```bash
pip install -e ".[eo1]"
```
3. If you want to train or evaluate on LIBERO, install the LIBERO dependencies too:
```bash
pip install -e ".[eo1,libero]"
```
EO-1 can use the standard PyTorch scaled-dot-product attention backend through `policy.attn_implementation=sdpa`. If your environment has a compatible `flash_attn` installation, you can request `policy.attn_implementation=flash_attention_2`.
## Data Requirements
EO-1 expects a LeRobot dataset with:
- At least one visual observation, for example `observation.images.image`
- `observation.state`
- `action`
- A language task instruction through the dataset `task` field
If your dataset uses different observation names, use `rename_map` to align them with the names expected by your training or evaluation setup.
## Usage
To use EO-1 in a LeRobot configuration, specify the policy type as:
```python
policy.type=eo1
```
By default, a new EO-1 policy initializes its backbone from:
```python
policy.vlm_base=Qwen/Qwen2.5-VL-3B-Instruct
```
Once a LeRobot-format EO-1 checkpoint is available, load it with:
```python
policy.path=your-org/your-eo1-checkpoint
```
## Training
### Training Command Example
```bash
lerobot-train \
--dataset.repo_id=your_org/your_dataset \
--policy.type=eo1 \
--policy.vlm_base=Qwen/Qwen2.5-VL-3B-Instruct \
--policy.dtype=bfloat16 \
--policy.attn_implementation=sdpa \
--policy.gradient_checkpointing=false \
--output_dir=./outputs/eo1_training \
--job_name=eo1_training \
--steps=300000 \
--batch_size=16 \
--policy.device=cuda
```
### Key Training Parameters
| Parameter | Default | Description |
| -------------------------------------- | ----------------------------- | ----------------------------------------------------------------------- |
| `policy.vlm_base` | `Qwen/Qwen2.5-VL-3B-Instruct` | Qwen2.5-VL checkpoint used to initialize a new policy |
| `policy.dtype` | `auto` | Backbone dtype request: `auto`, `bfloat16`, or `float32` |
| `policy.attn_implementation` | `None` | Optional Qwen attention backend, such as `sdpa` |
| `policy.gradient_checkpointing` | `false` | Reduces memory usage during training |
| `policy.chunk_size` | `8` | Number of future actions predicted per chunk |
| `policy.n_action_steps` | `8` | Number of actions consumed from a sampled chunk |
| `policy.num_denoise_steps` | `10` | Number of flow-matching denoising steps used during sampling |
| `policy.max_state_dim` | `32` | State padding dimension |
| `policy.max_action_dim` | `32` | Action padding dimension |
| `policy.force_fp32_autocast` | `true` | Keeps the flow head in fp32 even when the backbone uses mixed precision |
| `policy.supervise_padding_action_dims` | `true` | Controls whether padded action dimensions are supervised |
| `policy.supervise_padding_actions` | `true` | Controls whether padded future action rows are supervised |
## Evaluation
EO-1 can be evaluated through `lerobot-eval` once you have a LeRobot-format checkpoint:
```bash
lerobot-eval \
--policy.path=your-org/your-eo1-checkpoint \
--env.type=libero \
--env.task=libero_object \
--eval.batch_size=1 \
--eval.n_episodes=20
```
For datasets or environments whose camera names differ from the checkpoint configuration, pass a `rename_map`:
```bash
lerobot-eval \
--policy.path=your-org/your-eo1-checkpoint \
--env.type=libero \
--env.task=libero_object \
--rename_map='{"observation.images.image2":"observation.images.wrist_image"}'
```
## Configuration Notes
### Image Processing
EO-1 uses the Qwen2.5-VL processor. The `policy.image_min_pixels` and `policy.image_max_pixels` settings control the image resizing bounds before the visual tokens are passed into the backbone.
### State and Action Dimensions
The policy pads state and action vectors to `policy.max_state_dim` and `policy.max_action_dim` before the EO-1 flow head. Predictions are cropped back to the original action dimension before being returned by the policy.
### Attention Backend
Use `policy.attn_implementation=sdpa` for a portable setup. Use `flash_attention_2` only when `flash_attn` is installed and compatible with your environment.
## References
- [EO-1 project](https://github.com/EO-Robotics/EO1)
- [EO-1 paper](https://arxiv.org/abs/2508.21112)
- [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
## Citation
```bibtex
@article{eo1,
title={EO-1: Interleaved Vision-Text-Action Pretraining for General Robot Control},
author={Delin Qu and Haoming Song and Qizhi Chen and Zhaoqing Chen and Xianqiang Gao and Xinyi Ye and Qi Lv and Modi Shi and Guanghui Ren and Cheng Ruan and Maoqing Yao and Haoran Yang and Jiacheng Bao and Bin Zhao and Dong Wang},
journal={arXiv preprint},
year={2025},
url={https://arxiv.org/abs/2508.21112}
}
```
## License
This LeRobot integration follows the **Apache 2.0 License** used by LeRobot. Check the upstream EO-1 model and dataset pages for the licenses of released EO-1 checkpoints and data.

View File

@@ -50,30 +50,30 @@ This process can be repeated iteratively: deploy, collect, fine-tune, repeat. Ea
### Teleoperator Requirements
The `lerobot-rollout --strategy.type=dagger` mode requires **teleoperators with active motors** that can:
The `examples/hil` HIL scripts require **teleoperators with active motors** that can:
- Enable/disable torque programmatically
- Move to target positions (to mirror the robot state when pausing)
**Compatible teleoperators:**
**Compatible teleoperators in the current `examples/hil` scripts:**
- `openarm_mini` - OpenArm Mini
- `so_leader` - SO100 / SO101 leader arm
> [!IMPORTANT]
> The provided commands default to `bi_openarm_follower` + `openarm_mini`.
> The provided `examples/hil` commands default to `bi_openarm_follower` + `openarm_mini`.
> `so_follower` + `so_leader` configs are also registered and can be used via CLI flags.
---
## Script
Use `lerobot-rollout` with `--strategy.type=dagger` for HIL data collection. Select the inference backend with `--inference.type=sync|rtc`:
A single script handles both synchronous and RTC-based inference. Toggle RTC with `--rtc.enabled=true`:
| Mode | Flag | Models |
| ------------------------ | ---------------------- | --------------------- |
| Standard (default) | _(no flag needed)_ | ACT, Diffusion Policy |
| Real-Time Chunking (RTC) | `--inference.type=rtc` | Pi0, Pi0.5, SmolVLA |
| Mode | Flag | Models |
| ------------------------ | -------------------- | --------------------- |
| Standard (default) | _(no flag needed)_ | ACT, Diffusion Policy |
| Real-Time Chunking (RTC) | `--rtc.enabled=true` | Pi0, Pi0.5, SmolVLA |
---
@@ -97,7 +97,7 @@ python src/lerobot/scripts/lerobot_train.py \
**Standard inference (ACT, Diffusion Policy):**
```bash
lerobot-rollout --strategy.type=dagger \
python examples/hil/hil_data_collection.py \
--robot.type=bi_openarm_follower \
--robot.left_arm_config.port=can1 \
--robot.left_arm_config.side=left \
@@ -108,10 +108,11 @@ lerobot-rollout --strategy.type=dagger \
--teleop.port_left=/dev/ttyACM0 \
--teleop.port_right=/dev/ttyACM1 \
--policy.path=outputs/pretrain/checkpoints/last/pretrained_model \
--dataset.repo_id=your-username/rollout_hil_dataset \
--dataset.repo_id=your-username/hil-dataset \
--dataset.single_task="Fold the T-shirt properly" \
--dataset.fps=30 \
--strategy.num_episodes=50 \
--dataset.episode_time_s=1000 \
--dataset.num_episodes=50 \
--interpolation_multiplier=2
```
@@ -120,11 +121,11 @@ lerobot-rollout --strategy.type=dagger \
For models with high inference latency, enable RTC for smooth execution:
```bash
lerobot-rollout --strategy.type=dagger \
--inference.type=rtc \
--inference.rtc.execution_horizon=20 \
--inference.rtc.max_guidance_weight=5.0 \
--inference.rtc.prefix_attention_schedule=LINEAR \
python examples/hil/hil_data_collection.py \
--rtc.enabled=true \
--rtc.execution_horizon=20 \
--rtc.max_guidance_weight=5.0 \
--rtc.prefix_attention_schedule=LINEAR \
--robot.type=bi_openarm_follower \
--robot.left_arm_config.port=can1 \
--robot.left_arm_config.side=left \
@@ -135,10 +136,11 @@ lerobot-rollout --strategy.type=dagger \
--teleop.port_left=/dev/ttyACM0 \
--teleop.port_right=/dev/ttyACM1 \
--policy.path=outputs/pretrain/checkpoints/last/pretrained_model \
--dataset.repo_id=your-username/rollout_hil_rtc_dataset \
--dataset.repo_id=your-username/hil-rtc-dataset \
--dataset.single_task="Fold the T-shirt properly" \
--dataset.fps=30 \
--strategy.num_episodes=50 \
--dataset.episode_time_s=1000 \
--dataset.num_episodes=50 \
--interpolation_multiplier=3
```
@@ -233,7 +235,7 @@ This HIL data collection approach builds on ideas from interactive imitation lea
- **HG-DAgger** (Kelly et al., 2019) made this practical for robotics: a human expert monitors the robot and only intervenes when needed, rather than labeling every state. The gating between autonomous and human control is exactly the pause → takeover → return-to-policy loop used in the scripts here.
- **RaC** (Hu et al., 2025) scales this loop to long-horizon tasks by explicitly decomposing interventions into **recovery** (teleoperating back to a good state) and **correction** (demonstrating the right behavior from there). This decomposition is the protocol followed by the DAgger strategy in `lerobot-rollout`.
- **RaC** (Hu et al., 2025) scales this loop to long-horizon tasks by explicitly decomposing interventions into **recovery** (teleoperating back to a good state) and **correction** (demonstrating the right behavior from there). This decomposition is the protocol followed by the HIL scripts in `examples/hil`.
- **π0.6/RECAP** (Physical Intelligence, 2025) applies the same iterative collect-and-finetune loop at scale with VLA models, showing that even large pretrained policies benefit substantially from targeted human corrections on their own failure modes. π0.6 is trained using RECAP.

View File

@@ -685,10 +685,6 @@ Example configuration for training the [reward classifier](https://huggingface.c
```json
{
"dataset": {
"repo_id": "hf_username/dataset_name",
"root": null
},
"policy": {
"type": "reward_classifier",
"model_name": "helper2424/resnet10",
@@ -709,28 +705,8 @@ Example configuration for training the [reward classifier](https://huggingface.c
"type": "VISUAL",
"shape": [3, 128, 128]
}
},
"push_to_hub": true,
"repo_id": "hf_username/model_repo"
},
"batch_size": 16,
"num_workers": 4,
"steps": 5000,
"log_freq": 10,
"eval_freq": 1000,
"save_freq": 1000,
"save_checkpoint": true,
"seed": 2,
"resume": false,
"optimizer": {
"grad_clip_norm": 10.0
},
"wandb": {
"enable": true,
"project": "reward-classifier",
"disable_artifact": false
},
"job_name": "reward-classifier"
}
}
}
```

View File

@@ -32,12 +32,6 @@ Once youve gathered enough trajectories, youll train a neural network to i
If you run into any issues at any point, jump into our [Discord community](https://discord.com/invite/s3KuuzsPFb) for support.
<Tip>
Want to quickly get the right commands for your setup? The [quickstart notebook](https://github.com/huggingface/lerobot/blob/main/examples/notebooks/quickstart.ipynb) [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/lerobot/blob/main/examples/notebooks/quickstart.ipynb) lets you configure your robot once and generates all the commands below ready to paste.
</Tip>
## Set up and Calibrate
If you haven't yet set up and calibrated your robot and teleop device, please do so by following the robot-specific tutorial.
@@ -509,42 +503,121 @@ hf upload ${HF_USER}/act_so101_test${CKPT} \
## Run inference and evaluate your policy
Use `lerobot-rollout` to deploy a trained policy on your robot. You can choose different strategies depending on your needs:
You can use the `record` script from [`lerobot-record`](https://github.com/huggingface/lerobot/blob/main/src/lerobot/scripts/lerobot_record.py) with a policy checkpoint as input, to run inference and evaluate your policy. For instance, run this command or API example to run inference and record 10 evaluation episodes:
<hfoptions id="eval">
<hfoption id="Base mode (no recording)">
<hfoption id="Command">
```bash
lerobot-rollout \
--strategy.type=base \
--policy.path=${HF_USER}/my_policy \
--robot.type=so100_follower \
--robot.port=/dev/ttyACM1 \
--robot.cameras="{ up: {type: opencv, index_or_path: /dev/video10, width: 640, height: 480, fps: 30}, side: {type: intelrealsense, serial_number_or_name: 233522074606, width: 640, height: 480, fps: 30}}" \
--task="Put lego brick into the transparent box" \
--duration=60
```
</hfoption>
<hfoption id="Sentry mode (with recording)">
```bash
lerobot-rollout \
--strategy.type=sentry \
--strategy.upload_every_n_episodes=5 \
--policy.path=${HF_USER}/my_policy \
lerobot-record \
--robot.type=so100_follower \
--robot.port=/dev/ttyACM1 \
--robot.cameras="{ up: {type: opencv, index_or_path: /dev/video10, width: 640, height: 480, fps: 30}, side: {type: intelrealsense, serial_number_or_name: 233522074606, width: 640, height: 480, fps: 30}}" \
--robot.id=my_awesome_follower_arm \
--display_data=false \
--dataset.repo_id=${HF_USER}/eval_so100 \
--dataset.single_task="Put lego brick into the transparent box" \
--duration=600
--dataset.streaming_encoding=true \
--dataset.encoder_threads=2 \
# --dataset.vcodec=auto \
# <- Teleop optional if you want to teleoperate in between episodes \
# --teleop.type=so100_leader \
# --teleop.port=/dev/ttyACM0 \
# --teleop.id=my_awesome_leader_arm \
--policy.path=${HF_USER}/my_policy
```
</hfoption>
<hfoption id="API example">
<!-- prettier-ignore-start -->
```python
from lerobot.cameras.opencv import OpenCVCameraConfig
from lerobot.datasets import LeRobotDataset
from lerobot.utils.feature_utils import hw_to_dataset_features
from lerobot.policies.act import ACTPolicy
from lerobot.policies import make_pre_post_processors
from lerobot.robots.so_follower import SO100Follower, SO100FollowerConfig
from lerobot.scripts.lerobot_record import record_loop
from lerobot.common.control_utils import init_keyboard_listener
from lerobot.utils.utils import log_say
from lerobot.utils.visualization_utils import init_rerun
NUM_EPISODES = 5
FPS = 30
EPISODE_TIME_SEC = 60
TASK_DESCRIPTION = "My task description"
HF_MODEL_ID = "<hf_username>/<model_repo_id>"
HF_DATASET_ID = "<hf_username>/<eval_dataset_repo_id>"
# Create the robot configuration
camera_config = {"front": OpenCVCameraConfig(index_or_path=0, width=640, height=480, fps=FPS)}
robot_config = SO100FollowerConfig(
port="/dev/tty.usbmodem58760434471", id="my_awesome_follower_arm", cameras=camera_config
)
# Initialize the robot
robot = SO100Follower(robot_config)
# Initialize the policy
policy = ACTPolicy.from_pretrained(HF_MODEL_ID)
# Configure the dataset features
action_features = hw_to_dataset_features(robot.action_features, "action")
obs_features = hw_to_dataset_features(robot.observation_features, "observation")
dataset_features = {**action_features, **obs_features}
# Create the dataset
dataset = LeRobotDataset.create(
repo_id=HF_DATASET_ID,
fps=FPS,
features=dataset_features,
robot_type=robot.name,
use_videos=True,
image_writer_threads=4,
)
# Initialize the keyboard listener and rerun visualization
_, events = init_keyboard_listener()
init_rerun(session_name="recording")
# Connect the robot
robot.connect()
preprocessor, postprocessor = make_pre_post_processors(
policy_cfg=policy,
pretrained_path=HF_MODEL_ID,
dataset_stats=dataset.meta.stats,
)
for episode_idx in range(NUM_EPISODES):
log_say(f"Running inference, recording eval episode {episode_idx + 1} of {NUM_EPISODES}")
# Run the policy inference loop
record_loop(
robot=robot,
events=events,
fps=FPS,
policy=policy,
preprocessor=preprocessor,
postprocessor=postprocessor,
dataset=dataset,
control_time_s=EPISODE_TIME_SEC,
single_task=TASK_DESCRIPTION,
display_data=True,
)
dataset.save_episode()
# Clean up
robot.disconnect()
dataset.push_to_hub()
```
<!-- prettier-ignore-end -->
</hfoption>
</hfoptions>
The `--strategy.type` flag selects the execution mode:
As you can see, it's almost the same command as previously used to record your training dataset. Two things changed:
- `base`: Autonomous rollout with no data recording (useful for quick evaluation)
- `sentry`: Continuous recording with auto-upload (useful for large-scale evaluation)
- `highlight`: Ring buffer recording with keystroke save (useful for capturing interesting events)
- `dagger`: Human-in-the-loop data collection (see [HIL Data Collection](./hil_data_collection))
All strategies support `--inference.type=rtc` for smooth execution with slow VLA models (Pi0, Pi0.5, SmolVLA).
1. There is an additional `--control.policy.path` argument which indicates the path to your policy checkpoint with (e.g. `outputs/train/eval_act_so101_test/checkpoints/last/pretrained_model`). You can also use the model repository if you uploaded a model checkpoint to the hub (e.g. `${HF_USER}/act_so101_test`).
2. The name of dataset begins by `eval` to reflect that you are running inference (e.g. `${HF_USER}/eval_act_so101_test`).

View File

@@ -1,261 +0,0 @@
# Policy Deployment (lerobot-rollout)
`lerobot-rollout` is the single CLI for deploying trained policies on real robots. It supports multiple execution strategies and inference backends, from quick evaluation to continuous recording and human-in-the-loop data collection.
## Quick Start
No extra dependencies are needed beyond your robot and policy extras.
```bash
lerobot-rollout \
--strategy.type=base \
--policy.path=lerobot/act_koch_real \
--robot.type=koch_follower \
--robot.port=/dev/ttyACM0 \
--task="pick up cube" \
--duration=30
```
This runs the policy for 30 seconds with no recording.
---
## Strategies
Select a strategy with `--strategy.type=<name>`. Each strategy defines a different control loop with its own recording and interaction semantics.
### Base (`--strategy.type=base`)
Autonomous policy execution with no data recording. Use this for quick evaluation, demos, or when you only need to observe the robot.
```bash
lerobot-rollout \
--strategy.type=base \
--policy.path=${HF_USER}/my_policy \
--robot.type=so100_follower \
--robot.port=/dev/ttyACM0 \
--robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
--task="Put lego brick into the box" \
--duration=60
```
| Flag | Description |
| ---------------- | ------------------------------------------------------ |
| `--duration` | Run time in seconds (0 = infinite) |
| `--task` | Task description passed to the policy |
| `--display_data` | Stream observations/actions to Rerun for visualization |
### Sentry (`--strategy.type=sentry`)
Continuous autonomous recording with periodic upload to the Hugging Face Hub. Episode boundaries are auto-computed from camera resolution and FPS so each saved episode produces a complete video file, keeping uploads efficient.
Policy state (hidden state, RTC queue) persists across episode boundaries: the robot does not reset between episodes.
```bash
lerobot-rollout \
--strategy.type=sentry \
--strategy.upload_every_n_episodes=5 \
--policy.path=${HF_USER}/my_policy \
--robot.type=so100_follower \
--robot.port=/dev/ttyACM0 \
--robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
--dataset.repo_id=${HF_USER}/rollout_eval_data \
--dataset.single_task="Put lego brick into the box" \
--duration=3600
```
| Flag | Description |
| -------------------------------------- | ----------------------------------------------------------- |
| `--strategy.upload_every_n_episodes` | Push to Hub every N episodes (default: 5) |
| `--strategy.target_video_file_size_mb` | Target video file size for episode rotation (default: auto) |
| `--dataset.repo_id` | **Required.** Hub repository for the recorded dataset |
| `--dataset.push_to_hub` | Whether to push to Hub on teardown (default: true) |
### Highlight (`--strategy.type=highlight`)
Autonomous rollout with on-demand recording via a memory-bounded ring buffer. The robot runs continuously while the buffer captures the last N seconds of telemetry. Press the save key to flush the buffer and start live recording; press it again to save the episode.
```bash
lerobot-rollout \
--strategy.type=highlight \
--strategy.ring_buffer_seconds=30 \
--strategy.save_key=s \
--strategy.push_key=h \
--policy.path=${HF_USER}/my_policy \
--robot.type=koch_follower \
--robot.port=/dev/ttyACM0 \
--dataset.repo_id=${HF_USER}/rollout_highlight_data \
--dataset.single_task="Pick up the red cube"
```
**Keyboard controls:**
| Key | Action |
| ------------------ | -------------------------------------------------------- |
| `s` (configurable) | Start recording (flushes buffer) / stop and save episode |
| `h` (configurable) | Push dataset to Hub |
| `ESC` | Stop the session |
| Flag | Description |
| -------------------------------------- | ---------------------------------------------- |
| `--strategy.ring_buffer_seconds` | Duration of buffered telemetry (default: 30) |
| `--strategy.ring_buffer_max_memory_mb` | Memory cap for the ring buffer (default: 2048) |
| `--strategy.save_key` | Key to toggle recording (default: `s`) |
| `--strategy.push_key` | Key to push to Hub (default: `h`) |
### DAgger (`--strategy.type=dagger`)
Human-in-the-loop data collection. Alternates between autonomous policy execution and human intervention via a teleoperator. Intervention frames are tagged with `intervention=True`. Requires a teleoperator (`--teleop.type`).
See the [Human-In-the-Loop Data Collection](./hil_data_collection) guide for a detailed walkthrough.
**Corrections-only mode** (default): Only human correction windows are recorded. Each correction becomes one episode.
```bash
lerobot-rollout \
--strategy.type=dagger \
--strategy.num_episodes=20 \
--policy.path=outputs/pretrain/checkpoints/last/pretrained_model \
--robot.type=bi_openarm_follower \
--teleop.type=openarm_mini \
--dataset.repo_id=${HF_USER}/rollout_hil_data \
--dataset.single_task="Fold the T-shirt"
```
**Continuous recording mode** (`--strategy.record_autonomous=true`): Both autonomous and correction frames are recorded with time-based episode rotation (same as Sentry).
```bash
lerobot-rollout \
--strategy.type=dagger \
--strategy.record_autonomous=true \
--strategy.num_episodes=50 \
--policy.path=${HF_USER}/my_policy \
--robot.type=so100_follower \
--robot.port=/dev/ttyACM0 \
--teleop.type=so101_leader \
--teleop.port=/dev/ttyACM1 \
--dataset.repo_id=${HF_USER}/rollout_dagger_data \
--dataset.single_task="Grasp the block"
```
**Keyboard controls** (default input device):
| Key | Action |
| ------- | ------------------------------------------- |
| `Space` | Pause / resume policy execution |
| `Tab` | Start / stop human correction |
| `Enter` | Push dataset to Hub (corrections-only mode) |
| `ESC` | Stop the session |
Foot pedal input is also supported via `--strategy.input_device=pedal`. Configure pedal codes with `--strategy.pedal.*` flags.
| Flag | Description |
| ------------------------------------ | ------------------------------------------------------- |
| `--strategy.num_episodes` | Number of correction episodes to record (default: 10) |
| `--strategy.record_autonomous` | Record autonomous frames too (default: false) |
| `--strategy.upload_every_n_episodes` | Push to Hub every N episodes (default: 5) |
| `--strategy.input_device` | Input device: `keyboard` or `pedal` (default: keyboard) |
| `--teleop.type` | **Required.** Teleoperator type |
---
## Inference Backends
Select a backend with `--inference.type=<name>`. All strategies work with both backends.
### Sync (default)
One policy call per control tick. The main loop blocks until the action is computed.
Works with all policies. No extra flags needed.
### Real-Time Chunking (`--inference.type=rtc`)
A background thread produces action chunks asynchronously. The main control loop polls for the next ready action while the policy computes the next chunk in parallel.
Use RTC with large, slow VLA models (Pi0, Pi0.5, SmolVLA) for smooth, continuous motion despite high inference latency.
```bash
lerobot-rollout \
--strategy.type=base \
--inference.type=rtc \
--inference.rtc.execution_horizon=10 \
--inference.rtc.max_guidance_weight=10.0 \
--policy.path=${HF_USER}/pi0_policy \
--robot.type=so100_follower \
--robot.port=/dev/ttyACM0 \
--robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
--task="Pick up the cube" \
--duration=60 \
--device=cuda
```
| Flag | Description |
| ------------------------------------------- | -------------------------------------------------------------- |
| `--inference.rtc.execution_horizon` | Steps to blend with previous chunk (default: varies by policy) |
| `--inference.rtc.max_guidance_weight` | Consistency enforcement strength (default: varies by policy) |
| `--inference.rtc.prefix_attention_schedule` | Blend schedule: `LINEAR`, `EXP`, `ONES`, `ZEROS` |
| `--inference.queue_threshold` | Max queue size before backpressure (default: 30) |
See the [Real-Time Chunking](./rtc) guide for details on tuning RTC parameters.
---
## Common Flags
| Flag | Description | Default |
| --------------------------------- | ----------------------------------------------------------------- | ------- |
| `--policy.path` | **Required.** HF Hub model ID or local checkpoint path | -- |
| `--robot.type` | **Required.** Robot type (e.g. `so100_follower`, `koch_follower`) | -- |
| `--robot.port` | Serial port for the robot | -- |
| `--robot.cameras` | Camera configuration (JSON dict) | -- |
| `--fps` | Control loop frequency | 30 |
| `--duration` | Run time in seconds (0 = infinite) | 0 |
| `--device` | Torch device (`cpu`, `cuda`, `mps`) | auto |
| `--task` | Task description (used when no dataset is provided) | -- |
| `--display_data` | Stream telemetry to Rerun visualization | false |
| `--display_ip` / `--display_port` | Remote Rerun server address | -- |
| `--interpolation_multiplier` | Action interpolation factor | 1 |
| `--use_torch_compile` | Enable `torch.compile` for inference | false |
| `--resume` | Resume a previous recording session | false |
| `--play_sounds` | Vocal synthesis for events | true |
---
## Programmatic Usage
For custom deployments (e.g. with kinematics processors), use the rollout module API directly:
```python
from lerobot.rollout import BaseStrategyConfig, RolloutConfig, build_rollout_context
from lerobot.rollout.inference import SyncInferenceConfig
from lerobot.rollout.strategies import BaseStrategy
from lerobot.utils.process import ProcessSignalHandler
cfg = RolloutConfig(
robot=my_robot_config,
policy=my_policy_config,
strategy=BaseStrategyConfig(),
inference=SyncInferenceConfig(),
fps=30,
duration=60,
task="my task",
)
signal_handler = ProcessSignalHandler(use_threads=True)
ctx = build_rollout_context(
cfg,
signal_handler.shutdown_event,
robot_action_processor=my_custom_action_processor, # optional
robot_observation_processor=my_custom_obs_processor, # optional
)
strategy = BaseStrategy(cfg.strategy)
try:
strategy.setup(ctx)
strategy.run(ctx)
finally:
strategy.teardown(ctx)
```
See `examples/so100_to_so100_EE/rollout.py` and `examples/phone_to_so100/rollout.py` for full examples with kinematics processors.

View File

@@ -1,147 +0,0 @@
# Language columns and recipes
Most LeRobot datasets ship with a single `task` string per episode — fine for
short, single-instruction skills, but not enough for the longer-horizon,
multi-modal robot policies the field is moving toward (high-level planning,
memory, interjections, VQA, tool use). To support those policies without
forking the dataset format, LeRobot extends `LeRobotDataset` with two optional
language columns and a small recipe layer that turns those rows into
chat-style training samples on the fly.
The design splits cleanly into three layers:
1. **Data in the dataset** — language annotations stored next to frames in
`data/chunk-*/file-*.parquet` as two optional columns (`language_persistent`
and `language_events`). Datasets without these columns keep their existing
behavior.
2. **Recipe** — a YAML file that declares which annotation rows to bind and
how to lay them out as chat turns (`role`, `content`, optional images,
optional tool calls). Recipes are pure config; no Python required to add a
new one.
3. **Training format** — at sample time, `RenderMessagesStep` resolves the
recipe against the per-frame annotations and emits HF-style `messages` plus
LeRobot-specific sidecars (`message_streams`, `target_message_indices`)
that policy processors consume.
This page describes each layer in turn.
## Layer 1 — language columns in the dataset
The two optional columns live next to frame data in
`data/chunk-*/file-*.parquet`:
- `language_persistent`: a list of rows broadcast across every frame in an episode for state that remains active, such as `subtask`, `plan`, and `memory`.
- `language_events`: a list of rows only on the exact frame where an event was emitted, such as `interjection`, `vqa`, and speech tool calls.
Both columns share the same row shape (event rows omit `timestamp` because the
frame the row sits on already provides it):
```text
role: string
content: string | null
style: string | null
timestamp: float64 # persistent rows only
camera: string | null # observation.images.* feature key, view-dependent rows only
tool_calls: list[Json] | null
```
The `camera` field tags rows whose `content` is grounded in a specific camera
view. Rows of view-dependent styles (`vqa` and `trace`) MUST set `camera` to
the matching `observation.images.*` feature key. Rows of every other style —
including `motion`, which describes robot-frame primitives in joint / Cartesian
terms — MUST leave `camera` as `null`. Pipeline writers and the validator
enforce this via `validate_camera_field(style, camera)`.
`meta/tasks.parquet` remains the canonical source for the task. The special `${task}` recipe binding always reads that task string and does not depend on language annotations.
### Architecture
The language stack itself has three internal modules backing layer 1:
1. `lerobot.datasets.language` defines the schema, style registry, and `column_for_style`.
2. `lerobot.datasets.language_render` resolves rows and renders messages.
3. `RenderMessagesStep` turns dataset samples into `messages`, `message_streams`, and `target_message_indices`.
`LeRobotDataset` stays recipe-agnostic. It passes `language_persistent` and `language_events` through when present, and unannotated datasets keep their existing behavior.
### Temporal semantics
Persistent styles are active after emission until replaced:
- `active_at(t, style=subtask)`
- `nth_prev(style=memory, offset=1)`
- `nth_next(style=subtask, offset=1)`
Event styles only exist on their exact timestamp:
- `emitted_at(t, style=interjection)`
- `emitted_at(t, style=vqa, role=user, camera=observation.images.top)`
- `emitted_at(t, role=assistant, tool_name=say)`
Exact event matching has no tolerance window, so writers must stamp event rows with frame timestamps from the parquet data.
### View-dependent resolution
For view-dependent styles (`vqa` and `trace`), the resolver gains a
`camera=` filter parallel to `role=` and `tool_name=`. Datasets with multiple
cameras typically emit one (`vqa`, `user`) + (`vqa`, `assistant`) pair per
camera at the same timestamp; without `camera=`, those resolvers see two
matches and raise an ambiguity error. Recipes consume each camera through its
own binding plus a matching image block, e.g.
```yaml
ask_vqa_top:
bindings:
vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.top)"
vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.top)"
messages:
- role: user
stream: high_level
if_present: vqa_query
content:
- { type: image, feature: observation.images.top }
- { type: text, text: "${vqa_query}" }
- {
role: assistant,
content: "${vqa}",
stream: high_level,
target: true,
if_present: vqa,
}
```
Add one such sub-recipe per camera the dataset records.
## Layer 2 — recipe anatomy
Recipes are YAML files backed by `TrainingRecipe` and `MessageTurn`. They
declare which annotation rows to pull (via `bindings`) and how to compose them
into chat turns (`messages`).
```yaml
messages:
- { role: user, content: "${task}", stream: high_level }
- { role: assistant, content: "${subtask}", stream: low_level, target: true }
```
A recipe can also branch into a weighted **blend** of sub-recipes. At sample
time, exactly one branch is selected deterministically from the sample index,
so different frames train different objectives (e.g. memory updates vs.
low-level execution vs. VQA) without any Python wiring.
## Layer 3 — training format
Rendered samples use HF-style chat messages plus LeRobot sidecars:
```python
sample["messages"]
sample["message_streams"]
sample["target_message_indices"]
```
The renderer does not apply a tokenizer chat template. Policy processors decide how to serialize the messages for their backbone, which keeps the same dataset usable across SmolVLA, Pi0.5, and any future VLM that expects OpenAI-style chat messages.
## Graceful absence
If both language columns are missing, `None`, or empty, `RenderMessagesStep` is a no-op.
If an event-scoped branch is selected on a frame without the required event row, rendering returns `None`, allowing a loader to retry another sample.

View File

@@ -1,188 +0,0 @@
# LIBERO-plus
LIBERO-plus is a **robustness benchmark** for Vision-Language-Action (VLA) models built on top of [LIBERO](./libero). It systematically stress-tests policies by applying **seven independent perturbation dimensions** to the original LIBERO task set, exposing failure modes that standard benchmarks miss.
- Paper: [In-depth Robustness Analysis of Vision-Language-Action Models](https://arxiv.org/abs/2510.13626)
- GitHub: [sylvestf/LIBERO-plus](https://github.com/sylvestf/LIBERO-plus)
- Dataset: [lerobot/libero_plus](https://huggingface.co/datasets/lerobot/libero_plus)
![An overview of the LIBERO-plus benchmark perturbation dimensions](https://github.com/sylvestf/LIBERO-plus/raw/main/static/images/libero-plus.jpg)
## Perturbation dimensions
LIBERO-plus creates ~10 000 task variants by perturbing each original LIBERO task along these axes:
| Dimension | What changes |
| --------------------- | ----------------------------------------------------- |
| Objects layout | Target position, presence of confounding objects |
| Camera viewpoints | Camera position, orientation, field-of-view |
| Robot initial states | Manipulator start pose |
| Language instructions | LLM-rewritten task description (paraphrase / synonym) |
| Light conditions | Intensity, direction, color, shadow |
| Background textures | Scene surface and object appearance |
| Sensor noise | Photometric distortions and image degradation |
## Available task suites
LIBERO-plus covers the same five suites as LIBERO:
| Suite | CLI name | Tasks | Max steps | Description |
| -------------- | ---------------- | ----- | --------- | -------------------------------------------------- |
| LIBERO-Spatial | `libero_spatial` | 10 | 280 | Tasks requiring reasoning about spatial relations |
| LIBERO-Object | `libero_object` | 10 | 280 | Tasks centered on manipulating different objects |
| LIBERO-Goal | `libero_goal` | 10 | 300 | Goal-conditioned tasks with changing targets |
| LIBERO-90 | `libero_90` | 90 | 400 | Short-horizon tasks from the LIBERO-100 collection |
| LIBERO-Long | `libero_10` | 10 | 520 | Long-horizon tasks from the LIBERO-100 collection |
<Tip warning={true}>
Installing LIBERO-plus **replaces** vanilla LIBERO — it uninstalls `hf-libero`
so that `import libero` resolves to the LIBERO-plus fork. You cannot have both
installed at the same time. To switch back to vanilla LIBERO, uninstall the
fork and reinstall with `pip install -e ".[libero]"`.
</Tip>
## Installation
### System dependencies (Linux only)
```bash
sudo apt install libexpat1 libfontconfig1-dev libmagickwand-dev
```
### Python package
```bash
pip install -e ".[libero]" "robosuite==1.4.1" bddl easydict mujoco wand scikit-image gym
git clone https://github.com/sylvestf/LIBERO-plus.git
cd LIBERO-plus && pip install --no-deps -e .
pip uninstall -y hf-libero # so `import libero` resolves to the fork
```
LIBERO-plus is installed from its GitHub fork rather than a pyproject extra — the fork ships as a namespace package that pip can't handle, so it must be cloned and added to `PYTHONPATH`. See `docker/Dockerfile.benchmark.libero_plus` for the canonical install. MuJoCo is required, so only Linux is supported.
<Tip>
Set the MuJoCo rendering backend before running evaluation:
```bash
export MUJOCO_GL=egl # headless / HPC / cloud
```
</Tip>
### Download LIBERO-plus assets
LIBERO-plus ships its extended asset pack separately. Download `assets.zip` from the [Hugging Face dataset](https://huggingface.co/datasets/Sylvest/LIBERO-plus/tree/main) and extract it into the LIBERO-plus package directory:
```bash
# After installing the package, find where it was installed:
python -c "import libero; print(libero.__file__)"
# Then extract assets.zip into <package_root>/libero/assets/
```
## Evaluation
### Default evaluation (recommended)
Evaluate across the four standard suites (10 episodes per task):
```bash
lerobot-eval \
--policy.path="your-policy-id" \
--env.type=libero_plus \
--env.task=libero_spatial,libero_object,libero_goal,libero_10 \
--eval.batch_size=1 \
--eval.n_episodes=10 \
--env.max_parallel_tasks=1
```
### Single-suite evaluation
Evaluate on one LIBERO-plus suite:
```bash
lerobot-eval \
--policy.path="your-policy-id" \
--env.type=libero_plus \
--env.task=libero_spatial \
--eval.batch_size=1 \
--eval.n_episodes=10
```
- `--env.task` picks the suite (`libero_spatial`, `libero_object`, etc.).
- `--env.task_ids` restricts to specific task indices (`[0]`, `[1,2,3]`, etc.). Omit to run all tasks in the suite.
- `--eval.batch_size` controls how many environments run in parallel.
- `--eval.n_episodes` sets how many episodes to run per task.
### Multi-suite evaluation
Benchmark a policy across multiple suites at once by passing a comma-separated list:
```bash
lerobot-eval \
--policy.path="your-policy-id" \
--env.type=libero_plus \
--env.task=libero_spatial,libero_object \
--eval.batch_size=1 \
--eval.n_episodes=10
```
### Control mode
LIBERO-plus supports two control modes — `relative` (default) and `absolute`. Different VLA checkpoints are trained with different action parameterizations, so make sure the mode matches your policy:
```bash
--env.control_mode=relative # or "absolute"
```
### Policy inputs and outputs
**Observations:**
- `observation.state` — 8-dim proprioceptive features (eef position, axis-angle orientation, gripper qpos)
- `observation.images.image` — main camera view (`agentview_image`), HWC uint8
- `observation.images.image2` — wrist camera view (`robot0_eye_in_hand_image`), HWC uint8
**Actions:**
- Continuous control in `Box(-1, 1, shape=(7,))` — 6D end-effector delta + 1D gripper
### Recommended evaluation episodes
For reproducible benchmarking, use **10 episodes per task** across all four standard suites (Spatial, Object, Goal, Long). This gives 400 total episodes and matches the protocol used for published results.
## Training
### Dataset
A LeRobot-format training dataset for LIBERO-plus is available at:
- [lerobot/libero_plus](https://huggingface.co/datasets/lerobot/libero_plus)
### Example training command
```bash
lerobot-train \
--policy.type=smolvla \
--policy.repo_id=${HF_USER}/smolvla_libero_plus \
--policy.load_vlm_weights=true \
--dataset.repo_id=lerobot/libero_plus \
--env.type=libero_plus \
--env.task=libero_spatial \
--output_dir=./outputs/ \
--steps=100000 \
--batch_size=4 \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval_freq=1000
```
## Relationship to LIBERO
LIBERO-plus is a drop-in extension of LIBERO:
- Same Python gym interface (`LiberoEnv`, `LiberoProcessorStep`)
- Same camera names and observation/action format
- Same task suite names
- Installs under the same `libero` Python package name (different GitHub repo)
To use the original LIBERO benchmark, see [LIBERO](./libero) and use `--env.type=libero`.

View File

@@ -61,6 +61,17 @@ lerobot-eval \
--rename_map='{"observation.images.image": "observation.images.base_0_rgb", "observation.images.image2": "observation.images.left_wrist_0_rgb"}'
```
### Recording
`lerobot-record` also supports rename maps, nested under the dataset config:
```bash
lerobot-record \ # When running inference
--policy.path="<user>/smolVLA_finetuned" \
... \
--dataset.rename_map='{"observation.images.glove2": "observation.images.image"}'
```
## Alternative: edit the policy config directly
If you always use the same dataset or environment, you can **edit the policy's `config.json`** so its observation keys match your data source. Then no rename map is needed.
@@ -94,10 +105,10 @@ XVLA-base has three visual inputs and `empty_cameras=0` by default. Your dataset
## Quick reference
| Goal | What to do |
| --------------------------------------- | --------------------------------------------------------------------------- |
| Dataset keys ≠ policy keys | `--rename_map='{"dataset_key": "policy_key", ...}'` |
| Env keys ≠ policy keys (eval) | `--rename_map='{"env_key": "policy_key", ...}'` |
| Rollout with different keys (inference) | `--rename_map='{"source_key": "policy_key", ...}'`. |
| Fewer cameras than policy expects | `--policy.empty_cameras=N` (supported by PI0, PI05, PI0Fast, SmolVLA, XVLA) |
| Avoid passing a rename map | Edit the policy's `config.json` so its keys match your data source |
| Goal | What to do |
| ----------------------------------------- | --------------------------------------------------------------------------- |
| Dataset keys ≠ policy keys | `--rename_map='{"dataset_key": "policy_key", ...}'` |
| Env keys ≠ policy keys (eval) | `--rename_map='{"env_key": "policy_key", ...}'` |
| Recording with different keys (inference) | `--dataset.rename_map='{"source_key": "policy_key", ...}'`. |
| Fewer cameras than policy expects | `--policy.empty_cameras=N` (supported by PI0, PI05, PI0Fast, SmolVLA, XVLA) |
| Avoid passing a rename map | Edit the policy's `config.json` so its keys match your data source |

View File

@@ -1,188 +0,0 @@
# RoboCasa365
[RoboCasa365](https://robocasa.ai) is a large-scale simulation framework for training and benchmarking **generalist robots** in everyday kitchen tasks. It ships 365 diverse manipulation tasks across 2,500 kitchen environments, 3,200+ object assets and 600+ hours of human demonstration data, on a PandaOmron 12-DOF mobile manipulator (Franka arm on a holonomic base).
- Paper: [RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots](https://arxiv.org/abs/2406.02523)
- GitHub: [robocasa/robocasa](https://github.com/robocasa/robocasa)
- Project website: [robocasa.ai](https://robocasa.ai)
- Pretrained policy: [`lerobot/smolvla_robocasa`](https://huggingface.co/lerobot/smolvla_robocasa)
- Single-task dataset (CloseFridge): [`pepijn223/robocasa_CloseFridge`](https://huggingface.co/datasets/pepijn223/robocasa_CloseFridge)
<img
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/robocasa-banner.webp"
alt="RoboCasa365 benchmark overview"
width="85%"
/>
## Available tasks
RoboCasa365 organizes its 365 tasks into two families and three upstream benchmark groups that LeRobot exposes as first-class `--env.task` shortcuts:
| Family | Tasks | Description |
| --------- | ----- | ------------------------------------------------------------------------------- |
| Atomic | ~65 | Single-skill tasks: pick-and-place, door/drawer manipulation, appliance control |
| Composite | ~300 | Multi-step tasks across 60+ categories: cooking, cleaning, organizing, etc. |
**Atomic task examples:** `CloseFridge`, `OpenDrawer`, `OpenCabinet`, `TurnOnMicrowave`, `TurnOffStove`, `NavigateKitchen`, `PickPlaceCounterToStove`.
**Composite task categories:** baking, boiling, brewing, chopping, clearing table, defrosting food, loading dishwasher, making tea, microwaving food, washing dishes, and more.
`--env.task` accepts three forms:
- a single task name (`CloseFridge`)
- a comma-separated list (`CloseFridge,OpenBlenderLid,PickPlaceCoffee`)
- a benchmark-group shortcut — `atomic_seen`, `composite_seen`, `composite_unseen`, `pretrain50`, `pretrain100`, `pretrain200`, `pretrain300` — which auto-expands to the upstream task list and auto-sets the dataset `split` (`target` or `pretrain`).
## Installation
RoboCasa and its dependency `robosuite` are not published on PyPI, and RoboCasa's own `setup.py` hardcodes `lerobot==0.3.3`, which conflicts with this repo's `lerobot`. LeRobot therefore does **not** expose a `robocasa` extra — install the two packages manually as editable clones (using `--no-deps` on `robocasa` to skip its shadowed `lerobot` pin):
```bash
# After following the standard LeRobot installation instructions.
git clone https://github.com/robocasa/robocasa.git ~/robocasa
git clone https://github.com/ARISE-Initiative/robosuite.git ~/robosuite
pip install -e ~/robocasa --no-deps
pip install -e ~/robosuite
# Robocasa's runtime deps (the ones its setup.py would have pulled, minus
# the bad lerobot pin).
pip install numpy numba scipy mujoco pygame Pillow opencv-python \
pyyaml pynput tqdm termcolor imageio h5py lxml hidapi \
tianshou gymnasium
python -m robocasa.scripts.setup_macros
# Lightweight assets (lightwheel object meshes + textures). Enough for
# the default env out of the box.
python -m robocasa.scripts.download_kitchen_assets \
--type tex tex_generative fixtures_lw objs_lw
# Optional: full objaverse/aigen registries (~30GB) for richer object
# variety. Enable at eval time via --env.obj_registries (see below).
# python -m robocasa.scripts.download_kitchen_assets --type objs_objaverse
```
<Tip>
RoboCasa requires MuJoCo. Set the rendering backend before training or evaluation:
```bash
export MUJOCO_GL=egl # for headless servers (HPC, cloud)
```
</Tip>
### Object registries
By default the env samples objects only from the `lightwheel` registry (what `--type objs_lw` ships), which avoids a `Probabilities contain NaN` crash when the objaverse / aigen packs aren't on disk. If you've downloaded the full asset set, enable the full registry at runtime:
```bash
--env.obj_registries='[objaverse,lightwheel]'
```
## Evaluation
All eval snippets below mirror the CI command (see `.github/workflows/benchmark_tests.yml`). The `--rename_map` argument maps RoboCasa's native camera keys (`robot0_agentview_left` / `robot0_eye_in_hand` / `robot0_agentview_right`) onto the three-camera (`camera1` / `camera2` / `camera3`) input layout the released `smolvla_robocasa` policy was trained on.
### Single-task evaluation (recommended for quick iteration)
```bash
lerobot-eval \
--policy.path=lerobot/smolvla_robocasa \
--env.type=robocasa \
--env.task=CloseFridge \
--eval.batch_size=1 \
--eval.n_episodes=20 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={"observation.images.robot0_agentview_left": "observation.images.camera1", "observation.images.robot0_eye_in_hand": "observation.images.camera2", "observation.images.robot0_agentview_right": "observation.images.camera3"}'
```
### Multi-task evaluation
Pass a comma-separated list of tasks:
```bash
lerobot-eval \
--policy.path=lerobot/smolvla_robocasa \
--env.type=robocasa \
--env.task=CloseFridge,OpenCabinet,OpenDrawer,TurnOnMicrowave,TurnOffStove \
--eval.batch_size=1 \
--eval.n_episodes=20 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={"observation.images.robot0_agentview_left": "observation.images.camera1", "observation.images.robot0_eye_in_hand": "observation.images.camera2", "observation.images.robot0_agentview_right": "observation.images.camera3"}'
```
### Benchmark-group evaluation
Run an entire upstream group (e.g. all 18 `atomic_seen` tasks with `split=target`):
```bash
lerobot-eval \
--policy.path=lerobot/smolvla_robocasa \
--env.type=robocasa \
--env.task=atomic_seen \
--eval.batch_size=1 \
--eval.n_episodes=20 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={"observation.images.robot0_agentview_left": "observation.images.camera1", "observation.images.robot0_eye_in_hand": "observation.images.camera2", "observation.images.robot0_agentview_right": "observation.images.camera3"}'
```
### Recommended evaluation episodes
**20 episodes per task** for reproducible benchmarking. Matches the protocol used in published results.
## Policy inputs and outputs
**Observations** (raw RoboCasa camera names are preserved verbatim):
- `observation.state` — 16-dim proprioceptive state (base position, base quaternion, relative end-effector position, relative end-effector quaternion, gripper qpos)
- `observation.images.robot0_agentview_left` — left agent view, 256×256 HWC uint8
- `observation.images.robot0_eye_in_hand` — wrist camera view, 256×256 HWC uint8
- `observation.images.robot0_agentview_right` — right agent view, 256×256 HWC uint8
**Actions:**
- Continuous control in `Box(-1, 1, shape=(12,))` — base motion (4D) + control mode (1D) + end-effector position (3D) + end-effector rotation (3D) + gripper (1D).
## Training
### Single-task example
A ready-to-use single-task dataset is on the Hub:
[`pepijn223/robocasa_CloseFridge`](https://huggingface.co/datasets/pepijn223/robocasa_CloseFridge).
Fine-tune a SmolVLA base on `CloseFridge`:
```bash
lerobot-train \
--policy.type=smolvla \
--policy.repo_id=${HF_USER}/smolvla_robocasa_CloseFridge \
--policy.load_vlm_weights=true \
--policy.push_to_hub=true \
--dataset.repo_id=pepijn223/robocasa_CloseFridge \
--env.type=robocasa \
--env.task=CloseFridge \
--output_dir=./outputs/smolvla_robocasa_CloseFridge \
--steps=100000 \
--batch_size=4 \
--eval_freq=5000 \
--eval.batch_size=1 \
--eval.n_episodes=5 \
--save_freq=10000
```
Evaluate the resulting checkpoint:
```bash
lerobot-eval \
--policy.path=${HF_USER}/smolvla_robocasa_CloseFridge \
--env.type=robocasa \
--env.task=CloseFridge \
--eval.batch_size=1 \
--eval.n_episodes=20
```
## Reproducing published results
The released checkpoint [`lerobot/smolvla_robocasa`](https://huggingface.co/lerobot/smolvla_robocasa) is evaluated with the commands in the [Evaluation](#evaluation) section. CI runs a 10-atomic-task smoke eval (one episode each) on every PR touching the benchmark, picking fixture-centric tasks that don't require the objaverse asset pack.

View File

@@ -1,99 +0,0 @@
# RoboCerebra
[RoboCerebra](https://robocerebra-project.github.io/) is a long-horizon manipulation benchmark that evaluates **high-level reasoning, planning, and memory** in VLAs. Episodes chain multiple sub-goals with language-grounded intermediate instructions, built on top of LIBERO's simulator stack (MuJoCo + robosuite, Franka Panda 7-DOF).
- Paper: [RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation](https://arxiv.org/abs/2506.06677)
- Project website: [robocerebra-project.github.io](https://robocerebra-project.github.io/)
- Dataset: [`lerobot/robocerebra_unified`](https://huggingface.co/datasets/lerobot/robocerebra_unified) — LeRobot v3.0, 6,660 episodes / 571,116 frames at 20 fps, 1,728 language-grounded sub-tasks.
- Pretrained policy: [`lerobot/smolvla_robocerebra`](https://huggingface.co/lerobot/smolvla_robocerebra)
## Available tasks
RoboCerebra reuses LIBERO's simulator, so evaluation runs against the LIBERO `libero_10` long-horizon suite:
| Suite | CLI name | Tasks | Description |
| --------- | ----------- | ----- | ------------------------------------------------------------- |
| LIBERO-10 | `libero_10` | 10 | Long-horizon kitchen/living room tasks chaining 36 sub-goals |
Each RoboCerebra episode in the dataset is segmented into multiple sub-tasks with natural-language instructions, which the unified dataset exposes as independent supervision signals.
## Installation
RoboCerebra piggybacks on LIBERO, so the `libero` extra is all you need:
```bash
pip install -e ".[libero]"
```
<Tip>
RoboCerebra requires Linux (MuJoCo / robosuite). Set the rendering backend before training or evaluation:
```bash
export MUJOCO_GL=egl # for headless servers (HPC, cloud)
```
</Tip>
## Evaluation
RoboCerebra eval runs against LIBERO's `libero_10` suite with RoboCerebra's camera naming (`image` + `wrist_image`) and an extra empty-camera slot so a three-view-trained policy receives the expected input layout:
```bash
lerobot-eval \
--policy.path=lerobot/smolvla_robocerebra \
--env.type=libero \
--env.task=libero_10 \
--env.fps=20 \
--env.obs_type=pixels_agent_pos \
--env.observation_height=256 \
--env.observation_width=256 \
'--env.camera_name_mapping={"agentview_image": "image", "robot0_eye_in_hand_image": "wrist_image"}' \
--eval.batch_size=1 \
--eval.n_episodes=10 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.wrist_image": "observation.images.camera2"}' \
--policy.empty_cameras=1
```
### Recommended evaluation episodes
**10 episodes per task** across the `libero_10` suite (100 total) for reproducible benchmarking. Matches the protocol used in the RoboCerebra paper.
## Policy inputs and outputs
**Observations:**
- `observation.state` — 8-dim proprioceptive state (7 joint positions + gripper)
- `observation.images.image` — third-person view, 256×256 HWC uint8
- `observation.images.wrist_image` — wrist-mounted camera view, 256×256 HWC uint8
**Actions:**
- Continuous control in `Box(-1, 1, shape=(7,))` — end-effector delta (6D) + gripper (1D)
## Training
The unified dataset at [`lerobot/robocerebra_unified`](https://huggingface.co/datasets/lerobot/robocerebra_unified) exposes two RGB streams and language-grounded sub-task annotations:
| Feature | Shape | Description |
| -------------------------------- | ------------- | -------------------- |
| `observation.images.image` | (256, 256, 3) | Third-person view |
| `observation.images.wrist_image` | (256, 256, 3) | Wrist-mounted camera |
| `observation.state` | (8,) | Joint pos + gripper |
| `action` | (7,) | EEF delta + gripper |
Fine-tune a SmolVLA base on it:
```bash
lerobot-train \
--policy.path=lerobot/smolvla_base \
--dataset.repo_id=lerobot/robocerebra_unified \
--env.type=libero \
--env.task=libero_10 \
--output_dir=outputs/smolvla_robocerebra
```
## Reproducing published results
The released checkpoint [`lerobot/smolvla_robocerebra`](https://huggingface.co/lerobot/smolvla_robocerebra) was trained on `lerobot/robocerebra_unified` and evaluated with the command in the [Evaluation](#evaluation) section. CI runs the same command with `--eval.n_episodes=1` as a smoke test on every PR touching the benchmark.

View File

@@ -1,130 +0,0 @@
# RoboMME
[RoboMME](https://robomme.github.io) is a memory-augmented manipulation benchmark built on ManiSkill (SAPIEN). It evaluates a robot's ability to retain and use information across an episode — counting, object permanence, reference, and imitation.
- **16 tasks** across 4 memory-skill suites
- **1,600 training demos** (100 per task, 50 val, 50 test)
- **Dataset**: [`lerobot/robomme`](https://huggingface.co/datasets/lerobot/robomme) — LeRobot v3.0, 768K frames at 10 fps
- **Simulator**: ManiSkill / SAPIEN, Panda arm, Linux only
![RoboMME benchmark tasks overview](https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2603.04639/gradient.png)
## Tasks
| Suite | Tasks |
| --------------------------------- | ------------------------------------------------------------- |
| **Counting** (temporal memory) | BinFill, PickXtimes, SwingXtimes, StopCube |
| **Permanence** (spatial memory) | VideoUnmask, VideoUnmaskSwap, ButtonUnmask, ButtonUnmaskSwap |
| **Reference** (object memory) | PickHighlight, VideoRepick, VideoPlaceButton, VideoPlaceOrder |
| **Imitation** (procedural memory) | MoveCube, InsertPeg, PatternLock, RouteStick |
## Installation
> RoboMME requires **Linux** (ManiSkill/SAPIEN uses Vulkan rendering). Docker is recommended to isolate dependency conflicts.
### Native (Linux)
```bash
pip install --override <(printf 'gymnasium==0.29.1\nnumpy==1.26.4\n') \
-e '.[smolvla,av-dep]' \
'robomme @ git+https://github.com/RoboMME/robomme_benchmark.git@main'
```
> **Dependency note**: `mani-skill` (pulled by `robomme`) pins `gymnasium==0.29.1` and `numpy<2.0.0`, which conflict with lerobot's base `numpy>=2.0.0`. That's why `robomme` is not a pyproject extra — use the override install above, or the Docker approach below to avoid conflicts entirely.
### Docker (recommended)
```bash
# Build base image first (from repo root)
docker build -f docker/Dockerfile.eval-base -t lerobot-eval-base .
# Build RoboMME eval image (applies gymnasium + numpy pin overrides)
docker build -f docker/Dockerfile.benchmark.robomme -t lerobot-robomme .
```
The `docker/Dockerfile.benchmark.robomme` image overrides `gymnasium==0.29.1` and `numpy==1.26.4` after lerobot's install. Both versions are runtime-safe for lerobot's actual API usage.
## Running Evaluation
### Default (single task, single episode)
```bash
lerobot-eval \
--policy.path=<your_policy_repo> \
--env.type=robomme \
--env.task=PickXtimes \
--env.dataset_split=test \
--env.task_ids=[0] \
--eval.batch_size=1 \
--eval.n_episodes=1
```
### Multi-task evaluation
Evaluate multiple tasks in one run by comma-separating task names. Use `task_ids` to control which episodes are evaluated per task. Recommended: 50 episodes per task for the test split.
```bash
lerobot-eval \
--policy.path=<your_policy_repo> \
--env.type=robomme \
--env.task=PickXtimes,BinFill,StopCube,MoveCube,InsertPeg \
--env.dataset_split=test \
--env.task_ids=[0,1,2,3,4,5,6,7,8,9] \
--eval.batch_size=1 \
--eval.n_episodes=50
```
### Key CLI options for `env.type=robomme`
| Option | Default | Description |
| -------------------- | ------------- | -------------------------------------------------- |
| `env.task` | `PickXtimes` | Any of the 16 task names above (comma-separated) |
| `env.dataset_split` | `test` | `train`, `val`, or `test` |
| `env.action_space` | `joint_angle` | `joint_angle` (8-D) or `ee_pose` (7-D) |
| `env.episode_length` | `300` | Max steps per episode |
| `env.task_ids` | `null` | List of episode indices to evaluate (null = `[0]`) |
## Dataset
The dataset [`lerobot/robomme`](https://huggingface.co/datasets/lerobot/robomme) is in **LeRobot v3.0 format** and can be loaded directly:
```python
from lerobot.datasets.lerobot_dataset import LeRobotDataset
dataset = LeRobotDataset("lerobot/robomme")
```
### Dataset features
| Feature | Shape | Description |
| ------------------ | ------------- | ------------------------------- |
| `image` | (256, 256, 3) | Front camera RGB |
| `wrist_image` | (256, 256, 3) | Wrist camera RGB |
| `actions` | (8,) | Joint angles + gripper |
| `state` | (8,) | Joint positions + gripper state |
| `simple_subgoal` | str | High-level language annotation |
| `grounded_subgoal` | str | Grounded language annotation |
| `episode_index` | int | Episode ID |
| `frame_index` | int | Frame within episode |
### Feature key alignment (training)
The env wrapper exposes `pixels/image` and `pixels/wrist_image` as observation keys. The `features_map` in `RoboMMEEnv` maps these to `observation.images.image` and `observation.images.wrist_image` for the policy. State is exposed as `agent_pos` and maps to `observation.state`.
The dataset's `image` and `wrist_image` columns already align with the policy input keys, so no renaming is needed when fine-tuning.
## Action Spaces
| Type | Dim | Description |
| ------------- | --- | --------------------------------------------------------- |
| `joint_angle` | 8 | 7 joint angles + 1 gripper (1 closed, +1 open, absolute) |
| `ee_pose` | 7 | xyz + roll/pitch/yaw + gripper |
Set via `--env.action_space=joint_angle` (default) or `--env.action_space=ee_pose`.
## Platform Notes
- **Linux only**: ManiSkill requires SAPIEN/Vulkan. macOS and Windows are not supported.
- **GPU recommended**: Rendering is CPU-capable but slow; CUDA + Vulkan gives full speed.
- **gymnasium / numpy conflict**: See installation note above. Docker image handles this automatically.
- **ManiSkill fork**: `robomme` depends on a specific ManiSkill fork (`YinpeiDai/ManiSkill`), pulled in automatically via the `robomme` package.

View File

@@ -1,223 +0,0 @@
# RoboTwin 2.0
RoboTwin 2.0 is a **large-scale dual-arm manipulation benchmark** built on the SAPIEN physics engine. It provides a standardized evaluation protocol for bimanual robotic policies across 50 tasks (as of upstream `main`) with strong domain randomization (clutter, lighting, background, tabletop height, and language instructions).
- Paper: [RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation](https://arxiv.org/abs/2506.18088)
- GitHub: [RoboTwin-Platform/RoboTwin](https://github.com/RoboTwin-Platform/RoboTwin)
- Leaderboard: [robotwin-platform.github.io/leaderboard](https://robotwin-platform.github.io/leaderboard)
- Dataset: [lerobot/robotwin_unified](https://huggingface.co/datasets/lerobot/robotwin_unified)
![RoboTwin 2.0 benchmark overview](https://www.aitntnews.com/pictures/2025/7/8/9a7f79cb-5ba9-11f0-8581-fa163e47d677.png)
## Overview
| Property | Value |
| ------------- | -------------------------------------------------------- |
| Tasks | 50 dual-arm manipulation tasks |
| Robot | Aloha-AgileX bimanual (14 DOF, 7 per arm) |
| Action space | 14-dim joint-space, continuous in `[-1, 1]` |
| Cameras | `head_camera`, `left_camera`, `right_camera` |
| Simulator | SAPIEN (not MuJoCo) |
| Eval protocol | 100 episodes/task, 50 demo_clean demonstrations |
| Eval settings | **Easy** (`demo_clean`) and **Hard** (`demo_randomized`) |
## Available tasks
RoboTwin 2.0 ships 50 dual-arm manipulation tasks in its upstream `envs/` directory. The canonical list is the `ROBOTWIN_TASKS` tuple in `src/lerobot/envs/robotwin.py`, mirrored verbatim from the upstream repo. Example tasks:
| Task | CLI name | Category |
| ------------------------ | ------------------------ | ----------------- |
| Beat block with hammer | `beat_block_hammer` | Tool use |
| Click bell / alarm clock | `click_bell` | Precision press |
| Stack blocks (2 / 3) | `stack_blocks_two/three` | Stacking |
| Stack bowls (2 / 3) | `stack_bowls_two/three` | Stacking |
| Handover block / mic | `handover_block` | Bimanual coord. |
| Lift pot | `lift_pot` | Bimanual lift |
| Shake bottle | `shake_bottle` | Continuous motion |
| Turn switch | `turn_switch` | Articulated obj |
| Stamp seal | `stamp_seal` | Precision place |
| Scan object | `scan_object` | Mobile manip. |
Pass a comma-separated list to `--env.task` to run multiple tasks in a single eval sweep.
<Tip warning={true}>
`open_laptop` is currently broken upstream (its `check_success()` uses
`self.arm_tag`, which is only set inside the scripted-expert `play_once()`
path and therefore unavailable during normal policy eval). Avoid it until the
upstream bug is fixed, or patch the task to default `self.arm_tag = "left"` in
`load_actors()`.
</Tip>
## Dataset
The RoboTwin 2.0 dataset is available in **LeRobot v3.0 format** on the Hugging Face Hub:
```
lerobot/robotwin_unified
```
It contains over 100,000 pre-collected trajectories across all 50 tasks (79.6 GB, Apache 2.0 license). No format conversion is needed — it is already in the correct LeRobot v3.0 schema with video observations and action labels.
You can load it directly with the HF Datasets library:
```python
from datasets import load_dataset
ds = load_dataset("lerobot/robotwin_unified", split="train")
```
## Installation
RoboTwin 2.0 requires **Linux** with an NVIDIA GPU (CUDA 12.1 recommended). Installation takes approximately 20 minutes.
### 1. Create a conda environment
```bash
conda create -n robotwin python=3.10 -y
conda activate robotwin
```
### 2. Install LeRobot
```bash
git clone https://github.com/huggingface/lerobot.git
cd lerobot
pip install -e "."
```
### 3. Install RoboTwin 2.0
```bash
git clone https://github.com/RoboTwin-Platform/RoboTwin.git
cd RoboTwin
bash script/_install.sh
bash script/_download_assets.sh
```
The install script handles all Python dependencies including SAPIEN, CuRobo, mplib, and pytorch3d.
<Tip warning={true}>
If the automated install fails, install manually:
```bash
pip install -r requirements.txt
pip install "git+https://github.com/facebookresearch/pytorch3d.git@stable"
cd envs && git clone https://github.com/NVlabs/curobo.git && cd curobo
pip install -e . --no-build-isolation
```
Then apply the required mplib fix: in `mplib/planner.py` line 807, remove `or collide` from the conditional.
</Tip>
### 4. Add RoboTwin to PYTHONPATH
The RoboTwin task modules must be importable by LeRobot. From within the `RoboTwin/` directory:
```bash
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
```
Add this to your shell profile to make it permanent.
## Evaluation
### Standard evaluation (recommended)
Evaluate a policy on a single task with the official protocol (100 episodes):
```bash
lerobot-eval \
--policy.path="your-hf-policy-id" \
--env.type=robotwin \
--env.task=beat_block_hammer \
--eval.batch_size=1 \
--eval.n_episodes=100
```
### Single-task quick check
```bash
lerobot-eval \
--policy.path="your-hf-policy-id" \
--env.type=robotwin \
--env.task=beat_block_hammer \
--eval.batch_size=1 \
--eval.n_episodes=5
```
### Multi-task sweep
Evaluate on several tasks in one run:
```bash
lerobot-eval \
--policy.path="your-hf-policy-id" \
--env.type=robotwin \
--env.task=beat_block_hammer,click_bell,handover_block,stack_blocks_two \
--eval.batch_size=1 \
--eval.n_episodes=100
```
### Full benchmark (all 50 tasks)
```bash
lerobot-eval \
--policy.path="your-hf-policy-id" \
--env.type=robotwin \
--env.task=adjust_bottle,beat_block_hammer,blocks_ranking_rgb,blocks_ranking_size,click_alarmclock,click_bell,dump_bin_bigbin,grab_roller,handover_block,handover_mic,hanging_mug,lift_pot,move_can_pot,move_pillbottle_pad,move_playingcard_away,move_stapler_pad,open_microwave,pick_diverse_bottles,pick_dual_bottles,place_a2b_left,place_a2b_right,place_bread_basket,place_bread_skillet,place_burger_fries,place_can_basket,place_cans_plasticbox,place_container_plate,place_dual_shoes,place_empty_cup,place_fan,place_mouse_pad,place_object_basket,place_object_scale,place_object_stand,place_phone_stand,place_shoe,press_stapler,put_bottles_dustbin,put_object_cabinet,rotate_qrcode,scan_object,shake_bottle,shake_bottle_horizontally,stack_blocks_three,stack_blocks_two,stack_bowls_three,stack_bowls_two,stamp_seal,turn_switch \
--eval.batch_size=1 \
--eval.n_episodes=100
```
<Tip>
`open_laptop` is intentionally omitted above because of the upstream
`self.arm_tag` bug (see the **Available tasks** section). Re-add it once the
upstream fix lands.
</Tip>
## Camera configuration
By default, all three cameras are included:
| Camera key | Description |
| -------------- | ------------------------------ |
| `head_camera` | Torso-mounted overhead view |
| `left_camera` | Left arm wrist-mounted camera |
| `right_camera` | Right arm wrist-mounted camera |
To use a subset of cameras, override `--env.camera_names`:
```bash
lerobot-eval \
--policy.path="your-hf-policy-id" \
--env.type=robotwin \
--env.task=beat_block_hammer \
--env.camera_names="head_camera,left_camera" \
--eval.batch_size=1 \
--eval.n_episodes=10
```
## Environment config reference
Key parameters for `RoboTwinEnvConfig`:
| Parameter | Default | Description |
| -------------------- | ---------------------------------------- | ---------------------------------- |
| `task` | `"beat_block_hammer"` | Comma-separated task name(s) |
| `fps` | `25` | Simulation FPS |
| `episode_length` | `300` | Max steps per episode |
| `obs_type` | `"pixels_agent_pos"` | `"pixels"` or `"pixels_agent_pos"` |
| `camera_names` | `"head_camera,left_camera,right_camera"` | Comma-separated active cameras |
| `observation_height` | `240` | Camera pixel height |
| `observation_width` | `320` | Camera pixel width |
## Leaderboard submission
Results can be submitted to the [RoboTwin 2.0 leaderboard](https://robotwin-platform.github.io/leaderboard). The official protocol requires:
- Training on 50 `demo_clean` demonstrations per task
- Evaluating 100 episodes per task
- Reporting success rate separately for **Easy** (`demo_clean`) and **Hard** (`demo_randomized`) settings
For submission instructions, refer to the [RoboTwin 2.0 documentation](https://robotwin-platform.github.io/doc/).

View File

@@ -34,7 +34,7 @@ pip install -e ".[smolvla]"
### Using RTC with Pi0
You can use `lerobot-rollout --strategy.type=base --inference.type=rtc` for RTC deployment on real robots.
You can find a complete reference implementation in [eval_with_real_robot.py](examples/rtc/eval_with_real_robot.py).
The snippet below provides a simplified pseudo-example of how RTC operates with Pi0 in your pipeline:
```python
@@ -137,12 +137,8 @@ The script generates a visualization of the denoising process, comparing standar
## Testing RTC with a Real Robot
```bash
lerobot-rollout \
--strategy.type=base \
python examples/rtc/eval_with_real_robot.py \
--policy.path=${HF_USERNAME}/policy_repo_id \
--inference.type=rtc \
--inference.rtc.execution_horizon=10 \
--inference.rtc.max_guidance_weight=10.0 \
--robot.type=so100_follower \
--robot.port=/dev/tty.usbmodem58FA0834591 \
--robot.cameras="{ gripper: {type: opencv, index_or_path: 1, width: 640, height: 480, fps: 30}, front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
@@ -182,7 +178,7 @@ visualizer = RTCDebugVisualizer()
# ... create plots
```
See `examples/rtc/eval_dataset.py` for a complete example of offline RTC visualization.
See `examples/rtc/eval_dataset.py` for a complete example of visualization.
## References

View File

@@ -46,7 +46,7 @@ This ensures identical task states map to consistent progress values, even acros
## Inputs and Targets (What the new code expects)
SARM is trained through its processor (`src/lerobot/rewards/sarm/processor_sarm.py`), which:
SARM is trained through its processor (`src/lerobot/policies/sarm/processor_sarm.py`), which:
- **Encodes** images and task text with CLIP (ViT-B/32) into `video_features` and `text_features`
- **Pads/truncates** robot state into `state_features` (up to `max_state_dim`)
@@ -347,7 +347,7 @@ Use `compute_rabc_weights.py` with `--visualize-only` to visualize model predict
<hfoption id="single_stage">
```bash
python -m lerobot.rewards.sarm.compute_rabc_weights \
python src/lerobot/policies/sarm/compute_rabc_weights.py \
--dataset-repo-id your-username/your-dataset \
--reward-model-path your-username/sarm-model \
--visualize-only \
@@ -360,7 +360,7 @@ python -m lerobot.rewards.sarm.compute_rabc_weights \
<hfoption id="dense_only">
```bash
python -m lerobot.rewards.sarm.compute_rabc_weights \
python src/lerobot/policies/sarm/compute_rabc_weights.py \
--dataset-repo-id your-username/your-dataset \
--reward-model-path your-username/sarm-model \
--visualize-only \
@@ -373,7 +373,7 @@ python -m lerobot.rewards.sarm.compute_rabc_weights \
<hfoption id="dual">
```bash
python -m lerobot.rewards.sarm.compute_rabc_weights \
python src/lerobot/policies/sarm/compute_rabc_weights.py \
--dataset-repo-id your-username/your-dataset \
--reward-model-path your-username/sarm-model \
--visualize-only \
@@ -429,7 +429,7 @@ The weighting follows **Equations 8-9** from the paper:
First, run the SARM model on all frames in your dataset to compute progress values:
```bash
python -m lerobot.rewards.sarm.compute_rabc_weights \
python src/lerobot/policies/sarm/compute_rabc_weights.py \
--dataset-repo-id your-username/your-dataset \
--reward-model-path your-username/sarm-model \
--head-mode sparse \
@@ -465,15 +465,15 @@ This script:
### Step 5b: Train Policy with RA-BC
Once you have the progress file, train your policy with RA-BC weighting. The progress file is auto-detected from the dataset path (`sarm_progress.parquet`) if not explicitly provided. Currently PI0, PI0.5 and SmolVLA are supported with RA-BC:
Once you have the progress file, train your policy with RA-BC weighting. The progress file is auto-detected from the dataset path (`sarm_progress.parquet`). Currently PI0, PI0.5 and SmolVLA are supported with RA-BC:
```bash
lerobot-train \
--dataset.repo_id=your-username/your-dataset \
--policy.type=pi0 \
--sample_weighting.type=rabc \
--sample_weighting.head_mode=sparse \
--sample_weighting.kappa=0.01 \
--use_rabc=true \
--rabc_head_mode=sparse \
--rabc_kappa=0.01 \
--output_dir=outputs/train/policy_rabc \
--batch_size=32 \
--steps=40000
@@ -488,13 +488,12 @@ The training script automatically:
**RA-BC Arguments:**
| Argument | Description | Default |
| ---------------------------------- | ------------------------------------------------------ | ----------------------- |
| `--sample_weighting.type` | Weighting strategy type (`rabc` or `uniform`) | `rabc` |
| `--sample_weighting.progress_path` | Path to progress parquet file | `sarm_progress.parquet` |
| `--sample_weighting.head_mode` | Which SARM head's progress to use: `sparse` or `dense` | `sparse` |
| `--sample_weighting.kappa` | Threshold κ for high-quality samples | `0.01` |
| `--sample_weighting.epsilon` | Small constant for numerical stability | `1e-6` |
| Argument | Description | Default |
| ---------------------- | ---------------------------------------------------------- | ---------------------------------- |
| `--use_rabc` | Enable RA-BC sample weighting | `false` |
| `--rabc_progress_path` | Path to progress parquet file (auto-detected from dataset) | `sarm_progress.parquet` in dataset |
| `--rabc_head_mode` | Which SARM head's progress to use: `sparse` or `dense` | `sparse` |
| `--rabc_kappa` | Threshold κ for high-quality samples | `0.01` |
### Tuning RA-BC Kappa
@@ -512,30 +511,30 @@ The `kappa` parameter is the threshold that determines which samples get full we
Monitor these WandB metrics during training:
| Metric | Healthy Range | Problem Indicator |
| ----------------------------- | ------------- | ------------------------- |
| `sample_weight_mean_weight` | 0.3 - 0.8 | ≈ 1.0 means kappa too low |
| `sample_weighting/delta_mean` | > 0 | Should be positive |
| `sample_weighting/delta_std` | > 0 | Variance in data quality |
| Metric | Healthy Range | Problem Indicator |
| ------------------ | ------------- | ------------------------- |
| `rabc_mean_weight` | 0.3 - 0.8 | ≈ 1.0 means kappa too low |
| `rabc_delta_mean` | > 0 | Should be positive |
| `rabc_delta_std` | > 0 | Variance in data quality |
**If `sample_weight_mean_weight ≈ 1.0`:** Your kappa is too low. Most samples have `delta > kappa` and bypass the soft-weighting entirely. RA-BC becomes equivalent to vanilla BC.
**If `rabc_mean_weight ≈ 1.0`:** Your kappa is too low. Most samples have `delta > kappa` and bypass the soft-weighting entirely. RA-BC becomes equivalent to vanilla BC.
**Setting kappa based on your data:**
The default `kappa=0.01` was tuned for the paper's T-shirt folding task (~90s episodes at 30fps). For your dataset, check the logged `sample_weighting/delta_mean` and `sample_weighting/delta_std`:
The default `kappa=0.01` was tuned for the paper's T-shirt folding task (~90s episodes at 30fps). For your dataset, check the logged `rabc_delta_mean` and `rabc_delta_std`:
```
# If delta_mean ≈ 0.03 and delta_std ≈ 0.02:
# Most deltas fall in range [0.01, 0.05]
# Option 1: Set kappa = delta_mean (medium selectivity)
--sample_weighting.kappa=0.03
--rabc_kappa=0.03
# Option 2: Set kappa = delta_mean + delta_std (high selectivity)
--sample_weighting.kappa=0.05
--rabc_kappa=0.05
# Option 3: Set kappa = delta_mean + 2*delta_std (very selective)
--sample_weighting.kappa=0.07
--rabc_kappa=0.07
```
**When RA-BC may not help:**
@@ -551,8 +550,8 @@ accelerate launch \
src/lerobot/scripts/lerobot_train.py \
--dataset.repo_id=your-username/your-dataset \
--policy.type=pi0 \
--sample_weighting.type=rabc \
--sample_weighting.kappa=0.01 \
--use_rabc=true \
--rabc_kappa=0.01 \
--output_dir=outputs/train/policy_rabc \
--batch_size=32 \
--steps=40000
@@ -577,7 +576,7 @@ accelerate launch \
### RA-BC
1. **Train SARM first**: RA-BC quality depends entirely on SARM quality
2. **Monitor `sample_weight_mean_weight`**: If it's ≈ 1.0, increase kappa (see [Tuning RA-BC Kappa](#tuning-ra-bc-kappa))
2. **Monitor `rabc_mean_weight`**: If it's ≈ 1.0, increase kappa (see [Tuning RA-BC Kappa](#tuning-ra-bc-kappa))
---

View File

@@ -1,210 +0,0 @@
# Tools
LeRobot v3.1 supports **tool calls** in policies — assistant messages can
emit structured invocations like `say(text="OK, starting now")` that the
runtime dispatches to a real implementation (TTS, controller, logger, …).
This page covers:
1. Where the tool catalog lives.
2. How the annotation pipeline produces tool-call atoms.
3. How to add your own tool.
## Where tools are declared
Two layers.
**The catalog** — a list of OpenAI-style function schemas — lives at
`meta/info.json["tools"]` on each dataset. Example:
```json
{
"features": { "...": "..." },
"tools": [
{
"type": "function",
"function": {
"name": "say",
"description": "Speak a short utterance to the user via the TTS executor.",
"parameters": {
"type": "object",
"properties": {
"text": {
"type": "string",
"description": "The verbatim text to speak."
}
},
"required": ["text"]
}
}
}
]
}
```
Read it via the dataset metadata accessor:
```python
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata
meta = LeRobotDatasetMetadata(repo_id="pepijn/super_poulain_final_annotations")
tools = meta.tools # list[dict] — OpenAI tool schemas
```
If the dataset's `info.json` doesn't declare any tools, `meta.tools`
returns `DEFAULT_TOOLS` from `lerobot.datasets.language` — currently a
single-entry list with the canonical `say` schema. So unannotated
datasets and chat-template consumers keep working without any
configuration:
```python
prompt_str = tokenizer.apply_chat_template(
sample["messages"],
tools=meta.tools, # works either way
add_generation_prompt=False,
tokenize=False,
)
```
**The implementations** — runnable Python — will live under
`src/lerobot/tools/`, one file per tool. The runtime dispatcher and
the canonical `say` implementation (wrapping Kyutai's pocket-tts) are
not part of the catalog layer described here; today this layer ships
only the schema storage and the `DEFAULT_TOOLS` fallback constant.
## Per-row tool _invocations_
The catalog above describes _what can be called_. The actual _call_ — the
function name plus the argument values — is stored per-row, on the
assistant atoms in `language_events`:
```python
{
"role": "assistant",
"content": null,
"style": null,
"timestamp": 12.4,
"camera": null,
"tool_calls": [
{ "type": "function",
"function": { "name": "say", "arguments": { "text": "On it." } } }
]
}
```
Recipes splice these into rendered messages via `tool_calls_from`:
```yaml
user_interjection_response:
bindings:
speech: "emitted_at(t, role=assistant, tool_name=say)"
messages:
- { role: user, content: "${task}", stream: high_level }
- {
role: assistant,
content: "${current_plan}",
stream: high_level,
target: true,
tool_calls_from: speech,
}
```
The model's training target is one assistant turn that carries both the
plan text _and_ the `say` tool call. At inference, the runtime parses
the generated text back into structured `tool_calls` and dispatches to
the matching implementation.
## How to add your own tool
> **Note:** Steps 2 and 3 below describe the runtime layer
> (`src/lerobot/tools/`, the `Tool` protocol, `TOOL_REGISTRY`,
> `get_tools(meta)`) which is not part of the catalog layer shipped
> today — those modules don't yet exist in the tree. Step 1 alone is
> enough to make the tool visible to the chat template via
> `meta.tools` so the model can learn to _generate_ the call;
> executing the call at inference requires the runtime layer.
Three steps. Concrete example: a `record_observation` tool the policy
can call to capture an extra observation outside the regular control
loop.
### Step 1 — declare the schema
Add an entry under `meta/info.json["tools"]`. Either edit the file
directly on disk _before_ running the annotation pipeline (it'll be
preserved) or hand it to `lerobot-annotate` via a config flag.
```json
{
"tools": [
{ "type": "function", "function": { "name": "say", "...": "..." } },
{
"type": "function",
"function": {
"name": "record_observation",
"description": "Capture a high-resolution still image for the user.",
"parameters": {
"type": "object",
"properties": {
"label": {
"type": "string",
"description": "Short label for the saved image."
}
},
"required": ["label"]
}
}
}
]
}
```
The schema follows OpenAI's function-calling convention exactly, so the
chat template can render it natively.
### Step 2 — implement the call
Create `src/lerobot/tools/record_observation.py`:
```python
from .base import Tool
from typing import Any
RECORD_OBSERVATION_SCHEMA: dict[str, Any] = { "...": "..." } # mirrors the JSON above
class RecordObservationTool:
name = "record_observation"
schema = RECORD_OBSERVATION_SCHEMA
def __init__(self, schema: dict | None = None, output_dir: str = "."):
self.output_dir = output_dir
def call(self, arguments: dict) -> str:
label = arguments["label"]
# ... save the latest camera frame to <output_dir>/<label>.png ...
return f"saved {label}.png"
```
One file per tool keeps dependencies isolated — `record_observation`
might pull `pillow`, while `say` pulls `pocket-tts`. Users installing
only the tools they need avoid heavy transitive deps.
### Step 3 — register it
Add to `src/lerobot/tools/registry.py`:
```python
from .record_observation import RecordObservationTool
TOOL_REGISTRY["record_observation"] = RecordObservationTool
```
That's it. At runtime `get_tools(meta)` looks up each schema in
`meta.tools`, instantiates the matching registered class, and returns
a name → instance dict the dispatcher can route into.
If you want to use a tool _without_ writing an implementation (e.g. for
training-time chat-template formatting only), step 1 alone is enough —
the model still learns to _generate_ the call. Steps 2 and 3 are only
needed to actually _execute_ it at inference.

View File

@@ -274,8 +274,7 @@ python src/lerobot/scripts/lerobot_train.py \
Once trained, we recommend deploying policies using inference-time RTC:
```bash
lerobot-rollout \
--strategy.type=base \
python examples/rtc/eval_with_real_robot.py \
--policy.path=your-username/your-repo-id \
--policy.device=cuda \
--robot.type=unitree_g1 \
@@ -285,7 +284,7 @@ lerobot-rollout \
--task="task_description" \
--duration=1000 \
--fps=30 \
--inference.type=rtc
--rtc.enabled=true
```
---

View File

@@ -1,176 +0,0 @@
# VLABench
[VLABench](https://github.com/OpenMOSS/VLABench) is a large-scale benchmark for **language-conditioned robotic manipulation with long-horizon reasoning**. The upstream suite covers 100 task categories across 2,000+ objects and evaluates six dimensions of robot intelligence: mesh & texture understanding, spatial reasoning, world-knowledge transfer, semantic instruction comprehension, physical-law understanding, and long-horizon planning. Built on MuJoCo / dm_control with a Franka Panda 7-DOF arm. LeRobot exposes **43 of these tasks** through `--env.task` (21 primitives + 22 composites, see [Available tasks](#available-tasks) below).
- Paper: [VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning](https://arxiv.org/abs/2412.18194)
- GitHub: [OpenMOSS/VLABench](https://github.com/OpenMOSS/VLABench)
- Project website: [vlabench.github.io](https://vlabench.github.io)
- Pretrained policy: [`lerobot/smolvla_vlabench`](https://huggingface.co/lerobot/smolvla_vlabench)
<img
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/vlabench.png"
alt="VLABench benchmark overview"
width="85%"
/>
## Available tasks
VLABench ships two task suites covering **43 task categories** in LeRobot's `--env.task` surface:
| Suite | CLI name | Tasks | Description |
| --------- | ----------- | ----- | ---------------------------------------------------------------- |
| Primitive | `primitive` | 21 | Single / few-skill combinations (select, insert, physics QA) |
| Composite | `composite` | 22 | Multi-step reasoning and long-horizon planning (cook, rearrange) |
**Primitive tasks:** `select_fruit`, `select_toy`, `select_chemistry_tube`, `add_condiment`, `select_book`, `select_painting`, `select_drink`, `insert_flower`, `select_billiards`, `select_ingredient`, `select_mahjong`, `select_poker`, and physical-reasoning tasks (`density_qa`, `friction_qa`, `magnetism_qa`, `reflection_qa`, `simple_cuestick_usage`, `simple_seesaw_usage`, `sound_speed_qa`, `thermal_expansion_qa`, `weight_qa`).
**Composite tasks:** `cluster_billiards`, `cluster_book`, `cluster_drink`, `cluster_toy`, `cook_dishes`, `cool_drink`, `find_unseen_object`, `get_coffee`, `hammer_nail`, `heat_food`, `make_juice`, `play_mahjong`, `play_math_game`, `play_poker`, `play_snooker`, `rearrange_book`, `rearrange_chemistry_tube`, `set_dining_table`, `set_study_table`, `store_food`, `take_chemistry_experiment`, `use_seesaw_complex`.
`--env.task` accepts three forms:
- a single task name (`select_fruit`)
- a comma-separated list (`select_fruit,heat_food`)
- a suite shortcut (`primitive`, `composite`, or `primitive,composite`)
## Installation
VLABench is **not on PyPI** — its only distribution is the [OpenMOSS/VLABench](https://github.com/OpenMOSS/VLABench) GitHub repo — so LeRobot does not expose a `vlabench` extra. Install it manually as an editable clone, alongside the MuJoCo / dm_control pins VLABench needs, then fetch the mesh assets:
```bash
# After following the standard LeRobot installation instructions.
git clone https://github.com/OpenMOSS/VLABench.git ~/VLABench
git clone https://github.com/motion-planning/rrt-algorithms.git ~/rrt-algorithms
pip install -e ~/VLABench -e ~/rrt-algorithms
pip install "mujoco==3.2.2" "dm-control==1.0.22" \
open3d colorlog scikit-learn openai gdown
python ~/VLABench/scripts/download_assets.py
```
<Tip>
VLABench requires Linux (`sys_platform == 'linux'`) and Python 3.10+. Set the MuJoCo rendering backend before running:
```bash
export MUJOCO_GL=egl # for headless servers (HPC, cloud)
```
</Tip>
## Evaluation
All eval snippets below mirror the command CI runs (see `.github/workflows/benchmark_tests.yml`). The `--rename_map` argument maps VLABench's `image` / `second_image` / `wrist_image` camera keys onto the three-camera (`camera1` / `camera2` / `camera3`) input layout the released `smolvla_vlabench` policy was trained on.
### Single-task evaluation (recommended for quick iteration)
```bash
lerobot-eval \
--policy.path=lerobot/smolvla_vlabench \
--env.type=vlabench \
--env.task=select_fruit \
--eval.batch_size=1 \
--eval.n_episodes=10 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
```
### Multi-task evaluation
Pass a comma-separated list of tasks:
```bash
lerobot-eval \
--policy.path=lerobot/smolvla_vlabench \
--env.type=vlabench \
--env.task=select_fruit,select_toy,add_condiment,heat_food \
--eval.batch_size=1 \
--eval.n_episodes=10 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
```
### Suite-wide evaluation
Run an entire suite (all 21 primitives or all 22 composites):
```bash
lerobot-eval \
--policy.path=lerobot/smolvla_vlabench \
--env.type=vlabench \
--env.task=primitive \
--eval.batch_size=1 \
--eval.n_episodes=10 \
--eval.use_async_envs=false \
--policy.device=cuda \
--env.max_parallel_tasks=1 \
'--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
```
Or both suites:
```bash
lerobot-eval \
--policy.path=lerobot/smolvla_vlabench \
--env.type=vlabench \
--env.task=primitive,composite \
--eval.batch_size=1 \
--eval.n_episodes=10 \
--eval.use_async_envs=false \
--policy.device=cuda \
--env.max_parallel_tasks=1 \
'--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
```
### Recommended evaluation episodes
**10 episodes per task** for reproducible benchmarking (210 total for the full primitive suite, 220 for composite). Matches the protocol in the VLABench paper.
## Policy inputs and outputs
**Observations:**
- `observation.state` — 7-dim end-effector state (position xyz + Euler xyz + gripper)
- `observation.images.image` — front camera, 480×480 HWC uint8
- `observation.images.second_image` — second camera, 480×480 HWC uint8
- `observation.images.wrist_image` — wrist camera, 480×480 HWC uint8
**Actions:**
- Continuous control in `Box(-1, 1, shape=(7,))` — 3D position + 3D Euler orientation + 1D gripper.
## Training
### Datasets
Pre-collected VLABench datasets in LeRobot format on the Hub:
- [`VLABench/vlabench_primitive_ft_lerobot_video`](https://huggingface.co/datasets/VLABench/vlabench_primitive_ft_lerobot_video) — 5,000 episodes, 128 tasks, 480×480 images.
- [`VLABench/vlabench_composite_ft_lerobot_video`](https://huggingface.co/datasets/VLABench/vlabench_composite_ft_lerobot_video) — 5,977 episodes, 167 tasks, 224×224 images.
### Example training command
Fine-tune a SmolVLA base on the primitive suite:
```bash
lerobot-train \
--policy.type=smolvla \
--policy.repo_id=${HF_USER}/smolvla_vlabench_primitive \
--policy.load_vlm_weights=true \
--policy.push_to_hub=true \
--dataset.repo_id=VLABench/vlabench_primitive_ft_lerobot_video \
--env.type=vlabench \
--env.task=select_fruit \
--output_dir=./outputs/smolvla_vlabench_primitive \
--steps=100000 \
--batch_size=4 \
--eval_freq=5000 \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--save_freq=10000
```
## Reproducing published results
The released checkpoint [`lerobot/smolvla_vlabench`](https://huggingface.co/lerobot/smolvla_vlabench) was trained on the primitive-suite dataset above and is evaluated with the [Single-task](#single-task-evaluation-recommended-for-quick-iteration) / [Suite-wide](#suite-wide-evaluation) commands. CI runs a 10-primitive-task smoke eval (one episode each) on every PR touching the benchmark.

View File

@@ -220,7 +220,7 @@ REAL_DIM = 12
# Postprocessing: Trim 20D predictions to 12D for deployment
```
See the [action_hub.py](https://github.com/huggingface/lerobot/blob/main/src/lerobot/policies/xvla/action_hub.py) implementation for details.
See the [action_hub.py](/home/jade_choghari/robot/lerobot/src/lerobot/policies/xvla/action_hub.py) implementation for details.
#### Auto Action Mode (Recommended)
@@ -519,9 +519,9 @@ If you use X-VLA in your research, please cite:
- [X-VLA Paper](https://arxiv.org/pdf/2510.10274)
- [LeRobot Documentation](https://github.com/huggingface/lerobot)
- [Action Registry Implementation](https://github.com/huggingface/lerobot/blob/main/src/lerobot/policies/xvla/action_hub.py)
- [Processor Implementation](https://github.com/huggingface/lerobot/blob/main/src/lerobot/policies/xvla/processor_xvla.py)
- [Model Configuration](https://github.com/huggingface/lerobot/blob/main/src/lerobot/policies/xvla/configuration_xvla.py)
- [Action Registry Implementation](https://github.com/huggingface/lerobot/src/lerobot/policies/xvla/action_hub.py)
- [Processor Implementation](https://github.com/huggingface/lerobot/src/lerobot/policies/xvla/processor_xvla.py)
- [Model Configuration](https://github.com/huggingface/lerobot/src/lerobot/policies/xvla/configuration_xvla.py)
## Contributing

View File

@@ -1,84 +0,0 @@
#!/usr/bin/env python
"""Launch ``lerobot-annotate`` on a Hugging Face job (vllm + Qwen3.6 MoE).
Spawns one ``h200x2`` job that:
1. installs this branch of ``lerobot`` plus the annotation extras,
2. boots two vllm servers (one per GPU) with Qwen3.6-35B-A3B-FP8,
3. discovers the dataset's canonical subtask + memory vocabulary
from the first 3 sample episodes (phase 0),
4. runs the plan / interjections / vqa modules across the dataset
(subtasks + memory are constrained to the canonical vocabulary),
5. uploads the annotated dataset to ``--dest_repo_id`` (when set)
or back to ``--repo_id``.
Usage:
HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py
Adjust ``CMD`` below to point at your own dataset / target hub repo.
"""
import os
from huggingface_hub import get_token, run_job
token = os.environ.get("HF_TOKEN") or get_token()
if not token:
raise RuntimeError("No HF token. Run `huggingface-cli login` or `export HF_TOKEN=hf_...`")
CMD = (
"apt-get update -qq && apt-get install -y -qq git ffmpeg && "
"pip install --no-deps "
"'lerobot @ git+https://github.com/huggingface/lerobot.git@feat/language-annotation-pipeline' && "
"pip install --upgrade-strategy only-if-needed "
"datasets pyarrow av jsonlines draccus gymnasium torchcodec mergedeep pyyaml-include toml typing-inspect "
"openai && "
"export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 && "
"export VLLM_VIDEO_BACKEND=pyav && "
"lerobot-annotate "
"--repo_id=imstevenpmwork/super_poulain_draft "
"--dest_repo_id=pepijn223/super_poulain_vocab "
"--push_to_hub=true "
"--vlm.backend=openai "
"--vlm.model_id=Qwen/Qwen3.6-35B-A3B-FP8 "
"--vlm.parallel_servers=2 "
"--vlm.num_gpus=2 "
'--vlm.serve_command="vllm serve Qwen/Qwen3.6-35B-A3B-FP8 '
"--tensor-parallel-size 1 --max-model-len 32768 "
'--gpu-memory-utilization 0.8 --uvicorn-log-level warning --port {port}" '
"--vlm.serve_ready_timeout_s=1800 "
"--vlm.client_concurrency=128 "
"--vlm.max_new_tokens=512 "
"--vlm.temperature=0.7 "
"--executor.episode_parallelism=16 "
"--vlm.chat_template_kwargs='{\"enable_thinking\": false}' "
"--vlm.camera_key=observation.images.wrist "
# Phase 0 — canonical vocabulary discovery from the first N sample
# episodes. The VLM picks the right number of subtask + memory
# entries itself from what it sees; the resulting
# meta/canonical_vocabulary.json constrains every subtask + memory
# string to a small repeatable target distribution.
"--vocabulary.sample_episodes=3 "
# Phase 1 — plan module (subtasks + plan + memory + task_aug).
"--plan.frames_per_second=1.0 "
"--plan.use_video_url=true "
"--plan.use_video_url_fps=1.0 "
"--plan.derive_task_from_video=always "
"--plan.n_task_rephrasings=30 "
# Phase 2 — interjections + speech.
"--interjections.max_interjections_per_episode=6 "
# Phase 4 — general VQA.
"--vqa.K=3 "
"--vqa.vqa_emission_hz=1.0"
)
job = run_job(
image="vllm/vllm-openai:latest",
command=["bash", "-c", CMD],
flavor="h200x2",
secrets={"HF_TOKEN": token},
timeout="2h",
)
print(f"Job URL: {job.url}")
print(f"Job ID: {job.id}")

View File

@@ -69,7 +69,7 @@ class ComputeProgressShards(PipelineStep):
import torch
from tqdm import tqdm
from lerobot.rewards.sarm.compute_rabc_weights import (
from lerobot.policies.sarm.compute_rabc_weights import (
generate_all_frame_indices,
interpolate_progress,
load_sarm_resources,

File diff suppressed because it is too large Load Diff

226
examples/hil/hil_utils.py Normal file
View File

@@ -0,0 +1,226 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Shared utilities for Human-in-the-Loop data collection scripts."""
import logging
import time
from dataclasses import dataclass, field
from pathlib import Path
from lerobot.common.control_utils import is_headless
from lerobot.processor import (
IdentityProcessorStep,
RobotAction,
RobotObservation,
RobotProcessorPipeline,
observation_to_transition,
robot_action_observation_to_transition,
transition_to_observation,
transition_to_robot_action,
)
from lerobot.robots import Robot
from lerobot.teleoperators import Teleoperator
from lerobot.utils.robot_utils import precise_sleep
logger = logging.getLogger(__name__)
@dataclass
class HILDatasetConfig:
repo_id: str
single_task: str
root: str | Path | None = None
fps: int = 30
episode_time_s: float = 120
num_episodes: int = 50
video: bool = True
push_to_hub: bool = True
private: bool = False
tags: list[str] | None = None
num_image_writer_processes: int = 0
num_image_writer_threads_per_camera: int = 4
video_encoding_batch_size: int = 1
vcodec: str = "auto"
streaming_encoding: bool = True
encoder_queue_maxsize: int = 30
encoder_threads: int | None = None
rename_map: dict[str, str] = field(default_factory=dict)
def teleop_has_motor_control(teleop: Teleoperator) -> bool:
"""Check if teleoperator has motor control capabilities."""
return all(hasattr(teleop, attr) for attr in ("enable_torque", "disable_torque", "write_goal_positions"))
def teleop_disable_torque(teleop: Teleoperator) -> None:
"""Disable teleop torque if supported."""
if hasattr(teleop, "disable_torque"):
teleop.disable_torque()
def teleop_enable_torque(teleop: Teleoperator) -> None:
"""Enable teleop torque if supported."""
if hasattr(teleop, "enable_torque"):
teleop.enable_torque()
def teleop_smooth_move_to(teleop: Teleoperator, target_pos: dict, duration_s: float = 2.0, fps: int = 50):
"""Smoothly move teleop to target position if motor control is available."""
if not teleop_has_motor_control(teleop):
logger.warning("Teleop does not support motor control - cannot mirror robot position")
return
teleop_enable_torque(teleop)
current = teleop.get_action()
steps = max(int(duration_s * fps), 1)
for step in range(steps + 1):
t = step / steps
interp = {}
for k in current:
if k in target_pos:
interp[k] = current[k] * (1 - t) + target_pos[k] * t
else:
interp[k] = current[k]
teleop.write_goal_positions(interp)
time.sleep(1 / fps)
def init_keyboard_listener():
"""Initialize keyboard listener with HIL controls."""
events = {
"exit_early": False,
"rerecord_episode": False,
"stop_recording": False,
"policy_paused": False,
"correction_active": False,
"resume_policy": False,
"in_reset": False,
"start_next_episode": False,
}
if is_headless():
logger.warning("Headless environment - keyboard controls unavailable")
return None, events
from pynput import keyboard
def on_press(key):
try:
if events["in_reset"]:
if key in [keyboard.Key.space, keyboard.Key.right]:
logger.info("[HIL] Starting next episode...")
events["start_next_episode"] = True
elif hasattr(key, "char") and key.char == "c":
events["start_next_episode"] = True
elif key == keyboard.Key.esc:
logger.info("[HIL] ESC - Stop recording, pushing to hub...")
events["stop_recording"] = True
events["start_next_episode"] = True
else:
if key == keyboard.Key.space:
if not events["policy_paused"] and not events["correction_active"]:
logger.info("[HIL] PAUSED - Press 'c' to take control or 'p' to resume policy")
events["policy_paused"] = True
elif hasattr(key, "char") and key.char == "c":
if events["policy_paused"] and not events["correction_active"]:
logger.info("[HIL] Taking control...")
events["start_next_episode"] = True
elif hasattr(key, "char") and key.char == "p":
if events["policy_paused"] or events["correction_active"]:
logger.info("[HIL] Resuming policy...")
events["resume_policy"] = True
elif key == keyboard.Key.right:
logger.info("[HIL] End episode")
events["exit_early"] = True
elif key == keyboard.Key.left:
logger.info("[HIL] Re-record episode")
events["rerecord_episode"] = True
events["exit_early"] = True
elif key == keyboard.Key.esc:
logger.info("[HIL] ESC - Stop recording...")
events["stop_recording"] = True
events["exit_early"] = True
except Exception as e:
logger.info(f"Key error: {e}")
listener = keyboard.Listener(on_press=on_press)
listener.start()
return listener, events
def make_identity_processors():
"""Create identity processors for recording."""
teleop_proc = RobotProcessorPipeline[tuple[RobotAction, RobotObservation], RobotAction](
steps=[IdentityProcessorStep()],
to_transition=robot_action_observation_to_transition,
to_output=transition_to_robot_action,
)
obs_proc = RobotProcessorPipeline[RobotObservation, RobotObservation](
steps=[IdentityProcessorStep()],
to_transition=observation_to_transition,
to_output=transition_to_observation,
)
return teleop_proc, obs_proc
def reset_loop(robot: Robot, teleop: Teleoperator, events: dict, fps: int):
"""Reset period where human repositions environment."""
logger.info("[HIL] RESET")
events["in_reset"] = True
events["start_next_episode"] = False
obs = robot.get_observation()
robot_pos = {k: v for k, v in obs.items() if k.endswith(".pos") and k in robot.observation_features}
teleop_smooth_move_to(teleop, robot_pos, duration_s=2.0, fps=50)
logger.info("Press any key to enable teleoperation")
while not events["start_next_episode"] and not events["stop_recording"]:
precise_sleep(0.05)
if events["stop_recording"]:
return
events["start_next_episode"] = False
teleop_disable_torque(teleop)
logger.info("Teleop enabled - press any key to start episode")
while not events["start_next_episode"] and not events["stop_recording"]:
loop_start = time.perf_counter()
action = teleop.get_action()
robot.send_action(action)
precise_sleep(1 / fps - (time.perf_counter() - loop_start))
events["in_reset"] = False
events["start_next_episode"] = False
events["exit_early"] = False
events["policy_paused"] = False
events["correction_active"] = False
events["resume_policy"] = False
def print_controls(rtc: bool = False):
"""Print control instructions."""
mode = "Human-in-the-Loop Data Collection" + (" (RTC)" if rtc else "")
logger.info(
"%s\n Controls:\n"
" SPACE - Pause policy\n"
" c - Take control\n"
" p - Resume policy after pause/correction\n"
" → - End episode\n"
" ESC - Stop and push to hub",
mode,
)

View File

@@ -14,21 +14,17 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import time
from lerobot.common.control_utils import init_keyboard_listener, predict_action
from lerobot.common.control_utils import init_keyboard_listener
from lerobot.datasets import LeRobotDataset
from lerobot.policies import make_pre_post_processors
from lerobot.policies.act import ACTPolicy
from lerobot.policies.utils import make_robot_action
from lerobot.processor import make_default_processors
from lerobot.robots.lekiwi import LeKiwiClient, LeKiwiClientConfig
from lerobot.scripts.lerobot_record import record_loop
from lerobot.utils.constants import ACTION, OBS_STR
from lerobot.utils.feature_utils import build_dataset_frame, hw_to_dataset_features
from lerobot.utils.robot_utils import precise_sleep
from lerobot.utils.feature_utils import hw_to_dataset_features
from lerobot.utils.utils import log_say
from lerobot.utils.visualization_utils import init_rerun, log_rerun_data
from lerobot.utils.visualization_utils import init_rerun
NUM_EPISODES = 2
FPS = 30
@@ -39,9 +35,6 @@ HF_DATASET_ID = "<hf_username>/<eval_dataset_repo_id>"
def main():
# NOTE: For production policy deployment, use `lerobot-rollout` CLI instead.
# This script provides a self-contained example for educational purposes.
# Create the robot configuration & robot
robot_config = LeKiwiClientConfig(remote_ip="172.18.134.136", id="lekiwi")
@@ -90,67 +83,43 @@ def main():
raise ValueError("Robot is not connected!")
print("Starting evaluate loop...")
control_interval = 1 / FPS
recorded_episodes = 0
while recorded_episodes < NUM_EPISODES and not events["stop_recording"]:
log_say(f"Running inference, recording eval episode {recorded_episodes} of {NUM_EPISODES}")
# Inline evaluation loop: predict actions and send to robot
timestamp = 0
start_episode_t = time.perf_counter()
while timestamp < EPISODE_TIME_SEC:
start_loop_t = time.perf_counter()
if events["exit_early"]:
events["exit_early"] = False
break
# Get robot observation
obs = robot.get_observation()
obs_processed = robot_observation_processor(obs)
observation_frame = build_dataset_frame(dataset.features, obs_processed, prefix=OBS_STR)
# Predict action using the policy
action_tensor = predict_action(
observation=observation_frame,
policy=policy,
device=policy.config.device,
preprocessor=preprocessor,
postprocessor=postprocessor,
use_amp=policy.config.device.type == "cuda",
task=TASK_DESCRIPTION,
robot_type=robot.name,
)
# Convert policy output to robot action dict
action_values = make_robot_action(action_tensor, dataset.features)
# Process and send action to robot
robot_action_to_send = robot_action_processor((action_values, obs))
robot.send_action(robot_action_to_send)
# Write to dataset
action_frame = build_dataset_frame(dataset.features, action_values, prefix=ACTION)
frame = {**observation_frame, **action_frame, "task": TASK_DESCRIPTION}
dataset.add_frame(frame)
log_rerun_data(observation=obs_processed, action=action_values)
dt_s = time.perf_counter() - start_loop_t
sleep_time_s = control_interval - dt_s
if sleep_time_s < 0:
logging.warning(
f"Evaluate loop is running slower ({1 / dt_s:.1f} Hz) than the target FPS ({FPS} Hz)."
)
precise_sleep(max(sleep_time_s, 0.0))
timestamp = time.perf_counter() - start_episode_t
# Main record loop
record_loop(
robot=robot,
events=events,
fps=FPS,
policy=policy,
preprocessor=preprocessor, # Pass the pre and post policy processors
postprocessor=postprocessor,
dataset=dataset,
control_time_s=EPISODE_TIME_SEC,
single_task=TASK_DESCRIPTION,
display_data=True,
teleop_action_processor=teleop_action_processor,
robot_action_processor=robot_action_processor,
robot_observation_processor=robot_observation_processor,
)
# Reset the environment if not stopping or re-recording
if not events["stop_recording"] and (
(recorded_episodes < NUM_EPISODES - 1) or events["rerecord_episode"]
):
log_say("Reset the environment")
log_say("Waiting for environment reset, press right arrow key when ready...")
record_loop(
robot=robot,
events=events,
fps=FPS,
control_time_s=EPISODE_TIME_SEC,
single_task=TASK_DESCRIPTION,
display_data=True,
teleop_action_processor=teleop_action_processor,
robot_action_processor=robot_action_processor,
robot_observation_processor=robot_observation_processor,
)
if events["rerecord_episode"]:
log_say("Re-record episode")

View File

@@ -45,6 +45,9 @@ def main():
leader_arm = SO100Leader(leader_arm_config)
keyboard = KeyboardTeleop(keyboard_config)
# TODO(Steven): Update this example to use pipelines
teleop_action_processor, robot_action_processor, robot_observation_processor = make_default_processors()
# Configure the dataset features
action_features = hw_to_dataset_features(robot.action_features, ACTION)
obs_features = hw_to_dataset_features(robot.observation_features, OBS_STR)
@@ -74,10 +77,6 @@ def main():
if not robot.is_connected or not leader_arm.is_connected or not keyboard.is_connected:
raise ValueError("Robot or teleop is not connected!")
teleop_action_processor, robot_action_processor, robot_observation_processor = (
make_default_processors()
)
print("Starting record loop...")
recorded_episodes = 0
while recorded_episodes < NUM_EPISODES and not events["stop_recording"]:
@@ -88,14 +87,14 @@ def main():
robot=robot,
events=events,
fps=FPS,
teleop_action_processor=teleop_action_processor,
robot_action_processor=robot_action_processor,
robot_observation_processor=robot_observation_processor,
dataset=dataset,
teleop=[leader_arm, keyboard],
control_time_s=EPISODE_TIME_SEC,
single_task=TASK_DESCRIPTION,
display_data=True,
teleop_action_processor=teleop_action_processor,
robot_action_processor=robot_action_processor,
robot_observation_processor=robot_observation_processor,
)
# Reset the environment if not stopping or re-recording
@@ -107,13 +106,13 @@ def main():
robot=robot,
events=events,
fps=FPS,
teleop_action_processor=teleop_action_processor,
robot_action_processor=robot_action_processor,
robot_observation_processor=robot_observation_processor,
teleop=[leader_arm, keyboard],
control_time_s=RESET_TIME_SEC,
single_task=TASK_DESCRIPTION,
display_data=True,
teleop_action_processor=teleop_action_processor,
robot_action_processor=robot_action_processor,
robot_observation_processor=robot_observation_processor,
)
if events["rerecord_episode"]:

View File

@@ -1,77 +0,0 @@
# !/usr/bin/env python
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Run a trained policy on LeKiwi without recording (base rollout).
Uses the rollout engine's :class:`BaseStrategy` (autonomous execution,
no dataset) with :class:`SyncInferenceConfig` (inline policy call per
control tick). For a CLI entry point with the same capabilities plus
recording, upload, and human-in-the-loop variants, see ``lerobot-rollout``.
"""
from lerobot.configs import PreTrainedConfig
from lerobot.robots.lekiwi import LeKiwiClientConfig
from lerobot.rollout import BaseStrategyConfig, RolloutConfig, build_rollout_context
from lerobot.rollout.inference import SyncInferenceConfig
from lerobot.rollout.strategies import BaseStrategy
from lerobot.utils.process import ProcessSignalHandler
from lerobot.utils.utils import init_logging
FPS = 30
DURATION_SEC = 60
TASK_DESCRIPTION = "My task description"
HF_MODEL_ID = "<hf_username>/<model_repo_id>"
def main():
init_logging()
# Robot: LeKiwi client — make sure lekiwi_host is already running on the robot.
robot_config = LeKiwiClientConfig(remote_ip="172.18.134.136", id="lekiwi")
# Policy: load the pretrained config. ``pretrained_path`` is read downstream
# by ``build_rollout_context`` to reload the full model.
policy_config = PreTrainedConfig.from_pretrained(HF_MODEL_ID)
policy_config.pretrained_path = HF_MODEL_ID
# Assemble the rollout config: base strategy (no recording) + sync inference.
cfg = RolloutConfig(
robot=robot_config,
policy=policy_config,
strategy=BaseStrategyConfig(),
inference=SyncInferenceConfig(),
fps=FPS,
duration=DURATION_SEC,
task=TASK_DESCRIPTION,
)
# Graceful Ctrl-C: the strategy loop exits when shutdown_event is set.
signal_handler = ProcessSignalHandler(use_threads=True)
# Build the context (connects robot, loads policy, wires the inference strategy).
# No custom processors here — LeKiwi runs on raw joint features.
ctx = build_rollout_context(cfg, signal_handler.shutdown_event)
strategy = BaseStrategy(cfg.strategy)
try:
strategy.setup(ctx)
strategy.run(ctx)
finally:
strategy.teardown(ctx)
if __name__ == "__main__":
main()

View File

@@ -1,342 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 🤗 LeRobot Quickstart\n",
"\n",
"Calibration → teleoperation → data collection → training → evaluation.\n",
"\n",
"Install the required dependencies: `pip install -e .[notebook,dataset,training,viz,hardware]`.\n",
"\n",
"**How to use:**\n",
"1. Edit the **Configuration** cell with your settings.\n",
"2. Run all cells (`Run All`).\n",
"3. Each section prints a ready-to-paste terminal command - copy it and run it.\n",
"\n",
"Each setup is different, please refer to the [LeRobot documentation](https://huggingface.co/docs/lerobot/il_robots) for more details on each step and available options. <br>\n",
"Feel free to make this notebook your own and adapt it to your needs!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## Utils"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def _cameras_arg(cameras: dict) -> str:\n",
" if not cameras:\n",
" return \"\"\n",
" entries = [f\"{n}: {{{', '.join(f'{k}: {v}' for k, v in cfg.items())}}}\" for n, cfg in cameras.items()]\n",
" return \"{ \" + \", \".join(entries) + \" }\"\n",
"\n",
"\n",
"def print_cmd(*parts: str) -> None:\n",
" \"\"\"Print a shell command with line continuations, skipping empty parts.\"\"\"\n",
" non_empty = [p for p in parts if p]\n",
" print(\" \\\\\\n \".join(non_empty))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## Configuration\n",
"\n",
"Edit this cell, then **Run All** to generate all commands below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Robot (follower) - run `lerobot-find-port` to discover the port\n",
"ROBOT_TYPE = \"so101_follower\"\n",
"ROBOT_PORT = \"/dev/ttyACM0\"\n",
"ROBOT_ID = \"my_follower_arm\"\n",
"\n",
"# Teleop (leader) - run `lerobot-find-port` to discover the port\n",
"TELEOP_TYPE = \"so101_leader\"\n",
"TELEOP_PORT = \"/dev/ttyACM1\"\n",
"TELEOP_ID = \"my_leader_arm\"\n",
"\n",
"# Cameras - set to {} to disable\n",
"# Run `lerobot-find-cameras opencv` to list available cameras and their indices\n",
"CAMERAS = {\n",
" \"top\": {\"type\": \"opencv\", \"index_or_path\": 2, \"width\": 640, \"height\": 480, \"fps\": 30},\n",
" \"wrist\": {\"type\": \"opencv\", \"index_or_path\": 4, \"width\": 640, \"height\": 480, \"fps\": 30},\n",
"}\n",
"\n",
"# Dataset\n",
"HF_USER = \"your_hf_username\" # `huggingface-cli whoami` to find your username\n",
"DATASET_NAME = \"my_so101_dataset\"\n",
"TASK_DESCRIPTION = \"pick and place the block\"\n",
"NUM_EPISODES = 10\n",
"\n",
"# Training\n",
"POLICY_TYPE = \"act\" # act, diffusion, smolvla, ...\n",
"POLICY_DEVICE = \"cuda\" # cuda / cpu / mps\n",
"TRAIN_STEPS = 10_000\n",
"SAVE_FREQ = 2_000\n",
"OUTPUT_DIR = f\"outputs/train/{DATASET_NAME}\"\n",
"\n",
"# Inference - Hub repo ID or local checkpoint path\n",
"# e.g. set to f\"{OUTPUT_DIR}/checkpoints/last\" to use a local checkpoint\n",
"POLICY_PATH = f\"{HF_USER}/{DATASET_NAME}_{POLICY_TYPE}\"\n",
"LAST_CHECKPOINT_PATH = f\"{OUTPUT_DIR}/checkpoints/last\"\n",
"\n",
"# Derived\n",
"DATASET_REPO_ID = f\"{HF_USER}/{DATASET_NAME}\"\n",
"DATASET_ROOT = f\"data/{DATASET_NAME}\"\n",
"POLICY_REPO_ID = f\"{HF_USER}/{DATASET_NAME}_{POLICY_TYPE}\"\n",
"EVAL_REPO_ID = f\"{HF_USER}/eval_{DATASET_NAME}\"\n",
"CAMERAS_ARG = _cameras_arg(CAMERAS)\n",
"CAMERAS_FLAG = f'--robot.cameras=\"{CAMERAS_ARG}\"' if CAMERAS_ARG else \"\"\n",
"\n",
"print(f\"Robot : {ROBOT_TYPE} @ {ROBOT_PORT}\")\n",
"print(f\"Teleop : {TELEOP_TYPE} @ {TELEOP_PORT}\")\n",
"print(f\"Cameras: {list(CAMERAS) or 'none'}\")\n",
"print(f\"Dataset: {DATASET_REPO_ID} ({NUM_EPISODES} episodes) saved to {DATASET_ROOT}\")\n",
"print(f\"Policy : {POLICY_TYPE} -> {POLICY_REPO_ID}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## 1. Calibration\n",
"\n",
"Run once per arm before first use."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Follower\n",
"print_cmd(\n",
" \"lerobot-calibrate\",\n",
" f\"--robot.type={ROBOT_TYPE}\",\n",
" f\"--robot.port={ROBOT_PORT}\",\n",
" f\"--robot.id={ROBOT_ID}\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Leader\n",
"print_cmd(\n",
" \"lerobot-calibrate\",\n",
" f\"--teleop.type={TELEOP_TYPE}\",\n",
" f\"--teleop.port={TELEOP_PORT}\",\n",
" f\"--teleop.id={TELEOP_ID}\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## 2. Teleoperation\n",
"\n",
"See the [teleoperation docs](https://huggingface.co/docs/lerobot/il_robots#teleoperate) and the [cameras guide](https://huggingface.co/docs/lerobot/cameras) for more options."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print_cmd(\n",
" \"lerobot-teleoperate\",\n",
" f\"--robot.type={ROBOT_TYPE}\",\n",
" f\"--robot.port={ROBOT_PORT}\",\n",
" f\"--robot.id={ROBOT_ID}\",\n",
" CAMERAS_FLAG,\n",
" f\"--teleop.type={TELEOP_TYPE}\",\n",
" f\"--teleop.port={TELEOP_PORT}\",\n",
" f\"--teleop.id={TELEOP_ID}\",\n",
" \"--display_data=true\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## 3. Record Dataset\n",
"\n",
"See the [recording docs](https://huggingface.co/docs/lerobot/il_robots#record-a-dataset) for tips on gathering good data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print_cmd(\n",
" \"lerobot-record\",\n",
" f\"--robot.type={ROBOT_TYPE}\",\n",
" f\"--robot.port={ROBOT_PORT}\",\n",
" f\"--robot.id={ROBOT_ID}\",\n",
" CAMERAS_FLAG,\n",
" f\"--teleop.type={TELEOP_TYPE}\",\n",
" f\"--teleop.port={TELEOP_PORT}\",\n",
" f\"--teleop.id={TELEOP_ID}\",\n",
" f\"--dataset.repo_id={DATASET_REPO_ID}\",\n",
" f\"--dataset.num_episodes={NUM_EPISODES}\",\n",
" f'--dataset.single_task=\"{TASK_DESCRIPTION}\"',\n",
" \"--dataset.streaming_encoding=true\",\n",
" \"--display_data=true\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Resume a previously interrupted recording session\n",
"print_cmd(\n",
" \"lerobot-record\",\n",
" f\"--robot.type={ROBOT_TYPE}\",\n",
" f\"--robot.port={ROBOT_PORT}\",\n",
" f\"--robot.id={ROBOT_ID}\",\n",
" CAMERAS_FLAG,\n",
" f\"--teleop.type={TELEOP_TYPE}\",\n",
" f\"--teleop.port={TELEOP_PORT}\",\n",
" f\"--teleop.id={TELEOP_ID}\",\n",
" f\"--dataset.repo_id={DATASET_REPO_ID}\",\n",
" f\"--dataset.root={DATASET_ROOT}\",\n",
" f\"--dataset.num_episodes={NUM_EPISODES}\",\n",
" f'--dataset.single_task=\"{TASK_DESCRIPTION}\"',\n",
" \"--dataset.streaming_encoding=true\",\n",
" \"--display_data=true\",\n",
" \"--resume=true\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## 4. Train Policy\n",
"\n",
"See the [training docs](https://huggingface.co/docs/lerobot/il_robots#train-a-policy) for configuration options and tips."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print_cmd(\n",
" \"lerobot-train\",\n",
" f\"--dataset.repo_id={DATASET_REPO_ID}\",\n",
" f\"--policy.type={POLICY_TYPE}\",\n",
" f\"--policy.device={POLICY_DEVICE}\",\n",
" f\"--policy.repo_id={POLICY_REPO_ID}\",\n",
" f\"--output_dir={OUTPUT_DIR}\",\n",
" f\"--steps={TRAIN_STEPS}\",\n",
" f\"--save_freq={SAVE_FREQ}\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Resume a previously interrupted training session\n",
"print_cmd(\n",
" \"lerobot-train\",\n",
" f\"--config_path={LAST_CHECKPOINT_PATH}/pretrained_model/train_config.json\",\n",
" \"--resume=true\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## 5. Inference\n",
"\n",
"Uses `POLICY_PATH` from the Configuration cell (defaults to the Hub repo ID). You can also put there the `LAST_CHECKPOINT_PATH`.\n",
"\n",
"See the [inference docs](https://huggingface.co/docs/lerobot/il_robots#run-inference-and-evaluate-your-policy) for details."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print_cmd(\n",
" \"lerobot-record\",\n",
" f\"--policy.path={POLICY_PATH}\",\n",
" f\"--robot.type={ROBOT_TYPE}\",\n",
" f\"--robot.port={ROBOT_PORT}\",\n",
" f\"--robot.id={ROBOT_ID}\",\n",
" CAMERAS_FLAG,\n",
" f\"--teleop.type={TELEOP_TYPE}\",\n",
" f\"--teleop.port={TELEOP_PORT}\",\n",
" f\"--teleop.id={TELEOP_ID}\",\n",
" f\"--dataset.repo_id={EVAL_REPO_ID}\",\n",
" f\"--dataset.num_episodes={NUM_EPISODES}\",\n",
" f'--dataset.single_task=\"{TASK_DESCRIPTION}\"',\n",
" \"--dataset.streaming_encoding=true\",\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "lerobot (3.12.3)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -14,17 +14,13 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import time
from lerobot.cameras.opencv import OpenCVCameraConfig
from lerobot.common.control_utils import init_keyboard_listener, predict_action
from lerobot.common.control_utils import init_keyboard_listener
from lerobot.configs import FeatureType, PolicyFeature
from lerobot.datasets import LeRobotDataset, aggregate_pipeline_dataset_features, create_initial_features
from lerobot.model.kinematics import RobotKinematics
from lerobot.policies import make_pre_post_processors
from lerobot.policies.act import ACTPolicy
from lerobot.policies.utils import make_robot_action
from lerobot.processor import (
RobotProcessorPipeline,
make_default_teleop_action_processor,
@@ -38,12 +34,11 @@ from lerobot.robots.so_follower.robot_kinematic_processor import (
ForwardKinematicsJointsToEE,
InverseKinematicsEEToJoints,
)
from lerobot.scripts.lerobot_record import record_loop
from lerobot.types import RobotAction, RobotObservation
from lerobot.utils.constants import ACTION, OBS_STR
from lerobot.utils.feature_utils import build_dataset_frame, combine_feature_dicts
from lerobot.utils.robot_utils import precise_sleep
from lerobot.utils.feature_utils import combine_feature_dicts
from lerobot.utils.utils import log_say
from lerobot.utils.visualization_utils import init_rerun, log_rerun_data
from lerobot.utils.visualization_utils import init_rerun
NUM_EPISODES = 5
FPS = 30
@@ -54,9 +49,6 @@ HF_DATASET_ID = "<hf_username>/<dataset_repo_id>"
def main():
# NOTE: For production policy deployment, use `lerobot-rollout` CLI instead.
# This script provides a self-contained example for educational purposes.
# Create the robot configuration & robot
camera_config = {"front": OpenCVCameraConfig(index_or_path=0, width=640, height=480, fps=FPS)}
robot_config = SO100FollowerConfig(
@@ -151,67 +143,43 @@ def main():
raise ValueError("Robot is not connected!")
print("Starting evaluate loop...")
control_interval = 1 / FPS
episode_idx = 0
for episode_idx in range(NUM_EPISODES):
log_say(f"Running inference, recording eval episode {episode_idx + 1} of {NUM_EPISODES}")
# Inline evaluation loop: predict actions and send to robot
timestamp = 0
start_episode_t = time.perf_counter()
while timestamp < EPISODE_TIME_SEC:
start_loop_t = time.perf_counter()
if events["exit_early"]:
events["exit_early"] = False
break
# Get robot observation
obs = robot.get_observation()
obs_processed = robot_joints_to_ee_pose_processor(obs)
observation_frame = build_dataset_frame(dataset.features, obs_processed, prefix=OBS_STR)
# Predict action using the policy
action_tensor = predict_action(
observation=observation_frame,
policy=policy,
device=policy.config.device,
preprocessor=preprocessor,
postprocessor=postprocessor,
use_amp=policy.config.device.type == "cuda",
task=TASK_DESCRIPTION,
robot_type=robot.name,
)
# Convert policy output to robot action dict
action_values = make_robot_action(action_tensor, dataset.features)
# Process and send action to robot (EE -> joints via IK)
robot_action_to_send = robot_ee_to_joints_processor((action_values, obs))
robot.send_action(robot_action_to_send)
# Write to dataset
action_frame = build_dataset_frame(dataset.features, action_values, prefix=ACTION)
frame = {**observation_frame, **action_frame, "task": TASK_DESCRIPTION}
dataset.add_frame(frame)
log_rerun_data(observation=obs_processed, action=action_values)
dt_s = time.perf_counter() - start_loop_t
sleep_time_s = control_interval - dt_s
if sleep_time_s < 0:
logging.warning(
f"Evaluate loop is running slower ({1 / dt_s:.1f} Hz) than the target FPS ({FPS} Hz)."
)
precise_sleep(max(sleep_time_s, 0.0))
timestamp = time.perf_counter() - start_episode_t
# Main record loop
record_loop(
robot=robot,
events=events,
fps=FPS,
policy=policy,
preprocessor=preprocessor, # Pass the pre and post policy processors
postprocessor=postprocessor,
dataset=dataset,
control_time_s=EPISODE_TIME_SEC,
single_task=TASK_DESCRIPTION,
display_data=True,
teleop_action_processor=make_default_teleop_action_processor(),
robot_action_processor=robot_ee_to_joints_processor,
robot_observation_processor=robot_joints_to_ee_pose_processor,
)
# Reset the environment if not stopping or re-recording
if not events["stop_recording"] and (
(episode_idx < NUM_EPISODES - 1) or events["rerecord_episode"]
):
log_say("Reset the environment")
log_say("Waiting for environment reset, press right arrow key when ready...")
record_loop(
robot=robot,
events=events,
fps=FPS,
control_time_s=EPISODE_TIME_SEC,
single_task=TASK_DESCRIPTION,
display_data=True,
teleop_action_processor=make_default_teleop_action_processor(),
robot_action_processor=robot_ee_to_joints_processor,
robot_observation_processor=robot_joints_to_ee_pose_processor,
)
if events["rerecord_episode"]:
log_say("Re-record episode")
@@ -222,6 +190,7 @@ def main():
# Save episode
dataset.save_episode()
episode_idx += 1
finally:
# Clean up
log_say("Stop recording")

View File

@@ -65,15 +65,14 @@ def main():
robot = SO100Follower(robot_config)
phone = Phone(teleop_config)
# NOTE: It is highly recommended to use the urdf in the SO-ARM100 repo:
# https://github.com/TheRobotStudio/SO-ARM100/blob/main/Simulation/SO101/so101_new_calib.urdf
# NOTE: It is highly recommended to use the urdf in the SO-ARM100 repo: https://github.com/TheRobotStudio/SO-ARM100/blob/main/Simulation/SO101/so101_new_calib.urdf
kinematics_solver = RobotKinematics(
urdf_path="./SO101/so101_new_calib.urdf",
target_frame_name="gripper_frame_link",
joint_names=list(robot.bus.motors.keys()),
)
# Build pipeline to convert phone action to EE action (with gripper velocity mapped to joint).
# Build pipeline to convert phone action to EE action
phone_to_robot_ee_pose_processor = RobotProcessorPipeline[
tuple[RobotAction, RobotObservation], RobotAction
](
@@ -95,7 +94,7 @@ def main():
to_output=transition_to_robot_action,
)
# Build pipeline to convert EE action to joints action (IK).
# Build pipeline to convert EE action to joints action
robot_ee_to_joints_processor = RobotProcessorPipeline[tuple[RobotAction, RobotObservation], RobotAction](
steps=[
InverseKinematicsEEToJoints(
@@ -108,7 +107,7 @@ def main():
to_output=transition_to_robot_action,
)
# Build pipeline to convert joint observation to EE observation (FK).
# Build pipeline to convert joint observation to EE observation
robot_joints_to_ee_pose = RobotProcessorPipeline[RobotObservation, RobotObservation](
steps=[
ForwardKinematicsJointsToEE(
@@ -119,12 +118,13 @@ def main():
to_output=transition_to_observation,
)
# Create the dataset, deriving features from the pipelines so the on-disk schema
# matches exactly what the pipelines produce at runtime.
# Create the dataset
dataset = LeRobotDataset.create(
repo_id=HF_REPO_ID,
fps=FPS,
features=combine_feature_dicts(
# Run the feature contract of the pipelines
# This tells you how the features would look like after the pipeline steps
aggregate_pipeline_dataset_features(
pipeline=phone_to_robot_ee_pose_processor,
initial_features=create_initial_features(action=phone.action_features),
@@ -163,14 +163,14 @@ def main():
robot=robot,
events=events,
fps=FPS,
teleop_action_processor=phone_to_robot_ee_pose_processor,
robot_action_processor=robot_ee_to_joints_processor,
robot_observation_processor=robot_joints_to_ee_pose,
teleop=phone,
dataset=dataset,
control_time_s=EPISODE_TIME_SEC,
single_task=TASK_DESCRIPTION,
display_data=True,
teleop_action_processor=phone_to_robot_ee_pose_processor,
robot_action_processor=robot_ee_to_joints_processor,
robot_observation_processor=robot_joints_to_ee_pose,
)
# Reset the environment if not stopping or re-recording
@@ -182,13 +182,13 @@ def main():
robot=robot,
events=events,
fps=FPS,
teleop_action_processor=phone_to_robot_ee_pose_processor,
robot_action_processor=robot_ee_to_joints_processor,
robot_observation_processor=robot_joints_to_ee_pose,
teleop=phone,
control_time_s=RESET_TIME_SEC,
single_task=TASK_DESCRIPTION,
display_data=True,
teleop_action_processor=phone_to_robot_ee_pose_processor,
robot_action_processor=robot_ee_to_joints_processor,
robot_observation_processor=robot_joints_to_ee_pose,
)
if events["rerecord_episode"]:

View File

@@ -1,126 +0,0 @@
# !/usr/bin/env python
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Run a trained EE-space policy on SO100 (phone-trained) without recording.
Mirrors ``examples/so100_to_so100_EE/rollout.py`` — the model was trained
with phone teleoperation in EE space, so at deployment we only need the
joint↔EE conversion on the robot side; the phone is not used.
Uses :class:`BaseStrategy` (no recording) + :class:`SyncInferenceConfig`
(inline policy call). For recording during rollout, switch to Sentry,
Highlight, or DAgger via ``lerobot-rollout --strategy.type=...``.
"""
from lerobot.cameras.opencv import OpenCVCameraConfig
from lerobot.configs import PreTrainedConfig
from lerobot.model.kinematics import RobotKinematics
from lerobot.processor import (
RobotProcessorPipeline,
observation_to_transition,
robot_action_observation_to_transition,
transition_to_observation,
transition_to_robot_action,
)
from lerobot.robots.so_follower import SO100Follower, SO100FollowerConfig
from lerobot.robots.so_follower.robot_kinematic_processor import (
ForwardKinematicsJointsToEE,
InverseKinematicsEEToJoints,
)
from lerobot.rollout import BaseStrategyConfig, RolloutConfig, build_rollout_context
from lerobot.rollout.inference import SyncInferenceConfig
from lerobot.rollout.strategies import BaseStrategy
from lerobot.types import RobotAction, RobotObservation
from lerobot.utils.process import ProcessSignalHandler
from lerobot.utils.utils import init_logging
FPS = 30
DURATION_SEC = 60
TASK_DESCRIPTION = "My task description"
HF_MODEL_ID = "<hf_username>/<model_repo_id>"
def main():
init_logging()
camera_config = {"front": OpenCVCameraConfig(index_or_path=0, width=640, height=480, fps=FPS)}
robot_config = SO100FollowerConfig(
port="/dev/tty.usbmodem58760434471",
id="my_awesome_follower_arm",
cameras=camera_config,
use_degrees=True,
)
# Peek at motor names once to build the kinematic solver.
temp_robot = SO100Follower(robot_config)
motor_names = list(temp_robot.bus.motors.keys())
kinematics_solver = RobotKinematics(
urdf_path="./SO101/so101_new_calib.urdf",
target_frame_name="gripper_frame_link",
joint_names=motor_names,
)
robot_joints_to_ee_pose_processor = RobotProcessorPipeline[RobotObservation, RobotObservation](
steps=[ForwardKinematicsJointsToEE(kinematics=kinematics_solver, motor_names=motor_names)],
to_transition=observation_to_transition,
to_output=transition_to_observation,
)
robot_ee_to_joints_processor = RobotProcessorPipeline[tuple[RobotAction, RobotObservation], RobotAction](
steps=[
InverseKinematicsEEToJoints(
kinematics=kinematics_solver,
motor_names=motor_names,
initial_guess_current_joints=True,
),
],
to_transition=robot_action_observation_to_transition,
to_output=transition_to_robot_action,
)
policy_config = PreTrainedConfig.from_pretrained(HF_MODEL_ID)
policy_config.pretrained_path = HF_MODEL_ID
cfg = RolloutConfig(
robot=robot_config,
policy=policy_config,
strategy=BaseStrategyConfig(),
inference=SyncInferenceConfig(),
fps=FPS,
duration=DURATION_SEC,
task=TASK_DESCRIPTION,
)
signal_handler = ProcessSignalHandler(use_threads=True)
ctx = build_rollout_context(
cfg,
signal_handler.shutdown_event,
robot_action_processor=robot_ee_to_joints_processor,
robot_observation_processor=robot_joints_to_ee_pose_processor,
)
strategy = BaseStrategy(cfg.strategy)
try:
strategy.setup(ctx)
strategy.run(ctx)
finally:
strategy.teardown(ctx)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,673 @@
#!/usr/bin/env python
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Demo script showing how to use Real-Time Chunking (RTC) with action chunking policies on real robots.
This script demonstrates:
1. Creating a robot and policy (SmolVLA, Pi0, etc.) with RTC
2. Consuming actions from the policy while the robot executes
3. Periodically requesting new action chunks in the background using threads
4. Managing action buffers and timing for real-time operation
For simulation environments, see eval_with_simulation.py
Usage:
# Run RTC with Real robot with RTC
uv run examples/rtc/eval_with_real_robot.py \
--policy.path=<USER>/smolvla_check_rtc_last3 \
--policy.device=mps \
--rtc.enabled=true \
--rtc.execution_horizon=20 \
--robot.type=so100_follower \
--robot.port=/dev/tty.usbmodem58FA0834591 \
--robot.id=so100_follower \
--robot.cameras="{ gripper: {type: opencv, index_or_path: 1, width: 640, height: 480, fps: 30}, front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
--task="Move green small object into the purple platform" \
--duration=120
# Run RTC with Real robot without RTC
uv run examples/rtc/eval_with_real_robot.py \
--policy.path=<USER>/smolvla_check_rtc_last3 \
--policy.device=mps \
--rtc.enabled=false \
--robot.type=so100_follower \
--robot.port=/dev/tty.usbmodem58FA0834591 \
--robot.id=so100_follower \
--robot.cameras="{ gripper: {type: opencv, index_or_path: 1, width: 640, height: 480, fps: 30}, front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
--task="Move green small object into the purple platform" \
--duration=120
# Run RTC with Real robot with pi0.5 policy
uv run examples/rtc/eval_with_real_robot.py \
--policy.path=<USER>/pi05_check_rtc \
--policy.device=mps \
--rtc.enabled=true \
--rtc.execution_horizon=20 \
--robot.type=so100_follower \
--robot.port=/dev/tty.usbmodem58FA0834591 \
--robot.id=so100_follower \
--robot.cameras="{ gripper: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}, front: {type: opencv, index_or_path: 1, width: 640, height: 480, fps: 30}}" \
--task="Move green small object into the purple platform" \
--duration=120
# Run RTC with bi_openarm_follower (dual-arm OpenArms) and pi0.5 policy
python examples/rtc/eval_with_real_robot.py \
--policy.path=lerobot-data-collection/folding_final \
--robot.type=bi_openarm_follower \
--robot.cameras='{left_wrist: {type: opencv, index_or_path: "/dev/video4", width: 1280, height: 720, fps: 30}, base: {type: opencv, index_or_path: "/dev/video2", width: 640, height: 480, fps: 30}, right_wrist: {type: opencv, index_or_path: "/dev/video0", width: 1280, height: 720, fps: 30}}' \
--robot.left_arm_config.port=can0 \
--robot.left_arm_config.side=left \
--robot.left_arm_config.can_interface=socketcan \
--robot.left_arm_config.disable_torque_on_disconnect=true \
--robot.left_arm_config.max_relative_target=8.0 \
--robot.right_arm_config.port=can1 \
--robot.right_arm_config.side=right \
--robot.right_arm_config.can_interface=socketcan \
--robot.right_arm_config.disable_torque_on_disconnect=true \
--robot.right_arm_config.max_relative_target=8.0 \
--task="Fold the T-shirt properly" \
--fps=30 \
--duration=2000 \
--interpolation_multiplier=3 \
--rtc.enabled=true \
--rtc.execution_horizon=20 \
--rtc.max_guidance_weight=5.0 \
--rtc.prefix_attention_schedule=LINEAR \
--device=cuda
"""
import logging
import math
import sys
import time
import traceback
from dataclasses import dataclass, field
from threading import Event, Lock, Thread
import torch
from torch import Tensor
from lerobot.cameras.opencv import OpenCVCameraConfig # noqa: F401
from lerobot.cameras.realsense import RealSenseCameraConfig # noqa: F401
from lerobot.cameras.zmq import ZMQCameraConfig # noqa: F401
from lerobot.configs import PreTrainedConfig, RTCAttentionSchedule, parser
from lerobot.policies import get_policy_class, make_pre_post_processors
from lerobot.policies.rtc import ActionInterpolator, ActionQueue, LatencyTracker, RTCConfig
from lerobot.processor import (
NormalizerProcessorStep,
RelativeActionsProcessorStep,
TransitionKey,
create_transition,
make_default_robot_action_processor,
make_default_robot_observation_processor,
to_relative_actions,
)
from lerobot.rl.process import ProcessSignalHandler
from lerobot.robots import ( # noqa: F401
Robot,
RobotConfig,
bi_openarm_follower,
bi_so_follower,
koch_follower,
so_follower,
unitree_g1,
)
from lerobot.robots.utils import make_robot_from_config
from lerobot.utils.constants import OBS_IMAGES, OBS_STATE
from lerobot.utils.feature_utils import build_dataset_frame, hw_to_dataset_features
from lerobot.utils.hub import HubMixin
from lerobot.utils.utils import init_logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class RobotWrapper:
def __init__(self, robot: Robot):
self.robot = robot
self.lock = Lock()
def get_observation(self) -> dict[str, Tensor]:
with self.lock:
return self.robot.get_observation()
def send_action(self, action: Tensor):
with self.lock:
self.robot.send_action(action)
def observation_features(self) -> list[str]:
with self.lock:
return self.robot.observation_features
def action_features(self) -> list[str]:
with self.lock:
return self.robot.action_features
@dataclass
class RTCDemoConfig(HubMixin):
"""Configuration for RTC demo with action chunking policies and real robots."""
# Policy configuration
policy: PreTrainedConfig | None = None
# Robot configuration
robot: RobotConfig | None = None
# RTC configuration
rtc: RTCConfig = field(
default_factory=lambda: RTCConfig(
execution_horizon=10,
max_guidance_weight=1.0,
prefix_attention_schedule=RTCAttentionSchedule.EXP,
)
)
# Demo parameters
duration: float = 30.0 # Duration to run the demo (seconds)
fps: float = 10.0 # Action execution frequency (Hz)
interpolation_multiplier: int = 1 # Control rate multiplier (1=off, 2=2x, 3=3x)
# Compute device
device: str | None = None # Device to run on (cuda, cpu, auto)
# Get new actions horizon. The amount of executed steps after which will be requested new actions.
# It should be higher than inference delay + execution horizon.
action_queue_size_to_get_new_actions: int = 30
# Task to execute
task: str = field(default="", metadata={"help": "Task to execute"})
# Torch compile configuration
use_torch_compile: bool = field(
default=False,
metadata={"help": "Use torch.compile for faster inference (PyTorch 2.0+)"},
)
torch_compile_backend: str = field(
default="inductor",
metadata={"help": "Backend for torch.compile (inductor, aot_eager, cudagraphs)"},
)
torch_compile_mode: str = field(
default="default",
metadata={"help": "Compilation mode (default, reduce-overhead, max-autotune)"},
)
torch_compile_disable_cudagraphs: bool = field(
default=True,
metadata={
"help": "Disable CUDA graphs in torch.compile. Required due to in-place tensor "
"operations in denoising loop (x_t += dt * v_t) which cause tensor aliasing issues."
},
)
def __post_init__(self):
# HACK: We parse again the cli args here to get the pretrained path if there was one.
policy_path = parser.get_path_arg("policy")
if policy_path:
cli_overrides = parser.get_cli_overrides("policy")
self.policy = PreTrainedConfig.from_pretrained(policy_path, cli_overrides=cli_overrides)
self.policy.pretrained_path = policy_path
else:
raise ValueError("Policy path is required")
# Validate that robot configuration is provided
if self.robot is None:
raise ValueError("Robot configuration must be provided")
@classmethod
def __get_path_fields__(cls) -> list[str]:
"""This enables the parser to load config from the policy using `--policy.path=local/dir`"""
return ["policy"]
def is_image_key(k: str) -> bool:
return k.startswith(OBS_IMAGES)
def _reanchor_relative_rtc_prefix(
prev_actions_absolute: Tensor,
current_state: Tensor,
relative_step: RelativeActionsProcessorStep,
normalizer_step: NormalizerProcessorStep | None,
policy_device: torch.device | str,
) -> Tensor:
"""Convert absolute leftovers into model-space for relative-action RTC policies.
When a policy uses relative actions, the RTC prefix (leftover actions from
the previous chunk) is stored in absolute space. Before feeding it back to
the policy we need to re-express it relative to the *current* robot state
and then re-normalize.
"""
state = current_state.detach().cpu()
if state.dim() == 1:
state = state.unsqueeze(0)
action_cpu = prev_actions_absolute.detach().cpu()
mask = relative_step._build_mask(action_cpu.shape[-1])
relative_actions = to_relative_actions(action_cpu, state, mask)
transition = create_transition(action=relative_actions)
if normalizer_step is not None:
transition = normalizer_step(transition)
return transition[TransitionKey.ACTION].to(policy_device)
def get_actions(
policy,
robot: RobotWrapper,
robot_observation_processor,
action_queue: ActionQueue,
shutdown_event: Event,
cfg: RTCDemoConfig,
):
"""Thread function to request action chunks from the policy.
Args:
policy: The policy instance (SmolVLA, Pi0, etc.)
robot: The robot instance for getting observations
robot_observation_processor: Processor for raw robot observations
action_queue: Queue to put new action chunks
shutdown_event: Event to signal shutdown
cfg: Demo configuration
"""
try:
logger.info("[GET_ACTIONS] Starting get actions thread")
latency_tracker = LatencyTracker() # Track latency of action chunks
fps = cfg.fps
time_per_chunk = 1.0 / fps
# Only keep .pos joints + camera streams if the policy was trained on positions,
# not the full pos/vel/torque state the robot exposes.
observation_features_hw = {
key: value
for key, value in robot.observation_features().items()
if key.endswith(".pos") or isinstance(value, tuple)
}
dataset_features = hw_to_dataset_features(observation_features_hw, "observation")
policy_device = policy.config.device
# Load preprocessor and postprocessor from pretrained files
# The stats are embedded in the processor .safetensors files
logger.info(f"[GET_ACTIONS] Loading preprocessor/postprocessor from {cfg.policy.pretrained_path}")
preprocessor, postprocessor = make_pre_post_processors(
policy_cfg=cfg.policy,
pretrained_path=cfg.policy.pretrained_path,
dataset_stats=None, # Will load from pretrained processor files
preprocessor_overrides={
"device_processor": {"device": cfg.policy.device},
},
)
logger.info("[GET_ACTIONS] Preprocessor/postprocessor loaded successfully with embedded stats")
relative_step = next(
(s for s in preprocessor.steps if isinstance(s, RelativeActionsProcessorStep) and s.enabled),
None,
)
normalizer_step = next(
(s for s in preprocessor.steps if isinstance(s, NormalizerProcessorStep)),
None,
)
if relative_step is not None:
if relative_step.action_names is None:
cfg_names = getattr(cfg.policy, "action_feature_names", None)
if cfg_names:
relative_step.action_names = list(cfg_names)
else:
relative_step.action_names = [
k for k in robot.robot.action_features if k.endswith(".pos")
]
logger.info("[GET_ACTIONS] Relative actions enabled: will re-anchor RTC prefix")
get_actions_threshold = cfg.action_queue_size_to_get_new_actions
if not cfg.rtc.enabled:
get_actions_threshold = 0
while not shutdown_event.is_set():
if action_queue.qsize() <= get_actions_threshold:
current_time = time.perf_counter()
action_index_before_inference = action_queue.get_action_index()
prev_actions = action_queue.get_left_over()
inference_latency = latency_tracker.max()
inference_delay = math.ceil(inference_latency / time_per_chunk)
obs = robot.get_observation()
# Apply robot observation processor
obs_processed = robot_observation_processor(obs)
obs_with_policy_features = build_dataset_frame(
dataset_features, obs_processed, prefix="observation"
)
for name in obs_with_policy_features:
obs_with_policy_features[name] = torch.from_numpy(obs_with_policy_features[name])
if "image" in name:
obs_with_policy_features[name] = (
obs_with_policy_features[name].type(torch.float32) / 255
)
obs_with_policy_features[name] = (
obs_with_policy_features[name].permute(2, 0, 1).contiguous()
)
obs_with_policy_features[name] = obs_with_policy_features[name].unsqueeze(0)
obs_with_policy_features[name] = obs_with_policy_features[name].to(policy_device)
obs_with_policy_features["task"] = [cfg.task] # Task should be a list, not a string!
obs_with_policy_features["robot_type"] = (
robot.robot.name if hasattr(robot.robot, "name") else ""
)
preproceseded_obs = preprocessor(obs_with_policy_features)
# Re-anchor leftover actions for relative-action policies.
# We need the *postprocessed* (absolute) leftover, not the original
# (normalized/relative) one that get_left_over() returns.
if (
prev_actions is not None
and relative_step is not None
and OBS_STATE in obs_with_policy_features
):
with action_queue.lock:
if action_queue.queue is not None:
prev_actions_abs = action_queue.queue[action_queue.last_index :].clone()
else:
prev_actions_abs = None
if prev_actions_abs is not None and prev_actions_abs.numel() > 0:
prev_actions = _reanchor_relative_rtc_prefix(
prev_actions_absolute=prev_actions_abs,
current_state=obs_with_policy_features[OBS_STATE],
relative_step=relative_step,
normalizer_step=normalizer_step,
policy_device=policy_device,
)
# Generate actions WITH RTC
actions = policy.predict_action_chunk(
preproceseded_obs,
inference_delay=inference_delay,
prev_chunk_left_over=prev_actions,
)
# Store original actions (before postprocessing) for RTC
original_actions = actions.squeeze(0).clone()
postprocessed_actions = postprocessor(actions)
postprocessed_actions = postprocessed_actions.squeeze(0)
new_latency = time.perf_counter() - current_time
new_delay = math.ceil(new_latency / time_per_chunk)
latency_tracker.add(new_latency)
if cfg.action_queue_size_to_get_new_actions < cfg.rtc.execution_horizon + new_delay:
logger.warning(
"[GET_ACTIONS] cfg.action_queue_size_to_get_new_actions Too small, It should be higher than inference delay + execution horizon."
)
action_queue.merge(
original_actions, postprocessed_actions, new_delay, action_index_before_inference
)
else:
# Small sleep to prevent busy waiting
time.sleep(0.1)
logger.info("[GET_ACTIONS] get actions thread shutting down")
except Exception as e:
logger.error(f"[GET_ACTIONS] Fatal exception in get_actions thread: {e}")
logger.error(traceback.format_exc())
sys.exit(1)
def actor_control(
robot: RobotWrapper,
robot_action_processor,
action_queue: ActionQueue,
shutdown_event: Event,
cfg: RTCDemoConfig,
):
"""Thread function to execute actions on the robot.
Args:
robot: The robot instance
action_queue: Queue to get actions from
shutdown_event: Event to signal shutdown
cfg: Demo configuration
"""
try:
logger.info("[ACTOR] Starting actor thread")
action_keys = [k for k in robot.action_features() if k.endswith(".pos")]
action_count = 0
interpolator = ActionInterpolator(multiplier=cfg.interpolation_multiplier)
action_interval = interpolator.get_control_interval(cfg.fps)
while not shutdown_event.is_set():
start_time = time.perf_counter()
if interpolator.needs_new_action():
new_action = action_queue.get()
if new_action is not None:
interpolator.add(new_action.cpu())
action = interpolator.get()
if action is not None:
action = action.cpu()
action_dict = {key: action[i].item() for i, key in enumerate(action_keys)}
action_processed = robot_action_processor((action_dict, None))
robot.send_action(action_processed)
action_count += 1
dt_s = time.perf_counter() - start_time
time.sleep(max(0, (action_interval - dt_s) - 0.001))
logger.info(f"[ACTOR] Actor thread shutting down. Total actions executed: {action_count}")
except Exception as e:
logger.error(f"[ACTOR] Fatal exception in actor_control thread: {e}")
logger.error(traceback.format_exc())
sys.exit(1)
def _apply_torch_compile(policy, cfg: RTCDemoConfig):
"""Apply torch.compile to the policy's predict_action_chunk method.
Args:
policy: Policy instance to compile
cfg: Configuration containing torch compile settings
Returns:
Policy with compiled predict_action_chunk method
"""
# PI models handle their own compilation
if policy.type == "pi05" or policy.type == "pi0":
return policy
try:
# Check if torch.compile is available (PyTorch 2.0+)
if not hasattr(torch, "compile"):
logger.warning(
f"torch.compile is not available. Requires PyTorch 2.0+. "
f"Current version: {torch.__version__}. Skipping compilation."
)
return policy
logger.info("Applying torch.compile to predict_action_chunk...")
logger.info(f" Backend: {cfg.torch_compile_backend}")
logger.info(f" Mode: {cfg.torch_compile_mode}")
logger.info(f" Disable CUDA graphs: {cfg.torch_compile_disable_cudagraphs}")
# Compile the predict_action_chunk method
# - CUDA graphs disabled to prevent tensor aliasing from in-place ops (x_t += dt * v_t)
compile_kwargs = {
"backend": cfg.torch_compile_backend,
"mode": cfg.torch_compile_mode,
}
# Disable CUDA graphs if requested (prevents tensor aliasing issues)
if cfg.torch_compile_disable_cudagraphs:
compile_kwargs["options"] = {"triton.cudagraphs": False}
original_method = policy.predict_action_chunk
compiled_method = torch.compile(original_method, **compile_kwargs)
policy.predict_action_chunk = compiled_method
logger.info("✓ Successfully compiled predict_action_chunk")
except Exception as e:
logger.error(f"Failed to apply torch.compile: {e}")
logger.warning("Continuing without torch.compile")
return policy
@parser.wrap()
def demo_cli(cfg: RTCDemoConfig):
"""Main entry point for RTC demo with draccus configuration."""
# Initialize logging
init_logging()
logger.info(f"Using device: {cfg.device}")
# Setup signal handler for graceful shutdown
signal_handler = ProcessSignalHandler(use_threads=True, display_pid=False)
shutdown_event = signal_handler.shutdown_event
policy = None
robot = None
get_actions_thread = None
actor_thread = None
policy_class = get_policy_class(cfg.policy.type)
# Load config and set compile_model for pi0/pi05 models
config = PreTrainedConfig.from_pretrained(cfg.policy.pretrained_path)
if cfg.policy.type == "pi05" or cfg.policy.type == "pi0":
config.compile_model = cfg.use_torch_compile
if config.use_peft:
from peft import PeftConfig, PeftModel
peft_pretrained_path = cfg.policy.pretrained_path
peft_config = PeftConfig.from_pretrained(peft_pretrained_path)
policy = policy_class.from_pretrained(
pretrained_name_or_path=peft_config.base_model_name_or_path, config=config
)
policy = PeftModel.from_pretrained(policy, peft_pretrained_path, config=peft_config)
else:
policy = policy_class.from_pretrained(cfg.policy.pretrained_path, config=config)
# Turn on RTC
policy.config.rtc_config = cfg.rtc
# Init RTC processort, as by default if RTC disabled in the config
# The processor won't be created
policy.init_rtc_processor()
assert policy.name in ["smolvla", "pi05", "pi0"], "Only smolvla, pi05, and pi0 are supported for RTC"
policy = policy.to(cfg.device)
policy.eval()
# Apply torch.compile to predict_action_chunk method if enabled
if cfg.use_torch_compile:
policy = _apply_torch_compile(policy, cfg)
# Create robot
logger.info(f"Initializing robot: {cfg.robot.type}")
robot = make_robot_from_config(cfg.robot)
robot.connect()
robot_wrapper = RobotWrapper(robot)
# Create robot observation processor
robot_observation_processor = make_default_robot_observation_processor()
robot_action_processor = make_default_robot_action_processor()
# Create action queue for communication between threads
action_queue = ActionQueue(cfg.rtc)
# Start chunk requester thread
get_actions_thread = Thread(
target=get_actions,
args=(policy, robot_wrapper, robot_observation_processor, action_queue, shutdown_event, cfg),
daemon=True,
name="GetActions",
)
get_actions_thread.start()
logger.info("Started get actions thread")
# Start action executor thread
actor_thread = Thread(
target=actor_control,
args=(robot_wrapper, robot_action_processor, action_queue, shutdown_event, cfg),
daemon=True,
name="Actor",
)
actor_thread.start()
logger.info("Started actor thread")
logger.info("Started stop by duration thread")
# Main thread monitors for duration or shutdown
logger.info(f"Running demo for {cfg.duration} seconds...")
start_time = time.time()
while not shutdown_event.is_set() and (time.time() - start_time) < cfg.duration:
time.sleep(10)
# Log queue status periodically
if int(time.time() - start_time) % 5 == 0:
logger.info(f"[MAIN] Action queue size: {action_queue.qsize()}")
if time.time() - start_time > cfg.duration:
break
logger.info("Demo duration reached or shutdown requested")
# Signal shutdown
shutdown_event.set()
# Wait for threads to finish
if get_actions_thread and get_actions_thread.is_alive():
logger.info("Waiting for chunk requester thread to finish...")
get_actions_thread.join()
if actor_thread and actor_thread.is_alive():
logger.info("Waiting for action executor thread to finish...")
actor_thread.join()
# Cleanup robot
if robot:
robot.disconnect()
logger.info("Robot disconnected")
logger.info("Cleanup completed")
if __name__ == "__main__":
demo_cli()
logging.info("RTC demo finished")

View File

@@ -14,17 +14,13 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import time
from lerobot.cameras.opencv import OpenCVCameraConfig
from lerobot.common.control_utils import init_keyboard_listener, predict_action
from lerobot.common.control_utils import init_keyboard_listener
from lerobot.configs import FeatureType, PolicyFeature
from lerobot.datasets import LeRobotDataset, aggregate_pipeline_dataset_features, create_initial_features
from lerobot.model.kinematics import RobotKinematics
from lerobot.policies import make_pre_post_processors
from lerobot.policies.act import ACTPolicy
from lerobot.policies.utils import make_robot_action
from lerobot.processor import (
RobotProcessorPipeline,
make_default_teleop_action_processor,
@@ -38,12 +34,11 @@ from lerobot.robots.so_follower.robot_kinematic_processor import (
ForwardKinematicsJointsToEE,
InverseKinematicsEEToJoints,
)
from lerobot.scripts.lerobot_record import record_loop
from lerobot.types import RobotAction, RobotObservation
from lerobot.utils.constants import ACTION, OBS_STR
from lerobot.utils.feature_utils import build_dataset_frame, combine_feature_dicts
from lerobot.utils.robot_utils import precise_sleep
from lerobot.utils.feature_utils import combine_feature_dicts
from lerobot.utils.utils import log_say
from lerobot.utils.visualization_utils import init_rerun, log_rerun_data
from lerobot.utils.visualization_utils import init_rerun
NUM_EPISODES = 5
FPS = 30
@@ -54,9 +49,6 @@ HF_DATASET_ID = "<hf_username>/<dataset_repo_id>"
def main():
# NOTE: For production policy deployment, use `lerobot-rollout` CLI instead.
# This script provides a self-contained example for educational purposes.
# Create the robot configuration & robot
camera_config = {"front": OpenCVCameraConfig(index_or_path=0, width=640, height=480, fps=FPS)}
robot_config = SO100FollowerConfig(
@@ -151,67 +143,43 @@ def main():
raise ValueError("Robot is not connected!")
print("Starting evaluate loop...")
control_interval = 1 / FPS
episode_idx = 0
for episode_idx in range(NUM_EPISODES):
log_say(f"Running inference, recording eval episode {episode_idx + 1} of {NUM_EPISODES}")
# Inline evaluation loop: predict actions and send to robot
timestamp = 0
start_episode_t = time.perf_counter()
while timestamp < EPISODE_TIME_SEC:
start_loop_t = time.perf_counter()
if events["exit_early"]:
events["exit_early"] = False
break
# Get robot observation
obs = robot.get_observation()
obs_processed = robot_joints_to_ee_pose_processor(obs)
observation_frame = build_dataset_frame(dataset.features, obs_processed, prefix=OBS_STR)
# Predict action using the policy
action_tensor = predict_action(
observation=observation_frame,
policy=policy,
device=policy.config.device,
preprocessor=preprocessor,
postprocessor=postprocessor,
use_amp=policy.config.device.type == "cuda",
task=TASK_DESCRIPTION,
robot_type=robot.name,
)
# Convert policy output to robot action dict
action_values = make_robot_action(action_tensor, dataset.features)
# Process and send action to robot (EE -> joints via IK)
robot_action_to_send = robot_ee_to_joints_processor((action_values, obs))
robot.send_action(robot_action_to_send)
# Write to dataset
action_frame = build_dataset_frame(dataset.features, action_values, prefix=ACTION)
frame = {**observation_frame, **action_frame, "task": TASK_DESCRIPTION}
dataset.add_frame(frame)
log_rerun_data(observation=obs_processed, action=action_values)
dt_s = time.perf_counter() - start_loop_t
sleep_time_s = control_interval - dt_s
if sleep_time_s < 0:
logging.warning(
f"Evaluate loop is running slower ({1 / dt_s:.1f} Hz) than the target FPS ({FPS} Hz)."
)
precise_sleep(max(sleep_time_s, 0.0))
timestamp = time.perf_counter() - start_episode_t
# Main record loop
record_loop(
robot=robot,
events=events,
fps=FPS,
policy=policy,
preprocessor=preprocessor, # Pass the pre and post policy processors
postprocessor=postprocessor,
dataset=dataset,
control_time_s=EPISODE_TIME_SEC,
single_task=TASK_DESCRIPTION,
display_data=True,
teleop_action_processor=make_default_teleop_action_processor(),
robot_action_processor=robot_ee_to_joints_processor,
robot_observation_processor=robot_joints_to_ee_pose_processor,
)
# Reset the environment if not stopping or re-recording
if not events["stop_recording"] and (
(episode_idx < NUM_EPISODES - 1) or events["rerecord_episode"]
):
log_say("Reset the environment")
log_say("Waiting for environment reset, press right arrow key when ready...")
record_loop(
robot=robot,
events=events,
fps=FPS,
control_time_s=EPISODE_TIME_SEC,
single_task=TASK_DESCRIPTION,
display_data=True,
teleop_action_processor=make_default_teleop_action_processor(),
robot_action_processor=robot_ee_to_joints_processor,
robot_observation_processor=robot_joints_to_ee_pose_processor,
)
if events["rerecord_episode"]:
log_say("Re-record episode")
@@ -222,6 +190,7 @@ def main():
# Save episode
dataset.save_episode()
episode_idx += 1
finally:
# Clean up
log_say("Stop recording")

View File

@@ -62,20 +62,21 @@ def main():
follower = SO100Follower(follower_config)
leader = SO100Leader(leader_config)
# NOTE: It is highly recommended to use the urdf in the SO-ARM100 repo:
# https://github.com/TheRobotStudio/SO-ARM100/blob/main/Simulation/SO101/so101_new_calib.urdf
# NOTE: It is highly recommended to use the urdf in the SO-ARM100 repo: https://github.com/TheRobotStudio/SO-ARM100/blob/main/Simulation/SO101/so101_new_calib.urdf
follower_kinematics_solver = RobotKinematics(
urdf_path="./SO101/so101_new_calib.urdf",
target_frame_name="gripper_frame_link",
joint_names=list(follower.bus.motors.keys()),
)
# NOTE: It is highly recommended to use the urdf in the SO-ARM100 repo: https://github.com/TheRobotStudio/SO-ARM100/blob/main/Simulation/SO101/so101_new_calib.urdf
leader_kinematics_solver = RobotKinematics(
urdf_path="./SO101/so101_new_calib.urdf",
target_frame_name="gripper_frame_link",
joint_names=list(leader.bus.motors.keys()),
)
# Build pipeline to convert follower joints to EE observation.
# Build pipeline to convert follower joints to EE observation
follower_joints_to_ee = RobotProcessorPipeline[RobotObservation, RobotObservation](
steps=[
ForwardKinematicsJointsToEE(
@@ -86,7 +87,7 @@ def main():
to_output=transition_to_observation,
)
# Build pipeline to convert leader joints to EE action.
# Build pipeline to convert leader joints to EE action
leader_joints_to_ee = RobotProcessorPipeline[tuple[RobotAction, RobotObservation], RobotAction](
steps=[
ForwardKinematicsJointsToEE(
@@ -97,9 +98,9 @@ def main():
to_output=transition_to_robot_action,
)
# Build pipeline to convert EE action to follower joints (with safety bounds).
# Build pipeline to convert EE action to follower joints
ee_to_follower_joints = RobotProcessorPipeline[tuple[RobotAction, RobotObservation], RobotAction](
steps=[
[
EEBoundsAndSafety(
end_effector_bounds={"min": [-1.0, -1.0, -1.0], "max": [1.0, 1.0, 1.0]},
max_ee_step_m=0.10,
@@ -114,12 +115,13 @@ def main():
to_output=transition_to_robot_action,
)
# Create the dataset, deriving features from the pipelines so the on-disk schema
# matches exactly what the pipelines produce at runtime.
# Create the dataset
dataset = LeRobotDataset.create(
repo_id=HF_REPO_ID,
fps=FPS,
features=combine_feature_dicts(
# Run the feature contract of the pipelines
# This tells you how the features would look like after the pipeline steps
aggregate_pipeline_dataset_features(
pipeline=leader_joints_to_ee,
initial_features=create_initial_features(action=leader.action_features),
@@ -142,7 +144,7 @@ def main():
# Initialize the keyboard listener and rerun visualization
listener, events = init_keyboard_listener()
init_rerun(session_name="recording_so100_ee")
init_rerun(session_name="recording_phone")
try:
if not leader.is_connected or not follower.is_connected:
@@ -158,14 +160,14 @@ def main():
robot=follower,
events=events,
fps=FPS,
teleop_action_processor=leader_joints_to_ee,
robot_action_processor=ee_to_follower_joints,
robot_observation_processor=follower_joints_to_ee,
teleop=leader,
dataset=dataset,
control_time_s=EPISODE_TIME_SEC,
single_task=TASK_DESCRIPTION,
display_data=True,
teleop_action_processor=leader_joints_to_ee,
robot_action_processor=ee_to_follower_joints,
robot_observation_processor=follower_joints_to_ee,
)
# Reset the environment if not stopping or re-recording
@@ -177,13 +179,13 @@ def main():
robot=follower,
events=events,
fps=FPS,
teleop_action_processor=leader_joints_to_ee,
robot_action_processor=ee_to_follower_joints,
robot_observation_processor=follower_joints_to_ee,
teleop=leader,
control_time_s=RESET_TIME_SEC,
single_task=TASK_DESCRIPTION,
display_data=True,
teleop_action_processor=leader_joints_to_ee,
robot_action_processor=ee_to_follower_joints,
robot_observation_processor=follower_joints_to_ee,
)
if events["rerecord_episode"]:

View File

@@ -1,134 +0,0 @@
# !/usr/bin/env python
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Run a trained EE-space policy on SO100 without recording (base rollout).
Uses the rollout engine's :class:`BaseStrategy` (autonomous execution,
no dataset) with :class:`SyncInferenceConfig` (inline policy call per
control tick). The custom observation/action processors convert between
joint space (robot hardware) and end-effector space (policy I/O) via
forward/inverse kinematics.
"""
from lerobot.cameras.opencv import OpenCVCameraConfig
from lerobot.configs import PreTrainedConfig
from lerobot.model.kinematics import RobotKinematics
from lerobot.processor import (
RobotProcessorPipeline,
observation_to_transition,
robot_action_observation_to_transition,
transition_to_observation,
transition_to_robot_action,
)
from lerobot.robots.so_follower import SO100Follower, SO100FollowerConfig
from lerobot.robots.so_follower.robot_kinematic_processor import (
ForwardKinematicsJointsToEE,
InverseKinematicsEEToJoints,
)
from lerobot.rollout import BaseStrategyConfig, RolloutConfig, build_rollout_context
from lerobot.rollout.inference import SyncInferenceConfig
from lerobot.rollout.strategies import BaseStrategy
from lerobot.types import RobotAction, RobotObservation
from lerobot.utils.process import ProcessSignalHandler
from lerobot.utils.utils import init_logging
FPS = 30
DURATION_SEC = 60
TASK_DESCRIPTION = "My task description"
HF_MODEL_ID = "<hf_username>/<model_repo_id>"
def main():
init_logging()
# Robot configuration — the rollout engine will connect it inside build_rollout_context.
camera_config = {"front": OpenCVCameraConfig(index_or_path=0, width=640, height=480, fps=FPS)}
robot_config = SO100FollowerConfig(
port="/dev/tty.usbmodem5A460814411",
id="my_awesome_follower_arm",
cameras=camera_config,
use_degrees=True,
)
# Kinematic solver: we need the motor-name list, so peek at the robot once.
# (The rollout engine owns the connected instance; we only use this for introspection.)
temp_robot = SO100Follower(robot_config)
motor_names = list(temp_robot.bus.motors.keys())
# NOTE: It is highly recommended to use the urdf in the SO-ARM100 repo:
# https://github.com/TheRobotStudio/SO-ARM100/blob/main/Simulation/SO101/so101_new_calib.urdf
kinematics_solver = RobotKinematics(
urdf_path="./SO101/so101_new_calib.urdf",
target_frame_name="gripper_frame_link",
joint_names=motor_names,
)
# Joint-space observation → EE-space observation (consumed by the policy).
robot_joints_to_ee_pose_processor = RobotProcessorPipeline[RobotObservation, RobotObservation](
steps=[ForwardKinematicsJointsToEE(kinematics=kinematics_solver, motor_names=motor_names)],
to_transition=observation_to_transition,
to_output=transition_to_observation,
)
# EE-space action (produced by the policy) → joint-space action (sent to robot).
robot_ee_to_joints_processor = RobotProcessorPipeline[tuple[RobotAction, RobotObservation], RobotAction](
steps=[
InverseKinematicsEEToJoints(
kinematics=kinematics_solver,
motor_names=motor_names,
initial_guess_current_joints=True,
),
],
to_transition=robot_action_observation_to_transition,
to_output=transition_to_robot_action,
)
# Policy config (full model is loaded inside build_rollout_context).
policy_config = PreTrainedConfig.from_pretrained(HF_MODEL_ID)
policy_config.pretrained_path = HF_MODEL_ID
cfg = RolloutConfig(
robot=robot_config,
policy=policy_config,
strategy=BaseStrategyConfig(),
inference=SyncInferenceConfig(),
fps=FPS,
duration=DURATION_SEC,
task=TASK_DESCRIPTION,
)
signal_handler = ProcessSignalHandler(use_threads=True)
# Pass the EE kinematic processors via kwargs; the defaults (identity) would
# otherwise skip the joint↔EE conversion and the policy would receive the
# wrong observation/action space.
ctx = build_rollout_context(
cfg,
signal_handler.shutdown_event,
robot_action_processor=robot_ee_to_joints_processor,
robot_observation_processor=robot_joints_to_ee_pose_processor,
)
strategy = BaseStrategy(cfg.strategy)
try:
strategy.setup(ctx)
strategy.run(ctx)
finally:
strategy.teardown(ctx)
if __name__ == "__main__":
main()

View File

@@ -10,7 +10,7 @@ from lerobot.datasets import LeRobotDataset
from lerobot.envs.configs import HILSerlProcessorConfig, HILSerlRobotEnvConfig
from lerobot.policies import SACConfig
from lerobot.policies.sac.modeling_sac import SACPolicy
from lerobot.rewards.classifier.modeling_classifier import Classifier
from lerobot.policies.sac.reward_model.modeling_classifier import Classifier
from lerobot.rl.buffer import ReplayBuffer
from lerobot.rl.gym_manipulator import make_robot_env
from lerobot.robots.so_follower import SO100FollowerConfig

View File

@@ -1,7 +1,7 @@
import torch
from lerobot.datasets import LeRobotDataset
from lerobot.rewards import RewardClassifierConfig, make_reward_model, make_reward_pre_post_processors
from lerobot.policies import RewardClassifierConfig, make_policy, make_pre_post_processors
def main():
@@ -22,10 +22,10 @@ def main():
model_name="microsoft/resnet-18",
)
# Make reward model, preprocessor, and optimizer
reward_model = make_reward_model(config, dataset_stats=dataset.meta.stats)
optimizer = config.get_optimizer_preset().build(reward_model.parameters())
preprocessor, _ = make_reward_pre_post_processors(config, dataset_stats=dataset.meta.stats)
# Make policy, preprocessor, and optimizer
policy = make_policy(config, ds_meta=dataset.meta)
optimizer = config.get_optimizer_preset().build(policy.parameters())
preprocessor, _ = make_pre_post_processors(policy_cfg=config, dataset_stats=dataset.meta.stats)
classifier_id = "<user>/reward_classifier_hil_serl_example"
@@ -42,7 +42,7 @@ def main():
batch = preprocessor(batch)
# Forward pass
loss, output_dict = reward_model.forward(batch)
loss, output_dict = policy.forward(batch)
# Backward pass and optimization
optimizer.zero_grad()
@@ -58,8 +58,8 @@ def main():
print("Training finished!")
# You can now save the trained reward model.
reward_model.push_to_hub(classifier_id)
# You can now save the trained policy.
policy.push_to_hub(classifier_id)
if __name__ == "__main__":

View File

@@ -95,7 +95,7 @@ dependencies = [
# ── Feature-scoped extras ──────────────────────────────────
dataset = [
"datasets>=4.7.0,<5.0.0",
"datasets>=4.0.0,<5.0.0",
"pandas>=2.0.0,<3.0.0", # NOTE: Transitive dependency of datasets
"pyarrow>=21.0.0,<30.0.0", # NOTE: Transitive dependency of datasets
"lerobot[av-dep]",
@@ -108,9 +108,9 @@ training = [
"wandb>=0.24.0,<0.25.0",
]
hardware = [
"lerobot[pynput-dep]",
"lerobot[pyserial-dep]",
"lerobot[deepdiff-dep]",
"pynput>=1.7.8,<1.9.0",
"pyserial>=3.5,<4.0",
"deepdiff>=7.0.1,<9.0.0",
]
viz = [
"rerun-sdk>=0.24.0,<0.27.0",
@@ -128,7 +128,7 @@ dataset_viz = ["lerobot[dataset]", "lerobot[viz]"]
av-dep = ["av>=15.0.0,<16.0.0"]
pygame-dep = ["pygame>=2.5.1,<2.7.0"]
placo-dep = ["placo>=0.9.6,<0.9.17"]
transformers-dep = ["transformers>=5.4.0,<5.6.0"]
transformers-dep = ["transformers==5.3.0"] # TODO(Steven): https://github.com/huggingface/lerobot/pull/3249
grpcio-dep = ["grpcio==1.73.1", "protobuf>=6.31.1,<6.32.0"]
can-dep = ["python-can>=4.2.0,<5.0.0"]
peft-dep = ["peft>=0.18.0,<1.0.0"]
@@ -136,14 +136,10 @@ scipy-dep = ["scipy>=1.14.0,<2.0.0"]
diffusers-dep = ["diffusers>=0.27.2,<0.36.0"]
qwen-vl-utils-dep = ["qwen-vl-utils>=0.0.11,<0.1.0"]
matplotlib-dep = ["matplotlib>=3.10.3,<4.0.0", "contourpy>=1.3.0,<2.0.0"] # NOTE: Explicitly listing contourpy helps the resolver converge faster.
pyserial-dep = ["pyserial>=3.5,<4.0"]
deepdiff-dep = ["deepdiff>=7.0.1,<9.0.0"]
pynput-dep = ["pynput>=1.7.8,<1.9.0"]
pyzmq-dep = ["pyzmq>=26.2.1,<28.0.0"]
# Motors
feetech = ["feetech-servo-sdk>=1.0.0,<2.0.0", "lerobot[pyserial-dep]", "lerobot[deepdiff-dep]"]
dynamixel = ["dynamixel-sdk>=3.7.31,<3.9.0", "lerobot[pyserial-dep]", "lerobot[deepdiff-dep]"]
feetech = ["feetech-servo-sdk>=1.0.0,<2.0.0"]
dynamixel = ["dynamixel-sdk>=3.7.31,<3.9.0"]
damiao = ["lerobot[can-dep]"]
robstride = ["lerobot[can-dep]"]
@@ -151,11 +147,10 @@ robstride = ["lerobot[can-dep]"]
openarms = ["lerobot[damiao]"]
gamepad = ["lerobot[pygame-dep]", "hidapi>=0.14.0,<0.15.0"]
hopejr = ["lerobot[feetech]", "lerobot[pygame-dep]"]
lekiwi = ["lerobot[feetech]", "lerobot[pyzmq-dep]"]
lekiwi = ["lerobot[feetech]", "pyzmq>=26.2.1,<28.0.0"]
unitree_g1 = [
# "unitree-sdk2==1.0.1",
"lerobot[pyzmq-dep]",
"lerobot[pyserial-dep]",
"pyzmq>=26.2.1,<28.0.0",
"onnxruntime>=1.16.0,<2.0.0",
"onnx>=1.16.0,<2.0.0",
"meshcat>=0.3.0,<0.4.0",
@@ -194,28 +189,14 @@ groot = [
]
sarm = ["lerobot[transformers-dep]", "pydantic>=2.0.0,<3.0.0", "faker>=33.0.0,<35.0.0", "lerobot[matplotlib-dep]", "lerobot[qwen-vl-utils-dep]"]
xvla = ["lerobot[transformers-dep]"]
eo1 = ["lerobot[transformers-dep]", "lerobot[qwen-vl-utils-dep]"]
hilserl = ["lerobot[transformers-dep]", "gym-hil>=0.1.13,<0.2.0", "lerobot[grpcio-dep]", "lerobot[placo-dep]"]
# Features
async = ["lerobot[grpcio-dep]", "lerobot[matplotlib-dep]"]
peft = ["lerobot[transformers-dep]", "lerobot[peft-dep]"]
# Annotation pipeline (lerobot-annotate). vllm is the preferred backend
# on Linux, with a transformers fallback elsewhere; openai is the default
# backend and talks to any OpenAI-compatible server (``vllm serve`` /
# ``transformers serve`` / hosted endpoints). Distributed execution is
# delegated to Hugging Face Jobs (see examples/annotations/run_hf_job.py).
annotations = [
"lerobot[dataset]",
"lerobot[transformers-dep]",
"openai>=1.40,<2.0",
"vllm>=0.6.0,<1.0.0; sys_platform == 'linux'",
]
# Development
dev = ["pre-commit>=3.7.0,<5.0.0", "debugpy>=1.8.1,<1.9.0", "lerobot[grpcio-dep]", "grpcio-tools==1.73.1", "mypy>=1.19.1", "ruff>=0.14.1", "lerobot[notebook]"]
notebook = ["jupyter>=1.0.0,<2.0.0", "ipykernel>=6.0.0,<7.0.0"]
dev = ["pre-commit>=3.7.0,<5.0.0", "debugpy>=1.8.1,<1.9.0", "lerobot[grpcio-dep]", "grpcio-tools==1.73.1", "mypy>=1.19.1", "ruff>=0.14.1"]
test = ["pytest>=8.1.0,<9.0.0", "pytest-timeout>=2.4.0,<3.0.0", "pytest-cov>=5.0.0,<8.0.0", "mock-serial>=0.0.1,<0.1.0 ; sys_platform != 'win32'"]
video_benchmark = ["scikit-image>=0.23.2,<0.26.0", "pandas>=2.2.2,<2.4.0"]
@@ -225,20 +206,6 @@ aloha = ["lerobot[dataset]", "gym-aloha>=0.1.2,<0.2.0", "lerobot[scipy-dep]"]
pusht = ["lerobot[dataset]", "gym-pusht>=0.1.5,<0.2.0", "pymunk>=6.6.0,<7.0.0"] # TODO: Fix pymunk version in gym-pusht instead
libero = ["lerobot[dataset]", "lerobot[transformers-dep]", "hf-libero>=0.1.3,<0.2.0; sys_platform == 'linux'", "lerobot[scipy-dep]"]
metaworld = ["lerobot[dataset]", "metaworld==3.0.0", "lerobot[scipy-dep]"]
# NOTE: vlabench is NOT exposed as a `lerobot` extra. Its only distribution
# is the OpenMOSS/VLABench GitHub repo (package name `VLABench`, no PyPI
# release), so any `vlabench>=X` pip spec is unresolvable. Install it
# manually alongside MuJoCo / dm-control — see docs/source/vlabench.mdx
# for the recipe.
# NOTE: robomme is NOT a pyproject extra — mani-skill hard-pins numpy<2
# which conflicts with lerobot's numpy>=2 base pin, so the two trees can't
# resolve into a single env. Install it only in the RoboMME Docker image
# via `uv pip install --override` (see docker/Dockerfile.benchmark.robomme).
# NOTE: robocasa is NOT exposed as a `lerobot` extra. Its setup.py pins
# `lerobot==0.3.3` in install_requires, which cyclically shadows our own
# workspace `lerobot` and makes the graph unsolvable under any resolver
# (uv, pip). Install it manually alongside robosuite — see
# docs/source/robocasa.mdx for the recipe.
# All
all = [
@@ -302,12 +269,10 @@ lerobot-find-joint-limits="lerobot.scripts.lerobot_find_joint_limits:main"
lerobot-imgtransform-viz="lerobot.scripts.lerobot_imgtransform_viz:main"
lerobot-edit-dataset="lerobot.scripts.lerobot_edit_dataset:main"
lerobot-setup-can="lerobot.scripts.lerobot_setup_can:main"
lerobot-annotate="lerobot.scripts.lerobot_annotate:main"
lerobot-rollout="lerobot.scripts.lerobot_rollout:main"
# ---------------- Tool Configurations ----------------
[tool.setuptools.package-data]
lerobot = ["envs/*.json", "annotations/steerable_pipeline/prompts/*.txt"]
lerobot = ["envs/*.json"]
[tool.setuptools.packages.find]
where = ["src"]

View File

@@ -1,207 +0,0 @@
#!/usr/bin/env python3
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Extract natural-language task descriptions for a benchmark suite.
Runs inside the benchmark Docker container (where the env library is installed)
immediately after lerobot-eval, writing a JSON file that parse_eval_metrics.py
picks up and embeds in metrics.json.
Output format: {"<suite>_<task_idx>": "<nl instruction>", ...}
Usage:
python scripts/ci/extract_task_descriptions.py \\
--env libero --task libero_spatial \\
--output /tmp/eval-artifacts/task_descriptions.json
"""
from __future__ import annotations
import argparse
import json
import re
import sys
from pathlib import Path
# LIBERO-plus derives task.language by space-joining the perturbation-variant
# filename (grab_language_from_filename in libero/libero/benchmark/__init__.py),
# so non-_language_ variants inherit a trailing metadata blob like
# "view 0 0 100 0 0 initstate 0 noise 45" or "add 16". Strip those tokens so
# the description matches the base instruction used in the training dataset.
_LIBERO_PERTURBATION_TAIL_RE = re.compile(
r"(?:\s(?:view|initstate|noise|add|tb|table|light|level)(?:\s\d+)+)+$"
)
def _strip_libero_perturbation_tail(instruction: str) -> str:
return _LIBERO_PERTURBATION_TAIL_RE.sub("", instruction).strip()
def _libero_descriptions(task_suite: str) -> dict[str, str]:
from libero.libero import benchmark # type: ignore[import-untyped]
suite_dict = benchmark.get_benchmark_dict()
if task_suite not in suite_dict:
print(
f"[extract_task_descriptions] Unknown LIBERO suite '{task_suite}'. "
f"Available: {list(suite_dict.keys())}",
file=sys.stderr,
)
return {}
suite = suite_dict[task_suite]()
return {
f"{task_suite}_{i}": _strip_libero_perturbation_tail(suite.get_task(i).language)
for i in range(suite.n_tasks)
}
def _metaworld_descriptions(task_name: str) -> dict[str, str]:
# MetaWorld tasks don't expose a separate NL description attribute;
# use a cleaned version of the task name as the description.
label = task_name.removeprefix("metaworld-").replace("-", " ").strip()
return {f"{task_name}_0": label}
def _robotwin_descriptions(task_names: str) -> dict[str, str]:
"""Return descriptions for each requested RoboTwin task. Reads
`description/task_instruction/<task>.json` from the RoboTwin clone
(cwd is /opt/robotwin in CI). Falls back to the task name if missing."""
out: dict[str, str] = {}
root = Path("description/task_instruction")
for name in (t.strip() for t in task_names.split(",") if t.strip()):
desc_file = root / f"{name}.json"
desc = name.replace("_", " ")
if desc_file.is_file():
data = json.loads(desc_file.read_text())
full = data.get("full_description") or desc
# Strip the schema placeholders ({A}, {a}) — keep the sentence readable.
desc = full.replace("<", "").replace(">", "")
out[f"{name}_0"] = desc
return out
def _robocasa_descriptions(task_spec: str) -> dict[str, str]:
"""For each task in the comma-separated list, emit a cleaned-name label.
RoboCasa episodes carry their language instruction in the env's
`ep_meta['lang']`, populated per reset. Pulling it requires spinning
up the full kitchen env per task (~seconds each); we use the task
name as the key here and let the eval's episode info carry the
actual instruction.
"""
out: dict[str, str] = {}
for task in (t.strip() for t in task_spec.split(",") if t.strip()):
# Split CamelCase into words: "CloseFridge" → "close fridge".
label = "".join(f" {c.lower()}" if c.isupper() else c for c in task).strip()
out[f"{task}_0"] = label or task
return out
_ROBOMME_DESCRIPTIONS = {
"BinFill": "Fill the target bin with the correct number of cubes",
"PickXtimes": "Pick the indicated cube the specified number of times",
"SwingXtimes": "Swing the object the specified number of times",
"StopCube": "Grasp and stop the moving cube",
"VideoUnmask": "Pick the cube shown in the reference video",
"VideoUnmaskSwap": "Pick the cube matching the reference video after a swap",
"ButtonUnmask": "Press the button indicated by the reference",
"ButtonUnmaskSwap": "Press the correct button after objects are swapped",
"PickHighlight": "Pick the highlighted cube",
"VideoRepick": "Repick the cube shown in the reference video",
"VideoPlaceButton": "Place the cube on the button shown in the video",
"VideoPlaceOrder": "Place cubes in the order shown in the video",
"MoveCube": "Move the cube to the target location",
"InsertPeg": "Insert the peg into the target hole",
"PatternLock": "Unlock the pattern by pressing buttons in sequence",
"RouteStick": "Route the stick through the required waypoints",
}
def _robomme_descriptions(task_names: str, task_ids: list[int] | None = None) -> dict[str, str]:
"""Return descriptions for each requested RoboMME task. Keys match the
video filename pattern `<task>_<task_id>` used by the eval script."""
if task_ids is None:
task_ids = [0]
out: dict[str, str] = {}
for name in (t.strip() for t in task_names.split(",") if t.strip()):
desc = _ROBOMME_DESCRIPTIONS.get(name, name)
for tid in task_ids:
out[f"{name}_{tid}"] = desc
return out
def _vlabench_descriptions(task_spec: str) -> dict[str, str]:
"""For each task in the comma-separated list, emit a cleaned-name label.
VLABench tasks carry language instructions on their dm_control task
object, but pulling them requires loading the full env per task
(~seconds each). The CI smoke-eval already captures the instruction
inside its episode info; this mapping is just enough to key
`metrics.json` by `<task>_0`.
"""
out: dict[str, str] = {}
for task in (t.strip() for t in task_spec.split(",") if t.strip()):
out[f"{task}_0"] = task.replace("_", " ").strip()
return out
def main() -> int:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--env", required=True, help="Environment family (libero, metaworld, ...)")
parser.add_argument("--task", required=True, help="Task/suite name (e.g. libero_spatial)")
parser.add_argument(
"--task-ids",
type=str,
default=None,
help="Comma-separated task IDs (e.g. '0,1,2'). Default: [0]",
)
parser.add_argument("--output", required=True, help="Path to write task_descriptions.json")
args = parser.parse_args()
task_ids: list[int] | None = None
if args.task_ids:
task_ids = [int(x.strip()) for x in args.task_ids.split(",")]
descriptions: dict[str, str] = {}
try:
if args.env == ("libero", "libero_plus"):
descriptions = _libero_descriptions(args.task)
elif args.env == "metaworld":
descriptions = _metaworld_descriptions(args.task)
elif args.env == "robotwin":
descriptions = _robotwin_descriptions(args.task)
elif args.env == "robocasa":
descriptions = _robocasa_descriptions(args.task)
elif args.env == "robomme":
descriptions = _robomme_descriptions(args.task, task_ids=task_ids)
elif args.env == "vlabench":
descriptions = _vlabench_descriptions(args.task)
else:
print(
f"[extract_task_descriptions] No description extractor for env '{args.env}'.",
file=sys.stderr,
)
except Exception as exc:
print(f"[extract_task_descriptions] Warning: {exc}", file=sys.stderr)
out_path = Path(args.output)
out_path.parent.mkdir(parents=True, exist_ok=True)
out_path.write_text(json.dumps(descriptions, indent=2))
print(f"[extract_task_descriptions] {len(descriptions)} descriptions → {out_path}")
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -1,147 +0,0 @@
#!/usr/bin/env python3
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Parse lerobot-eval output into a small metrics.json artifact.
Reads eval_info.json written by lerobot-eval --output_dir and extracts the
key metrics needed by the health dashboard. Handles both single-task and
multi-task eval output formats.
NOTE: This script runs on the bare CI runner (not inside Docker), so it
must use only Python stdlib modules. Do not add third-party imports.
Usage:
python scripts/ci/parse_eval_metrics.py \\
--artifacts-dir /tmp/libero-artifacts \\
--env libero \\
--task libero_spatial \\
--policy pepijn223/smolvla_libero
Writes <artifacts-dir>/metrics.json. The CI workflow then uploads this file
as a GitHub Actions artifact named "<env>-metrics".
"""
from __future__ import annotations
import argparse
import json
import math
import sys
from pathlib import Path
def _safe_float(v: float | int | None) -> float | None:
if v is None:
return None
f = float(v)
return None if math.isnan(f) else f
def _safe_int(v: float | int | None) -> int | None:
if v is None:
return None
f = float(v)
return None if math.isnan(f) else int(f)
def _extract_metrics(info: dict) -> tuple[float | None, int | None, float | None, float | None]:
"""Extract (pc_success, n_episodes, avg_sum_reward, eval_s) from eval_info.json.
Handles two output shapes:
- Single-task: {"aggregated": {"pc_success": 80.0, ...}}
- Multi-task: {"overall": {"pc_success": 80.0, "n_episodes": 5, ...}}
"""
for key in ("aggregated", "overall"):
if key not in info:
continue
agg = info[key]
pc = agg.get("pc_success")
n = agg.get("n_episodes")
reward = agg.get("avg_sum_reward")
eval_s = agg.get("eval_s")
if pc is not None and not math.isnan(pc):
return (
float(pc),
_safe_int(n),
_safe_float(reward),
_safe_float(eval_s),
)
return None, None, None, None
def main() -> int:
parser = argparse.ArgumentParser(
description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter
)
parser.add_argument("--artifacts-dir", required=True, help="Path to the mounted artifacts volume")
parser.add_argument("--env", required=True, help="Environment name (e.g. libero)")
parser.add_argument("--task", required=True, help="Task name (e.g. libero_spatial)")
parser.add_argument("--policy", required=True, help="Policy hub path (e.g. pepijn223/smolvla_libero)")
args = parser.parse_args()
artifacts_dir = Path(args.artifacts_dir)
eval_info_path = artifacts_dir / "eval_info.json"
pc_success: float | None = None
n_episodes: int | None = None
avg_sum_reward: float | None = None
eval_s: float | None = None
if eval_info_path.exists():
try:
info = json.loads(eval_info_path.read_text())
pc_success, n_episodes, avg_sum_reward, eval_s = _extract_metrics(info)
except (json.JSONDecodeError, KeyError, TypeError) as exc:
print(f"[parse_eval_metrics] Warning: could not parse eval_info.json: {exc}", file=sys.stderr)
else:
print(
f"[parse_eval_metrics] Warning: {eval_info_path} not found — eval may have failed.",
file=sys.stderr,
)
task_descriptions: dict[str, str] = {}
task_desc_path = artifacts_dir / "task_descriptions.json"
if task_desc_path.exists():
try:
task_descriptions = json.loads(task_desc_path.read_text())
except json.JSONDecodeError as exc:
print(
f"[parse_eval_metrics] Warning: could not parse task_descriptions.json: {exc}",
file=sys.stderr,
)
metrics = {
"env": args.env,
"task": args.task,
"policy": args.policy,
"pc_success": pc_success,
"n_episodes": n_episodes,
"avg_sum_reward": avg_sum_reward,
"eval_s": eval_s,
"task_descriptions": task_descriptions,
}
out_path = artifacts_dir / "metrics.json"
out_path.write_text(json.dumps(metrics, indent=2))
print(f"[parse_eval_metrics] Written: {out_path}")
print(json.dumps(metrics, indent=2))
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -1,15 +0,0 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View File

@@ -1,50 +0,0 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Steerable annotation pipeline producing ``language_persistent`` and
``language_events`` columns for LeRobot datasets.
The pipeline is decomposed into three independently runnable modules whose
outputs are staged per-episode before a final parquet rewrite:
- :mod:`.modules.plan_subtasks_memory` (the ``plan`` module) — persistent styles
- :mod:`.modules.interjections_and_speech` (the ``interjections`` module) — event styles + speech
- :mod:`.modules.general_vqa` (the ``vqa`` module) — event-style VQA pairs
"""
from .config import AnnotationPipelineConfig
from .validator import StagingValidator, ValidationReport
from .vocabulary import (
VOCABULARY_FILENAME,
Vocabulary,
VocabularyDiscoveryModule,
load_vocabulary,
save_vocabulary,
vocabulary_path,
)
from .writer import LanguageColumnsWriter
__all__ = [
"VOCABULARY_FILENAME",
"AnnotationPipelineConfig",
"LanguageColumnsWriter",
"StagingValidator",
"ValidationReport",
"Vocabulary",
"VocabularyDiscoveryModule",
"load_vocabulary",
"save_vocabulary",
"vocabulary_path",
]

View File

@@ -1,251 +0,0 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import annotations
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
@dataclass
class VocabularyConfig:
"""Phase 0 — dataset-level canonical vocabulary discovery.
Watches the first ``sample_episodes`` episode videos and asks the VLM
to derive a small canonical vocabulary (subtask labels + memory
milestones) that every episode in the dataset will reuse. The VLM
decides the count itself from what it sees in the clips — short
pick-and-place demos get ~6 labels, longer multi-step recipes more.
The output lands at ``meta/canonical_vocabulary.json`` and feeds
phase 1's subtask + memory generation as both a prompt-side
constraint and a post-VLM validation gate.
Why this exists: free-form LLM rephrasing per episode produces near-
unique subtask strings, which makes the downstream low-level policy's
conditioning effectively noise — at inference the policy generates a
*new* paraphrase the action expert has never seen and produces tiny
cautious actions. Forcing every episode onto the same small set of
canonical strings gives the action expert dense supervision per
string and a small target distribution to learn against.
Set ``enabled=False`` to fall back to free-form generation (original
behaviour). ``reuse_existing=True`` keeps a hand-edited vocabulary
file from being clobbered on re-runs.
"""
enabled: bool = True
sample_episodes: int = 3
max_video_frames_per_episode: int = 32
# When True (default), an existing meta/canonical_vocabulary.json is
# loaded as-is and no VLM call is made — lets operators hand-edit the
# file. Set False to always rediscover from the sample episodes.
reuse_existing: bool = True
@dataclass
class PlanConfig:
"""``plan`` module: plan + subtasks + memory + task augmentation.
The ``plan`` module attaches the whole episode as one Qwen-VL video
block; ``max_video_frames`` only caps the frames packed in (a
model-capacity bound, not an annotation-logic knob).
"""
enabled: bool = True
# Number of ``task_aug`` rephrasings emitted at ``t=0``. The renderer's
# ``${task}`` binding rotates among them per ``sample_idx``. ``0`` disables.
n_task_rephrasings: int = 10
# When to derive the task from the video instead of using
# ``record.episode_task``: ``off``, ``if_short`` (short / placeholder /
# missing canonical task), or ``always``. The derived task replaces the
# canonical one for every ``plan``-module prompt; ``meta/tasks.parquet``
# is never modified.
derive_task_from_video: str = "if_short"
derive_task_min_words: int = 3
# Frame sampling for the subtask-decomposition prompt.
frames_per_second: float = 1.0
max_video_frames: int = 128
min_subtask_seconds: float = 1.5
plan_max_steps: int = 8
# When True (and backend supports it, e.g. ``openai``), the ``plan``
# module sends a ``video_url`` block pointing at a per-episode mp4
# subclip and lets the server sample frames at ``use_video_url_fps``.
use_video_url: bool = False
use_video_url_fps: float = 1.0
@dataclass
class InterjectionsConfig:
"""``interjections`` module: interjections + paired speech."""
enabled: bool = True
# Each interjection emits a paired ``(interjection, speech)`` event row
# and triggers a ``plan`` refresh at the same timestamp via the
# ``plan`` module.
max_interjections_per_episode: int = 3
interjection_min_t: float = 2.0
# Visual context attached to the interjection prompt: a short window
# of frames centered on the chosen timestamp so the VLM sees the
# ongoing motion rather than a single frozen frame.
interjection_window_seconds: float = 2.0
interjection_window_frames: int = 4
@dataclass
class VqaConfig:
"""``vqa`` module: general VQA."""
enabled: bool = True
vqa_emission_hz: float = 1.0
K: int = 1
"""How many *consecutive* frames each emission tick anchors a VQA pair
to. The VLM grounds its answer (bbox / keypoint coordinates, count, …)
against the *first* anchored frame's image, so anchoring K>1 frames
copies that same answer onto later frames where the scene has already
moved — stale labels. Default ``1``: a VQA pair lands on exactly its
emission frame, no temporal smear. Raise it only to trade label
precision for more (noisier) VQA frames."""
question_types: tuple[str, ...] = ("bbox", "keypoint", "count", "attribute", "spatial")
@dataclass
class VlmConfig:
"""Shared Qwen-VL client configuration."""
# One of ``vllm``, ``transformers``, ``openai``, or ``stub`` (tests).
# ``openai`` talks to a local OpenAI-compatible server; the CLI
# auto-spawns one when ``auto_serve=True``.
backend: str = "openai"
model_id: str = "Qwen/Qwen3.6-35B-A3B-FP8"
# OpenAI-compatible server endpoint; ``EMPTY`` works for local servers.
api_base: str = "http://localhost:8000/v1"
api_key: str = "EMPTY"
# When True with ``backend=openai``, the CLI probes ``api_base`` and
# spawns a server if none answers (default: ``transformers serve``).
# Set to False to fail fast when pointing at a remote endpoint.
auto_serve: bool = True
serve_port: int = 8000
# Override the auto-serve command. ``{port}`` is substituted per replica
# when ``parallel_servers > 1``.
serve_command: str | None = None
# Run multiple independent inference servers for round-robin client
# routing (each pinned to a GPU via ``CUDA_VISIBLE_DEVICES`` and bound
# to ``serve_port + i``). ``num_gpus=0`` means one GPU per replica.
parallel_servers: int = 1
num_gpus: int = 0
client_concurrency: int = 16
serve_ready_timeout_s: float = 600.0
max_new_tokens: int = 512
temperature: float = 0.2
json_mode: bool = True
batch_size: int = 4
tensor_parallel_size: int = 1
# Fraction of GPU memory vllm allocates for weights + KV cache.
gpu_memory_utilization: float = 0.9
# Cap context length (None = model default). On 80 GB H100 a 30B BF16
# model often needs <= 8192 to leave KV-cache headroom.
max_model_len: int | None = None
trust_remote_code: bool = False
# Override the camera stream used for keyframe attachment. None picks
# the first ``observation.images.*`` key the dataset declares.
camera_key: str | None = None
# Forwarded as ``extra_body.chat_template_kwargs`` on every chat call;
# use to pass model-specific flags such as ``{"enable_thinking": false}``.
chat_template_kwargs: dict[str, Any] | None = None
@dataclass
class ExecutorConfig:
"""Executor settings.
Distributed execution is provided by Hugging Face Jobs (see
``examples/annotation/run_hf_job.py``); this config only controls
intra-process episode concurrency.
"""
# Episodes processed concurrently within each module phase. Each
# in-flight episode dispatches 3-5 dependent VLM calls, so this is the
# main knob for saturating ``parallel_servers`` and ``client_concurrency``.
episode_parallelism: int = 16
@dataclass
class AnnotationPipelineConfig:
"""Top-level config for ``lerobot-annotate``.
The writer rewrites ``data/chunk-*/file-*.parquet`` in place. Multiple
revisions of the same dataset live in separate copies.
"""
# Hub dataset id. Used as the download source when ``root`` is unset,
# and as the destination repo when ``push_to_hub`` is enabled and
# ``dest_repo_id`` is unset.
repo_id: str | None = None
# Optional separate Hub dataset id to push the annotated result to. When
# unset, ``push_to_hub`` uploads back to ``repo_id`` (annotate in place);
# when set, the source ``repo_id`` is left untouched.
dest_repo_id: str | None = None
root: Path | None = None
# Defaults to ``<root>/.annotate_staging/`` when unset.
staging_dir: Path | None = None
seed: int = 1729
vocabulary: VocabularyConfig = field(default_factory=VocabularyConfig)
plan: PlanConfig = field(default_factory=PlanConfig)
interjections: InterjectionsConfig = field(default_factory=InterjectionsConfig)
vqa: VqaConfig = field(default_factory=VqaConfig)
vlm: VlmConfig = field(default_factory=VlmConfig)
executor: ExecutorConfig = field(default_factory=ExecutorConfig)
skip_validation: bool = False
only_episodes: tuple[int, ...] | None = None
# Keyframe decode backend. When unset, the pipeline decodes with the
# ffmpeg CLI: it decodes AV1 and runs each decode as an isolated child
# process, which is both crash-safe and safe under the concurrent
# decode the executor performs (torchcodec is not thread-safe and
# SIGSEGVs there). Set to ``"torchcodec"`` or ``"pyav"`` to pin an
# in-process decoder when its build is known thread-safe.
video_backend: str | None = None
# When True, upload the annotated dataset to the Hugging Face Hub:
# to ``dest_repo_id`` if set, otherwise back to ``repo_id``. One of
# the two must be set for this to take effect.
push_to_hub: bool = False
push_private: bool = False
push_commit_message: str | None = None
def resolved_staging_dir(self, root: Path) -> Path:
return self.staging_dir if self.staging_dir is not None else root / ".annotate_staging"

View File

@@ -1,322 +0,0 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""In-process executor that runs the annotation phases.
The executor plans **seven phases** in the dependency order from the plan:
phase 0: vocabulary discovery — derive a small canonical vocabulary
from the first few sample-episode videos (subtask labels +
memory milestones) and persist it next to the dataset; the
``plan`` module then constrains every per-episode generation
to those strings, so the downstream policy sees a small,
repeatable conditioning distribution
phase 1: ``plan`` module (plan + subtasks + memory)
phase 2: ``interjections`` module (interjections + speech)
phase 3: ``plan`` plan-update pass — re-runs plan emission at every
interjection timestamp produced by phase 2
phase 4: ``vqa`` module (VQA)
phase 5: validator
phase 6: writer
Phase 3 is why the ``plan`` module must be re-entered after the
``interjections`` module — to refresh ``plan`` rows at interjection
timestamps.
Distributed execution is provided by Hugging Face Jobs (see
``examples/annotations/run_hf_job.py``); the runner inside the job
invokes ``lerobot-annotate`` which uses this in-process executor.
Episode-level concurrency is controlled by
``ExecutorConfig.episode_parallelism``.
"""
from __future__ import annotations
import logging
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from dataclasses import dataclass
from pathlib import Path
from typing import Any
from .config import AnnotationPipelineConfig
from .reader import EpisodeRecord, iter_episodes
from .staging import EpisodeStaging
from .validator import StagingValidator
from .writer import LanguageColumnsWriter
logger = logging.getLogger(__name__)
@dataclass
class PhaseResult:
"""Summary of one pipeline phase across all episodes."""
name: str
episodes_processed: int
episodes_skipped: int
@dataclass
class PipelineRunSummary:
"""Aggregated result returned by :meth:`Executor.run`."""
phases: list[PhaseResult]
written_paths: list[Path]
validation_report: Any # ValidationReport, kept Any to avoid import cycle
@dataclass
class Executor:
"""Run all six phases over a dataset root in-process.
Episode-level concurrency comes from ``ExecutorConfig.episode_parallelism``
(a thread pool); cluster-level concurrency comes from running this
executor inside a Hugging Face Job. Tests construct the executor
directly with stub modules.
"""
config: AnnotationPipelineConfig
plan: Any # PlanSubtasksMemoryModule
interjections: Any # InterjectionsAndSpeechModule
vqa: Any # GeneralVqaModule
writer: LanguageColumnsWriter
validator: StagingValidator
vocabulary: Any = None # VocabularyDiscoveryModule | None
def run(self, root: Path) -> PipelineRunSummary:
records = list(iter_episodes(root, only_episodes=self.config.only_episodes))
n = len(records)
if n == 0:
raise ValueError(f"No episodes found under {root}/data/")
print(f"[annotate] {n} episodes total", flush=True)
staging_dir = self.config.resolved_staging_dir(root)
staging_dir.mkdir(parents=True, exist_ok=True)
phases: list[PhaseResult] = []
# Phase 0: vocabulary discovery. Mutates ``self.plan.vocabulary``
# so subsequent per-episode plan calls see the canonical labels.
phases.append(self._run_vocabulary_phase(records, root))
# Phase 1: ``plan`` module (plan + subtasks + memory)
phases.append(self._run_module_phase("plan", records, staging_dir, self.plan))
# Phase 2: ``interjections`` module (interjections + speech). It
# reads the ``plan`` module's subtask rows from the same staging
# tree to ground the interjection prompt in the correct local subtask.
phases.append(self._run_module_phase("interjections", records, staging_dir, self.interjections))
# Phase 3: ``plan`` plan-update pass at interjection timestamps.
phases.append(self._run_plan_update_phase(records, staging_dir))
# Phase 4: ``vqa`` module (VQA)
phases.append(self._run_module_phase("vqa", records, staging_dir, self.vqa))
print("[annotate] running validator...", flush=True)
report = self.validator.validate(records, staging_dir)
if not report.ok and not self.config.skip_validation:
raise RuntimeError(f"Staging validation failed: {report.summary()}")
print(f"[annotate] validator: {report.summary()}", flush=True)
print(f"[annotate] writing parquet shards into {root}/data/...", flush=True)
written = self.writer.write_all(records, staging_dir, root)
print(f"[annotate] wrote {len(written)} shard(s); pipeline complete", flush=True)
# Keep meta/info.json aligned with the parquet schema we just wrote.
# Idempotent and additive: existing user metadata is preserved.
self._ensure_annotation_metadata_in_info(root)
return PipelineRunSummary(phases=phases, written_paths=written, validation_report=report)
@staticmethod
def _ensure_annotation_metadata_in_info(root: Path) -> None:
"""Write language features and canonical tools to ``meta/info.json``.
``LanguageColumnsWriter`` adds ``language_persistent`` and
``language_events`` to parquet shards. The metadata must advertise
those columns too, otherwise non-streaming ``LeRobotDataset`` loads
cast against the old schema and fail on the extra parquet columns.
"""
from lerobot.datasets.io_utils import load_info, write_info # noqa: PLC0415
from lerobot.datasets.language import SAY_TOOL_SCHEMA, language_feature_info # noqa: PLC0415
info_path = root / "meta" / "info.json"
if not info_path.exists():
return
try:
info = load_info(root)
except Exception as exc: # noqa: BLE001
print(f"[annotate] could not read {info_path}: {exc}", flush=True)
return
changed = False
merged_features = {**info.features, **language_feature_info()}
if merged_features != info.features:
info.features = merged_features
changed = True
existing = info.tools or []
names = {(t.get("function") or {}).get("name") for t in existing if isinstance(t, dict)}
if SAY_TOOL_SCHEMA["function"]["name"] not in names:
info.tools = [*existing, SAY_TOOL_SCHEMA]
changed = True
if changed:
write_info(info, root)
print(
"[annotate] meta/info.json: "
f"language_features={list(language_feature_info())}, "
f"tools={[t['function']['name'] for t in (info.tools or [])]}",
flush=True,
)
def _run_vocabulary_phase(
self, records: list[EpisodeRecord], root: Path
) -> PhaseResult:
"""Discover (or load) the canonical vocabulary, wire it into ``self.plan``.
Returns a ``PhaseResult`` whose ``episodes_processed`` is the number
of sample episodes consulted (0 when disabled or no VLM call was
needed); ``episodes_skipped`` is always ``0`` because vocabulary is
a once-per-dataset artifact, not a per-episode product.
"""
from .vocabulary import load_vocabulary, save_vocabulary # noqa: PLC0415
if self.vocabulary is None or not getattr(self.vocabulary, "enabled", False):
print(
"[annotate] phase=vocabulary skipped (module disabled or unset)",
flush=True,
)
return PhaseResult(name="vocabulary", episodes_processed=0, episodes_skipped=0)
existing = load_vocabulary(root)
if existing is not None and self.config.vocabulary.reuse_existing:
print(
f"[annotate] phase=vocabulary reusing {root / 'meta' / 'canonical_vocabulary.json'} "
f"({len(existing.subtasks)} subtask labels, "
f"{len(existing.memory_milestones)} memory milestones)",
flush=True,
)
self.plan.vocabulary = existing
return PhaseResult(name="vocabulary", episodes_processed=0, episodes_skipped=0)
sample_n = max(1, min(int(self.config.vocabulary.sample_episodes), len(records)))
print(
f"[annotate] phase=vocabulary discovering from {sample_n} sample episode(s)...",
flush=True,
)
t0 = time.time()
vocab = self.vocabulary.discover(records[:sample_n], existing=existing)
if vocab is None:
print(
"[annotate] phase=vocabulary returned no vocabulary — "
"plan module will fall back to free-form generation",
flush=True,
)
return PhaseResult(name="vocabulary", episodes_processed=0, episodes_skipped=0)
save_path = save_vocabulary(root, vocab)
print(
f"[annotate] phase=vocabulary wrote {save_path} "
f"({len(vocab.subtasks)} subtask labels, "
f"{len(vocab.memory_milestones)} memory milestones) in "
f"{time.time() - t0:.1f}s",
flush=True,
)
self.plan.vocabulary = vocab
return PhaseResult(name="vocabulary", episodes_processed=sample_n, episodes_skipped=0)
def _run_module_phase(
self,
name: str,
records: list[EpisodeRecord],
staging_dir: Path,
module: Any,
) -> PhaseResult:
if not module.enabled:
print(f"[annotate] phase={name} skipped (module disabled)", flush=True)
return PhaseResult(name=name, episodes_processed=0, episodes_skipped=len(records))
n = len(records)
parallelism = max(1, min(self.config.executor.episode_parallelism, n))
print(
f"[annotate] phase={name} starting on {n} episode(s) (parallelism={parallelism})",
flush=True,
)
t0 = time.time()
def _do(idx_record: tuple[int, EpisodeRecord]) -> tuple[int, int, float]:
i, record = idx_record
ep_start = time.time()
staging = EpisodeStaging(staging_dir, record.episode_index)
module.run_episode(record, staging)
return i, record.episode_index, time.time() - ep_start
processed = 0
if parallelism == 1:
for i, record in enumerate(records, 1):
_, ep_idx, elapsed = _do((i, record))
processed += 1
print(
f"[annotate] {name} episode {i}/{n} (idx={ep_idx}) done in {elapsed:.1f}s",
flush=True,
)
else:
with ThreadPoolExecutor(max_workers=parallelism) as pool:
futures = [pool.submit(_do, (i, r)) for i, r in enumerate(records, 1)]
for fut in as_completed(futures):
i, ep_idx, elapsed = fut.result()
processed += 1
print(
f"[annotate] {name} episode {processed}/{n} "
f"(idx={ep_idx}, submit_order={i}) done in {elapsed:.1f}s",
flush=True,
)
total = time.time() - t0
print(f"[annotate] phase={name} complete: {processed}/{n} in {total:.1f}s", flush=True)
return PhaseResult(name=name, episodes_processed=processed, episodes_skipped=0)
def _run_plan_update_phase( # noqa: PLR0915
self, records: list[EpisodeRecord], staging_dir: Path
) -> PhaseResult:
"""Re-emit ``plan`` rows at each timestamp the ``interjections`` module produced.
The ``plan`` module owns the prompt; the ``interjections`` module
produced the timestamps. This phase therefore calls back into the
``plan`` module with the interjection timestamps so its existing
prompt path is reused.
"""
if not self.plan.enabled or not self.interjections.enabled:
return PhaseResult(
name="plan_update", episodes_processed=0, episodes_skipped=len(records)
)
processed = 0
for record in records:
staging = EpisodeStaging(staging_dir, record.episode_index)
interjection_rows = [
row for row in staging.read("interjections") if row.get("style") == "interjection"
]
interjection_times = [float(row["timestamp"]) for row in interjection_rows]
interjection_texts = [str(row.get("content") or "") for row in interjection_rows]
if interjection_times:
self.plan.run_plan_updates(record, staging, interjection_times, interjection_texts)
processed += 1
# Episodes without any interjections are skipped (no plan refresh
# needed); count them so the summary's processed+skipped == total.
return PhaseResult(
name="plan_update",
episodes_processed=processed,
episodes_skipped=len(records) - processed,
)

View File

@@ -1,483 +0,0 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Keyframe extraction for the annotation pipeline.
Modules attach decoded camera frames to their VLM prompts so the model can
ground subtask decomposition, interjection scenarios, and VQA in actual
visual content. The pipeline shares one provider across modules and one
episode at a time, with a small per-episode cache so multiple modules
querying the same timestamp pay decode cost once.
"""
from __future__ import annotations
import logging
import threading
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, Protocol
import PIL.Image
import torch
from lerobot.datasets.video_utils import decode_video_frames
from .reader import EpisodeRecord
logger = logging.getLogger(__name__)
class FrameProvider(Protocol):
"""Decodes camera frames at episode-relative timestamps."""
@property
def camera_keys(self) -> list[str]:
"""All ``observation.images.*`` feature keys this provider can decode."""
def frames_at(
self,
record: EpisodeRecord,
timestamps: list[float],
camera_key: str | None = None,
) -> list[Any]:
"""Return one decoded frame per timestamp from ``camera_key`` (or default).
Frames are ``torch.Tensor`` (``C, H, W`` uint8) — the shape
:func:`lerobot.datasets.video_utils.decode_video_frames` returns.
:func:`to_image_blocks` converts them to PIL only at the VLM-message
boundary.
Empty list if the camera is unavailable. ``camera_key=None`` falls back
to the provider's default camera so existing single-camera callers
(the ``plan`` and ``interjections`` modules) keep working unchanged.
"""
def video_for_episode(
self,
record: EpisodeRecord,
max_frames: int,
camera_key: str | None = None,
) -> list[Any]:
"""Return up to ``max_frames`` decoded frames covering the whole episode.
Sampling is uniform across the episode duration. Frames are
``torch.Tensor`` (``C, H, W`` uint8); :func:`to_video_block` wraps
them into one ``{"type":"video", "video":<list>}`` block for a
Qwen-VL-compatible model that pools temporally itself. Empty list if
no camera available.
"""
@dataclass
class _NullProvider:
"""No-op provider used when the dataset has no video keys or in tests."""
@property
def camera_keys(self) -> list[str]:
return []
def frames_at(
self,
record: EpisodeRecord,
timestamps: list[float],
camera_key: str | None = None,
) -> list[Any]:
return []
def video_for_episode(
self,
record: EpisodeRecord,
max_frames: int,
camera_key: str | None = None,
) -> list[Any]:
return []
def null_provider() -> FrameProvider:
return _NullProvider()
@dataclass
class VideoFrameProvider:
"""Decodes frames from the dataset's ``observation.images.*`` streams.
By default the *first* camera key is used for the ``plan`` module
(subtask decomposition) and the ``interjections`` module (interjection
scenarios) — those prompts care about *what is happening*, not which
angle. The ``vqa`` module instead iterates over every camera in
:attr:`camera_keys` so each frame's
grounded answer (bbox/keypoint/...) is tagged with the camera it was
grounded against.
``camera_key`` overrides the default-camera choice but does not restrict
:attr:`camera_keys`. Pass ``camera_key`` explicitly to ``frames_at`` /
``video_for_episode`` to read a non-default stream.
Caches up to ``cache_size`` decoded frames per process to keep
co-timestamped ``interjections`` + ``plan`` plan-update calls cheap.
"""
root: Path
camera_key: str | None = None
tolerance_s: float = 1e-2
cache_size: int = 256
# Keyframe decode backend. ``None`` uses the ffmpeg CLI — the
# concurrency- and crash-safe default for the pipeline's threaded
# decode. Set to ``"torchcodec"`` or ``"pyav"`` to pin an in-process
# decoder when the build is known thread-safe.
video_backend: str | None = None
_meta: Any = field(default=None, init=False, repr=False)
_cache: dict = field(default_factory=dict, init=False, repr=False)
_camera_keys: list[str] = field(default_factory=list, init=False, repr=False)
# Pipeline runs the three module phases under a ThreadPoolExecutor (see
# ``ExecutorConfig.episode_parallelism``); guard the dict cache and the
# one-shot warn flag against concurrent updates from worker threads.
_lock: threading.Lock = field(default_factory=threading.Lock, init=False, repr=False)
def __post_init__(self) -> None:
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata # noqa: PLC0415
self._meta = LeRobotDatasetMetadata(repo_id="local", root=self.root)
# ``camera_keys`` covers both image- and video-stored cameras and is
# always defined on the metadata (``[]`` in the worst case), so it is
# the single source we need here.
keys = list(self._meta.camera_keys)
# Last-resort fallback: if metadata didn't surface anything but the
# caller explicitly named a camera (``--vlm.camera_key=...``), trust
# them — the key is by definition known to exist on the dataset.
if not keys and self.camera_key:
keys = [self.camera_key]
self._camera_keys = keys
if self.camera_key is None:
self.camera_key = keys[0] if keys else None
@property
def camera_keys(self) -> list[str]:
"""All ``observation.images.*`` keys available on this dataset."""
return list(self._camera_keys)
def frames_at(
self,
record: EpisodeRecord,
timestamps: list[float],
camera_key: str | None = None,
) -> list[Any]:
target = camera_key if camera_key is not None else self.camera_key
if not timestamps or target is None:
return []
out: list[Any] = []
misses: list[float] = []
miss_indices: list[int] = []
with self._lock:
for i, ts in enumerate(timestamps):
key = (record.episode_index, target, round(float(ts), 6))
cached = self._cache.get(key)
if cached is not None:
out.append(cached)
else:
out.append(None)
misses.append(float(ts))
miss_indices.append(i)
if misses:
decoded = self._decode(record.episode_index, misses, target)
# ``_decode`` returns exactly one frame per requested timestamp,
# or an empty list if decoding failed wholesale. A partial list
# would mean a frame/timestamp misalignment, so only pair them up
# when the counts match (``strict=True`` then guards regressions).
if len(decoded) == len(miss_indices):
with self._lock:
for i, frame in zip(miss_indices, decoded, strict=True):
out[i] = frame
key = (record.episode_index, target, round(float(timestamps[i]), 6))
if len(self._cache) >= self.cache_size:
self._cache.pop(next(iter(self._cache)))
self._cache[key] = frame
# filter out any None left over from decode failures
return [frame for frame in out if frame is not None]
def video_for_episode(
self,
record: EpisodeRecord,
max_frames: int,
camera_key: str | None = None,
) -> list[Any]:
"""Return up to ``max_frames`` frames uniformly sampled across the episode.
The whole episode duration is covered; the model picks subtask
boundaries from the temporal pooling it does internally. Frames are
``torch.Tensor`` (see :meth:`frames_at`).
"""
target = camera_key if camera_key is not None else self.camera_key
if max_frames <= 0 or target is None or not record.frame_timestamps:
return []
n_frames = min(max_frames, len(record.frame_timestamps))
if n_frames == len(record.frame_timestamps):
timestamps = list(record.frame_timestamps)
else:
t0 = record.frame_timestamps[0]
t_last = record.frame_timestamps[-1]
if t_last <= t0:
timestamps = [float(t0)] * n_frames
else:
step = (t_last - t0) / (n_frames - 1) if n_frames > 1 else 0.0
timestamps = [float(t0 + i * step) for i in range(n_frames)]
return self.frames_at(record, timestamps, camera_key=target)
def episode_clip_path(self, record: EpisodeRecord, cache_dir: Path) -> Path | None:
"""Extract the episode's subclip to ``cache_dir/ep_{idx:06d}.mp4``.
Returns ``None`` if the dataset has no video tracks. Skips
re-extract when the cached clip already exists. Re-encodes to
H.264 (libx264) so the resulting mp4 is decodable by every
downstream video processor — stream-copy would inherit the
source codec (often AV1 in modern LeRobot datasets), which
vllm's libav build cannot decode.
"""
import subprocess # noqa: PLC0415
if self.camera_key is None:
return None
cache_dir.mkdir(parents=True, exist_ok=True)
out_path = cache_dir / f"ep_{record.episode_index:06d}.mp4"
if out_path.exists() and out_path.stat().st_size > 0:
return out_path
ep = self._meta.episodes[record.episode_index]
from_timestamp = float(ep[f"videos/{self.camera_key}/from_timestamp"])
to_timestamp = float(ep[f"videos/{self.camera_key}/to_timestamp"])
src = self.root / self._meta.get_video_file_path(record.episode_index, self.camera_key)
cmd = [
"ffmpeg",
"-y",
"-loglevel",
"error",
"-ss",
f"{from_timestamp:.3f}",
"-to",
f"{to_timestamp:.3f}",
"-i",
str(src),
"-c:v",
"libx264",
"-preset",
"ultrafast",
"-crf",
"23",
"-pix_fmt",
"yuv420p",
"-an",
str(out_path),
]
try:
subprocess.run(cmd, check=True, timeout=300)
except (subprocess.CalledProcessError, subprocess.TimeoutExpired, FileNotFoundError):
return None
return out_path if out_path.exists() and out_path.stat().st_size > 0 else None
def _decode(self, episode_index: int, timestamps: list[float], camera_key: str) -> list[Any]:
"""Decode ``timestamps`` from the episode's video as ``(C, H, W)`` tensors.
Delegates to :func:`lerobot.datasets.video_utils.decode_video_frames`
(torchcodec by default, PyAV fallback) rather than a bespoke decoder.
Returns one frame per requested timestamp, or ``[]`` if decoding
failed wholesale — callers treat ``[]`` as "no frames available".
"""
ep = self._meta.episodes[episode_index]
from_timestamp = ep[f"videos/{camera_key}/from_timestamp"]
shifted = [from_timestamp + ts for ts in timestamps]
video_path = self.root / self._meta.get_video_file_path(episode_index, camera_key)
# Default to the ffmpeg CLI. The pipeline decodes under a 16-wide
# ThreadPoolExecutor and the in-process decoders are unsafe there:
# torchcodec is not thread-safe and SIGSEGVs under concurrent decode
# (a crash no try/except can catch), PyAV can likewise segfault on
# AV1, and lerobot's ``pyav`` backend routes through the removed
# ``torchvision.io.VideoReader``. ``_decode_frames_ffmpeg`` shells
# out per frame: each decode is an isolated child process, so it is
# both crash-safe and concurrency-safe. ``video_backend`` can pin
# ``torchcodec`` / ``pyav`` explicitly for callers that know their
# build is safe.
chain = [self.video_backend] if self.video_backend else ["ffmpeg"]
exc: Exception | None = None
for backend in chain:
try:
if backend == "ffmpeg":
return _decode_frames_ffmpeg(video_path, shifted)
if backend in ("pyav", "av"):
return _decode_frames_av(video_path, shifted)
# Stacked ``(N, C, H, W)`` uint8 tensor; one row per timestamp.
decoded = decode_video_frames(
video_path, shifted, self.tolerance_s, backend=backend, return_uint8=True
)
return list(decoded)
except Exception as e: # noqa: PERF203
exc = e
# Every backend raised. Log loudly the first time so a silent
# vqa-module no-op (every prompt skipped because frames_at returned
# []) is debuggable from the job log instead of post-hoc parquet
# inspection. Subsequent failures stay quiet.
with self._lock:
already_warned = getattr(self, "_warned_decode_fail", False)
if not already_warned:
self._warned_decode_fail = True
if not already_warned:
logger.warning(
"VideoFrameProvider._decode failed for episode=%s camera=%s "
"video_path=%s backends=%s: %s",
episode_index,
camera_key,
video_path,
chain,
exc,
exc_info=exc,
)
return []
def make_frame_provider(
root: Path, camera_key: str | None = None, video_backend: str | None = None
) -> FrameProvider:
"""Build a :class:`VideoFrameProvider` if videos are present, else null."""
try:
provider = VideoFrameProvider(root=root, camera_key=camera_key, video_backend=video_backend)
except Exception:
return null_provider()
if provider.camera_key is None:
return null_provider()
return provider
def _decode_frames_ffmpeg(video_path: Path, timestamps: list[float]) -> list[Any]:
"""Decode the frames nearest to ``timestamps`` via the ffmpeg CLI.
Runs one ``ffmpeg`` process per timestamp, seeking with ``-ss`` and
piping a single PNG to stdout. Unlike the in-process decoders this
survives a hostile container: a full ffmpeg build decodes AV1 (the codec
modern LeRobot datasets use) where torchcodec raises and PyAV can
SIGSEGV, and a crash stays isolated to the child process — a non-zero
exit is a catchable error, not a segfault of the whole job. Returns one
``(C, H, W)`` uint8 tensor per timestamp.
"""
import io # noqa: PLC0415
import subprocess # noqa: PLC0415
import numpy as np # noqa: PLC0415
frames: list[Any] = []
for ts in timestamps:
proc = subprocess.run(
[
"ffmpeg", "-nostdin", "-loglevel", "error",
"-ss", f"{max(ts, 0.0):.3f}",
"-i", str(video_path),
"-frames:v", "1",
"-f", "image2pipe", "-vcodec", "png", "pipe:1",
],
capture_output=True,
check=True,
timeout=120,
)
if not proc.stdout:
raise RuntimeError(f"ffmpeg returned no frame for t={ts:.3f}s of {video_path}")
img = PIL.Image.open(io.BytesIO(proc.stdout)).convert("RGB")
frames.append(torch.from_numpy(np.asarray(img).copy()).permute(2, 0, 1).contiguous())
return frames
def _decode_frames_av(video_path: Path, timestamps: list[float]) -> list[Any]:
"""Decode the frames nearest to ``timestamps`` using PyAV directly.
lerobot's ``decode_video_frames(backend="pyav")`` routes through
``torchvision.io.VideoReader``, removed in torchvision 0.23+. This helper
talks to the ``av`` package directly. Note PyAV can SIGSEGV on AV1
streams in some builds — prefer ``_decode_frames_ffmpeg`` as the default
fallback; this stays available behind ``video_backend="pyav"``. Returns
one ``(C, H, W)`` uint8 tensor per timestamp.
"""
import av # noqa: PLC0415
first_ts = min(timestamps)
last_ts = max(timestamps)
loaded_frames: list[torch.Tensor] = []
loaded_ts: list[float] = []
with av.open(str(video_path)) as container:
stream = container.streams.video[0]
# Seek to the keyframe at or before the first requested timestamp.
offset = max(int(first_ts / stream.time_base), 0) if stream.time_base else 0
container.seek(offset, stream=stream, backward=True, any_frame=False)
for idx, frame in enumerate(container.decode(stream)):
ts = frame.time
if ts is None:
ts = float(frame.pts * stream.time_base) if frame.pts is not None else float(idx)
loaded_ts.append(ts)
loaded_frames.append(
torch.from_numpy(frame.to_ndarray(format="rgb24")).permute(2, 0, 1).contiguous()
)
if ts >= last_ts:
break
if not loaded_frames:
raise RuntimeError(f"PyAV decoded no frames from {video_path}")
ts_tensor = torch.tensor(loaded_ts)
return [loaded_frames[int(torch.argmin((ts_tensor - q).abs()))] for q in timestamps]
def _frame_to_pil(frame: Any) -> Any:
"""Materialise a decoded frame as a ``PIL.Image`` for the VLM message.
Frames flow through the provider as ``torch.Tensor`` (``C, H, W`` uint8,
straight from :func:`decode_video_frames`); PIL is only created here, at
the VLM-message boundary, because the chat backends expect PIL images /
data URLs. Non-tensor inputs (e.g. test stubs) pass through untouched.
"""
if not isinstance(frame, torch.Tensor):
return frame
array = frame.detach().cpu()
if array.ndim == 3 and array.shape[0] in (1, 3):
array = array.permute(1, 2, 0) # (C, H, W) -> (H, W, C)
if array.shape[-1] == 1:
array = array.squeeze(-1)
return PIL.Image.fromarray(array.to(torch.uint8).numpy())
def to_image_blocks(frames: list[Any]) -> list[dict[str, Any]]:
"""Convert decoded frames to Qwen-VL-compatible image content blocks."""
return [{"type": "image", "image": _frame_to_pil(frame)} for frame in frames]
def to_video_block(frames: list[Any]) -> list[dict[str, Any]]:
"""Wrap a list of decoded frames as one Qwen-VL video block.
Returns ``[]`` when the list is empty, so the caller can splat the result
into a content array without a separate emptiness check.
"""
if not frames:
return []
return [{"type": "video", "video": [_frame_to_pil(frame) for frame in frames]}]
def to_video_url_block(url: str | None, fps: float = 2.0) -> list[dict[str, Any]]:
"""Wrap a video file URL as one ``video_url`` block.
Used by the ``openai`` backend (transformers serve / vllm serve /
ktransformers serve), where the server handles frame sampling.
Returns ``[]`` when ``url`` is ``None`` so the caller can splat.
"""
if not url:
return []
return [{"type": "video_url", "video_url": {"url": url}, "fps": fps}]

View File

@@ -1,25 +0,0 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .general_vqa import GeneralVqaModule
from .interjections_and_speech import InterjectionsAndSpeechModule
from .plan_subtasks_memory import PlanSubtasksMemoryModule
__all__ = [
"GeneralVqaModule",
"InterjectionsAndSpeechModule",
"PlanSubtasksMemoryModule",
]

View File

@@ -1,228 +0,0 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""``vqa`` module: general VQA at a timed cadence.
Every ``1/hz`` seconds an emission tick fires; each tick anchors ``K``
consecutive frames, and every anchored frame gets its own VQA pair. Each
pair is grounded on that single anchor frame — there is no per-pair frame
window. For datasets with multiple cameras, every anchored frame produces
one ``(vqa, user)`` + ``(vqa, assistant)`` pair *per camera*: each pair is
generated against that camera's frame and stamped with the matching
``camera`` field on the emitted rows. The resolver disambiguates via
``camera=...``; recipes that consume VQA do so through one sub-recipe
per camera (see ``recipes/pi05_hirobot.yaml``).
Within a single (frame, camera) we still emit at most one ``(vqa, user)``
and one ``(vqa, assistant)`` row, so the resolver contract stays scalar.
Question types covered (per the plan's ``vqa`` table): bbox, keypoint,
count, attribute, spatial. The assistant's ``content`` is a JSON string
whose schema depends on the question type. Malformed JSON triggers one
retry inside :meth:`VlmClient.generate_json`.
"""
from __future__ import annotations
import json
import logging
import random
from collections.abc import Sequence
from dataclasses import dataclass, field
from typing import Any
from ..config import VqaConfig
from ..frames import FrameProvider, null_provider, to_image_blocks
from ..prompts import load as load_prompt
from ..reader import EpisodeRecord
from ..staging import EpisodeStaging
from ..validator import classify_vqa_answer
from ..vlm_client import VlmClient
def _emission_anchor_indices(frame_timestamps: Sequence[float], hz: float, k: int) -> list[int]:
"""Return the relative frame indices to anchor VQA emissions to.
For each emission tick (every ``1/hz`` seconds), we anchor ``k``
consecutive frames starting at the tick. Ticks fall on the nearest
available source frame timestamp.
"""
if hz <= 0 or k <= 0 or not frame_timestamps:
return []
t0 = frame_timestamps[0]
t_last = frame_timestamps[-1]
period = 1.0 / hz
indices: list[int] = []
t = t0
while t <= t_last + 1e-9:
# find the index of the nearest frame to t
nearest_i = min(range(len(frame_timestamps)), key=lambda i: abs(frame_timestamps[i] - t))
for offset in range(k):
j = nearest_i + offset
if j >= len(frame_timestamps):
break
if not indices or indices[-1] != j:
indices.append(j)
t += period
# dedupe while preserving order
seen: set[int] = set()
deduped: list[int] = []
for i in indices:
if i in seen:
continue
seen.add(i)
deduped.append(i)
return deduped
@dataclass
class GeneralVqaModule:
"""Emit grounded VQA pairs at a timed cadence."""
vlm: VlmClient
config: VqaConfig
seed: int = 1729
frame_provider: FrameProvider = field(default_factory=null_provider)
@property
def enabled(self) -> bool:
return self.config.enabled
def run_episode(self, record: EpisodeRecord, staging: EpisodeStaging) -> None:
if not record.frame_timestamps:
staging.write("vqa", [])
return
rng = random.Random(f"{self.seed}:{record.episode_index}:vqa")
anchor_idx = _emission_anchor_indices(
record.frame_timestamps, self.config.vqa_emission_hz, self.config.K
)
cameras = self._target_cameras()
if not cameras:
# No camera available — emit nothing rather than producing
# untagged rows that would fail validation. Surface a loud one-
# time warning so this is never silently a no-op.
if not getattr(self, "_warned_no_camera", False):
logging.getLogger(__name__).warning(
"vqa module found no cameras on the frame provider — "
"every episode will emit zero VQA rows. Check that the "
"dataset declares observation.images.* features in "
"meta/info.json; passing --vlm.camera_key=<key> at the "
"CLI now also seeds the cameras list as a fallback."
)
self._warned_no_camera = True
staging.write("vqa", [])
return
# Build all messages first (one per (frame, camera)), then issue them
# as a single batched generate_json call so the client can fan them
# out concurrently.
per_call: list[tuple[float, str, str, list[dict[str, Any]]]] = []
for idx in anchor_idx:
ts = float(record.frame_timestamps[idx])
qtype = rng.choice(self.config.question_types)
for camera in cameras:
messages = self._build_messages(record, qtype, ts, camera)
# Skip cameras that decoded to zero frames at this ts: no point
# asking the VLM to ground a bbox without an image.
if not _has_image_block(messages):
continue
per_call.append((ts, camera, qtype, messages))
if not per_call:
staging.write("vqa", [])
return
results = self.vlm.generate_json([m for _, _, _, m in per_call])
rows: list[dict[str, Any]] = []
for (ts, camera, _qtype, _messages), result in zip(per_call, results, strict=True):
qa = self._postprocess(result)
if qa is None:
continue
question, answer = qa
rows.append(
{
"role": "user",
"content": question,
"style": "vqa",
"timestamp": ts,
"camera": camera,
"tool_calls": None,
}
)
rows.append(
{
"role": "assistant",
"content": json.dumps(answer, sort_keys=True),
"style": "vqa",
"timestamp": ts,
"camera": camera,
"tool_calls": None,
}
)
staging.write("vqa", rows)
def _target_cameras(self) -> list[str]:
"""Return the cameras the ``vqa`` module should iterate per anchored frame.
Defaults to every camera the provider exposes. Datasets with no
cameras (or test/null providers) yield an empty list, which makes
``run_episode`` a no-op.
"""
return list(getattr(self.frame_provider, "camera_keys", []) or [])
def _build_messages(
self,
record: EpisodeRecord,
question_type: str,
frame_timestamp: float,
camera_key: str,
) -> list[dict[str, Any]]:
prompt = load_prompt("module_3_vqa").format(
episode_task=record.episode_task,
question_type=question_type,
)
images = self.frame_provider.frames_at(
record, [frame_timestamp], camera_key=camera_key
)
content = [*to_image_blocks(images), {"type": "text", "text": prompt}]
return [{"role": "user", "content": content}]
def _postprocess(self, result: Any) -> tuple[str, dict[str, Any]] | None:
if not isinstance(result, dict):
return None
question = result.get("question")
answer = result.get("answer")
if not isinstance(question, str) or not question.strip():
return None
if not isinstance(answer, dict):
return None
# The validator will enforce shape; here we just sanity-check that the
# answer matches *some* known shape so we can drop garbage early.
if classify_vqa_answer(answer) is None:
return None
return question.strip(), answer
def _has_image_block(messages: list[dict[str, Any]]) -> bool:
"""Return True if any user content block is a populated image block."""
for msg in messages:
content = msg.get("content")
if not isinstance(content, list):
continue
for block in content:
if isinstance(block, dict) and block.get("type") == "image":
return True
return False

View File

@@ -1,210 +0,0 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""``interjections`` module: interjections + paired speech (EVENT styles + speech atoms).
Two sub-passes:
1. At ``t=0``, emit ONLY a speech tool-call atom (acknowledgement of the
canonical task). No interjection row — the canonical task is already the
user utterance from ``meta/tasks.parquet``.
2. For mid-episode interruptions, emit a co-timestamped pair:
{role:user, style:interjection, content:<text>}
speech atom (role:assistant, style:None, tool_calls=[say(...)])
Both rows go in ``language_events`` at the same timestamp.
The ``plan`` module's :meth:`run_plan_updates` reuses this module's
interjection timestamps to refresh the ``plan`` row at the same instant.
"""
from __future__ import annotations
import random
from collections.abc import Sequence
from dataclasses import dataclass, field
from typing import Any
from ..config import InterjectionsConfig
from ..frames import FrameProvider, null_provider, to_image_blocks
from ..prompts import load as load_prompt
from ..reader import EpisodeRecord, reconstruct_subtask_spans, snap_to_frame
from ..staging import EpisodeStaging
from ..vlm_client import VlmClient
from ..writer import speech_atom
@dataclass
class InterjectionsAndSpeechModule:
"""Generate task-start speech and mid-episode interjection/speech pairs."""
vlm: VlmClient
config: InterjectionsConfig
seed: int = 1729
frame_provider: FrameProvider = field(default_factory=null_provider)
@property
def enabled(self) -> bool:
return self.config.enabled
def run_episode(self, record: EpisodeRecord, staging: EpisodeStaging) -> None:
rows: list[dict[str, Any]] = []
if record.frame_timestamps:
t0 = float(record.frame_timestamps[0])
initial = self._initial_speech(record)
if initial:
rows.append(speech_atom(t0, initial))
# Pull the ``plan`` module's subtask spans for this episode so the
# interjection prompt can ground itself in the actual current
# subtask at each chosen timestamp. The ``plan`` module ran first.
episode_end_t = float(record.frame_timestamps[-1]) if record.frame_timestamps else None
subtask_spans = reconstruct_subtask_spans(staging.read("plan"), episode_end_t=episode_end_t)
rows.extend(self._mid_episode_interjections(record, subtask_spans))
staging.write("interjections", rows)
@staticmethod
def _subtask_at(spans: Sequence[dict[str, Any]], t: float) -> str | None:
current: str | None = None
for span in spans:
if float(span["start"]) <= t:
current = span.get("text")
else:
break
return current
def _initial_speech(self, record: EpisodeRecord) -> str | None:
prompt = load_prompt("module_2_initial_speech").format(
episode_task=record.episode_task,
)
messages = [{"role": "user", "content": [{"type": "text", "text": prompt}]}]
result = self.vlm.generate_json([messages])[0]
if isinstance(result, dict) and isinstance(result.get("text"), str):
text = result["text"].strip()
if text:
return text
return None
def _mid_episode_interjections(
self,
record: EpisodeRecord,
subtask_spans: Sequence[dict[str, Any]],
) -> list[dict[str, Any]]:
"""Generate interjections aligned with the actual demo trajectory.
Teleop data is frozen — the robot already executed every step in
the video. A *counterfactual* interjection like "actually skip
the wipe" contradicts what then happens in the video, which is
what qwen36moe-10/11 surfaced as low-quality interjections.
Instead, anchor every interjection at a subtask boundary and
write it as a natural user request for the *upcoming* subtask.
The robot's visible next behavior IS the interjection's effect,
so the training signal stays consistent: interjection text →
plan refresh → action stream all line up.
"""
if self.config.max_interjections_per_episode <= 0:
return []
if len(subtask_spans) < 2:
# Need at least one transition (subtask 0 → subtask 1).
return []
# Deterministic per-episode RNG so reruns are stable across SLURM jobs.
rng = random.Random(f"{self.seed}:{record.episode_index}:interjection")
# Boundaries: the start time of every subtask except the first
# (which is just t0 and is covered by the initial-task speech atom).
boundaries: list[tuple[float, str, str]] = []
for i in range(1, len(subtask_spans)):
ts = float(subtask_spans[i]["start"])
if ts < self.config.interjection_min_t:
continue
prev_text = (subtask_spans[i - 1].get("text") or "").strip()
next_text = (subtask_spans[i].get("text") or "").strip()
if not next_text:
continue
boundaries.append((ts, prev_text, next_text))
if not boundaries:
return []
n = min(self.config.max_interjections_per_episode, len(boundaries))
chosen = sorted(rng.sample(boundaries, n), key=lambda b: b[0])
out: list[dict[str, Any]] = []
for t, prev_subtask, next_subtask in chosen:
t_snap = snap_to_frame(t, record.frame_timestamps)
# Window straddles the boundary so the VLM sees the end of the
# previous subtask and the start of the next one — same
# conditioning the policy will see at training time.
window_ts = self._window_timestamps(t_snap, record.frame_timestamps)
prompt = load_prompt("module_2_interjection").format(
episode_task=record.episode_task,
prev_subtask=prev_subtask or "(starting from initial state)",
next_subtask=next_subtask,
timestamp=t_snap,
window_seconds=self.config.interjection_window_seconds,
)
images = self.frame_provider.frames_at(record, window_ts)
content = [*to_image_blocks(images), {"type": "text", "text": prompt}]
messages = [{"role": "user", "content": content}]
result = self.vlm.generate_json([messages])[0]
if not isinstance(result, dict):
continue
interjection_text = result.get("interjection")
speech_text = result.get("speech")
if not isinstance(interjection_text, str) or not interjection_text.strip():
continue
if not isinstance(speech_text, str) or not speech_text.strip():
continue
out.append(
{
"role": "user",
"content": interjection_text.strip(),
"style": "interjection",
"timestamp": t_snap,
"tool_calls": None,
}
)
out.append(speech_atom(t_snap, speech_text.strip()))
return out
def _window_timestamps(self, t_anchor: float, frame_timestamps: Sequence[float]) -> list[float]:
"""Return a small set of frame timestamps centered on ``t_anchor``.
The window straddles the subtask boundary the interjection sits
on: roughly half the frames cover the end of the previous
subtask, half cover the start of the next one. The VLM therefore
sees BOTH what just finished AND what's about to start, which is
the conditioning we need to write a natural "now please do X"
request that matches the visible upcoming behavior.
"""
if not frame_timestamps:
return [t_anchor]
n = max(1, int(self.config.interjection_window_frames))
if n == 1:
return [t_anchor]
window = float(self.config.interjection_window_seconds)
step = window / max(1, n - 1)
# Center the window on the anchor so half lands before, half after.
start_offset = -window / 2.0
targets = [t_anchor + start_offset + step * i for i in range(n)]
last_ts = float(frame_timestamps[-1])
snapped: list[float] = []
seen: set[float] = set()
for tgt in targets:
clamped = min(last_ts, max(0.0, tgt))
t = snap_to_frame(clamped, frame_timestamps)
if t not in seen:
seen.add(t)
snapped.append(t)
return snapped or [t_anchor]

View File

@@ -1,617 +0,0 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""``plan`` module: subtask decomposition + plan + memory (PERSISTENT styles)."""
from __future__ import annotations
import logging
from collections.abc import Sequence
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
from ..config import PlanConfig
from ..frames import (
FrameProvider,
VideoFrameProvider,
null_provider,
to_video_block,
to_video_url_block,
)
from ..prompts import load as load_prompt
from ..reader import EpisodeRecord, reconstruct_subtask_spans, snap_to_frame
from ..staging import EpisodeStaging
from ..vlm_client import VlmClient
from ..vocabulary import Vocabulary
logger = logging.getLogger(__name__)
@dataclass
class PlanSubtasksMemoryModule:
"""Generate subtask spans, plan, and memory rows.
All output is persistent (lives in ``language_persistent``):
- ``subtask`` rows: one per span, stamped at the span's *start* timestamp
(snapped to an exact frame).
- ``plan`` rows: emitted at ``t=0``; refreshed at every interjection
timestamp via :meth:`run_plan_updates` (called by the executor after
the ``interjections`` module completes).
- ``memory`` rows: emitted at each subtask boundary (= subtask start
timestamp from the second subtask onward).
"""
vlm: VlmClient
config: PlanConfig
frame_provider: FrameProvider = field(default_factory=null_provider)
vocabulary: Vocabulary | None = None
"""When set, the module constrains subtask + memory generation to the
canonical strings in ``vocabulary``. Phase 0 (vocabulary discovery)
populates this once per dataset; ``None`` falls back to free-form
generation (original behaviour)."""
@property
def enabled(self) -> bool:
return self.config.enabled
def run_episode(self, record: EpisodeRecord, staging: EpisodeStaging) -> None:
rows: list[dict[str, Any]] = []
# Resolve the task that drives every other ``plan``-module prompt.
# May be the canonical ``record.episode_task`` (default), or a fresh
# description derived from the video when the canonical task is
# empty / placeholder / forced-off (see PlanConfig.derive_task_*).
effective_task = self._resolve_effective_task(record)
# ``task_aug`` rows at t=0 (role=user), one per rephrasing — the
# message renderer rotates ``${task}`` deterministically through
# them so the policy sees diverse phrasings during training.
t0 = float(record.frame_timestamps[0]) if record.frame_timestamps else 0.0
if self.config.n_task_rephrasings > 0 and effective_task:
rephrasings = self._generate_task_rephrasings(effective_task, n=self.config.n_task_rephrasings)
# Always include the effective task itself as the first variant
# so the rotation is guaranteed to cover the source-of-truth
# phrasing, not just synthetic alternatives.
seen: set[str] = set()
ordered = [effective_task, *rephrasings]
for phrasing in ordered:
key = phrasing.strip()
if not key or key in seen:
continue
seen.add(key)
rows.append(
{
"role": "user",
"content": key,
"style": "task_aug",
"timestamp": t0,
"tool_calls": None,
}
)
subtask_spans = self._generate_subtasks(record, task=effective_task)
# subtask rows
for span in subtask_spans:
rows.append(
{
"role": "assistant",
"content": span["text"],
"style": "subtask",
"timestamp": snap_to_frame(span["start"], record.frame_timestamps),
"tool_calls": None,
}
)
# Plan rows at every subtask boundary — including t=0 (start of
# the first subtask). Because the plan is just a numbered list
# of *still-todo* subtasks, re-emitting at each boundary makes
# the active plan shrink as work progresses: at frame t the
# rendered ``${plan}`` is the most recent emission, which
# contains exactly the subtasks that started at or after the
# current span. Saves the runtime from having to derive
# "what's still left" at inference time.
for span in subtask_spans:
boundary_t = snap_to_frame(span["start"], record.frame_timestamps)
plan_text = self._generate_plan(
record, subtask_spans, refresh_t=boundary_t, task=effective_task
)
if plan_text is not None:
rows.append(
{
"role": "assistant",
"content": plan_text,
"style": "plan",
"timestamp": float(boundary_t),
"tool_calls": None,
}
)
# memory rows at every subtask boundary except the very first start
prior_memory = ""
for i, span in enumerate(subtask_spans[1:], start=1):
completed = subtask_spans[i - 1]["text"]
remaining = [s["text"] for s in subtask_spans[i:]]
mem_text = self._generate_memory(record, prior_memory, completed, remaining, task=effective_task)
if mem_text:
ts = snap_to_frame(span["start"], record.frame_timestamps)
rows.append(
{
"role": "assistant",
"content": mem_text,
"style": "memory",
"timestamp": ts,
"tool_calls": None,
}
)
prior_memory = mem_text
staging.write("plan", rows)
# ------------------------------------------------------------------
# Task derivation + rephrasings
# ------------------------------------------------------------------
_PLACEHOLDER_TASKS: frozenset[str] = frozenset(
{
"debug",
"test",
"tbd",
"todo",
"n/a",
"na",
"untitled",
"unnamed",
"default",
"placeholder",
}
)
def _resolve_effective_task(self, record: EpisodeRecord) -> str:
"""Decide which task string drives the ``plan`` module for this episode.
Returns the user-supplied ``record.episode_task`` unless
``derive_task_from_video`` says otherwise (see config docstring).
Falls back gracefully to the canonical task if video derivation
fails.
"""
canonical = (record.episode_task or "").strip()
mode = (self.config.derive_task_from_video or "off").strip().lower()
if mode == "always":
derived = self._derive_task_from_video(record)
return derived or canonical
if mode == "if_short" and self._task_seems_bad(canonical):
derived = self._derive_task_from_video(record)
if derived:
return derived
return canonical
def _task_seems_bad(self, task: str) -> bool:
if not task:
return True
if len(task.split()) < int(self.config.derive_task_min_words):
return True
return task.lower() in self._PLACEHOLDER_TASKS
# ------------------------------------------------------------------
# VLM call helpers (factored out: every ``plan``-module prompt below follows
# the same "build messages → single VLM call → pull a named field"
# shape, only differing in field name + post-processing).
# ------------------------------------------------------------------
def _vlm_field(self, messages: list[dict[str, Any]], field: str) -> Any:
"""Run a single VLM call and return ``result[field]`` or ``None``.
Centralizes the ``vlm.generate_json([m])[0]`` + ``isinstance(dict)``
dance every prompt-call site needs.
"""
result = self.vlm.generate_json([messages])[0]
if isinstance(result, dict):
return result.get(field)
return None
@staticmethod
def _text_message(text: str) -> list[dict[str, Any]]:
"""One-shot text-only user message wrapped for ``generate_json``."""
return [{"role": "user", "content": [{"type": "text", "text": text}]}]
def _video_message(self, record: EpisodeRecord, prompt: str) -> list[dict[str, Any]]:
"""User message combining the episode video block with ``prompt``."""
content = [*self._episode_video_block(record), {"type": "text", "text": prompt}]
return [{"role": "user", "content": content}]
def _derive_task_from_video(self, record: EpisodeRecord) -> str | None:
"""Ask the VLM "what is this video about" with no task hint at all."""
text = self._vlm_field(self._video_message(record, load_prompt("module_1_video_task")), "task")
return text.strip() if isinstance(text, str) and text.strip() else None
def _generate_task_rephrasings(self, base_task: str, *, n: int) -> list[str]:
"""Generate ``n`` text-only paraphrases of ``base_task``."""
if n <= 0 or not base_task:
return []
prompt = load_prompt("module_1_task_rephrasings").format(base_task=base_task, n=n)
raw = self._vlm_field(self._text_message(prompt), "rephrasings")
if not isinstance(raw, list):
return []
out = [item.strip().strip('"').strip("'") for item in raw if isinstance(item, str)]
return [s for s in out if s][:n]
def _episode_video_block(self, record: EpisodeRecord) -> list[dict[str, Any]]:
"""Same video block ``_generate_subtasks`` builds — extracted helper."""
if not record.frame_timestamps:
return []
if self.config.use_video_url and isinstance(self.frame_provider, VideoFrameProvider):
cache_dir = Path(self.frame_provider.root) / ".annotate_staging" / ".video_clips"
clip = self.frame_provider.episode_clip_path(record, cache_dir)
return (
to_video_url_block(f"file://{clip}", fps=self.config.use_video_url_fps)
if clip is not None
else []
)
episode_duration = record.frame_timestamps[-1] - record.frame_timestamps[0]
target_count = max(1, int(round(episode_duration * self.config.frames_per_second)))
target_count = min(target_count, self.config.max_video_frames)
video_frames = self.frame_provider.video_for_episode(record, target_count)
return to_video_block(video_frames)
def run_plan_updates(
self,
record: EpisodeRecord,
staging: EpisodeStaging,
interjection_times: Sequence[float],
interjection_texts: Sequence[str] | None = None,
) -> None:
"""Append additional ``plan`` rows at every interjection timestamp.
Plans refresh ONLY on user interjections — subtask generation
runs ~1 Hz at inference, but plan re-emission is event-driven.
Now also forwards the interjection's own text into the prompt so
the refreshed plan can actually reflect the user's correction
(the previous version told the model "an interjection happened"
without telling it what the user said).
"""
existing = staging.read("plan")
# Pass the episode's last frame timestamp so the final subtask
# span is closed (otherwise its ``end`` equals its ``start``,
# zero duration, and the "current subtask at refresh_t" lookup
# in ``_generate_plan`` misses any refresh that lands inside it).
episode_end_t = float(record.frame_timestamps[-1]) if record.frame_timestamps else None
spans = reconstruct_subtask_spans(existing, episode_end_t=episode_end_t)
already_planned: set[float] = {float(r["timestamp"]) for r in existing if r.get("style") == "plan"}
new_rows = list(existing)
texts: list[str | None] = (
[None] * len(interjection_times)
if interjection_texts is None
else [str(t) if t else None for t in interjection_texts]
)
for raw_t, inter_text in zip(interjection_times, texts, strict=True):
t = snap_to_frame(raw_t, record.frame_timestamps)
if t in already_planned:
continue
already_planned.add(t)
plan_text = self._generate_plan(record, spans, refresh_t=t, interjection=inter_text)
if plan_text is not None:
new_rows.append(
{
"role": "assistant",
"content": plan_text,
"style": "plan",
"timestamp": t,
"tool_calls": None,
}
)
staging.write("plan", new_rows)
def _generate_subtasks(self, record: EpisodeRecord, *, task: str | None = None) -> list[dict[str, Any]]:
if record.row_count == 0 or not record.frame_timestamps:
return []
episode_duration = record.frame_timestamps[-1] - record.frame_timestamps[0]
prompt = load_prompt("module_1_subtasks").format(
episode_task=(task if task is not None else record.episode_task),
min_subtask_seconds=self.config.min_subtask_seconds,
max_steps=self.config.plan_max_steps,
episode_duration=f"{episode_duration:.3f}",
vocabulary_block=self._subtask_vocabulary_block(),
)
messages = self._video_message(record, prompt)
spans = self._vlm_field(messages, "subtasks")
# When a vocabulary is in force, do a single targeted retry if
# any returned subtask is off-vocab — strict exact-match only,
# no fuzzy snapping. The retry includes the offending strings
# and the full canonical list so the VLM can correct itself.
if self.vocabulary is not None and self.vocabulary.subtasks and spans:
invalid = self._invalid_subtasks(spans)
if invalid:
logger.info(
"episode %d: VLM emitted %d off-vocab subtask(s) (%s); retrying once",
record.episode_index,
len(invalid),
invalid,
)
retry_msg = self._build_subtask_retry_message(messages, invalid)
retried = self._vlm_field(retry_msg, "subtasks")
if retried:
spans = retried
if not spans:
return []
# clamp to [t0, t_last] and sort
t0 = record.frame_timestamps[0]
t_last = record.frame_timestamps[-1]
cleaned: list[dict[str, Any]] = []
for span in spans:
try:
start = float(span["start"])
end = float(span["end"])
text = str(span["text"]).strip()
except (KeyError, ValueError, TypeError):
continue
start = max(t0, min(start, t_last))
end = max(t0, min(end, t_last))
if end < start:
start, end = end, start
if not text:
continue
text = self._canonicalize_subtask(text)
if not text:
continue
cleaned.append({"text": text, "start": start, "end": end})
cleaned.sort(key=lambda s: s["start"])
cleaned = self._dedupe_starts_to_distinct_frames(cleaned, record)
if self.vocabulary is not None and self.vocabulary.subtasks and not cleaned:
logger.warning(
"episode %d: every VLM subtask was off-vocab even after retry — "
"episode left empty (extend meta/canonical_vocabulary.json to "
"cover the missing phase)",
record.episode_index,
)
return cleaned
@staticmethod
def _dedupe_starts_to_distinct_frames(
spans: list[dict[str, Any]], record: EpisodeRecord
) -> list[dict[str, Any]]:
"""Bump same-frame subtask starts onto distinct frames.
Two consecutive VLM spans whose ``start`` rounds to the same
source frame (after :func:`snap_to_frame`) would otherwise emit
two ``style=subtask`` rows at the identical persistent
timestamp. The training-time renderer's ``active_at(t,
style=subtask)`` resolver can't disambiguate that and raises
``Ambiguous resolver for style='subtask'``.
Walk the (sorted-by-start) spans, snap each to its frame, and
if the snapped frame is already taken push the span onto the
next unused frame so both subtasks survive on distinct
timestamps. If the episode ends before a free frame is found,
the trailing span is dropped with a warning — better than
poisoning the render.
"""
if not spans:
return spans
frames = record.frame_timestamps
if not frames:
return spans
used: set[float] = set()
out: list[dict[str, Any]] = []
for span in spans:
ts = snap_to_frame(span["start"], frames)
if ts in used:
next_ts = next((f for f in frames if f > ts and f not in used), None)
if next_ts is None:
logger.warning(
"episode %d: subtask %r snapped to occupied frame "
"%.3f and no free later frame exists — dropping",
record.episode_index,
span.get("text"),
ts,
)
continue
ts = next_ts
used.add(ts)
new_span = {**span, "start": ts}
if float(new_span.get("end", ts)) < ts:
new_span["end"] = ts
out.append(new_span)
return out
# ------------------------------------------------------------------
# Canonical-vocabulary helpers
# ------------------------------------------------------------------
def _subtask_vocabulary_block(self) -> str:
"""Bullet-list of canonical subtasks the VLM must pick from.
Returns an empty string when no vocabulary is configured —
``module_1_subtasks.txt`` then falls back to its free-form
rules (original behaviour).
"""
if self.vocabulary is None or not self.vocabulary.subtasks:
return ""
bullets = "\n".join(f"- {s}" for s in self.vocabulary.subtasks)
return (
"You MUST choose each subtask label verbatim from this canonical "
"vocabulary — pick the closest match for each phase of the demo, "
"and reuse the SAME string every time that phase recurs. The "
"low-level policy is conditioned on these exact strings; any "
"novel paraphrase you invent will make its conditioning OOD.\n"
"Canonical subtask labels:\n"
f"{bullets}\n\n"
)
def _memory_vocabulary_block(self) -> str:
"""Bullet-list of canonical memory milestones the VLM must pick from."""
if self.vocabulary is None or not self.vocabulary.memory_milestones:
return ""
bullets = "\n".join(f"- {m}" for m in self.vocabulary.memory_milestones)
return (
"Compose the memory by picking ONLY from this canonical milestone "
"list — append a milestone (or rewrite the running memory to "
"compress past ones) using these exact phrases. Do not invent new "
"wording: every paraphrase weakens the downstream conditioning.\n"
"Canonical memory milestones:\n"
f"{bullets}\n\n"
)
_NORMALIZE_STRIP_TOKENS: frozenset[str] = frozenset({"the", "a", "an"})
def _canonicalize_subtask(self, text: str) -> str:
"""Validate ``text`` against the canonical vocabulary; no fuzzy snap.
Without a vocabulary, the original text passes through. With a
vocabulary, accept the span only if its normalised form (lower-
cased, articles stripped, whitespace collapsed) matches a
canonical entry exactly — the canonical wording is returned so
the supervised string is byte-identical across episodes.
Off-vocab spans are dropped (empty string). Upstream
``_generate_subtasks`` triggers a targeted retry before reaching
the drop path; this function never snaps or warps a span into
a different label.
"""
if self.vocabulary is None or not self.vocabulary.subtasks:
return text.strip()
normalised = self._normalize(text)
if not normalised:
return ""
for candidate in self.vocabulary.subtasks:
if self._normalize(candidate) == normalised:
return candidate
return ""
@classmethod
def _normalize(cls, text: str) -> str:
"""Lowercase, strip articles, collapse whitespace, drop punctuation."""
words = [
w.strip(".,:;\"'!?()")
for w in text.lower().replace(",", " ").split()
]
return " ".join(w for w in words if w and w not in cls._NORMALIZE_STRIP_TOKENS)
def _invalid_subtasks(self, spans: list[dict[str, Any]]) -> list[str]:
"""Return the unique off-vocab subtask strings the VLM produced."""
seen: list[str] = []
for span in spans:
text = str((span or {}).get("text") or "").strip()
if not text:
continue
if self._canonicalize_subtask(text):
continue
if text not in seen:
seen.append(text)
return seen
def _build_subtask_retry_message(
self, original_messages: list[dict[str, Any]], invalid: list[str]
) -> list[dict[str, Any]]:
"""Compose a one-shot correction prompt naming the off-vocab strings."""
assert self.vocabulary is not None
canonical = "\n".join(f"- {s}" for s in self.vocabulary.subtasks)
invalid_list = "\n".join(f"- {s!r}" for s in invalid)
correction = (
"Your previous response included subtask labels that are NOT in "
"the canonical vocabulary:\n"
f"{invalid_list}\n\n"
"Re-emit the same segmentation (same number of spans, same start/end "
"timestamps where they were valid) but replace every off-vocab "
"label with the EXACT canonical string for that phase, copied "
"verbatim from this list:\n"
f"{canonical}\n\n"
"Strict rules:\n"
"- Output strings must be byte-for-byte identical to entries above.\n"
"- No articles, no adverbs, no extra words.\n"
"- If a phase truly has no canonical match, omit that span entirely.\n"
"Return the same JSON shape as before."
)
# Append the correction as an additional user turn; the model
# sees the original prompt + its prior output is implied by the
# conversation context (the VLM client is stateless, so we
# re-send the original content plus this correction).
retry_messages = [
{
"role": m.get("role", "user"),
"content": (
m.get("content")
if isinstance(m.get("content"), str)
else list(m.get("content") or [])
),
}
for m in original_messages
]
retry_messages.append({"role": "user", "content": correction})
return retry_messages
def _generate_plan(
self,
record: EpisodeRecord, # noqa: ARG002 (kept for signature stability)
subtask_spans: Sequence[dict[str, Any]],
*,
refresh_t: float | None = None,
interjection: str | None = None, # noqa: ARG002
task: str | None = None, # noqa: ARG002
) -> str | None:
"""Deterministic plan = numbered list of *still-todo* subtasks.
Previously this called the VLM with a prompt that asked it to
compress the subtasks into a "compact hierarchical plan". That
produced longer-than-necessary plans, cost an extra VLM round-trip
per episode (plus one per interjection on refresh), and could
diverge from the actual subtask sequence the model is going to
execute. Replacing it with a plain summarisation keeps the plan
tightly aligned with the upcoming subtasks and removes the VLM
call entirely.
Layout — short imperative fragments prefixed by "N. ":
1. <subtask 1>
2. <subtask 2>
...
On a refresh at ``refresh_t`` (called from ``run_plan_updates``
on interjection events, and from ``run_episode`` at every subtask
boundary), only subtasks whose start is at or after ``refresh_t``
are included — the plan shrinks as work progresses, so it always
describes what's left.
"""
if not subtask_spans:
return None
remaining = [
s
for s in subtask_spans
if refresh_t is None or float(s.get("start", 0.0)) >= float(refresh_t)
]
if not remaining:
# Past the last subtask boundary on a late refresh — nothing
# left to plan; emit None so the caller skips the row.
return None
return "\n".join(
f"{i}. {span.get('text', '').strip()}" for i, span in enumerate(remaining, start=1)
)
def _generate_memory(
self,
record: EpisodeRecord,
prior_memory: str,
completed: str,
remaining: Sequence[str],
*,
task: str | None = None,
) -> str:
prompt = load_prompt("module_1_memory").format(
episode_task=(task if task is not None else record.episode_task),
prior_memory=prior_memory or "(none)",
completed_subtask=completed,
remaining_subtasks=", ".join(remaining) if remaining else "(none)",
vocabulary_block=self._memory_vocabulary_block(),
)
memory = self._vlm_field(self._text_message(prompt), "memory")
return memory.strip() if isinstance(memory, str) else ""

View File

@@ -1,33 +0,0 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Prompt templates loaded as plain text.
One file per use site. Templates use ``str.format(**vars)`` substitution; we
intentionally avoid jinja2 here so the templates remain inspectable in
plain editors and roundtrip cleanly through ``ruff format``.
"""
from __future__ import annotations
from pathlib import Path
_DIR = Path(__file__).parent
def load(name: str) -> str:
"""Read prompt template ``name.txt`` from the ``prompts/`` directory."""
path = _DIR / f"{name}.txt"
return path.read_text(encoding="utf-8")

View File

@@ -1,53 +0,0 @@
You are inspecting {n_episodes} sample episode video(s) from a teleoperated
robot dataset. Every episode in the dataset performs the SAME task; the
user originally asked: "{episode_task}".
Watch all the clips and produce a SHORT canonical vocabulary that every
episode in this dataset will reuse. The downstream low-level policy is
conditioned on these strings — duplicate phrasings (e.g. "grasp blue
cube" vs "pick up the blue cube") would destroy the conditioning, so
pick one wording per concept and reuse it everywhere.
Decide how many entries each list needs YOURSELF based on what you see —
the smallest set that still covers every recurring phase in the demos.
A simple two-object pick-and-place might need ~6 subtask labels and 2
memory milestones; a long multi-step recipe needs more. Err on the side
of FEWER — extra entries that don't recur across episodes weaken the
conditioning.
You output two lists:
1. `subtasks`: imperative, telegraphic commands the robot can execute.
- Verb-first. Drop articles, adverbs, qualifiers.
- Consistent object nouns (if the task says "cube", every subtask says
"cube" — never "block" / "object").
- Atomic — one skill per subtask (gripper-open events, contact, regrasps,
transitions all become cut points).
- Each label must recur across the demos. If you see a motion only
once across all sample clips, it probably isn't a canonical phase.
- Good: "move to blue cube", "grasp blue cube", "lift blue cube",
"place blue cube in box", "release blue cube", "retract arm".
- Bad: "the robot arm moves towards the blue cube" (third person,
too long), "carefully pick up the cube" (adverb, article),
"carrying the yellow cube over the green basket" (gerund — should
be imperative "transport yellow cube to green basket").
2. `memory_milestones`: first-person past-tense sentences the running
memory composes from. Each subtask phase that produces a lasting
change should have a milestone; transient motions (move, retract)
should NOT.
- First person, past tense. Start with "I".
- One sentence. Functional outcome only — no grasp / motion detail.
- Good: "I picked up the blue cube.", "I placed the blue cube in
the green box.", "I wiped the counter."
- Bad: "The robot arm grasped the blue cube." (third person),
"I carefully grasped the blue cube with the parallel gripper."
(irrelevant detail), "I moved towards the blue cube." (transient
motion — should be omitted, not memorialised).
Output strictly valid JSON of shape:
{{
"subtasks": ["<verb phrase>", ...],
"memory_milestones": ["I <past-tense sentence>.", ...]
}}

View File

@@ -1,36 +0,0 @@
You are updating the robot's compressed semantic memory at the boundary of
a completed subtask.
Reference (verbatim from MEM, Torne 2026):
"Remove or compress information in the language memory whenever
appropriate. Keep ONLY the minimal set of relevant information for future
task execution. Specific object attributes (colors, precise quantities of
each item) get discarded when their details won't affect subsequent
actions. Functional outcomes (where items went, how many) are preserved."
Episode task: "{episode_task}"
Previous memory: {prior_memory}
Just-completed subtask: "{completed_subtask}"
Remaining subtasks (for relevance judgement only): {remaining_subtasks}
{vocabulary_block}Write the memory as a short FIRST-PERSON, PAST-TENSE narrative of what the
robot has accomplished so far — the running story it would tell itself.
Authoring rules:
- First person, past tense. Every sentence starts with "I": "I picked
up...", "I opened...", "I moved to...".
- One or two short sentences. Extend the previous memory with the
just-completed subtask; do not rewrite it from scratch.
- Keep WHAT happened (functional outcomes — where items went, how many),
drop HOW (grasp details, motions).
- Compress completed steps and drop object attributes (colors, exact
counts) once they no longer affect the remaining subtasks.
Example (MEM, Torne 2026):
Before: "I prepared the pot and got the potatoes, milk, and butter. I
moved to the drawer."
After: "I prepared the pot and got the ingredients. I opened the
drawer with the masher."
Output strictly valid JSON:
{{ "memory": "<one or two short first-person past-tense sentences>" }}

View File

@@ -1,46 +0,0 @@
You are labeling a teleoperated robot demonstration.
The user originally asked: "{episode_task}"
You are shown the entire demonstration as a single video. Watch the
whole clip, then segment it into a list of consecutive atomic subtasks
the robot performs.
{vocabulary_block}Authoring rules — Hi Robot atom granularity, pi0.7-style short prompts:
- Each subtask = one atomic skill the low-level policy can execute.
- Write each subtask as an IMPERATIVE COMMAND, starting with a verb:
move, reach, pick up, grasp, place, put, push, pull, open, close,
turn, press, lift, insert, pour...
- Keep it SHORT — a verb phrase, not a sentence. Drop articles
("the", "a") and adverbs ("carefully", "slowly"). Add a "how"
detail (which hand, which grasp point) ONLY when it is needed to
disambiguate.
- NEVER use third person. Never write "the robot", "the arm", "the
gripper moves", "it picks up" — the robot is implied. Command it,
do not describe it.
- Use the exact object nouns from the task above. If the task says
"cube", every subtask says "cube" — never switch to "block". If it
says "box", never switch to "bin"/"container". Keep vocabulary
consistent across the whole episode.
- Good: "move to blue cube", "grasp blue cube", "lift blue cube",
"place blue cube in box", "open drawer", "release yellow cube".
- Bad: "the robot arm moves towards the blue cube" (third person,
too long), "carefully pick up the cube" (adverb, article),
"release the yellow block" ("block" when the task said "cube").
- Subtasks are non-overlapping and cover the full episode in order.
Choose the cut points yourself based on what you see in the video
(gripper open/close events, contact, regrasps, transitions).
- Each subtask spans at least {min_subtask_seconds} seconds.
- Do not exceed {max_steps} subtasks total.
- Every subtask's [start_time, end_time] must lie within
[0.0, {episode_duration}] seconds.
Output strictly valid JSON of shape:
{{
"subtasks": [
{{"text": "<short imperative verb phrase>", "start": <float>, "end": <float>}},
...
]
}}

View File

@@ -1,32 +0,0 @@
You are generating training data for a Hi Robot-style policy. We need
{n} alternative phrasings of the same robot task so the policy sees
diverse user prompts during training instead of the same canonical
string repeated every frame.
Original task:
"{base_task}"
Generate exactly {n} alternative phrasings of the same task. Vary:
- formality (casual / polite / curt)
- verbosity (mostly short imperative; occasional polite request)
- word choice (synonyms, different verbs)
- sentence structure (imperative / question / suggestion)
Hard rules:
- Each phrasing MUST preserve the exact meaning of the original task.
Do not change which object is involved, the destination, or the
action. Do not add extra steps. Do not invent new objects.
- Each phrasing must be a short phrase or sentence, plain prose, no
markdown, no quotes, no list numbers.
- Phrasings must be distinct — no near-duplicates.
- Output exactly {n} entries.
Output strictly valid JSON:
{{
"rephrasings": [
"<phrasing 1>",
"<phrasing 2>",
...
]
}}

View File

@@ -1,17 +0,0 @@
The video above shows a robot manipulation episode in full. Look at
the entire video and describe in ONE concise sentence what the robot
is doing.
Rules:
- One sentence, in natural English, like a user instruction.
- Capture the goal of the demonstration, not low-level motions.
Example: "place the yellow cube into the red bin" — not "move the
end-effector down 5cm and close the gripper".
- 4 to 15 words. Plain prose, no markdown, no bullets, no quotes.
- Do not invent objects or actions that aren't visible.
- Do not output anything other than the JSON object below.
Output strictly valid JSON:
{{
"task": "<single concise sentence describing what the robot does in this video>"
}}

View File

@@ -1,12 +0,0 @@
The user just asked the robot: "{episode_task}".
Generate a short verbal acknowledgement the robot would speak back before
beginning the task. Style: compact, confident, friendly.
Examples (Hi Robot, Shi 2025): "Sure, I won't put cheese on it.",
"OK, starting with the sponge.", "Got it.".
Prefer very short replies: "Got it.", "On it.", "OK."
Output strictly valid JSON:
{{ "text": "<the spoken acknowledgement>" }}

View File

@@ -1,46 +0,0 @@
You are generating training data for a Hi Robot-style hierarchical
robot policy. The robot in this demonstration has ALREADY executed
every step shown in the video — we cannot retroactively change the
action stream. To keep training data consistent with the video, the
"interjection" must align with what the robot is *about to do next* in
the demonstration, framed as a natural mid-task user request.
The episode's overall task: "{episode_task}".
The images above show roughly {window_seconds:.1f} seconds straddling a
subtask boundary in the demonstration:
- Subtask the robot just finished: "{prev_subtask}"
- Subtask the robot is about to start: "{next_subtask}"
- Time into episode: {timestamp:.2f}s
Write ONE compact interjection the user would naturally say at this
moment to prompt / confirm / encourage the robot to do "{next_subtask}".
Keep it like a mid-task coaching cue, not a full instruction paragraph.
Also write the robot's compact verbal acknowledgement.
Hard rules:
- The interjection MUST be consistent with the next subtask. The user
cannot ask for something different from what the robot then does in
the video. If you're tempted to say "actually skip X" or "do Y
instead", DO NOT — those would contradict the demonstration.
- The interjection must reference an object, location, or action that
is plausible given the visible scene and the next subtask text.
- One short phrase or sentence each. Conversational, not robotic.
- Prefer direct cues: "{next_subtask}, please."; "Now {next_subtask}."
- Keep robot speech very short: "OK.", "On it.", "Doing that."
Style examples (vary the phrasing — don't reuse these verbatim):
- "Now go ahead and {next_subtask}."
- "Great, can you {next_subtask} next?"
- "{next_subtask}, please."
- "Before you continue, please {next_subtask}."
- "Looking good — {next_subtask} now."
- "Okay, {next_subtask}."
Output strictly valid JSON:
{{
"interjection": "<short cue from the user, asking for the next subtask>",
"speech": "<short robot acknowledgement>"
}}

View File

@@ -1,32 +0,0 @@
You are generating a frame-grounded visual question/answer pair for
chain-of-thought training. Reference: ECoT (Zawalski 2024) and Steerable
Policies — both train policies on grounded features such as bounding box
pixel coordinates, keypoints, counts, attributes, and spatial relations.
The frame shows a robot working on: "{episode_task}".
Question types and the EXACT answer JSON shape required for each:
bbox => {{"detections": [{{"label": "<obj>", "bbox_format": "xyxy",
"bbox": [x1, y1, x2, y2]}}, ...]}}
bbox is in pixel coordinates (x_min, y_min, x_max, y_max).
ECoT example: "a white cup [124, 25, 176, 113]".
keypoint => {{"label": "<point>", "point_format": "xy",
"point": [x, y]}}
count => {{"label": "<obj>", "count": <int>,
"note": "<optional short note>"}}
attribute => {{"label": "<obj>", "attribute": "<color|shape|state|...>",
"value": "<observed value>"}}
spatial => {{"subject": "<obj>", "relation": "<left_of|right_of|on|in|"
"above|below|near>", "object": "<obj>"}}
Generate a question of type "{question_type}". Output strictly valid JSON:
{{
"question": "<short, frame-grounded question>",
"answer": <object whose shape matches the schema above>
}}

View File

@@ -1,274 +0,0 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Datatrove-shaped reader.
The reader walks ``data/chunk-*/file-*.parquet`` and yields one record per
episode containing:
- ``episode_index``: int
- ``frame_timestamps``: tuple[float, ...]
- ``frame_indices``: tuple[int, ...]
- ``episode_task``: str (canonical task from ``meta/tasks.parquet``)
- ``data_path``: pathlib.Path of the source parquet shard
- ``frames_df``: pandas.DataFrame slice for the episode (only loaded on demand)
This shape lets each module operate per-episode without loading all parquet
rows into memory at once.
"""
from __future__ import annotations
from collections.abc import Iterator, Sequence
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
import pyarrow.parquet as pq
from lerobot.datasets.io_utils import load_tasks
from lerobot.datasets.utils import DEFAULT_TASKS_PATH
@dataclass
class EpisodeRecord:
"""Per-episode record yielded by the reader."""
episode_index: int
episode_task: str
frame_timestamps: tuple[float, ...]
frame_indices: tuple[int, ...]
data_path: Path
row_offset: int # row offset within the parquet file where this episode starts
row_count: int # number of rows for this episode
# Memoized parquet slice — populated on first ``frames_df()`` call so
# repeat queries from different modules don't re-read the whole shard.
_frames_df_cache: Any = field(default=None, init=False, repr=False, compare=False)
def frames_df(self): # type: ignore[no-untyped-def]
"""Lazy-load the pandas slice for this episode (memoized)."""
if self._frames_df_cache is None:
import pandas as pd # noqa: PLC0415 - deferred for optional dataset extra
table = pq.read_table(self.data_path)
df: pd.DataFrame = table.to_pandas()
self._frames_df_cache = df.iloc[self.row_offset : self.row_offset + self.row_count].reset_index(
drop=True
)
return self._frames_df_cache
def reconstruct_subtask_spans(
rows: Sequence[dict[str, Any]],
*,
episode_end_t: float | None = None,
) -> list[dict[str, Any]]:
"""Turn ``style="subtask"`` rows into ``{text, start, end}`` spans.
Each span's ``end`` is the next span's ``start``. The final span's
``end`` defaults to its own ``start`` (zero-duration) — pass
``episode_end_t`` to extend it to the episode's last frame instead,
which is what downstream consumers (memory, interjection boundary
selection) expect.
Used by the ``plan`` module (plan-update pass) and the
``interjections`` module (interjection anchoring), which both need the
same span shape.
"""
sorted_rows = sorted(
(r for r in rows if r.get("style") == "subtask"),
key=lambda r: float(r["timestamp"]),
)
spans: list[dict[str, Any]] = []
for r in sorted_rows:
t = float(r["timestamp"])
if spans:
spans[-1]["end"] = t
spans.append({"text": r.get("content") or "", "start": t, "end": t})
if spans and episode_end_t is not None and float(episode_end_t) > spans[-1]["start"]:
spans[-1]["end"] = float(episode_end_t)
return spans
def snap_to_frame(t: float, frame_timestamps: Sequence[float]) -> float:
"""Snap an arbitrary float to the nearest exact source frame timestamp.
Modules use this when emitting event-style rows so the row's
timestamp matches a real parquet frame: event rows must land on an
exact frame, otherwise the per-frame event lookup the writer does
would never match them.
"""
if not frame_timestamps:
return float(t)
nearest = min(frame_timestamps, key=lambda f: abs(f - t))
return float(nearest)
def _load_tasks_lookup(root: Path) -> dict[int, str]:
"""Map ``task_index -> task`` from ``meta/tasks.parquet``.
Returns an empty dict when the file is absent — the task description is
derived later from the video if needed. Reuses the library-level
:func:`lerobot.datasets.io_utils.load_tasks`, which returns the tasks
frame indexed by task string with a ``task_index`` column.
"""
if not (root / DEFAULT_TASKS_PATH).exists():
return {}
tasks = load_tasks(root)
return {int(idx): str(task) for task, idx in zip(tasks.index, tasks["task_index"], strict=True)}
def iter_episodes(root: Path, *, only_episodes: tuple[int, ...] | None = None) -> Iterator[EpisodeRecord]:
"""Yield :class:`EpisodeRecord` for every episode under ``root/data/``.
Episodes are yielded in ascending ``episode_index`` order. The reader does
not assume a specific chunk/file layout: it scans every ``*.parquet``
under ``data/`` and groups by ``episode_index``.
"""
tasks = _load_tasks_lookup(root)
data_dir = root / "data"
parquet_files = sorted(data_dir.rglob("*.parquet"))
only_set = set(only_episodes) if only_episodes is not None else None
for path in parquet_files:
yield from _iter_one_path(path, tasks, only_set)
def _iter_one_path(path: Path, tasks: dict[int, str], only_set: set[int] | None) -> Iterator[EpisodeRecord]:
table = pq.read_table(path)
names = table.column_names
if "episode_index" not in names:
return
episode_col = table.column("episode_index").to_pylist()
timestamp_col = (
table.column("timestamp").to_pylist() if "timestamp" in names else [0.0] * len(episode_col)
)
frame_col = (
table.column("frame_index").to_pylist() if "frame_index" in names else list(range(len(episode_col)))
)
task_col = table.column("task_index").to_pylist() if "task_index" in names else None
def _build(
ep: int,
start: int,
end: int,
task_idx: int | None,
ts_buf: list[float],
fi_buf: list[int],
) -> EpisodeRecord | None:
if only_set is not None and ep not in only_set:
return None
task = tasks.get(task_idx, "") if task_idx is not None else ""
return EpisodeRecord(
episode_index=ep,
episode_task=task,
frame_timestamps=tuple(ts_buf),
frame_indices=tuple(fi_buf),
data_path=path,
row_offset=start,
row_count=end - start,
)
cur_ep: int | None = None
start_offset = 0
ts_buf: list[float] = []
fi_buf: list[int] = []
cur_task_idx: int | None = None
for i, ep in enumerate(episode_col):
if cur_ep is None:
cur_ep = ep
start_offset = i
ts_buf = [timestamp_col[i]]
fi_buf = [frame_col[i]]
cur_task_idx = task_col[i] if task_col is not None else None
continue
if ep != cur_ep:
rec = _build(cur_ep, start_offset, i, cur_task_idx, ts_buf, fi_buf)
if rec is not None:
yield rec
cur_ep = ep
start_offset = i
ts_buf = [timestamp_col[i]]
fi_buf = [frame_col[i]]
cur_task_idx = task_col[i] if task_col is not None else None
else:
ts_buf.append(timestamp_col[i])
fi_buf.append(frame_col[i])
if cur_ep is not None:
rec = _build(cur_ep, start_offset, len(episode_col), cur_task_idx, ts_buf, fi_buf)
if rec is not None:
yield rec
def gather_data_paths(root: Path) -> list[Path]:
"""Return every ``data/chunk-*/file-*.parquet`` path under ``root``."""
return sorted((root / "data").rglob("*.parquet"))
def episode_offsets_per_path(path: Path) -> dict[int, tuple[int, int]]:
"""Return ``{episode_index: (row_offset, row_count)}`` for one parquet."""
table = pq.read_table(path, columns=["episode_index"])
episode_col = table.column("episode_index").to_pylist()
out: dict[int, tuple[int, int]] = {}
cur_ep: int | None = None
start = 0
for i, ep in enumerate(episode_col):
if cur_ep is None:
cur_ep = ep
start = i
continue
if ep != cur_ep:
out[cur_ep] = (start, i - start)
cur_ep = ep
start = i
if cur_ep is not None:
out[cur_ep] = (start, len(episode_col) - start)
return out
def keyframe_indices(record: EpisodeRecord, k: int) -> list[int]:
"""Return ``k`` evenly spaced row indices into the episode (relative)."""
n = record.row_count
if k <= 0 or n == 0:
return []
if k >= n:
return list(range(n))
step = (n - 1) / (k - 1) if k > 1 else 0.0
return [int(round(i * step)) for i in range(k)] if k > 1 else [n // 2]
def lookup_data_path(root: Path, episode_index: int) -> tuple[Path, int, int] | None:
"""Find the parquet file containing ``episode_index`` and its slice bounds."""
for path in gather_data_paths(root):
offsets = episode_offsets_per_path(path)
if episode_index in offsets:
start, count = offsets[episode_index]
return path, start, count
return None
def episode_frame_timestamps(root: Path, episode_index: int) -> tuple[Any, list[float]]:
"""Return the parquet path and per-frame timestamps for ``episode_index``."""
found = lookup_data_path(root, episode_index)
if found is None:
raise ValueError(f"Episode {episode_index} not found under {root}/data/")
path, start, count = found
table = pq.read_table(path, columns=["timestamp"])
timestamps = table.column("timestamp").to_pylist()[start : start + count]
return path, [float(t) for t in timestamps]

View File

@@ -1,104 +0,0 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Per-episode staging.
Each module writes its raw output as a JSONL file under
``<staging_dir>/episode_{ep:06d}/<module>.jsonl``. The writer reads back this
staging tree and partitions rows into the two language columns.
JSONL is preferred over parquet here because the staging artifact is meant to
be human-inspectable, easy to diff between prompt iterations, and trivially
appended to. The final dataset format is parquet; staging is just an
intermediate.
"""
from __future__ import annotations
import json
from collections.abc import Iterable, Iterator
from dataclasses import dataclass
from pathlib import Path
from typing import Any
ModuleName = str
_MODULES: tuple[ModuleName, ...] = (
"plan",
"interjections",
"vqa",
)
@dataclass
class EpisodeStaging:
"""Filesystem layout for a single episode's staged module outputs."""
root: Path
episode_index: int
@property
def episode_dir(self) -> Path:
return self.root / f"episode_{self.episode_index:06d}"
def path_for(self, module: ModuleName) -> Path:
if module not in _MODULES:
raise ValueError(f"Unknown module {module!r}; expected one of {_MODULES}")
return self.episode_dir / f"{module}.jsonl"
def write(self, module: ModuleName, rows: Iterable[dict[str, Any]]) -> Path:
path = self.path_for(module)
path.parent.mkdir(parents=True, exist_ok=True)
# Atomic replace: a crash mid-write would otherwise leave a
# half-written JSONL file that ``read()`` would then fail to
# parse. Write to a sibling .tmp and rename so the target path
# only ever points at a complete file.
tmp_path = path.with_suffix(path.suffix + ".tmp")
with tmp_path.open("w", encoding="utf-8") as f:
for row in rows:
f.write(json.dumps(row, ensure_ascii=False, sort_keys=True))
f.write("\n")
tmp_path.replace(path)
return path
def read(self, module: ModuleName) -> list[dict[str, Any]]:
path = self.path_for(module)
if not path.exists():
return []
out: list[dict[str, Any]] = []
with path.open(encoding="utf-8") as f:
for line in f:
line = line.strip()
if line:
out.append(json.loads(line))
return out
def read_all(self) -> dict[ModuleName, list[dict[str, Any]]]:
return {m: self.read(m) for m in _MODULES}
def has(self, module: ModuleName) -> bool:
return self.path_for(module).exists()
def iter_staged_episodes(root: Path) -> Iterator[int]:
"""Yield episode indices for which any staging artifact exists."""
if not root.exists():
return
for child in sorted(root.iterdir()):
if child.is_dir() and child.name.startswith("episode_"):
try:
yield int(child.name.removeprefix("episode_"))
except ValueError:
continue

View File

@@ -1,334 +0,0 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Pre-write validation against staged outputs.
Runs after all three modules have written their per-episode artifacts but
*before* the writer rewrites parquet shards. The validator never touches
parquet; it only inspects the staging tree and the source frame timestamps
exposed by :class:`EpisodeRecord`.
Checks (per the plan's "Intermediate staging and validation" section):
- exact timestamp alignment against source frame timestamps
- no orphan speech / interjection pairs
- plan / memory emission consistency (events have a paired persistent row)
- VQA assistant ``content`` is valid JSON (one of bbox / keypoint / count /
attribute / spatial)
- every row maps to its correct column under :func:`column_for_style`
"""
from __future__ import annotations
import json
import logging
from collections.abc import Iterable, Sequence
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
from lerobot.datasets.language import (
LANGUAGE_EVENTS,
LANGUAGE_PERSISTENT,
column_for_style,
is_view_dependent_style,
validate_camera_field,
)
from .reader import EpisodeRecord
from .staging import EpisodeStaging
logger = logging.getLogger(__name__)
@dataclass
class ValidationReport:
"""Outcome of one validation pass across all episodes."""
errors: list[str] = field(default_factory=list)
warnings: list[str] = field(default_factory=list)
episodes_checked: int = 0
@property
def ok(self) -> bool:
return not self.errors
def add_error(self, message: str) -> None:
self.errors.append(message)
def add_warning(self, message: str) -> None:
self.warnings.append(message)
def summary(self) -> str:
return f"checked={self.episodes_checked} errors={len(self.errors)} warnings={len(self.warnings)}"
VQA_ANSWER_SHAPES: dict[str, set[str]] = {
"bbox": {"detections"},
"keypoint": {"label", "point_format", "point"},
"count": {"label", "count"},
"attribute": {"label", "attribute", "value"},
"spatial": {"subject", "relation", "object"},
}
def classify_vqa_answer(payload: Any) -> str | None:
"""Best-effort classification of a VQA answer payload to a question type."""
if not isinstance(payload, dict):
return None
keys = set(payload.keys())
for kind, required in VQA_ANSWER_SHAPES.items():
if required.issubset(keys):
return kind
return None
@dataclass
class StagingValidator:
"""Walks the staging tree and produces a :class:`ValidationReport`."""
timestamp_atol: float = 0.0 # exact-match by default
dataset_camera_keys: tuple[str, ...] | None = None
"""Known ``observation.images.*`` keys on the dataset. When set, the
validator additionally enforces that every view-dependent row's
``camera`` field references one of these keys. Pass ``None`` (default)
to skip that cross-check (e.g. in unit tests with no real dataset)."""
def validate(
self,
records: Sequence[EpisodeRecord],
staging_dir: Path,
) -> ValidationReport:
report = ValidationReport()
for record in records:
self._validate_episode(record, staging_dir, report)
report.episodes_checked += 1
return report
def _validate_episode(
self,
record: EpisodeRecord,
staging_dir: Path,
report: ValidationReport,
) -> None:
staging = EpisodeStaging(staging_dir, record.episode_index)
staged = staging.read_all()
all_rows: list[dict[str, Any]] = []
for module_name, rows in staged.items():
for row in rows:
row = {**row, "_module": module_name}
all_rows.append(row)
frame_ts = set(record.frame_timestamps)
events: list[dict[str, Any]] = []
persistent: list[dict[str, Any]] = []
for row in all_rows:
self._check_column_routing(row, report, record.episode_index)
self._check_camera_field(
row, report, record.episode_index, self.dataset_camera_keys
)
if column_for_style(row.get("style")) == LANGUAGE_PERSISTENT:
persistent.append(row)
else:
events.append(row)
for row in events:
self._check_event_timestamp_alignment(row, frame_ts, report, record.episode_index)
self._check_speech_interjection_pairs(events, report, record.episode_index)
self._check_plan_memory_consistency(persistent, events, report, record.episode_index)
self._check_vqa_json(events, report, record.episode_index)
self._check_vqa_uniqueness_per_frame_camera(events, report, record.episode_index)
def _check_camera_field(
self,
row: dict[str, Any],
report: ValidationReport,
episode_index: int,
dataset_camera_keys: Sequence[str] | None,
) -> None:
"""Enforce the camera invariant + that the key matches the dataset's cameras."""
style = row.get("style")
camera = row.get("camera")
try:
validate_camera_field(style, camera)
except ValueError as exc:
report.add_error(
f"ep={episode_index} module={row.get('_module')}: {exc}"
)
return
if (
is_view_dependent_style(style)
and dataset_camera_keys
and camera not in dataset_camera_keys
):
report.add_error(
f"ep={episode_index} module={row.get('_module')}: camera {camera!r} on style "
f"{style!r} is not one of the dataset's video keys {sorted(dataset_camera_keys)!r}"
)
def _check_vqa_uniqueness_per_frame_camera(
self,
events: Iterable[dict[str, Any]],
report: ValidationReport,
episode_index: int,
) -> None:
"""Ensure at most one (vqa, user) and one (vqa, assistant) per (t, camera)."""
counts: dict[tuple[float, str, str], int] = {}
for row in events:
if row.get("style") != "vqa":
continue
ts = row.get("timestamp")
camera = row.get("camera")
role = row.get("role")
if ts is None or camera is None or role is None:
continue # other validators flag these
key = (float(ts), str(camera), str(role))
counts[key] = counts.get(key, 0) + 1
for (ts, camera, role), n in counts.items():
if n > 1:
report.add_error(
f"ep={episode_index}: {n} duplicate vqa rows at t={ts} "
f"camera={camera!r} role={role!r}; expected at most one per (t, camera, role)"
)
def _check_column_routing(
self,
row: dict[str, Any],
report: ValidationReport,
episode_index: int,
) -> None:
style = row.get("style")
module = row.get("_module")
try:
target_col = column_for_style(style)
except ValueError:
report.add_error(f"ep={episode_index} module={module}: unknown style {style!r}")
return
if module == "plan" and target_col != LANGUAGE_PERSISTENT:
report.add_error(
f"ep={episode_index} module=plan emitted style {style!r} that routes to {target_col} (must be persistent)"
)
if module in {"interjections", "vqa"} and target_col != LANGUAGE_EVENTS:
report.add_error(
f"ep={episode_index} module={module} emitted style {style!r} that routes to {target_col} (must be events)"
)
def _check_event_timestamp_alignment(
self,
row: dict[str, Any],
frame_ts: set[float],
report: ValidationReport,
episode_index: int,
) -> None:
ts = row.get("timestamp")
if ts is None:
report.add_error(f"ep={episode_index}: event row missing timestamp: {row!r}")
return
if self.timestamp_atol == 0.0:
if float(ts) not in frame_ts:
report.add_error(
f"ep={episode_index}: event row timestamp {ts!r} does not match any source frame timestamp"
)
else:
if not any(abs(float(ts) - f) <= self.timestamp_atol for f in frame_ts):
report.add_error(
f"ep={episode_index}: event row timestamp {ts!r} not within {self.timestamp_atol}s of any frame"
)
def _check_speech_interjection_pairs(
self,
events: Iterable[dict[str, Any]],
report: ValidationReport,
episode_index: int,
) -> None:
speech_ts: dict[float, int] = {}
interjection_ts: dict[float, int] = {}
for row in events:
ts = row.get("timestamp")
if ts is None:
continue
ts_f = float(ts)
if row.get("style") is None and row.get("role") == "assistant":
speech_ts[ts_f] = speech_ts.get(ts_f, 0) + 1
if row.get("style") == "interjection":
interjection_ts[ts_f] = interjection_ts.get(ts_f, 0) + 1
for ts in interjection_ts:
if ts not in speech_ts:
report.add_error(f"ep={episode_index}: interjection at t={ts} has no paired speech atom")
def _check_plan_memory_consistency(
self,
persistent: Sequence[dict[str, Any]],
events: Sequence[dict[str, Any]],
report: ValidationReport,
episode_index: int,
) -> None:
plan_ts = sorted({float(r["timestamp"]) for r in persistent if r.get("style") == "plan"})
memory_ts = sorted({float(r["timestamp"]) for r in persistent if r.get("style") == "memory"})
subtask_ts = sorted({float(r["timestamp"]) for r in persistent if r.get("style") == "subtask"})
interjection_ts = sorted(
{
float(r["timestamp"])
for r in events
if r.get("style") == "interjection" and r.get("timestamp") is not None
}
)
if persistent and not plan_ts:
report.add_warning(f"ep={episode_index}: persistent rows present but no plan emitted")
# every interjection should have a same-timestamp plan refresh
for ts in interjection_ts:
if ts not in set(plan_ts):
report.add_error(
f"ep={episode_index}: interjection at t={ts} has no co-timestamped plan update"
)
# memory should be emitted at subtask boundaries (subset relation)
if memory_ts and subtask_ts:
mem_set = set(memory_ts)
sub_set = set(subtask_ts)
stray = sorted(mem_set - sub_set)
if stray:
report.add_warning(f"ep={episode_index}: memory rows at {stray} not at any subtask boundary")
def _check_vqa_json(
self,
events: Iterable[dict[str, Any]],
report: ValidationReport,
episode_index: int,
) -> None:
for row in events:
if row.get("style") != "vqa" or row.get("role") != "assistant":
continue
content = row.get("content")
if content is None:
report.add_error(
f"ep={episode_index}: VQA assistant row at t={row.get('timestamp')} has null content"
)
continue
try:
payload = json.loads(content)
except (TypeError, ValueError) as exc:
report.add_error(
f"ep={episode_index}: VQA assistant content not valid JSON at t={row.get('timestamp')}: {exc}"
)
continue
shape = classify_vqa_answer(payload)
if shape is None:
report.add_error(
f"ep={episode_index}: VQA assistant payload at t={row.get('timestamp')} does not match any known shape: keys={list(payload) if isinstance(payload, dict) else type(payload).__name__}"
)

View File

@@ -1,703 +0,0 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Shared Qwen-VL client.
The pipeline uses a single shared VLM across modules. vLLM is preferred when
available (high throughput, JSON-guided decoding); transformers is the
fallback. A ``stub`` backend is used for unit tests so fixtures never call
into a real model.
The client speaks one method, :meth:`VlmClient.generate_json`, which:
- accepts a list of OpenAI/HF-style multimodal messages,
- requests JSON output (``json_mode=True`` enables guided decoding when the
backend supports it),
- batches requests transparently,
- and reprompts once on a JSON parse failure with an inline correction
message before raising.
"""
from __future__ import annotations
import atexit
import base64
import io
import json
import os
import shlex
import signal
import subprocess
import sys
import threading
import time
import urllib.request
from collections.abc import Callable, Sequence
from concurrent.futures import ThreadPoolExecutor
from dataclasses import dataclass
from typing import Any, Protocol
from .config import VlmConfig
class VlmClient(Protocol):
"""Protocol every backend must implement."""
def generate_json(
self,
messages_batch: Sequence[Sequence[dict[str, Any]]],
*,
max_new_tokens: int | None = None,
temperature: float | None = None,
) -> list[Any]:
"""Generate one JSON-decoded response per messages list."""
@dataclass
class StubVlmClient:
"""Deterministic stub used in unit tests.
A test passes a callable that maps the *last user message text* (or, if
that is empty, the full message list) to a JSON-serializable response.
"""
responder: Callable[[Sequence[dict[str, Any]]], Any]
def generate_json(
self,
messages_batch: Sequence[Sequence[dict[str, Any]]],
*,
max_new_tokens: int | None = None,
temperature: float | None = None,
) -> list[Any]:
return [self.responder(list(messages)) for messages in messages_batch]
def _strip_to_json(text: str) -> Any:
text = text.strip()
# Strip <think>...</think> blocks (Qwen3 Thinking style)
while "<think>" in text and "</think>" in text:
start = text.find("<think>")
end = text.find("</think>", start) + len("</think>")
text = (text[:start] + text[end:]).strip()
# Strip ```json ... ``` fences from chat-tuned backbones
if text.startswith("```"):
first = text.find("\n")
last = text.rfind("```")
if first != -1 and last != -1 and last > first:
text = text[first + 1 : last].strip()
try:
return json.loads(text)
except (ValueError, json.JSONDecodeError):
pass
# Fall back to extracting the first balanced {...} block.
obj_text = _extract_first_json_object(text)
if obj_text is None:
raise json.JSONDecodeError("No JSON object found", text, 0)
return json.loads(obj_text)
def _extract_first_json_object(text: str) -> str | None:
"""Return the first balanced ``{...}`` substring, ignoring braces in
string literals. Returns ``None`` if no balanced block is found."""
start = text.find("{")
if start < 0:
return None
depth = 0
in_string = False
escape = False
for i in range(start, len(text)):
ch = text[i]
if escape:
escape = False
continue
if ch == "\\":
escape = True
continue
# Note: ``escape`` is always False here — the ``if escape`` branch
# above already handled and reset it.
if ch == '"':
in_string = not in_string
continue
if in_string:
continue
if ch == "{":
depth += 1
elif ch == "}":
depth -= 1
if depth == 0:
return text[start : i + 1]
return None
@dataclass
class _GenericTextClient:
"""Wraps any text-generation callable in JSON-mode + one-retry semantics."""
generate_text: Callable[[Sequence[Sequence[dict[str, Any]]], int, float], list[str]]
config: VlmConfig
def generate_json(
self,
messages_batch: Sequence[Sequence[dict[str, Any]]],
*,
max_new_tokens: int | None = None,
temperature: float | None = None,
) -> list[Any]:
max_tok = max_new_tokens if max_new_tokens is not None else self.config.max_new_tokens
temp = temperature if temperature is not None else self.config.temperature
raw = self.generate_text(messages_batch, max_tok, temp)
out: list[Any] = []
for messages, text in zip(messages_batch, raw, strict=True):
try:
out.append(_strip_to_json(text))
continue
except (ValueError, json.JSONDecodeError):
pass
retry = list(messages) + [
{"role": "assistant", "content": text},
{
"role": "user",
"content": (
"Your previous reply was not valid JSON. "
"Reply with strictly valid JSON, no prose, no fences."
),
},
]
retry_text = self.generate_text([retry], max_tok, temp)[0]
try:
out.append(_strip_to_json(retry_text))
except (ValueError, json.JSONDecodeError):
# After retry: log preview and return None instead of crashing
# the whole pipeline. Modules treat None as "skip".
preview = retry_text.strip().replace("\n", " ")[:200]
print(
f"[vlm] WARNING: failed to parse JSON after retry; preview: {preview!r}",
flush=True,
)
out.append(None)
return out
def make_vlm_client(config: VlmConfig) -> VlmClient:
"""Build the shared VLM client per the configured backend.
For ``stub``, callers should construct :class:`StubVlmClient` directly with
a responder callable. ``stub`` here is rejected to make accidental misuse
obvious.
"""
if config.backend == "stub":
raise ValueError(
"Use StubVlmClient(...) directly for the stub backend; make_vlm_client builds real clients."
)
if config.backend == "vllm":
return _make_vllm_client(config)
if config.backend == "transformers":
return _make_transformers_client(config)
if config.backend == "openai":
return _make_openai_client(config)
raise ValueError(f"Unknown VLM backend: {config.backend!r}")
def _make_vllm_client(config: VlmConfig) -> VlmClient:
try:
from vllm import LLM, SamplingParams # type: ignore[import-not-found]
except ImportError as exc:
raise ImportError(
"vllm is required for backend='vllm'. Install with `pip install lerobot[annotations]`."
) from exc
# Workaround for cuDNN 9.x + torch 2.8 conv3d regression that surfaces
# as CUDNN_STATUS_NOT_INITIALIZED in Qwen-VL vision-tower patch
# embedders. Setting LEROBOT_DISABLE_CUDNN=1 forces native PyTorch
# convolution kernels — slower but functional.
if os.environ.get("LEROBOT_DISABLE_CUDNN", "").lower() in {"1", "true", "yes"}:
import torch as _torch # noqa: PLC0415 - optional GPU dep, deferred
_torch.backends.cudnn.enabled = False
llm_kwargs: dict[str, Any] = {
"model": config.model_id,
"tensor_parallel_size": config.tensor_parallel_size,
"gpu_memory_utilization": config.gpu_memory_utilization,
"trust_remote_code": config.trust_remote_code,
}
if config.max_model_len is not None:
llm_kwargs["max_model_len"] = config.max_model_len
llm = LLM(**llm_kwargs)
def _gen(batch: Sequence[Sequence[dict[str, Any]]], max_tok: int, temp: float) -> list[str]:
# ``guided_decoding`` would speed up parsing but its API differs across
# vllm releases (dict vs GuidedDecodingParams). The _GenericTextClient
# wrapper already has a one-retry JSON-recovery path, so we skip it.
params = SamplingParams(max_tokens=max_tok, temperature=temp)
# ``llm.chat`` handles chat-template application + multimodal input
# extraction (image/video blocks) internally, which ``llm.generate``
# does not.
outputs = llm.chat([list(m) for m in batch], params)
return [o.outputs[0].text for o in outputs]
return _GenericTextClient(_gen, config)
def _make_transformers_client(config: VlmConfig) -> VlmClient:
try:
import torch # type: ignore[import-not-found]
import transformers # type: ignore[import-not-found]
from transformers import AutoProcessor # type: ignore[import-not-found]
except ImportError as exc:
raise ImportError("transformers + torch are required for backend='transformers'.") from exc
auto_cls = getattr(transformers, "AutoModelForImageTextToText", None) or getattr(
transformers, "AutoModelForVision2Seq", None
)
if auto_cls is None:
raise ImportError(
"Neither AutoModelForImageTextToText nor AutoModelForVision2Seq is available in this "
"transformers version. Install transformers>=4.45 (which has AutoModelForImageTextToText) "
"for VL models."
)
processor = AutoProcessor.from_pretrained(config.model_id, trust_remote_code=config.trust_remote_code)
use_accelerate = os.environ.get("LEROBOT_TRANSFORMERS_DEVICE_MAP", "manual") != "manual"
# ``device_map='auto'`` triggers a known std::bad_alloc on the Qwen3-VL
# post-load dispatch path (the alloc fails in accelerate's hook setup
# even with TBs of host RAM). Default to manual: load on CPU with
# ``low_cpu_mem_usage=True``, then ``.to("cuda")``. Set
# ``LEROBOT_TRANSFORMERS_DEVICE_MAP=auto`` to opt back into the old path.
if use_accelerate:
model = auto_cls.from_pretrained(
config.model_id,
torch_dtype="auto",
device_map="auto",
low_cpu_mem_usage=True,
trust_remote_code=config.trust_remote_code,
)
else:
import torch as _torch # noqa: PLC0415 - optional GPU dep, deferred
model = auto_cls.from_pretrained(
config.model_id,
torch_dtype=_torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=config.trust_remote_code,
)
model = model.to("cuda")
model.eval()
def _gen(batch: Sequence[Sequence[dict[str, Any]]], max_tok: int, temp: float) -> list[str]:
outs: list[str] = []
for messages in batch:
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[text], return_tensors="pt").to(model.device)
with torch.no_grad():
gen = model.generate(
**inputs,
max_new_tokens=max_tok,
temperature=temp,
do_sample=temp > 0.0,
)
decoded = processor.batch_decode(
gen[:, inputs["input_ids"].shape[-1] :], skip_special_tokens=True
)[0]
outs.append(decoded)
return outs
return _GenericTextClient(_gen, config)
def _make_openai_client(config: VlmConfig) -> VlmClient:
"""Backend that talks to any OpenAI-compatible server.
Compatible with ``vllm serve``, ``transformers serve``,
``ktransformers serve``, and hosted endpoints. By default the server
is expected to be already running. Set ``auto_serve=True`` to have
this client spawn one (default: ``transformers serve``), wait until
it's ready, and tear it down on process exit.
Image blocks ``{"type":"image", "image":<PIL.Image>}`` are
auto-converted to ``image_url`` data-URLs. Video blocks
``{"type":"video", "video":[<PIL>...]}`` are forwarded as
multi-frame ``video_url`` items where supported.
"""
try:
from openai import OpenAI # type: ignore[import-not-found]
except ImportError as exc:
raise ImportError(
"openai package is required for backend='openai'. Install with `pip install openai`."
) from exc
api_base = config.api_base
api_key = config.api_key
auto_serve = config.auto_serve
api_bases: list[str] = [api_base]
print(
f"[lerobot-annotate] backend=openai model={config.model_id} "
f"api_base={api_base} auto_serve={auto_serve}",
flush=True,
)
if auto_serve:
if config.parallel_servers > 1:
print(
f"[lerobot-annotate] spawning {config.parallel_servers} parallel servers",
flush=True,
)
api_bases = _spawn_parallel_inference_servers(config)
elif _server_is_up(api_base):
print(f"[lerobot-annotate] reusing server already up at {api_base}", flush=True)
else:
print("[lerobot-annotate] no server reachable; spawning one", flush=True)
api_base = _spawn_inference_server(config)
api_bases = [api_base]
print(f"[lerobot-annotate] server ready at {api_base}", flush=True)
clients = [OpenAI(base_url=base, api_key=api_key) for base in api_bases]
# round-robin counter for parallel mode
rr_counter = {"i": 0}
# ``mm_processor_kwargs`` is a vllm-specific extra; transformers serve
# rejects it with HTTP 422. Send it only when explicitly opted in via
# an env var (e.g. ``LEROBOT_OPENAI_SEND_MM_KWARGS=1`` for vllm).
send_mm_kwargs = os.environ.get("LEROBOT_OPENAI_SEND_MM_KWARGS", "").lower() in {"1", "true", "yes"}
rr_lock = threading.Lock()
def _one_call(messages: Sequence[dict[str, Any]], max_tok: int, temp: float) -> str:
api_messages, mm_kwargs = _to_openai_messages(messages)
kwargs: dict[str, Any] = {
"model": config.model_id,
"messages": api_messages,
"max_tokens": max_tok,
"temperature": temp,
}
extra_body: dict[str, Any] = {}
if send_mm_kwargs and mm_kwargs:
extra_body["mm_processor_kwargs"] = {**mm_kwargs, "do_sample_frames": True}
if config.chat_template_kwargs:
extra_body["chat_template_kwargs"] = config.chat_template_kwargs
if extra_body:
kwargs["extra_body"] = extra_body
with rr_lock:
chosen = clients[rr_counter["i"] % len(clients)]
rr_counter["i"] += 1
response = chosen.chat.completions.create(**kwargs)
return response.choices[0].message.content or ""
def _gen(batch: Sequence[Sequence[dict[str, Any]]], max_tok: int, temp: float) -> list[str]:
if len(batch) <= 1 or config.client_concurrency <= 1:
return [_one_call(messages, max_tok, temp) for messages in batch]
# Parallel fan-out — vllm batches these on the server side.
max_workers = min(config.client_concurrency, len(batch))
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = [pool.submit(_one_call, messages, max_tok, temp) for messages in batch]
return [f.result() for f in futures]
return _GenericTextClient(_gen, config)
def _spawn_parallel_inference_servers(config: VlmConfig) -> list[str]:
"""Spawn ``config.parallel_servers`` independent vllm replicas.
Each replica:
- is pinned to a single GPU via ``CUDA_VISIBLE_DEVICES``
- listens on ``serve_port + i``
- is shut down via the same atexit hook as the single-server path
Returns the list of ``api_base`` URLs the client should round-robin
across.
"""
n = config.parallel_servers
api_bases: list[str] = []
procs: list[subprocess.Popen] = []
ready_events: list[threading.Event] = []
# Multiple readiness signals — uvicorn's own banner is suppressed at
# ``--uvicorn-log-level warning``, so we also accept vllm's own
# "Starting vLLM API server" line and the route-listing line. The
# HTTP probe below is the ultimate fallback.
ready_markers = (
"Uvicorn running",
"Application startup complete",
"Starting vLLM API server",
"Available routes are",
)
# Single lock for all server-stream threads so multibyte chars from
# different servers don't interleave and tear UTF-8 sequences.
print_lock = threading.Lock()
base_cmd = config.serve_command or (
f"vllm serve {shlex.quote(config.model_id)} "
f"--tensor-parallel-size 1 "
f"--max-model-len {config.max_model_len or 32768} "
f"--uvicorn-log-level warning"
)
num_gpus = config.num_gpus if config.num_gpus > 0 else n
for i in range(n):
port = config.serve_port + i
gpu = i % num_gpus
env = os.environ.copy()
env["CUDA_VISIBLE_DEVICES"] = str(gpu)
cmd = base_cmd.replace("{port}", str(port)) if "{port}" in base_cmd else f"{base_cmd} --port {port}"
api_base = f"http://localhost:{port}/v1"
api_bases.append(api_base)
print(f"[server-{i}] launching on GPU {gpu} port {port}: {cmd}", flush=True)
proc = subprocess.Popen(
shlex.split(cmd),
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True,
bufsize=1,
env=env,
)
procs.append(proc)
ready = threading.Event()
ready_events.append(ready)
def _stream(idx: int, p: subprocess.Popen, ev: threading.Event) -> None:
# Read whole lines and emit each line atomically under the
# shared print_lock so output from N servers stays readable.
assert p.stdout is not None
for line in iter(p.stdout.readline, ""):
with print_lock:
sys.stdout.write(f"[server-{idx}] {line}")
if not line.endswith(("\n", "\r")):
sys.stdout.write("\n")
sys.stdout.flush()
if any(m in line for m in ready_markers):
ev.set()
threading.Thread(target=_stream, args=(i, proc, ready), daemon=True).start()
def _probe(idx: int, base: str, ev: threading.Event, p: subprocess.Popen) -> None:
while not ev.is_set() and p.poll() is None:
if _server_is_up(base):
print(f"[server-{idx}] ready (http probe)", flush=True)
ev.set()
return
time.sleep(2)
threading.Thread(target=_probe, args=(i, api_base, ready, proc), daemon=True).start()
def _shutdown() -> None:
for i, p in enumerate(procs):
if p.poll() is None:
print(f"[server-{i}] stopping pid={p.pid}", flush=True)
p.send_signal(signal.SIGINT)
for p in procs:
try:
p.wait(timeout=15)
except subprocess.TimeoutExpired:
p.kill()
p.wait(timeout=5)
atexit.register(_shutdown)
deadline = time.monotonic() + config.serve_ready_timeout_s
while any(not ev.is_set() for ev in ready_events) and time.monotonic() < deadline:
for i, p in enumerate(procs):
if p.poll() is not None:
raise RuntimeError(
f"[server-{i}] inference server exited unexpectedly with rc={p.returncode}"
)
time.sleep(2)
if any(not ev.is_set() for ev in ready_events):
raise RuntimeError(f"[server] not all replicas became ready within {config.serve_ready_timeout_s}s")
print(f"[lerobot-annotate] all {n} servers ready: {api_bases}", flush=True)
return api_bases
def _server_is_up(api_base: str) -> bool:
"""Return True if ``api_base/models`` answers 200 within 2 seconds."""
url = api_base.rstrip("/") + "/models"
# ``api_base`` is the user-configured local-server URL we just spawned
# or the user passed in via ``--vlm.api_base``; the bandit B310 warning
# is for arbitrary user-controlled URLs with file:/ schemes which
# cannot reach this code path.
try:
with urllib.request.urlopen(url, timeout=2) as resp: # noqa: S310 # nosec B310
return resp.status == 200
except Exception: # noqa: BLE001
return False
def _spawn_inference_server(config: VlmConfig) -> str:
"""Spawn ``transformers serve`` (or ``serve_command``), wait until it
accepts ``/v1/models``, and register a shutdown hook.
Streams the server's stdout/stderr to the parent terminal in
real-time on a background thread so users can see model-load
progress and errors as they happen.
Returns the full ``api_base`` URL the OpenAI client should use.
"""
cmd = config.serve_command
if not cmd:
cmd = (
f"transformers serve {shlex.quote(config.model_id)} "
f"--port {config.serve_port} --continuous-batching"
)
api_base = f"http://localhost:{config.serve_port}/v1"
print(f"[server] launching: {cmd}", flush=True)
proc = subprocess.Popen(
shlex.split(cmd),
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True,
bufsize=1,
)
# Watch the server output for the uvicorn readiness banner. This is
# more reliable than polling /v1/models because transformers serve
# rescans its cache on every model-list request, which can exceed
# the urllib timeout and trigger an infinite probe loop.
ready_event = threading.Event()
# See _spawn_parallel_inference_servers for why we accept these.
ready_markers = (
"Uvicorn running",
"Application startup complete",
"Starting vLLM API server",
"Available routes are",
)
def _probe() -> None:
while not ready_event.is_set() and proc.poll() is None:
if _server_is_up(api_base):
print("[server] ready (http probe)", flush=True)
ready_event.set()
return
time.sleep(2)
threading.Thread(target=_probe, daemon=True).start()
def _stream_output() -> None:
# Read raw chunks instead of iterating lines so tqdm progress
# bars (which overwrite using \r) flush in real time.
assert proc.stdout is not None
buf = ""
prefix_started = False
while True:
ch = proc.stdout.read(1)
if ch == "":
# process exited; flush any tail
if buf:
sys.stdout.write(buf)
sys.stdout.flush()
return
if not prefix_started:
sys.stdout.write("[server] ")
prefix_started = True
sys.stdout.write(ch)
sys.stdout.flush()
buf += ch
if ch in ("\n", "\r"):
if any(marker in buf for marker in ready_markers):
ready_event.set()
buf = ""
prefix_started = False
threading.Thread(target=_stream_output, daemon=True).start()
def _shutdown() -> None:
if proc.poll() is None:
print(f"[server] stopping pid={proc.pid}", flush=True)
proc.send_signal(signal.SIGINT)
try:
proc.wait(timeout=15)
except subprocess.TimeoutExpired:
proc.kill()
proc.wait(timeout=5)
atexit.register(_shutdown)
deadline = time.monotonic() + config.serve_ready_timeout_s
while time.monotonic() < deadline:
if proc.poll() is not None:
raise RuntimeError(
f"[server] inference server exited unexpectedly with rc={proc.returncode}. "
f"See [server] log lines above for the cause."
)
if ready_event.wait(timeout=2):
return api_base
proc.terminate()
raise RuntimeError(f"[server] did not become ready within {config.serve_ready_timeout_s}s")
def _to_openai_messages(
messages: Sequence[dict[str, Any]],
) -> tuple[list[dict[str, Any]], dict[str, Any]]:
"""Convert internal messages to OpenAI chat format.
Returns ``(api_messages, mm_kwargs)``. Multimodal-processor kwargs
(``fps`` from ``video_url`` blocks) are extracted out so the caller
can pass them via ``extra_body.mm_processor_kwargs`` rather than
inside the content blocks (which transformers serve rejects).
File-URL video blocks are inlined as base64 data URLs.
"""
out_messages: list[dict[str, Any]] = []
mm_kwargs: dict[str, Any] = {}
for message in messages:
content = message.get("content")
if not isinstance(content, list):
out_messages.append({"role": message["role"], "content": content})
continue
out_blocks: list[dict[str, Any]] = []
for block in content:
block_type = block.get("type") if isinstance(block, dict) else None
if block_type == "text":
out_blocks.append({"type": "text", "text": block.get("text", "")})
elif block_type == "image":
out_blocks.append(
{"type": "image_url", "image_url": {"url": _pil_to_data_url(block["image"])}}
)
elif block_type == "video":
frames = block.get("video", [])
for img in frames:
out_blocks.append({"type": "image_url", "image_url": {"url": _pil_to_data_url(img)}})
elif block_type == "video_url":
video_url = dict(block["video_url"])
url = video_url.get("url", "")
if url.startswith("file://"):
video_url["url"] = _file_to_data_url(url[len("file://") :])
out_blocks.append({"type": "video_url", "video_url": video_url})
fps = block.get("fps")
if fps is not None:
mm_kwargs["fps"] = fps
else:
out_blocks.append(block)
out_messages.append({"role": message["role"], "content": out_blocks})
return out_messages, mm_kwargs
def _file_to_data_url(path: str) -> str:
"""Read a local video file and return a base64 ``data:video/mp4`` URL."""
with open(path, "rb") as f:
b64 = base64.b64encode(f.read()).decode("ascii")
return f"data:video/mp4;base64,{b64}"
def _pil_to_data_url(image: Any) -> str:
"""Encode a PIL.Image as a base64 data URL."""
buf = io.BytesIO()
image.save(buf, format="PNG")
b64 = base64.b64encode(buf.getvalue()).decode("ascii")
return f"data:image/png;base64,{b64}"
def _messages_to_prompt(messages: Sequence[dict[str, Any]]) -> Any:
"""Pass-through hook used by the vllm backend.
vllm exposes its own multimodal entry points that vary by version; for the
base flow we simply forward the raw message list and let the caller's
custom backend handle templating. Real deployments override this.
"""
return list(messages)

View File

@@ -1,222 +0,0 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Dataset-level canonical vocabulary discovery (Phase 0).
The downstream consumer of these annotations is a low-level action expert
conditioned on the ``subtask`` string. Free-form per-episode LLM rephrasing
gives near-unique strings per occurrence, which collapses the action
expert's conditioning to noise and makes runtime subtask-paraphrase drift
catastrophic. The Hi-Robot / π0.6-MEM recipe ships a small canonical
vocabulary per environment (~10 strings) that every episode reuses; this
module derives that vocabulary automatically from the first few episode
videos and persists it next to the dataset.
Pipeline-level flow:
Phase 0 (here): watch N sample episodes → produce vocabulary.json
Phase 1 (plan module): reuse vocabulary on every episode, both as
prompt-side constraint *and* post-VLM validation
The vocabulary is JSON, lives at ``<root>/meta/canonical_vocabulary.json``,
and is human-inspectable / hand-editable — if the discovered set is wrong,
operators edit the file and re-run the pipeline without phase 0.
"""
from __future__ import annotations
import json
import logging
from collections.abc import Sequence
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
from .config import VocabularyConfig
from .frames import FrameProvider, null_provider, to_video_block
from .prompts import load as load_prompt
from .reader import EpisodeRecord
from .vlm_client import VlmClient
logger = logging.getLogger(__name__)
VOCABULARY_FILENAME = "canonical_vocabulary.json"
@dataclass
class Vocabulary:
"""Canonical phrasings shared across every episode of one dataset.
Both lists are strict: per-episode subtask + memory generation pick
from these strings only; the downstream policy then has a small,
repeatable target distribution to learn instead of thousands of
LLM paraphrases.
"""
subtasks: tuple[str, ...]
"""Imperative subtask labels — what the low-level policy is conditioned
on. Verb-first, telegraphic, consistent object nouns. Example:
``("move to blue cube", "grasp blue cube", "lift blue cube",
"place blue cube in box", "retract arm")``.
"""
memory_milestones: tuple[str, ...]
"""First-person past-tense milestone sentences — building blocks for
the running memory string. Example: ``("I picked up the blue cube.",
"I placed the blue cube in the green box.")``. Each milestone maps
1:1 onto a completed subtask phase; ``memory_at_step_k`` is the
concatenation of milestones for completed phases.
"""
def to_json(self) -> dict[str, list[str]]:
return {
"subtasks": list(self.subtasks),
"memory_milestones": list(self.memory_milestones),
}
@classmethod
def from_json(cls, payload: dict[str, Any]) -> Vocabulary:
subtasks = tuple(
str(s).strip() for s in (payload.get("subtasks") or []) if str(s).strip()
)
memory_milestones = tuple(
str(s).strip() for s in (payload.get("memory_milestones") or []) if str(s).strip()
)
return cls(subtasks=subtasks, memory_milestones=memory_milestones)
def is_empty(self) -> bool:
return not self.subtasks and not self.memory_milestones
def vocabulary_path(root: Path) -> Path:
"""Return the canonical on-disk location for the vocabulary file."""
return root / "meta" / VOCABULARY_FILENAME
def load_vocabulary(root: Path) -> Vocabulary | None:
"""Read ``<root>/meta/canonical_vocabulary.json`` if present.
Returns ``None`` when the file does not exist — callers fall back to
free-form (unconstrained) subtask + memory generation, preserving the
pipeline's behaviour on datasets that never ran phase 0.
"""
path = vocabulary_path(root)
if not path.exists():
return None
try:
payload = json.loads(path.read_text(encoding="utf-8"))
except (OSError, json.JSONDecodeError) as exc:
logger.warning("could not read %s: %s — proceeding without vocabulary", path, exc)
return None
if not isinstance(payload, dict):
logger.warning("%s is not a JSON object — ignoring", path)
return None
vocab = Vocabulary.from_json(payload)
if vocab.is_empty():
return None
return vocab
def save_vocabulary(root: Path, vocab: Vocabulary) -> Path:
"""Atomically persist ``vocab`` to ``<root>/meta/canonical_vocabulary.json``."""
path = vocabulary_path(root)
path.parent.mkdir(parents=True, exist_ok=True)
tmp = path.with_suffix(path.suffix + ".tmp")
tmp.write_text(
json.dumps(vocab.to_json(), indent=2, ensure_ascii=False) + "\n",
encoding="utf-8",
)
tmp.replace(path)
return path
@dataclass
class VocabularyDiscoveryModule:
"""Derive a dataset-level canonical vocabulary from sample episodes.
Phase 0 of the executor: pulls ``config.sample_episodes`` episode
videos, packs them into one Qwen-VL multi-video prompt, and asks the
model to enumerate the small set of canonical subtask labels +
memory milestones that recur across them. The output is persisted
to ``meta/canonical_vocabulary.json`` and consumed by phase 1.
"""
vlm: VlmClient
config: VocabularyConfig
frame_provider: FrameProvider = field(default_factory=null_provider)
@property
def enabled(self) -> bool:
return self.config.enabled
def discover(
self,
records: Sequence[EpisodeRecord],
*,
existing: Vocabulary | None = None,
) -> Vocabulary | None:
"""Run vocabulary discovery against the first N sample episodes.
``existing`` short-circuits the VLM call when ``config.reuse_existing``
is True and an on-disk vocabulary is already present — keeps re-runs
cheap and lets operators hand-edit the file without it getting
overwritten.
"""
if existing is not None and self.config.reuse_existing:
logger.info(
"vocabulary: reusing existing (%d subtasks, %d memory milestones)",
len(existing.subtasks),
len(existing.memory_milestones),
)
return existing
sample = list(records[: max(1, int(self.config.sample_episodes))])
if not sample:
return None
task_hint = next((r.episode_task for r in sample if r.episode_task), "")
prompt = load_prompt("module_0_vocabulary").format(
episode_task=task_hint or "(unspecified)",
n_episodes=len(sample),
)
# Pack one video block per sample episode so the VLM sees the
# variation across episodes (different starting poses, different
# object placements) rather than overfitting to one trajectory.
content: list[dict[str, Any]] = []
for record in sample:
video_frames = self.frame_provider.video_for_episode(
record, int(self.config.max_video_frames_per_episode)
)
if video_frames:
content.extend(to_video_block(video_frames))
content.append({"type": "text", "text": prompt})
messages = [{"role": "user", "content": content}]
result = self.vlm.generate_json([messages])[0]
if not isinstance(result, dict):
logger.warning("vocabulary: VLM did not return a JSON object — skipping")
return None
vocab = Vocabulary.from_json(result)
if vocab.is_empty():
logger.warning("vocabulary: VLM returned an empty vocabulary — skipping")
return None
logger.info(
"vocabulary: discovered %d subtask labels + %d memory milestones from %d episodes",
len(vocab.subtasks),
len(vocab.memory_milestones),
len(sample),
)
return vocab

View File

@@ -1,356 +0,0 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Final parquet rewrite.
For every episode the writer:
1. reads the staged module outputs,
2. partitions them into a persistent slice (PERSISTENT_STYLES) and an event
slice (EVENT_ONLY_STYLES + style=None tool-call atoms),
3. sorts each slice deterministically,
4. broadcasts the persistent slice across every frame in the episode,
5. for each frame, materializes the sublist of event rows whose timestamp
exactly equals that frame's timestamp,
6. drops the legacy ``subtask_index`` column,
7. writes the parquet shard back in place.
The writer does NOT add a dataset-level ``tools`` column. Tool *calls* are
emitted per-row via the existing ``tool_calls`` field on the v3.1 row
struct for every speech atom. The tool *schema* (the description
of the ``say`` function and its parameters) is a fixed code constant —
``SAY_TOOL_SCHEMA`` below — and downstream chat-template consumers import
it directly rather than reading a redundant per-row column.
Invariants enforced here (and re-checked by the validator):
- per-episode persistent slice is byte-identical across every frame;
- ``language_events`` rows on a frame all have ``timestamp == frame_ts``
(timestamps come straight from the source parquet — never recomputed);
- every row passes ``column_for_style(style)``.
"""
from __future__ import annotations
import logging
from collections import defaultdict
from collections.abc import Iterable, Sequence
from dataclasses import dataclass
from pathlib import Path
from typing import Any
import pyarrow as pa
import pyarrow.parquet as pq
from lerobot.datasets.language import (
EVENT_ONLY_STYLES,
LANGUAGE_EVENTS,
LANGUAGE_PERSISTENT,
PERSISTENT_STYLES,
column_for_style,
validate_camera_field,
)
from .reader import EpisodeRecord
from .staging import EpisodeStaging
logger = logging.getLogger(__name__)
# Tool schema constants live in lerobot.datasets.language — single
# source of truth. Re-exported here so existing imports
# (``from lerobot.annotations.steerable_pipeline.writer import SAY_TOOL_SCHEMA``)
# keep working.
from lerobot.datasets.language import DEFAULT_TOOLS, SAY_TOOL_SCHEMA # noqa: F401, E402
def _row_persistent_sort_key(row: dict[str, Any]) -> tuple:
return (float(row["timestamp"]), row.get("style") or "", row.get("role") or "")
def _row_event_sort_key(row: dict[str, Any]) -> tuple:
# events are bucketed per-frame, but within a frame we still want determinism
return (
row.get("style") or "",
row.get("role") or "",
row.get("camera") or "",
)
def _normalize_persistent_row(row: dict[str, Any]) -> dict[str, Any]:
"""Coerce a staged row into the persistent column's struct shape."""
style = row.get("style")
if style not in PERSISTENT_STYLES:
raise ValueError(
f"persistent slice contains row with non-persistent style {style!r}; "
"row would be misrouted under column_for_style()"
)
if "timestamp" not in row:
raise ValueError(f"persistent row missing timestamp: {row!r}")
if "role" not in row:
# Surface a friendly error from the writer rather than letting
# the raw KeyError bubble out of the dict access below — modules
# are expected to always emit ``role``, but the validator
# currently doesn't check this so a future bug would otherwise
# be hard to triage.
raise ValueError(f"persistent row missing role: {row!r}")
camera = row.get("camera")
validate_camera_field(style, camera)
return {
"role": str(row["role"]),
"content": None if row.get("content") is None else str(row["content"]),
"style": style,
"timestamp": float(row["timestamp"]),
"camera": None if camera is None else str(camera),
"tool_calls": _normalize_tool_calls(row.get("tool_calls")),
}
def _normalize_event_row(row: dict[str, Any]) -> dict[str, Any]:
"""Coerce a staged row into the event column's struct shape (no timestamp)."""
style = row.get("style")
if style is not None and style not in EVENT_ONLY_STYLES:
raise ValueError(
f"event slice contains row with style {style!r}; expected None or one of {EVENT_ONLY_STYLES}"
)
if column_for_style(style) != LANGUAGE_EVENTS:
raise ValueError(f"event row with style {style!r} would not route to language_events")
if "role" not in row:
raise ValueError(f"event row missing role: {row!r}")
camera = row.get("camera")
validate_camera_field(style, camera)
return {
"role": str(row["role"]),
"content": None if row.get("content") is None else str(row["content"]),
"style": style,
"camera": None if camera is None else str(camera),
"tool_calls": _normalize_tool_calls(row.get("tool_calls")),
}
def _normalize_tool_calls(value: Any) -> list[Any] | None:
if value is None:
return None
if not isinstance(value, list):
raise ValueError(f"tool_calls must be a list or None, got {type(value).__name__}")
return list(value)
def _validate_atom_invariants(row: dict[str, Any]) -> None:
"""At-least-one of content/tool_calls; style=None implies tool_calls."""
has_content = row.get("content") is not None
has_tools = row.get("tool_calls") is not None
if not (has_content or has_tools):
raise ValueError(f"row has neither content nor tool_calls: {row!r}")
if row.get("style") is None and not has_tools:
raise ValueError(f"style=None requires tool_calls: {row!r}")
def _validate_speech_atom(row: dict[str, Any]) -> None:
"""Speech atoms: role=assistant, style=None, content=None, say tool call."""
if row.get("style") is not None:
return # not a speech atom
if row.get("role") != "assistant":
raise ValueError(f"speech atom must have role=assistant: {row!r}")
if row.get("content") is not None:
raise ValueError(f"speech atom must have content=null: {row!r}")
tool_calls = row.get("tool_calls")
if not tool_calls or not isinstance(tool_calls, list):
raise ValueError(f"speech atom must have non-empty tool_calls list: {row!r}")
first = tool_calls[0]
if not isinstance(first, dict):
raise ValueError(f"speech atom tool_calls[0] must be a dict: {row!r}")
if first.get("type") != "function":
raise ValueError(f"speech atom tool_calls[0].type must be 'function': {row!r}")
fn = first.get("function") or {}
if fn.get("name") != "say":
raise ValueError(f"speech atom tool_calls[0].function.name must be 'say': {row!r}")
args = fn.get("arguments") or {}
if not isinstance(args, dict) or "text" not in args or not isinstance(args["text"], str):
raise ValueError(f"speech atom must carry 'text' string in arguments: {row!r}")
@dataclass
class LanguageColumnsWriter:
"""Rewrite ``data/chunk-*/file-*.parquet`` with the two language columns."""
drop_existing_subtask_index: bool = True
def write_all(
self,
records: Sequence[EpisodeRecord],
staging_dir: Path,
root: Path,
) -> list[Path]:
episodes_by_path: dict[Path, list[EpisodeRecord]] = defaultdict(list)
for record in records:
episodes_by_path[record.data_path].append(record)
written: list[Path] = []
for path, eps in episodes_by_path.items():
self._rewrite_one(path, eps, staging_dir, root)
written.append(path)
return written
def _rewrite_one(
self,
path: Path,
episodes: Sequence[EpisodeRecord],
staging_dir: Path,
root: Path,
) -> None:
table = pq.read_table(path)
n_rows = table.num_rows
# Ensure we cover every episode in the file. Episodes that don't have
# staging artifacts are passed through with empty annotation lists —
# this keeps the writer idempotent and safe for partial reruns.
staged_per_ep: dict[int, dict[str, list[dict[str, Any]]]] = {}
for record in episodes:
staging = EpisodeStaging(staging_dir, record.episode_index)
staged_per_ep[record.episode_index] = staging.read_all()
persistent_by_ep: dict[int, list[dict[str, Any]]] = {}
events_by_ep_ts: dict[int, dict[float, list[dict[str, Any]]]] = {}
for ep_index, ep_staged in staged_per_ep.items():
persistent_rows: list[dict[str, Any]] = []
event_rows: list[dict[str, Any]] = [] # carry timestamp until bucketed
for _module_name, rows in ep_staged.items():
for row in rows:
style = row.get("style")
if column_for_style(style) == LANGUAGE_PERSISTENT:
persistent_rows.append(row)
else:
event_rows.append(row)
persistent_rows.sort(key=_row_persistent_sort_key)
normalized_persistent = []
for r in persistent_rows:
_validate_atom_invariants(r)
_validate_speech_atom(r)
normalized_persistent.append(_normalize_persistent_row(r))
persistent_by_ep[ep_index] = normalized_persistent
buckets: dict[float, list[dict[str, Any]]] = defaultdict(list)
for r in event_rows:
_validate_atom_invariants(r)
_validate_speech_atom(r)
ts = float(r["timestamp"])
buckets[ts].append(_normalize_event_row(r))
for ts in list(buckets.keys()):
buckets[ts].sort(key=_row_event_sort_key)
events_by_ep_ts[ep_index] = buckets
episode_col = (
table.column("episode_index").to_pylist() if "episode_index" in table.column_names else None
)
ts_col = table.column("timestamp").to_pylist() if "timestamp" in table.column_names else None
if episode_col is None or ts_col is None:
raise ValueError(f"{path} is missing 'episode_index' or 'timestamp' — required by the writer.")
per_row_persistent: list[list[dict[str, Any]]] = []
per_row_events: list[list[dict[str, Any]]] = []
for i in range(n_rows):
ep = episode_col[i]
ts = float(ts_col[i])
per_row_persistent.append(persistent_by_ep.get(ep, []))
buckets = events_by_ep_ts.get(ep, {})
per_row_events.append(buckets.get(ts, []))
new_table = self._materialize_table(
table, per_row_persistent, per_row_events, drop_old=self.drop_existing_subtask_index
)
# Atomic replace: write to a sibling tmp path and rename so a crash
# mid-write can't leave a half-written shard that ``pq.read_table``
# would then fail to open. ``Path.replace`` is atomic on POSIX +
# Windows when source and target sit on the same filesystem.
tmp_path = path.with_suffix(path.suffix + ".tmp")
pq.write_table(new_table, tmp_path)
tmp_path.replace(path)
def _materialize_table(
self,
table: pa.Table,
persistent: list[list[dict[str, Any]]],
events: list[list[dict[str, Any]]],
*,
drop_old: bool,
) -> pa.Table:
cols = []
names = []
for name in table.column_names:
if drop_old and name == "subtask_index":
continue
if name in (LANGUAGE_PERSISTENT, LANGUAGE_EVENTS):
continue # we'll re-add canonical versions
# Strip any legacy ``tools`` column previously emitted by older
# writers — the schema no longer uses it (constant lives in
# SAY_TOOL_SCHEMA / DEFAULT_TOOLS).
if name == "tools":
continue
cols.append(table.column(name))
names.append(name)
# We let pyarrow infer struct/list schema rather than passing the
# canonical type from `lerobot.datasets.language` directly: that type
# uses `pa.json_()` for the `tool_calls` element type, which
# `pa.array(..., type=...)` cannot materialize from Python lists on
# current pyarrow versions. The inferred schema round-trips through
# parquet and `LeRobotDataset` correctly — `tests/datasets/test_language.py`
# exercises the same flow.
persistent_arr = pa.array(persistent)
events_arr = pa.array(events)
cols.extend([persistent_arr, events_arr])
names.extend([LANGUAGE_PERSISTENT, LANGUAGE_EVENTS])
return pa.Table.from_arrays(cols, names=names)
def speech_atom(timestamp: float, text: str) -> dict[str, Any]:
"""Build a canonical speech tool-call atom for the events column."""
return {
"role": "assistant",
"content": None,
"style": None,
"timestamp": float(timestamp),
"camera": None,
"tool_calls": [
{
"type": "function",
"function": {
"name": "say",
"arguments": {"text": text},
},
}
],
}
def normalize_rows_for_writer(
rows: Iterable[dict[str, Any]],
) -> tuple[list[dict[str, Any]], list[dict[str, Any]]]:
"""Helper used by tests/validators to partition a flat row list into
(persistent_rows, event_rows) using ``column_for_style``.
"""
persistent: list[dict[str, Any]] = []
events: list[dict[str, Any]] = []
for row in rows:
if column_for_style(row.get("style")) == LANGUAGE_PERSISTENT:
persistent.append(row)
else:
events.append(row)
return persistent, events

View File

@@ -33,7 +33,7 @@ import cv2 # type: ignore # TODO: add type stubs for OpenCV
import numpy as np # type: ignore # TODO: add type stubs for numpy
from lerobot.utils.decorators import check_if_not_connected
from lerobot.utils.import_utils import _reachy2_sdk_available, require_package
from lerobot.utils.import_utils import _reachy2_sdk_available
if TYPE_CHECKING or _reachy2_sdk_available:
from reachy2_sdk.media.camera import CameraView
@@ -76,7 +76,6 @@ class Reachy2Camera(Camera):
Args:
config: The configuration settings for the camera.
"""
require_package("reachy2_sdk", extra="reachy2")
super().__init__(config)
self.config = config

View File

@@ -17,21 +17,18 @@ Provides the RealSenseCamera class for capturing frames from Intel RealSense cam
"""
import logging
import sys
import time
from threading import Event, Lock, Thread
from typing import TYPE_CHECKING, Any
from typing import Any
import cv2 # type: ignore # TODO: add type stubs for OpenCV
import numpy as np # type: ignore # TODO: add type stubs for numpy
from numpy.typing import NDArray # type: ignore # TODO: add type stubs for numpy.typing
from lerobot.utils.import_utils import _pyrealsense2_available, require_package
if TYPE_CHECKING or _pyrealsense2_available:
import pyrealsense2 as rs
else:
rs = None
try:
import pyrealsense2 as rs # type: ignore # TODO: add type stubs for pyrealsense2
except Exception as e:
logging.info(f"Could not import realsense: {e}")
from lerobot.utils.decorators import check_if_already_connected, check_if_not_connected
from lerobot.utils.errors import DeviceNotConnectedError
@@ -42,7 +39,6 @@ from ..utils import get_cv2_rotation
from .configuration_realsense import RealSenseCameraConfig
logger = logging.getLogger(__name__)
pkg_name = "pyrealsense2-macosx" if sys.platform == "darwin" else "pyrealsense2"
class RealSenseCamera(Camera):
@@ -116,7 +112,7 @@ class RealSenseCamera(Camera):
Args:
config: The configuration settings for the camera.
"""
require_package(pkg_name, extra="intelrealsense", import_name="pyrealsense2")
super().__init__(config)
self.config = config

View File

@@ -28,19 +28,12 @@ import json
import logging
import time
from threading import Event, Lock, Thread
from typing import TYPE_CHECKING, Any
from typing import Any
import cv2
import numpy as np
from numpy.typing import NDArray
from lerobot.utils.import_utils import _zmq_available, require_package
if TYPE_CHECKING or _zmq_available:
import zmq
else:
zmq = None
from lerobot.utils.decorators import check_if_already_connected, check_if_not_connected
from lerobot.utils.errors import DeviceNotConnectedError
@@ -81,8 +74,8 @@ class ZMQCamera(Camera):
"""
def __init__(self, config: ZMQCameraConfig):
require_package("pyzmq", extra="pyzmq-dep", import_name="zmq")
super().__init__(config)
import zmq
self.config = config
self.server_address = config.server_address
@@ -124,6 +117,8 @@ class ZMQCamera(Camera):
logger.info(f"Connecting to {self}...")
try:
import zmq
self.context = zmq.Context()
self.socket = self.context.socket(zmq.SUB)
self.socket.setsockopt_string(zmq.SUBSCRIBE, "")
@@ -185,8 +180,11 @@ class ZMQCamera(Camera):
try:
message = self.socket.recv_string()
except zmq.Again as e:
raise TimeoutError(f"{self} timeout after {self.timeout_ms}ms") from e
except Exception as e:
# zmq is lazy-imported in connect(), so check by name to avoid a top-level import
if type(e).__name__ == "Again":
raise TimeoutError(f"{self} timeout after {self.timeout_ms}ms") from e
raise
# Decode JSON message
data = json.loads(message)

View File

@@ -28,12 +28,6 @@ import numpy as np
import torch
from lerobot.policies import PreTrainedPolicy, prepare_observation_for_inference
from lerobot.utils.import_utils import _deepdiff_available, require_package
if TYPE_CHECKING or _deepdiff_available:
from deepdiff import DeepDiff
else:
DeepDiff = None
if TYPE_CHECKING:
from lerobot.datasets import LeRobotDataset
@@ -223,7 +217,10 @@ def sanity_check_dataset_robot_compatibility(
Raises:
ValueError: If any of the checked metadata fields do not match.
"""
require_package("deepdiff", extra="deepdiff-dep")
from lerobot.utils.import_utils import require_package
require_package("deepdiff", extra="hardware")
from deepdiff import DeepDiff
from lerobot.utils.constants import DEFAULT_FEATURES

View File

@@ -41,12 +41,8 @@ def cfg_to_group(
return tag
return tag[:max_tag_length]
if cfg.is_reward_model_training:
trainable_tag = f"reward_model:{cfg.reward_model.type}"
else:
trainable_tag = f"policy:{cfg.policy.type}"
lst = [
trainable_tag,
f"policy:{cfg.policy.type}",
f"seed:{cfg.seed}",
]
if cfg.dataset is not None:

View File

@@ -21,10 +21,8 @@ are intentionally NOT re-exported here to avoid circular dependencies
Import them directly: ``from lerobot.configs.train import TrainPipelineConfig``
"""
from .dataset import DatasetRecordConfig
from .default import DatasetConfig, EvalConfig, PeftConfig, WandBConfig
from .policies import PreTrainedConfig
from .recipe import MessageTurn, TrainingRecipe, load_recipe
from .types import (
FeatureType,
NormalizationMode,
@@ -41,13 +39,9 @@ __all__ = [
"PolicyFeature",
"RTCAttentionSchedule",
# Config classes
"DatasetRecordConfig",
"DatasetConfig",
"EvalConfig",
"MessageTurn",
"PeftConfig",
"PreTrainedConfig",
"TrainingRecipe",
"WandBConfig",
"load_recipe",
]

View File

@@ -1,80 +0,0 @@
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Shared dataset recording configuration used by both ``lerobot-record`` and ``lerobot-rollout``."""
from dataclasses import dataclass
from datetime import datetime
from pathlib import Path
@dataclass
class DatasetRecordConfig:
# Dataset identifier. By convention it should match '{hf_username}/{dataset_name}' (e.g. `lerobot/test`).
repo_id: str = ""
# A short but accurate description of the task performed during the recording (e.g. "Pick the Lego block and drop it in the box on the right.")
single_task: str = ""
# Root directory where the dataset will be stored (e.g. 'dataset/path'). If None, defaults to $HF_LEROBOT_HOME/repo_id.
root: str | Path | None = None
# Limit the frames per second.
fps: int = 30
# Number of seconds for data recording for each episode.
episode_time_s: int | float = 60
# Number of seconds for resetting the environment after each episode.
reset_time_s: int | float = 60
# Number of episodes to record.
num_episodes: int = 50
# Encode frames in the dataset into video
video: bool = True
# Upload dataset to Hugging Face hub.
push_to_hub: bool = True
# Upload on private repository on the Hugging Face hub.
private: bool = False
# Add tags to your dataset on the hub.
tags: list[str] | None = None
# Number of subprocesses handling the saving of frames as PNG. Set to 0 to use threads only;
# set to ≥1 to use subprocesses, each using threads to write images. The best number of processes
# and threads depends on your system. We recommend 4 threads per camera with 0 processes.
# If fps is unstable, adjust the thread count. If still unstable, try using 1 or more subprocesses.
num_image_writer_processes: int = 0
# Number of threads writing the frames as png images on disk, per camera.
# Too many threads might cause unstable teleoperation fps due to main thread being blocked.
# Not enough threads might cause low camera fps.
num_image_writer_threads_per_camera: int = 4
# Number of episodes to record before batch encoding videos
# Set to 1 for immediate encoding (default behavior), or higher for batched encoding
video_encoding_batch_size: int = 1
# Video codec for encoding videos. Options: 'h264', 'hevc', 'libsvtav1', 'auto',
# or hardware-specific: 'h264_videotoolbox', 'h264_nvenc', 'h264_vaapi', 'h264_qsv'.
# Use 'auto' to auto-detect the best available hardware encoder.
vcodec: str = "libsvtav1"
# Enable streaming video encoding: encode frames in real-time during capture instead
# of writing PNG images first. Makes save_episode() near-instant. More info in the documentation: https://huggingface.co/docs/lerobot/streaming_video_encoding
streaming_encoding: bool = False
# Maximum number of frames to buffer per camera when using streaming encoding.
# ~1s buffer at 30fps. Provides backpressure if the encoder can't keep up.
encoder_queue_maxsize: int = 30
# Number of threads per encoder instance. None = auto (codec default).
# Lower values reduce CPU usage, maps to 'lp' (via svtav1-params) for libsvtav1 and 'threads' for h264/hevc..
encoder_threads: int | None = None
def stamp_repo_id(self) -> None:
"""Append a date-time tag to ``repo_id`` so each recording session gets a unique name.
Must be called explicitly at dataset *creation* time — not on resume,
where the existing ``repo_id`` (already stamped) must be preserved.
"""
if self.repo_id:
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
self.repo_id = f"{self.repo_id}_{timestamp}"

View File

@@ -35,9 +35,6 @@ class DatasetConfig:
revision: str | None = None
use_imagenet_stats: bool = True
video_backend: str = field(default_factory=get_safe_default_codec)
# When True, video frames are returned as uint8 tensors (0-255) instead of float32 (0.0-1.0).
# This reduces memory and speeds up DataLoader IPC. The training pipeline handles the conversion.
return_uint8: bool = False
streaming: bool = False
def __post_init__(self) -> None:

View File

@@ -1,206 +0,0 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import annotations
import re
from dataclasses import dataclass
from pathlib import Path
from typing import Any, Literal, get_args
MessageRole = Literal["user", "assistant", "system", "tool"]
MessageStream = Literal["high_level", "low_level"]
DEFAULT_BINDINGS = {
"subtask": "active_at(t, style=subtask)",
"memory": "active_at(t, style=memory)",
"plan": "active_at(t, style=plan)",
"speech": "emitted_at(t, role=assistant, tool_name=say)",
"interjection": "emitted_at(t, style=interjection)",
"vqa": "emitted_at(t, style=vqa, role=assistant)",
"vqa_query": "emitted_at(t, style=vqa, role=user)",
}
PLACEHOLDER_RE = re.compile(r"\$\{([A-Za-z_][A-Za-z0-9_]*)\}")
"""``${name}`` placeholder pattern used by both recipe binding-reference
discovery (here) and rendered-message substitution (in ``language_render``)."""
_VALID_ROLES = frozenset(get_args(MessageRole))
_VALID_STREAMS = frozenset(get_args(MessageStream))
@dataclass
class MessageTurn:
"""A single chat-style turn in a recipe template.
``content`` may be a plain string, a list of HF-style multimodal blocks, or
``None`` when ``tool_calls_from`` supplies tool-call payloads instead.
``stream`` tags the turn for downstream filtering, ``target`` flags it as a
training target, and ``if_present`` skips the turn when the named binding
resolves to ``None``.
"""
role: MessageRole
content: str | list[dict[str, Any]] | None = None
stream: MessageStream | None = None
target: bool = False
if_present: str | None = None
tool_calls_from: str | None = None
def __post_init__(self) -> None:
"""Validate role, stream, and content after dataclass construction."""
if self.role not in _VALID_ROLES:
raise ValueError(f"Unsupported message role: {self.role!r}")
# ``stream`` is typed Optional only so the dataclass can keep its
# field ordering, but recipes must always tag every turn with a
# stream — the renderer's ``_validate_rendered`` would reject
# ``None`` later on. Fail at construction so the bad recipe is
# caught at YAML load time rather than at the first sample.
if self.stream is None:
raise ValueError(
f"MessageTurn(role={self.role!r}) is missing a stream — "
f"every turn must declare one of {sorted(_VALID_STREAMS)}."
)
if self.stream not in _VALID_STREAMS:
raise ValueError(f"Unsupported message stream: {self.stream!r}")
if self.content is None and self.tool_calls_from is None:
raise ValueError("MessageTurn.content is required unless tool_calls_from is set.")
if self.content is not None and not isinstance(self.content, (str, list)):
raise TypeError("MessageTurn.content must be a string, a list of HF-style blocks, or None.")
if isinstance(self.content, list):
for block in self.content:
if not isinstance(block, dict) or "type" not in block:
raise ValueError(
"Multimodal content blocks must be HF-style dictionaries with a type key."
)
@classmethod
def from_dict(cls, data: dict[str, Any]) -> MessageTurn:
"""Construct a :class:`MessageTurn` from a plain dictionary."""
return cls(**data)
@dataclass
class TrainingRecipe:
"""A recipe describing how to render training samples from language rows.
A recipe is either a *message recipe* (``messages`` plus optional
``bindings``) or a *blend recipe* (``blend`` mapping names to weighted
sub-recipes). ``weight`` is only meaningful inside a blend.
"""
messages: list[MessageTurn] | None = None
bindings: dict[str, str] | None = None
blend: dict[str, TrainingRecipe] | None = None
weight: float | None = None
def __post_init__(self) -> None:
"""Validate that exactly one of ``messages`` or ``blend`` is set."""
if self.messages is not None and self.blend is not None:
raise ValueError("TrainingRecipe must set only one of messages or blend.")
if self.messages is None and self.blend is None:
raise ValueError("TrainingRecipe must set one of messages or blend.")
if self.messages is not None:
self._validate_message_recipe()
if self.blend is not None:
self._validate_blend_recipe()
@classmethod
def from_dict(cls, data: dict[str, Any]) -> TrainingRecipe:
"""Construct a :class:`TrainingRecipe` from a nested dictionary."""
data = dict(data)
if data.get("messages") is not None:
data["messages"] = [
turn if isinstance(turn, MessageTurn) else MessageTurn.from_dict(turn)
for turn in data["messages"]
]
if data.get("blend") is not None:
data["blend"] = {
name: recipe if isinstance(recipe, TrainingRecipe) else cls.from_dict(recipe)
for name, recipe in data["blend"].items()
}
return cls(**data)
@classmethod
def from_yaml(cls, path: str | Path) -> TrainingRecipe:
"""Load a :class:`TrainingRecipe` from a YAML file at ``path``."""
import yaml # type: ignore[import-untyped]
with open(path) as f:
data = yaml.safe_load(f)
if not isinstance(data, dict):
raise ValueError(f"Recipe YAML must contain a mapping at the top level: {path}")
return cls.from_dict(data)
def _validate_message_recipe(self) -> None:
"""Ensure every templated binding is known and at least one turn is a target."""
assert self.messages is not None
known_bindings = set(DEFAULT_BINDINGS) | set(self.bindings or {}) | {"task"}
for turn in self.messages:
missing = self._referenced_bindings(turn) - known_bindings
if missing:
raise ValueError(f"MessageTurn references unknown binding(s): {sorted(missing)}")
if not any(turn.target for turn in self.messages):
raise ValueError("Message recipes must contain at least one target turn.")
def _validate_blend_recipe(self) -> None:
"""Ensure each blend component is a non-empty, weighted message recipe."""
assert self.blend is not None
if not self.blend:
raise ValueError("Blend recipes must contain at least one component.")
for name, recipe in self.blend.items():
if recipe.blend is not None:
raise ValueError(f"Blend component {name!r} cannot itself define a blend.")
if recipe.messages is None:
raise ValueError(f"Blend component {name!r} must define messages.")
if recipe.weight is None:
raise ValueError(f"Blend component {name!r} must define weight.")
if recipe.weight <= 0:
raise ValueError(f"Blend component {name!r} must have a positive weight.")
def _referenced_bindings(self, turn: MessageTurn) -> set[str]:
"""Return the binding names that ``turn`` references via placeholders or attributes."""
names: set[str] = set()
if turn.if_present is not None:
names.add(turn.if_present)
if turn.tool_calls_from is not None:
names.add(turn.tool_calls_from)
names.update(_placeholders_in_content(turn.content))
return names
def _placeholders_in_content(content: str | list[dict[str, Any]] | None) -> set[str]:
"""Return the set of ``${name}`` placeholders found anywhere in ``content``."""
if content is None:
return set()
if isinstance(content, str):
return set(PLACEHOLDER_RE.findall(content))
names: set[str] = set()
for block in content:
for value in block.values():
if isinstance(value, str):
names.update(PLACEHOLDER_RE.findall(value))
return names
def load_recipe(path: str | Path) -> TrainingRecipe:
"""Load a :class:`TrainingRecipe` from a YAML file at ``path``."""
return TrainingRecipe.from_yaml(path)

View File

@@ -1,163 +0,0 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import abc
import builtins
import json
import logging
import os
import tempfile
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, TypeVar
import draccus
from huggingface_hub import hf_hub_download
from huggingface_hub.constants import CONFIG_NAME
from huggingface_hub.errors import HfHubHTTPError
from lerobot.configs.types import PolicyFeature
from lerobot.optim.optimizers import OptimizerConfig
from lerobot.optim.schedulers import LRSchedulerConfig
from lerobot.utils.device_utils import auto_select_torch_device, is_torch_device_available
from lerobot.utils.hub import HubMixin
T = TypeVar("T", bound="RewardModelConfig")
logger = logging.getLogger(__name__)
@dataclass
class RewardModelConfig(draccus.ChoiceRegistry, HubMixin, abc.ABC):
"""Base configuration for reward models.
Args:
input_features: A dictionary defining the PolicyFeature of the input data for the reward. The key represents
the input data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
output_features: A dictionary defining the PolicyFeature of the output data for the reward. The key represents
the output data name, and the value is PolicyFeature, which consists of FeatureType and shape attributes.
"""
# Reuses PolicyFeature
input_features: dict[str, PolicyFeature] = field(default_factory=dict)
output_features: dict[str, PolicyFeature] = field(default_factory=dict)
device: str | None = None
pretrained_path: str | None = None
push_to_hub: bool = False
repo_id: str | None = None
# Hub metadata
license: str | None = None
tags: list[str] | None = None
private: bool | None = None
def __post_init__(self) -> None:
if not self.device or not is_torch_device_available(self.device):
auto_device = auto_select_torch_device()
logger.warning(f"Device '{self.device}' is not available. Switching to '{auto_device}'.")
self.device = auto_device.type
@property
def type(self) -> str:
choice_name = self.get_choice_name(self.__class__)
if not isinstance(choice_name, str):
raise TypeError(f"Expected string from get_choice_name, got {type(choice_name)}")
return choice_name
@property
def observation_delta_indices(self) -> list | None: # type: ignore[type-arg]
return None
@property
def action_delta_indices(self) -> list | None: # type: ignore[type-arg]
return None
@property
def reward_delta_indices(self) -> list | None: # type: ignore[type-arg]
return None
@abc.abstractmethod
def get_optimizer_preset(self) -> OptimizerConfig:
raise NotImplementedError
def get_scheduler_preset(self) -> LRSchedulerConfig | None:
return None
def validate_features(self) -> None:
pass
def _save_pretrained(self, save_directory: Path) -> None:
with open(save_directory / CONFIG_NAME, "w") as f, draccus.config_type("json"):
draccus.dump(self, f, indent=4)
@classmethod
def from_pretrained(
cls: builtins.type[T],
pretrained_name_or_path: str | Path,
*,
force_download: bool = False,
resume_download: bool | None = None,
proxies: dict[Any, Any] | None = None,
token: str | bool | None = None,
cache_dir: str | Path | None = None,
local_files_only: bool = False,
revision: str | None = None,
**reward_kwargs: Any,
) -> T:
model_id = str(pretrained_name_or_path)
config_file: str | None = None
if Path(model_id).is_dir():
if CONFIG_NAME in os.listdir(model_id):
config_file = os.path.join(model_id, CONFIG_NAME)
else:
logger.error(f"{CONFIG_NAME} not found in {Path(model_id).resolve()}")
else:
try:
config_file = hf_hub_download(
repo_id=model_id,
filename=CONFIG_NAME,
revision=revision,
cache_dir=cache_dir,
force_download=force_download,
proxies=proxies,
resume_download=resume_download,
token=token,
local_files_only=local_files_only,
)
except HfHubHTTPError as e:
raise FileNotFoundError(
f"{CONFIG_NAME} not found on the HuggingFace Hub in {model_id}"
) from e
if config_file is None:
raise FileNotFoundError(f"{CONFIG_NAME} not found in {model_id}")
# HACK: Parse the original config to get the config subclass, so that we can
# apply cli overrides.
with draccus.config_type("json"):
orig_config = draccus.parse(cls, config_file, args=[])
with open(config_file) as f:
config = json.load(f)
config.pop("type", None)
with tempfile.NamedTemporaryFile("w+", delete=False, suffix=".json") as f:
json.dump(config, f)
config_file = f.name
cli_overrides = reward_kwargs.pop("cli_overrides", [])
with draccus.config_type("json"):
return draccus.parse(orig_config.__class__, config_file, args=cli_overrides)

View File

@@ -13,9 +13,7 @@
# limitations under the License.
import builtins
import datetime as dt
import json
import os
import tempfile
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
@@ -28,57 +26,18 @@ from lerobot import envs
from lerobot.configs import parser
from lerobot.optim import LRSchedulerConfig, OptimizerConfig
from lerobot.utils.hub import HubMixin
from lerobot.utils.sample_weighting import SampleWeightingConfig
from .default import DatasetConfig, EvalConfig, PeftConfig, WandBConfig
from .policies import PreTrainedConfig
from .rewards import RewardModelConfig
TRAIN_CONFIG_NAME = "train_config.json"
def _migrate_legacy_rabc_fields(config: dict[str, Any]) -> dict[str, Any] | None:
"""Return migrated payload for legacy RA-BC fields, or None when no migration is needed."""
legacy_fields = (
"use_rabc",
"rabc_progress_path",
"rabc_kappa",
"rabc_epsilon",
"rabc_head_mode",
)
if not any(key in config for key in legacy_fields):
return None
migrated_config = dict(config)
use_rabc = bool(migrated_config.pop("use_rabc", False))
rabc_progress_path = migrated_config.pop("rabc_progress_path", None)
rabc_kappa = migrated_config.pop("rabc_kappa", None)
rabc_epsilon = migrated_config.pop("rabc_epsilon", None)
rabc_head_mode = migrated_config.pop("rabc_head_mode", None)
# New configs may already define sample_weighting explicitly. In that case,
# legacy fields are ignored after being stripped from the payload.
if migrated_config.get("sample_weighting") is None and use_rabc:
sample_weighting: dict[str, Any] = {"type": "rabc"}
if rabc_progress_path is not None:
sample_weighting["progress_path"] = rabc_progress_path
if rabc_kappa is not None:
sample_weighting["kappa"] = rabc_kappa
if rabc_epsilon is not None:
sample_weighting["epsilon"] = rabc_epsilon
if rabc_head_mode is not None:
sample_weighting["head_mode"] = rabc_head_mode
migrated_config["sample_weighting"] = sample_weighting
return migrated_config
@dataclass
class TrainPipelineConfig(HubMixin):
dataset: DatasetConfig
env: envs.EnvConfig | None = None
policy: PreTrainedConfig | None = None
reward_model: RewardModelConfig | None = None
# Set `dir` to where you would like to save all of the run outputs. If you run another training session
# with the same value for `dir` its contents will be overwritten unless you set `resume` to true.
output_dir: Path | None = None
@@ -97,8 +56,6 @@ class TrainPipelineConfig(HubMixin):
# Number of workers for the dataloader.
num_workers: int = 4
batch_size: int = 8
prefetch_factor: int = 4
persistent_workers: bool = True
steps: int = 100_000
eval_freq: int = 20_000
log_freq: int = 200
@@ -113,41 +70,27 @@ class TrainPipelineConfig(HubMixin):
wandb: WandBConfig = field(default_factory=WandBConfig)
peft: PeftConfig | None = None
# Sample weighting configuration (e.g., for RA-BC training)
sample_weighting: SampleWeightingConfig | None = None
# RA-BC (Reward-Aligned Behavior Cloning) parameters
use_rabc: bool = False # Enable reward-weighted training
rabc_progress_path: str | None = None # Path to precomputed SARM progress parquet file
rabc_kappa: float = 0.01 # Hard threshold for high-quality samples
rabc_epsilon: float = 1e-6 # Small constant for numerical stability
rabc_head_mode: str | None = "sparse" # For dual-head models: "sparse" or "dense"
# Rename map for the observation to override the image and state keys
rename_map: dict[str, str] = field(default_factory=dict)
checkpoint_path: Path | None = field(init=False, default=None)
@property
def is_reward_model_training(self) -> bool:
"""True when the config targets a reward model rather than a policy."""
return self.reward_model is not None
@property
def trainable_config(self) -> PreTrainedConfig | RewardModelConfig:
"""Return whichever config (policy or reward_model) is active."""
if self.is_reward_model_training:
return self.reward_model # type: ignore[return-value]
return self.policy # type: ignore[return-value]
def validate(self) -> None:
# HACK: We parse again the cli args here to get the pretrained paths if there was some.
policy_path = parser.get_path_arg("policy")
reward_model_path = parser.get_path_arg("reward_model")
if reward_model_path:
cli_overrides = parser.get_cli_overrides("reward_model")
self.reward_model = RewardModelConfig.from_pretrained(
reward_model_path, cli_overrides=cli_overrides
)
self.reward_model.pretrained_path = str(Path(reward_model_path))
elif policy_path:
if policy_path:
# Only load the policy config
cli_overrides = parser.get_cli_overrides("policy")
self.policy = PreTrainedConfig.from_pretrained(policy_path, cli_overrides=cli_overrides)
self.policy.pretrained_path = Path(policy_path)
elif self.resume:
# The entire train config is already loaded, we just need to get the checkpoint dir
config_path = parser.parse_arg("config_path")
if not config_path:
raise ValueError(
@@ -163,22 +106,18 @@ class TrainPipelineConfig(HubMixin):
policy_dir = Path(config_path).parent
if self.policy is not None:
self.policy.pretrained_path = policy_dir
if self.reward_model is not None:
self.reward_model.pretrained_path = str(policy_dir)
self.checkpoint_path = policy_dir.parent
if self.policy is None and self.reward_model is None:
if self.policy is None:
raise ValueError(
"Neither policy nor reward_model is configured. "
"Please specify one with `--policy.path` or `--reward_model.path`."
"Policy is not configured. Please specify a pretrained policy with `--policy.path`."
)
active_cfg = self.trainable_config
if not self.job_name:
if self.env is None:
self.job_name = f"{active_cfg.type}"
self.job_name = f"{self.policy.type}"
else:
self.job_name = f"{self.env.type}_{active_cfg.type}"
self.job_name = f"{self.env.type}_{self.policy.type}"
if not self.resume and isinstance(self.output_dir, Path) and self.output_dir.is_dir():
raise FileExistsError(
@@ -196,16 +135,26 @@ class TrainPipelineConfig(HubMixin):
if not self.use_policy_training_preset and (self.optimizer is None or self.scheduler is None):
raise ValueError("Optimizer and Scheduler must be set when the policy presets are not used.")
elif self.use_policy_training_preset and not self.resume:
self.optimizer = active_cfg.get_optimizer_preset()
self.scheduler = active_cfg.get_scheduler_preset()
self.optimizer = self.policy.get_optimizer_preset()
self.scheduler = self.policy.get_scheduler_preset()
if hasattr(active_cfg, "push_to_hub") and active_cfg.push_to_hub and not active_cfg.repo_id:
raise ValueError("'repo_id' argument missing. Please specify it to push the model to the hub.")
if self.policy.push_to_hub and not self.policy.repo_id:
raise ValueError(
"'policy.repo_id' argument missing. Please specify it to push the model to the hub."
)
if self.use_rabc and not self.rabc_progress_path:
# Auto-detect from dataset path
repo_id = self.dataset.repo_id
if self.dataset.root:
self.rabc_progress_path = str(Path(self.dataset.root) / "sarm_progress.parquet")
else:
self.rabc_progress_path = f"hf://datasets/{repo_id}/sarm_progress.parquet"
@classmethod
def __get_path_fields__(cls) -> list[str]:
"""Keys for draccus pretrained-path loading."""
return ["policy", "reward_model"]
"""This enables the parser to load config from the policy using `--policy.path=local/dir`"""
return ["policy"]
def to_dict(self) -> dict[str, Any]:
return draccus.encode(self) # type: ignore[no-any-return] # because of the third-party library draccus uses Any as the return type
@@ -256,15 +205,6 @@ class TrainPipelineConfig(HubMixin):
) from e
cli_args = kwargs.pop("cli_args", [])
if config_file is not None:
with open(config_file) as f:
config = json.load(f)
migrated_config = _migrate_legacy_rabc_fields(config)
if migrated_config is not None:
with tempfile.NamedTemporaryFile("w+", delete=False, suffix=".json") as f:
json.dump(migrated_config, f)
config_file = f.name
with draccus.config_type("json"):
return draccus.parse(cls, config_file, args=cli_args)

View File

@@ -37,14 +37,6 @@ from .dataset_tools import (
from .factory import make_dataset, resolve_delta_timestamps
from .image_writer import safe_stop_image_writer
from .io_utils import load_episodes, write_stats
from .language import (
EVENT_ONLY_STYLES,
LANGUAGE_EVENTS,
LANGUAGE_PERSISTENT,
PERSISTENT_STYLES,
STYLE_REGISTRY,
column_for_style,
)
from .lerobot_dataset import LeRobotDataset
from .multi_dataset import MultiLeRobotDataset
from .pipeline_features import aggregate_pipeline_dataset_features, create_initial_features
@@ -61,15 +53,10 @@ __all__ = [
"CODEBASE_VERSION",
"DEFAULT_EPISODES_PATH",
"DEFAULT_QUANTILES",
"EVENT_ONLY_STYLES",
"EpisodeAwareSampler",
"LANGUAGE_EVENTS",
"LANGUAGE_PERSISTENT",
"LeRobotDataset",
"LeRobotDatasetMetadata",
"MultiLeRobotDataset",
"PERSISTENT_STYLES",
"STYLE_REGISTRY",
"StreamingLeRobotDataset",
"VideoEncodingManager",
"add_features",
@@ -79,7 +66,6 @@ __all__ = [
"convert_image_to_video_dataset",
"create_initial_features",
"create_lerobot_dataset_card",
"column_for_style",
"delete_episodes",
"get_feature_stats",
"load_episodes",

View File

@@ -97,8 +97,8 @@ def update_data_df(df, src_meta, dst_meta):
pd.DataFrame: Updated DataFrame with adjusted indices.
"""
df["episode_index"] = df["episode_index"] + dst_meta.info.total_episodes
df["index"] = df["index"] + dst_meta.info.total_frames
df["episode_index"] = df["episode_index"] + dst_meta.info["total_episodes"]
df["index"] = df["index"] + dst_meta.info["total_frames"]
src_task_names = src_meta.tasks.index.take(df["task_index"].to_numpy())
df["task_index"] = dst_meta.tasks.loc[src_task_names, "task_index"].to_numpy()
@@ -225,9 +225,9 @@ def update_meta_data(
# Clean up temporary columns
df = df.drop(columns=["_orig_chunk", "_orig_file"])
df["dataset_from_index"] = df["dataset_from_index"] + dst_meta.info.total_frames
df["dataset_to_index"] = df["dataset_to_index"] + dst_meta.info.total_frames
df["episode_index"] = df["episode_index"] + dst_meta.info.total_episodes
df["dataset_from_index"] = df["dataset_from_index"] + dst_meta.info["total_frames"]
df["dataset_to_index"] = df["dataset_to_index"] + dst_meta.info["total_frames"]
df["episode_index"] = df["episode_index"] + dst_meta.info["total_episodes"]
return df
@@ -237,8 +237,8 @@ def aggregate_datasets(
aggr_repo_id: str,
roots: list[Path] | None = None,
aggr_root: Path | None = None,
data_files_size_in_mb: int | None = None,
video_files_size_in_mb: int | None = None,
data_files_size_in_mb: float | None = None,
video_files_size_in_mb: float | None = None,
chunk_size: int | None = None,
):
"""Aggregates multiple LeRobot datasets into a single unified dataset.
@@ -313,8 +313,8 @@ def aggregate_datasets(
# to avoid interference between different source datasets
data_idx.pop("src_to_dst", None)
dst_meta.info.total_episodes += src_meta.total_episodes
dst_meta.info.total_frames += src_meta.total_frames
dst_meta.info["total_episodes"] += src_meta.total_episodes
dst_meta.info["total_frames"] += src_meta.total_frames
finalize_aggregation(dst_meta, all_metadata)
logging.info("Aggregation complete.")
@@ -640,10 +640,14 @@ def finalize_aggregation(aggr_meta, all_metadata):
write_tasks(aggr_meta.tasks, aggr_meta.root)
logging.info("write info")
aggr_meta.info.total_tasks = len(aggr_meta.tasks)
aggr_meta.info.total_episodes = sum(m.total_episodes for m in all_metadata)
aggr_meta.info.total_frames = sum(m.total_frames for m in all_metadata)
aggr_meta.info.splits = {"train": f"0:{sum(m.total_episodes for m in all_metadata)}"}
aggr_meta.info.update(
{
"total_tasks": len(aggr_meta.tasks),
"total_episodes": sum(m.total_episodes for m in all_metadata),
"total_frames": sum(m.total_frames for m in all_metadata),
"splits": {"train": f"0:{sum(m.total_episodes for m in all_metadata)}"},
}
)
write_info(aggr_meta.info, aggr_meta.root)
logging.info("write stats")

View File

@@ -512,7 +512,7 @@ def compute_episode_stats(
ep_stats = {}
for key, data in episode_data.items():
if features[key]["dtype"] in {"string", "language"}:
if features[key]["dtype"] == "string":
continue
if features[key]["dtype"] in ["image", "video"]:

View File

@@ -34,13 +34,16 @@ from .io_utils import (
load_episodes,
load_info,
load_stats,
load_subtasks,
load_tasks,
write_info,
write_json,
write_stats,
write_tasks,
)
from .utils import (
DEFAULT_EPISODES_PATH,
INFO_PATH,
check_version_compatibility,
get_safe_version,
has_legacy_hub_download_metadata,
@@ -174,6 +177,7 @@ class LeRobotDatasetMetadata:
self.info = load_info(self.root)
check_version_compatibility(self.repo_id, self._version, CODEBASE_VERSION)
self.tasks = load_tasks(self.root)
self.subtasks = load_subtasks(self.root)
self.episodes = load_episodes(self.root)
self.stats = load_stats(self.root)
@@ -224,7 +228,7 @@ class LeRobotDatasetMetadata:
@property
def _version(self) -> packaging.version.Version:
"""Codebase version used to create this dataset."""
return packaging.version.parse(self.info.codebase_version)
return packaging.version.parse(self.info["codebase_version"])
def get_data_file_path(self, ep_index: int) -> Path:
"""Return the relative parquet file path for the given episode index.
@@ -279,27 +283,27 @@ class LeRobotDatasetMetadata:
@property
def data_path(self) -> str:
"""Formattable string for the parquet files."""
return self.info.data_path
return self.info["data_path"]
@property
def video_path(self) -> str | None:
"""Formattable string for the video files."""
return self.info.video_path
return self.info["video_path"]
@property
def robot_type(self) -> str | None:
"""Robot type used in recording this dataset."""
return self.info.robot_type
return self.info["robot_type"]
@property
def fps(self) -> int:
"""Frames per second used during data collection."""
return self.info.fps
return self.info["fps"]
@property
def features(self) -> dict[str, dict]:
"""All features contained in the dataset."""
return self.info.features
return self.info["features"]
@property
def image_keys(self) -> list[str]:
@@ -316,39 +320,6 @@ class LeRobotDatasetMetadata:
"""Keys to access visual modalities (regardless of their storage method)."""
return [key for key, ft in self.features.items() if ft["dtype"] in ["video", "image"]]
@property
def has_language_columns(self) -> bool:
"""Return ``True`` if the dataset declares any language column.
Used to gate language-aware code paths (collate, render step) so
unannotated datasets keep PyTorch's default collate behavior.
"""
from .language import LANGUAGE_COLUMNS # noqa: PLC0415 (avoid circular import)
return any(col in self.features for col in LANGUAGE_COLUMNS)
@property
def tools(self) -> list[dict]:
"""OpenAI-style tool schemas declared by this dataset.
Read from ``meta/info.json["tools"]``. Returns a copy, so callers
can mutate the result safely. Falls back to
:data:`lerobot.datasets.language.DEFAULT_TOOLS` (the canonical
``say`` schema) when the dataset doesn't declare any — that way
unannotated datasets and chat-template consumers
(``apply_chat_template(messages, tools=meta.tools)``) keep
working out of the box.
Implementations live under :mod:`lerobot.tools` (one file per
tool); see ``docs/source/tools.mdx`` for the authoring guide.
"""
from .language import DEFAULT_TOOLS # noqa: PLC0415 (avoid circular import)
declared = self.info.tools
if declared:
return [dict(t) for t in declared]
return [dict(t) for t in DEFAULT_TOOLS]
@property
def names(self) -> dict[str, list | dict]:
"""Names of the various dimensions of vector modalities."""
@@ -362,32 +333,32 @@ class LeRobotDatasetMetadata:
@property
def total_episodes(self) -> int:
"""Total number of episodes available."""
return self.info.total_episodes
return self.info["total_episodes"]
@property
def total_frames(self) -> int:
"""Total number of frames saved in this dataset."""
return self.info.total_frames
return self.info["total_frames"]
@property
def total_tasks(self) -> int:
"""Total number of different tasks performed in this dataset."""
return self.info.total_tasks
return self.info["total_tasks"]
@property
def chunks_size(self) -> int:
"""Max number of files per chunk."""
return self.info.chunks_size
return self.info["chunks_size"]
@property
def data_files_size_in_mb(self) -> int:
"""Max size of data file in mega bytes."""
return self.info.data_files_size_in_mb
return self.info["data_files_size_in_mb"]
@property
def video_files_size_in_mb(self) -> int:
"""Max size of video file in mega bytes."""
return self.info.video_files_size_in_mb
return self.info["video_files_size_in_mb"]
def get_task_index(self, task: str) -> int | None:
"""
@@ -531,10 +502,10 @@ class LeRobotDatasetMetadata:
self._save_episode_metadata(episode_dict)
# Update info
self.info.total_episodes += 1
self.info.total_frames += episode_length
self.info.total_tasks = len(self.tasks)
self.info.splits = {"train": f"0:{self.info.total_episodes}"}
self.info["total_episodes"] += 1
self.info["total_frames"] += episode_length
self.info["total_tasks"] = len(self.tasks)
self.info["splits"] = {"train": f"0:{self.info['total_episodes']}"}
write_info(self.info, self.root)
@@ -553,7 +524,7 @@ class LeRobotDatasetMetadata:
for key in video_keys:
if not self.features[key].get("info", None):
video_path = self.root / self.video_path.format(video_key=key, chunk_index=0, file_index=0)
self.info.features[key]["info"] = get_video_info(video_path)
self.info["features"][key]["info"] = get_video_info(video_path)
def update_chunk_settings(
self,
@@ -575,17 +546,17 @@ class LeRobotDatasetMetadata:
if chunks_size is not None:
if chunks_size <= 0:
raise ValueError(f"chunks_size must be positive, got {chunks_size}")
self.info.chunks_size = chunks_size
self.info["chunks_size"] = chunks_size
if data_files_size_in_mb is not None:
if data_files_size_in_mb <= 0:
raise ValueError(f"data_files_size_in_mb must be positive, got {data_files_size_in_mb}")
self.info.data_files_size_in_mb = data_files_size_in_mb
self.info["data_files_size_in_mb"] = data_files_size_in_mb
if video_files_size_in_mb is not None:
if video_files_size_in_mb <= 0:
raise ValueError(f"video_files_size_in_mb must be positive, got {video_files_size_in_mb}")
self.info.video_files_size_in_mb = video_files_size_in_mb
self.info["video_files_size_in_mb"] = video_files_size_in_mb
# Update the info file on disk
write_info(self.info, self.root)
@@ -664,6 +635,7 @@ class LeRobotDatasetMetadata:
_validate_feature_names(features)
obj.tasks = None
obj.subtasks = None
obj.episodes = None
obj.stats = None
obj.info = create_empty_dataset_info(
@@ -681,7 +653,7 @@ class LeRobotDatasetMetadata:
f"Features contain video keys {obj.video_keys}, but 'use_videos' is set to False. "
"Either remove video features from the features dict, or set 'use_videos=True'."
)
write_info(obj.info, obj.root)
write_json(obj.info, obj.root / INFO_PATH)
obj.revision = None
obj._pq_writer = None
obj.latest_episode = None

Some files were not shown because too many files have changed in this diff Show More