Compare commits

...

111 Commits

Author SHA1 Message Date
CarolinePascal
e6e54391bd fix(missing method): adding missing add_microphones_recordings method in LeRobotDataset 2026-04-29 16:17:40 +02:00
CarolinePascal
a201b33d20 chore(typos): fixing uncaught typos 2026-04-29 16:13:03 +02:00
CarolinePascal
9d42de328e Merge branch 'main' into feat/audio_dataset 2026-04-01 19:09:09 +02:00
Pepijn
15934d8d08 feat(policies): add relative action support for pi0, pi0.5, and pi0_fast (#2970)
* Add option for pi family models to train with relative actions (relative to state)

* formatting

* add recomputation of stats and option to compute delta stats

* normalzie after delta conversion

* only recompute state for stats

* calulate chunk based stats

* sample 100k

* load from parquet

* sample 1m

* stats per chunck

* fix

* use quantiles

* stats for entire dataset

* fix

* max 1m frames

* compute before dist

* fix multi gpu processor bug

* Fix RTC with delta actions and OpenArms motor_type wiring

* feat: align pi0_fast delta actions with pi0/pi05 and add RTC integration tests

- Add delta_exclude_joints and action_feature_names to PI0FastConfig
- Move to_absolute_actions from modeling to processor pipeline for pi0_fast
- Add delta action detection and logging to eval_with_real_robot.py
- Add delta actions documentation to pi0 and pi05 READMEs
- Fix ruff lint issues in test_delta_actions.py
- Add test_rtc_delta_actions.py (24 tests) covering:
  - ActionQueue with delta vs absolute actions
  - RTC denoise step with delta leftovers
  - Full pipeline roundtrip (delta → RTC → absolute)
  - State rebasing approximation bounds
  - Non-delta policy compatibility
  - Multi-chunk consistency

* chore: clean up test comments, add OpenPI attribution, remove debug logging

- Replace decorative comment separators in test files with plain section headers
- Add attribution comments for 1e-6 epsilon in normalize_processor.py (from OpenPI)
- Remove debug logging blocks from lerobot_train.py

* refactor: extract compute_delta_action_stats into compute_stats.py

Move the ~70-line inline delta action stats block from lerobot_train.py
into a dedicated function in compute_stats.py, where all other stats
computation already lives. The training script now calls it in 6 lines.

* refactor: remove unused get_processed_left_over from ActionQueue

This method was never called outside of tests. Leftover actions for RTC
guidance are always retrieved via get_left_over() (delta/original space).

* revert: remove logging-only changes from eval_with_real_robot.py

The delta actions detection helper and log message added no functional
value — the script already handles delta policies correctly via the
processor pipeline.

* refactor: use ACTION/OBS_STATE constants instead of hardcoded strings

Replace hardcoded "action" and "observation.state" with ACTION and
OBS_STATE from utils.constants in compute_stats.py, dataset_tools.py,
and lerobot_train.py.

* style: remove stray blank lines in training loop

* refactor: move delta action stats to preprocessing step, remove on-the-fly computation

- Remove on-the-fly compute_delta_action_stats from lerobot_train.py
- Rewrite recompute_stats to delegate action stats to compute_delta_action_stats
  (chunk-based sampling matching what the model sees during training)
- Add chunk_size parameter to recompute_stats for delta action computation
- Add delta actions documentation to pi0.mdx and pi05.mdx

* feat: add recompute_stats CLI operation to lerobot-edit-dataset

* fix(tests): relax quantile normalization test tolerance for 1e-6 epsilon

* chore: remove agents_memory/pr_details.md from repo

* refactor: rename delta actions to relative actions throughout

What OpenPI calls "DeltaActions" is actually UMI's "relative trajectory"
representation: each action in the chunk is an offset from the current
state, not from the previous action. This avoids error accumulation.

Renamed across all source, tests, docs, and CLI:
- DeltaActionsProcessorStep → RelativeActionsProcessorStep
- to_delta_actions → to_relative_actions
- use_delta_actions → use_relative_actions
- delta_exclude_joints → relative_exclude_joints
- compute_delta_action_stats → compute_relative_action_stats
- delta_action_processor.py → relative_action_processor.py
- test_delta_actions.py → test_relative_actions.py

Kept as-is: AbsoluteActionsProcessorStep (converts TO absolute),
registry ID "delta_actions_processor" (backward compat), and unrelated
delta references (IK pipeline, Robosuite, RA-BC metrics, gym envs).

* docs: add Action Representations guide

Dedicated page explaining absolute, relative, and delta actions with
numerical examples, joint vs EE space, and how to use kinematics
pipelines and the relative action processor. References UMI paper
(Chi et al., 2024) for the terminology.

* docs: remove redundant OpenPI naming note from action representations

* docs: remove opinionated OpenPI reference from delta actions section

* docs: replace ASCII diagram with UMI paper figure

* docs: remove OpenPI reference from action representations

* docs: use HF-hosted image instead of local asset

* docs: clarify figure attribution

* revert: restore original normalization epsilon behavior

The 1e-6 unconditional epsilon change perturbed all normalized values,
breaking backward compatibility tests. The original approach (1e-8 eps
for MEAN_STD, conditional torch.where for QUANTILES) already handles
division by zero correctly without affecting non-degenerate cases.

* fix: restore delta_action_processor.py used by phone/RL teleop

The rename commit incorrectly deleted delta_action_processor.py and
duplicated its classes into relative_action_processor.py. Restore the
original file and import from it instead.

* fix(processor): address PR #2970 review comments

- Remove shebang from relative_action_processor.py (library module, not script)
- Add device alignment in to_relative_actions/to_absolute_actions so _last_state
  on CPU doesn't cause cross-device errors when actions are on CUDA
- Rename delta_step → relative_step in AbsoluteActionsProcessorStep for naming
  consistency; update factory.py, all processor files, and tests
- Expand _reconnect_relative_absolute_steps docstring to explain why post-hoc
  rewiring is needed after deserialization
- Fix off-by-one in compute_stats.py: sample_upper_bound = total_frames - chunk_size + 1
  so last valid start index is included and total_frames == chunk_size is not rejected
- Remove redundant NOTE comment in processor_pi05.py (duplicated two lines below)
- Fix pi0_fast processor ordering: move relative_step before NormalizerProcessorStep
  so normalizer sees delta actions (matching pi0/pi05); flip postprocessor to
  unnormalize → absolute accordingly. Relative stats are now required for all pi models
- Revert use_relative_joint_actions_aloha → use_delta_joint_actions_aloha in
  configuration_smolvla.py (preserve existing public API)
- Update action_representations.mdx: add missing joint to 6-DOF example, fix
  'based on a figure', clarify pi family ordering, add RTC compatibility section

* update rtc link

* feat: compute relative action stats over full dataset with optional parallelism

Remove the 100k sample cap from compute_relative_action_stats and process
all valid chunks. Vectorize with numpy (pre-load actions/states, fancy
indexing + broadcasting) for a large speedup over the per-index HF dataset
loop. Add num_workers param for thread-based parallelism (numpy releases
the GIL). Update docs to show --push_to_hub for recompute_stats.

* style: apply ruff formatting to compute_stats.py

* testing on real robot

* style: fix ruff format and remove redundant .keys() calls
2026-04-01 12:59:12 +02:00
Jai Kumaar Ratadia
9300352876 Fix SO-101 assembly instruction order to match videos (#3242)
* Fix SO-101 assembly instruction order to match videos

Motor horn installation steps were listed after placing motors
into the housing, but the assembly videos show installing horns
first. Reordered steps to match the videos, which is also the
easier approach since horns are harder to attach once the motor
is seated. Added missing detail that bottom horns don't require
screws.

* Update docs/source/so101.mdx

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Jai Kumaar Ratadia <jaikumaarratadia@gmail.com>

---------

Signed-off-by: Jai Kumaar Ratadia <jaikumaarratadia@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>
2026-03-31 12:16:34 +02:00
Steven Palma
720cf8e3a0 Revert "fix(deps): breaking change from transformers 5.4.0" (#3249)
* Revert "fix(deps): breaking change from transformers 5.4.0 (#3231)"

This reverts commit 07502868e5.

* chore(dependecies): pin transformers to 5.3.0 temporarily
2026-03-30 19:11:41 +02:00
Steven Palma
5d4fdf5088 feat(scripts): add transformers version (#3248)
* feat(scripts): add transformers and torch version

* chore(scripts): remove pytorch
2026-03-30 16:33:17 +02:00
四七
3b185f7f9d fix(datasets): remove unreachable timestamp branch in add_frame (#3163)
* fix(datasets): remove unreachable timestamp branch in add_frame and document caller contract

- Remove dead code: frame.pop("timestamp") branch in add_frame() could never
  execute because validate_frame() raises ValueError for any DEFAULT_FEATURES
  key (including timestamp) before we reach that line.
- Expand add_frame() docstring: explicitly document that timestamp and
  frame_index must NOT be passed by the caller.
- Add explanatory comment in validate_frame(): clarifies why DEFAULT_FEATURES
  are excluded from expected_features, preventing future re-introduction of
  the dead branch.

The dead branch originated in #1200, which fixed a shape-(1,) mismatch for a
code path that was subsequently made unreachable by a refactor of validate_frame.

* chore(datasets): narrow PR scope

* fix(datasets): move add_frame timestamp cleanup to dataset_writer
2026-03-28 11:37:57 +01:00
Bryson Jones
2e069b1c47 Feature/add multitask diffusion transformer policy implementation (#2545)
* Add multitask diffusion transformer policy

Add multitask diffusion transformer policy

* expand the observation encoder to support differnt size encoders for vision and text

* add RoPE attention module as this is shown to help training dynamics and generation quality for DiTs

* update readme and citations for multitask dit policy

* remove dino vision encoder and simplify text and vision encoders by removing inheritance structure

* adjust factory comment

* update docstring for multitask dit policy processor file

* simplify config for multitask dit by merging and flattening everything, then adding comments to denote where some parameters are only used for specific objectives

* add references to the modeling file comments

* merge all modules files into the main modeling file

* add torch.no_grad decorators

* split up select action return statement

* remove redundant asserts

* add tutorial to training with multi_task_dit

* fix bugs when testing on hardware

* remove environment state conditioning

* update typo in test instruction comment

* add processor tests to multitask dit tests

* move policy to top of file

* use constants for indexing into batches and remove env state references

* remove the base classes since we don't need to be able to extend

* fix nit formatting in generate actions fcn

* reformat and clean up tutorial for multitask dit policy

* add more descriptions and depth to multitask dit tutorial

* note origins of each training objective

* rename config param for multiple vision encoders

* refactor code to perform task tokenization in the processor instead of in the modeling code for multitask dit

* add multitask dit to toc for docs

* add conditional transformers import to match all other policies that use transformers lib

* add test handling for multitask dit when transformers isnt available

* skip tests without transformers

* remove cropping of images smaller than the crop size

* add kwargs arg to multitask dit constructor

* add wallx dep conflict management for multitask dit policy

* use hyphens for cleanliness in pyproject.toml

* add conflict management to pyproject toml for pi conflict for mtdp as well

* update tests script to not use unnecessary uv sync call which resolves dependencies that do not need to run. This drastically reduces CI run time

* revert fast tests edits

* update docs and readme files, fixing some typos and adding multitask dit to readme

* chore(dependencies): upgrade transformers + hggingface-hub + peft + scipy

* chore(dependencies): bump pi0 family to transformers v5

* chore(dependencies): bump wall x to transformers v5

* chore(dependencies): bump gr00t to transformers v5

* chore(style): fix pre-commit

* fix(policy): xvla forced_bos_token missing

* test(rl): skip ci tests for resnet10

* Fix: full pi models support for transformer v5 (#2967)

* fix(pi): remove loss truncation

* fix(pi): remove state padding before tokenization

* fix(pi): fix image padding value

* fix from_pretrain

* add transformer v5 changes

* remove reference

* more fixes

* make it work

* add support for rest of pi family

* add pifast work

* more changes

* more changes

* more cleanup

* fix torch params

* dtype fix

* torch compile

* embed mismatch fix

* revert groot

* more nit fixes

* remove unused classes

* more fixes

* revert

* nit

* torch dtype warning fix

* but back dynamic renaming

* add tie embedding

---------

Co-authored-by: Yufei Sun <skieyfly@gmail.com>

* chore: fix XVLA in transformers v5 (#3006)

* test(policies): enable wall x CI testing

* style(test): pre-commit check

* style(test): pre-commit

---------

Signed-off-by: Bryson Jones <63133702+brysonjones@users.noreply.github.com>
Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>
Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
Co-authored-by: Jade Choghari <chogharijade@gmail.com>
Co-authored-by: Yufei Sun <skieyfly@gmail.com>
Co-authored-by: Steven Palma <steven.palma@huggingface.co>
2026-03-28 00:41:26 +01:00
Steven Palma
4e45acca52 fix(dataset): use revision-safe Hub cache for downloaded datasets (#3233)
* refactor(dataset): enhance dataset root directory handling and introduce hub cache support

- Updated DatasetConfig and LeRobotDatasetMetadata to clarify root directory behavior and introduce a dedicated hub cache for downloads.
- Refactored LeRobotDataset and StreamingLeRobotDataset to utilize the new hub cache and improve directory management.
- Added tests to ensure correct behavior when using the hub cache and handling different revisions without a specified root directory.

* refactor(dataset): improve root directory handling in LeRobotDataset

- Updated LeRobotDataset to store the requested root path separately from the actual root path.
- Adjusted metadata loading to use the requested root, enhancing clarity and consistency in directory management.

* refactor(dataset): minor improvements for hub cache support

* chore(datasets): guard in resume + assertion test

---------

Co-authored-by: AdilZouitine <adilzouitinegm@gmail.com>
Co-authored-by: mickaelChen <mickael.chen.levinson@gmail.com>
2026-03-27 22:21:55 +01:00
Maxime Ellerbach
975d89b38d chore(docs): add more guidance to bring your own policies tutorial (#3230)
* chore(docs): add more guidance to bring your own policies tutorial

* removing normalization to avoid confusion with processors

* trailing whitespace

* Update docs/source/bring_your_own_policies.mdx

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>

* Update docs/source/bring_your_own_policies.mdx

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>

* adding get optim params and predict_action chunk

* removing extra quotes

---------

Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>
2026-03-27 21:25:37 +01:00
Maxime Ellerbach
07502868e5 fix(deps): breaking change from transformers 5.4.0 (#3231)
* fix(deps): breaking change from transformers 5.4.0

* Update src/lerobot/policies/xvla/modeling_florence2.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>

* Update src/lerobot/policies/wall_x/qwen_model/qwen2_5_vl_moe.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>

* removing dataclass

* bumping transformers 5.4.0

---------

Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-03-27 21:25:12 +01:00
Reece O'Mahoney
aa9cc9bd43 fix(logging): suppress noisy httpx INFO logs (#3173)
Set httpx logger level to WARNING in init_logging to prevent
HTTP request traces from flooding the terminal during train and
eval scripts.

Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
2026-03-26 21:05:15 +01:00
Steven Palma
123495250b refactor(dataset): split LeRobotDataset into DatasetReader & DatasetWriter (+ API cleanup) (#3180)
* refactor(dataset): split reader and writer

* chore(dataset): remove proxys

* refactor(dataset): better reader & writer encapsulation

* refactor(datasets): clean API + reduce leaky implementations

* refactor(dataset): API cleaning for writer, reader and meta

* refactor(dataset): expose writer & reader + other minor improvements

* refactor(dataset): improve teardown routine

* refactor(dataset): add hf_dataset property at the facade level

* chore(dataset): add init for datasset module

* docs(dataset): add docstrings for public API of the dataset classes

* tests(dataset): add tests for new classes

* fix(dataset): remove circular dependecy
2026-03-26 19:09:25 +01:00
Jade Choghari
017ff73fbf chore(docs): add rename map and empty cam guide (#3065)
* add blog/guide

* add to tree

* chore(docs): rephrase rename_map docs for clarity and simplicity

---------

Co-authored-by: Steven Palma <steven.palma@huggingface.co>
Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
2026-03-23 13:57:53 -07:00
Praedico
f90db58c15 docs(async): fix GitHub issues link (#3186) 2026-03-19 22:32:07 -07:00
Altman
e64fa667c3 fix(vqbet): use in-place fill_ to avoid overwriting DDP GPU buffers with CPU tensors (#3128)
* fix(vqbet): use in-place fill_ to avoid overwriting DDP GPU buffers with CPU tensors

When VQ discretization phase completes, the code was overwriting
register_buffer('discretized') and register_buffer('freeze_codebook')
with torch.tensor(True), which is created on CPU. DDP then fails in
_sync_buffers() with: RuntimeError: No backend type associated with
device type cpu. Fix by updating the buffers in-place with .fill_(True)
so device and registration are preserved.

Made-with: Cursor

* test(vqbet): add regression test for in-place buffer update during discretization

Verifies that discretize() updates the 'discretized' and 'freeze_codebook'
registered buffers in-place (via fill_()) rather than replacing them with new
CPU tensors. The test checks data_ptr() identity and that the tensors remain
registered buffers after the call. This prevents regressions of the DDP fix.

Made-with: Cursor

* test(vqbet): add GPU regression test to verify buffers stay on CUDA after discretize()

Directly catches the original DDP failure mode: when buffers are replaced with
torch.tensor(True) they land on CPU, causing NCCL to raise 'No backend type
associated with device type cpu' in _sync_buffers(). The GPU test places the
model on cuda:0 and asserts both buffers remain on CUDA after discretization.

Made-with: Cursor

* test(vqbet): simplify to single device-check test in test_policies.py

Per reviewer feedback: remove the separate test file and replace the two
CPU/GPU tests (with data_ptr checks) with a single focused test in
tests/policies/test_policies.py that only asserts the registered buffers
remain on the model device after discretize(). Uses DEVICE from tests/utils.py
so it runs on whatever device the CI/user selects (cpu, cuda, mps).

Made-with: Cursor

* style: fix import order in test_policies.py to pass ruff/pre-commit checks

Made-with: Cursor

---------

Co-authored-by: Zhan DiJia <2476100824@example.com>
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
2026-03-18 13:24:07 +01:00
Khalil Meftah
d9ec3a6fa2 Fix/earth rover dataset features (#3088)
* docs(earthrover): update EarthRover Mini Plus dataset features and descriptions

* refactor(teleop): rename rover action keys to linear_velocity/angular_velocity

* fix(earthrover): align observation and action features with frodobots/berkeley-frodobots-lerobot-7k

* chore: address PR review comments

* ci: retrigger checks
2026-03-17 18:33:53 +01:00
Steven Palma
d90e4bcfd3 refactor(dataset): modular files (#3171)
* refactor(dataset): modular files

* refactor(dataset): update imports across the codebase
2026-03-15 23:58:09 -07:00
Steven Palma
9d3b62aa61 chore(dataset): basic house-keeping (#3170) 2026-03-15 22:12:09 -07:00
CarolinePascal
8b9451b585 fix(mypy): fixing mypy tests 2026-01-20 12:25:55 +01:00
CarolinePascal
ab4903e752 tests(clean): cleaning up audio tests 2026-01-20 12:25:55 +01:00
CarolinePascal
538cea6dbc fix(race condition): fixing race condition in stop_recording (wait for proper stop before cleaning) 2026-01-20 12:25:55 +01:00
CarolinePascal
5cd3572713 chore(busy_wait): renaming busy_wait into precise_sleep 2026-01-20 12:25:55 +01:00
CarolinePascal
3399513e5e feat(control loop): adding logs for control loop delays 2026-01-20 12:25:55 +01:00
CarolinePascal
32fc4015ee chore(headers): fixing headers 2026-01-20 12:25:55 +01:00
CarolinePascal
cc72c813bf fix(process init timeout): adding proper support when waiting for record process initialization 2026-01-20 12:25:55 +01:00
CarolinePascal
606f31a86e feat(tactile becnhmark): adding illustrative tactile sensor benchmark 2026-01-20 12:25:55 +01:00
CarolinePascal
4933c9dcc7 chore(touchlab imports): adding imports for touchlab sensors 2026-01-20 12:25:52 +01:00
CarolinePascal
7e25385024 feat(touchlab tactile sensors): adding initial support for TouchLab tactile sensors 2026-01-20 12:25:01 +01:00
CarolinePascal
cc70bff74d fix(increasing DTS): fix issue with non strictly increasing DTS in media files concatenation 2026-01-20 12:25:01 +01:00
CarolinePascal
9f50913b9c fix(typos): fixing typos in audio handling in LeRobotDataset 2026-01-20 12:25:01 +01:00
CarolinePascal
4eb7694d47 test(rerun audio): adding tests for audio visualization with rerun 2026-01-20 12:25:01 +01:00
CarolinePascal
edb5559b5b fix(rerun audio): fixing rerun visualization for audio 2026-01-20 12:24:56 +01:00
CarolinePascal
552ec76195 feat(audio dataset conversion): adding support for audio in dataset v2.1 -> v3.0 conversion scipts 2026-01-20 12:24:00 +01:00
CarolinePascal
e75340b473 feat(aggregate audio): adding support for audio in dataset aggregation functions 2026-01-20 12:24:00 +01:00
CarolinePascal
2a4c223ec7 feat(parametrized audio processor): adding parameters for AudioProcessorStep definition 2026-01-20 12:24:00 +01:00
CarolinePascal
1ee4d84f07 fix(audio visualization): making audio visualization robust to multiple channel audio data 2026-01-20 12:24:00 +01:00
CarolinePascal
6bd40ca219 fix(__str__): adding missing __str__ method in PortAudioMicrophone 2026-01-20 12:24:00 +01:00
CarolinePascal
b879cf3d04 feat(shared array): removing queues copy and flush delays with a SharedArray inter-process communication 2026-01-20 12:24:00 +01:00
CarolinePascal
bd9e5c1a64 fix(is_writing): adding a correct definition to is_writing based on output file opening and closing 2026-01-20 12:24:00 +01:00
CarolinePascal
9271a0c900 fix(Microphone): adding proper logger definition, timeout for end of recording monitoring and failsafe __del__ method 2026-01-20 12:24:00 +01:00
CarolinePascal
af2f044f5a feat(rolling vstack): opting for an inplace copy efficient implementation of the rolling vstack for the audio buffer 2026-01-20 12:24:00 +01:00
CarolinePascal
0caba222ef fix(typos): fixing typos and missing imports 2026-01-20 12:23:58 +01:00
CarolinePascal
6d73f5bfe6 test(Microphone): removing unittest.TestCase class architecture to add tests parametrization on multiprocessing/multithreading use 2026-01-20 12:23:22 +01:00
CarolinePascal
ef8f40c21b test(LeRobotDataset): add missing test and support for audio frames addition 2026-01-20 12:23:19 +01:00
CarolinePascal
0232879245 test(Microphone): adding missing testsand support for float sample rate 2026-01-20 12:20:28 +01:00
CarolinePascal
2726b4e865 fix(hw-dataset): adding missing support for audio in hw-to-dataset functions 2026-01-20 12:20:28 +01:00
CarolinePascal
e126d35249 test(Microphone): adding tests for the PortAudioMicrophone class 2026-01-20 12:20:27 +01:00
CarolinePascal
d7ae8cd699 docs(mic utils): adding complete docstrings for Microphone utils functions and removing unused utils function 2026-01-20 12:20:27 +01:00
CarolinePascal
2f96d8bf76 fix(Microphone): adding missing properties and argument in Microphone base class 2026-01-20 12:20:27 +01:00
CarolinePascal
e129c71b4f feat(sounddevice SDK): adding SDK interface and support for sounddevice dependency injection 2026-01-20 12:20:27 +01:00
CarolinePascal
a02d70389d feat(clear-less start): adding support for synchronized recording start without delaying queue clearings 2026-01-20 12:20:27 +01:00
CarolinePascal
0d4922ce49 refactor(properties): making microphones properties more robust and adding proper checks on state changes 2026-01-20 12:20:27 +01:00
CarolinePascal
eaeff78924 style(names): renaming attributes names for better clarity 2026-01-20 12:20:27 +01:00
CarolinePascal
e2f3982e2c style(config validation): storing microphone config validation in dedicated methods 2026-01-20 12:20:27 +01:00
CarolinePascal
a73ac2bdbb [skip-ci] style(clean): cleaning up audio benchmark 2026-01-20 12:20:27 +01:00
CarolinePascal
95de732e55 [skip-ci] feat(multiprocess audio stream): running input audio stream in a separate process to avoid extensive CPU usage impacts 2026-01-20 12:20:27 +01:00
CarolinePascal
b2383236ca [skip-ci] fix(audio buffers): fill audio buffers before entering the control loop to avoid first audio chunk reading delay 2026-01-20 12:20:27 +01:00
CarolinePascal
4b98cc25c8 [skip-ci] fix(async read): remove async read from LeKiwi to avoid delays 2026-01-20 12:20:27 +01:00
CarolinePascal
90780c4de8 [skip-ci] feat(portaudio timeout): decresing portaudio stream timeout for low latency 2026-01-20 12:20:27 +01:00
CarolinePascal
6f6e046c53 [skip-ci] feat(audio rerun): adding support for multiple channels visualization with rerun 2026-01-20 12:20:27 +01:00
CarolinePascal
8cd64eaad1 fix(resnet weights): fix default ResNet weights for audio 2026-01-20 12:20:27 +01:00
CarolinePascal
e620395416 feat(init audio buffer): adding a separate parameter for the initial audio buffer size 2026-01-20 12:20:27 +01:00
CarolinePascal
0fbcbcdb2e feat(audio ACT): removing normalization and pretrained weights because it does not really make sense 2026-01-20 12:20:26 +01:00
CarolinePascal
674f5dfd75 test(audio frame): fixing decoded audio frame shape 2026-01-20 12:20:26 +01:00
CarolinePascal
7d430c8067 style(return types): adding missing return types 2026-01-20 12:20:26 +01:00
CarolinePascal
5f114c1d74 feat(init audio buffers): adding correct audio buffer initialization with actually recorded background noise instead of pure silence 2026-01-20 12:20:26 +01:00
CarolinePascal
ad01ef19f4 fix(audio buffers): add security crop to avoid audio buffer overfilling 2026-01-20 12:20:26 +01:00
CarolinePascal
59e8f4572c [skip-ci] fix(typo): fixing microphone key typo in audio benchmark 2026-01-20 12:20:26 +01:00
CarolinePascal
97e91698fb [skip-ci] fix(online audio chunks): adding missing observation copy when adding audio buffer 2026-01-20 12:20:26 +01:00
CarolinePascal
af0294198a [skip ci] feat(visualize audio): adding audio recordings visualization in rerun 2026-01-20 12:20:22 +01:00
CarolinePascal
421fdcce96 [skip ci] fix(audio latency): setting microphone recording latency to low as default 2026-01-20 12:12:59 +01:00
CarolinePascal
bb63ad9715 fix(online audio chunks): querying audio chunks as a shifting widow over previous audio samples to match the default audio chunk size 2026-01-20 12:12:59 +01:00
CarolinePascal
3c90a79c57 feat(audio in ACT): adding audio features support in ACT using mel-spectrogram representation 2026-01-20 12:12:59 +01:00
CarolinePascal
8e29c530ed fix(pytorch audio format): switching to pytorch's default channel first format for audio 2026-01-20 12:12:59 +01:00
CarolinePascal
b573b7a052 fix(audio decoding): fixing edge cases where the requested audio chunk starts before the beginning of the recording 2026-01-20 12:12:59 +01:00
CarolinePascal
926184110b feat(audio in policies): adding audio as a input feature in policies 2026-01-20 12:12:59 +01:00
CarolinePascal
bf8ede852d fix(busy wait): removing busy_wait on MacOS for python>3.11 2026-01-20 12:12:59 +01:00
CarolinePascal
f73db4394b fix(audio chunks): querying audio chunks in the past rather than in the future 2026-01-20 12:12:59 +01:00
CarolinePascal
bff91f9927 feat(torchcodec): setting torchcodec as default as the new official release supports audio decoding 2026-01-20 12:12:59 +01:00
CarolinePascal
6d726266fd fix(audio load file): adding missing dimension when loading mono audio data 2026-01-20 12:12:59 +01:00
CarolinePascal
2962330bb1 style(imports): simplify soundfile imports 2026-01-20 12:12:59 +01:00
CarolinePascal
067993bb11 fix(typos): fixing typos 2026-01-20 12:12:58 +01:00
CarolinePascal
e4dd00c8f5 fix(audio feature shape): fixing audio feature shape ordering (frames first, channels second) 2026-01-20 12:12:58 +01:00
CarolinePascal
e714ff22e2 [skip ci] doc(benchmark): adding microphones recording benchmark 2026-01-20 12:12:58 +01:00
CarolinePascal
3bbd161cfd [skip ci] feat(audio recording): adding new asyn start_recording, stop_recording and read functions to avoid for loop delays 2026-01-20 12:12:58 +01:00
CarolinePascal
6d7be63f59 [skip ci] fix(wav format): fix default WAV format for audio storage 2026-01-20 12:12:58 +01:00
CarolinePascal
b9d0dfb9a2 [skip ci] refactor(Microphone class): properly rename Microphone class to PortAudioMicrophone to avoid conflicts with the eponym base class 2026-01-20 12:12:58 +01:00
CarolinePascal
dce483060f [skip ci] feat(audio recording): handle folder creation in start_recording directly 2026-01-20 12:12:58 +01:00
CarolinePascal
c32b9182d9 [skip ci] feat(torchcodec): adding support for torchcodec audio decoding 2026-01-20 12:12:58 +01:00
CarolinePascal
a4d4ef0e7f fix: fixing typos 2026-01-20 12:12:58 +01:00
CarolinePascal
9a5c96b2b1 fix: fixing issue with microphone channels numbering and status recovery on stop_recording 2026-01-20 12:12:58 +01:00
CarolinePascal
0a6ca58299 fix: adding proper definition for "total_audio" in LeRobotDataset 2026-01-20 12:12:58 +01:00
CarolinePascal
688195fc46 docs: add methods descriptions and comments on tricky parts 2026-01-20 12:12:58 +01:00
CarolinePascal
99eb0bbafc Adding last missing audio features in LeRobotDataset 2026-01-20 12:12:58 +01:00
CarolinePascal
16de8b3f19 Adding support for audio data recording and broadcasting for LeKiwi 2026-01-20 12:12:55 +01:00
CarolinePascal
580008663b Adding flag for file writting recording case 2026-01-20 12:12:07 +01:00
CarolinePascal
52c424c5eb Adding multiprocessing support for audio recording 2026-01-20 12:12:07 +01:00
CarolinePascal
836195e59c Renamming sampling rate to sample rate for consistency 2026-01-20 12:12:07 +01:00
CarolinePascal
be09a59e05 Adding audio tests 2026-01-20 12:12:03 +01:00
CarolinePascal
373a169bd2 Fixing sounddevice stream active state recovery and adding corresponding exceptions 2026-01-20 12:10:45 +01:00
CarolinePascal
00536c6c5b Adding missing features for audio frames verification and stats 2026-01-20 12:10:45 +01:00
CarolinePascal
cdd3a859ef Adding pytorch compatible conversion for audio 2026-01-20 12:10:45 +01:00
CarolinePascal
5276fc0d6f Remove variable audio recordings data types (will be converted to float32 anyway) 2026-01-20 12:10:45 +01:00
CarolinePascal
6a2882f978 Adding audio frames reading capability 2026-01-20 12:10:37 +01:00
CarolinePascal
8874547353 Adding microphone recording in control loop 2026-01-20 12:10:12 +01:00
CarolinePascal
2864caad80 Adding audio modality in LeRobotDatasets 2026-01-20 12:10:12 +01:00
CarolinePascal
d998660aa1 Integrate microphones in Robot class 2026-01-20 12:10:09 +01:00
CarolinePascal
7e5f3b35e9 Implementing basic integration of microphones using soundfile and sounddevice 2026-01-20 12:08:27 +01:00
CarolinePascal
01fea7c407 Adding audio dependencies 2026-01-20 12:08:27 +01:00
169 changed files with 15388 additions and 3419 deletions

View File

@@ -100,11 +100,11 @@ lerobot-train \
--dataset.repo_id=lerobot/aloha_mobile_cabinet
```
| Category | Models |
| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Imitation Learning** | [ACT](./docs/source/policy_act_README.md), [Diffusion](./docs/source/policy_diffusion_README.md), [VQ-BeT](./docs/source/policy_vqbet_README.md) |
| **Reinforcement Learning** | [HIL-SERL](./docs/source/hilserl.mdx), [TDMPC](./docs/source/policy_tdmpc_README.md) & QC-FQL (coming soon) |
| **VLAs Models** | [Pi0Fast](./docs/source/pi0fast.mdx), [Pi0.5](./docs/source/pi05.mdx), [GR00T N1.5](./docs/source/policy_groot_README.md), [SmolVLA](./docs/source/policy_smolvla_README.md), [XVLA](./docs/source/xvla.mdx) |
| Category | Models |
| -------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Imitation Learning** | [ACT](./docs/source/policy_act_README.md), [Diffusion](./docs/source/policy_diffusion_README.md), [VQ-BeT](./docs/source/policy_vqbet_README.md), [Multitask DiT Policy](./docs/source/policy_multi_task_dit_README.md) |
| **Reinforcement Learning** | [HIL-SERL](./docs/source/hilserl.mdx), [TDMPC](./docs/source/policy_tdmpc_README.md) & QC-FQL (coming soon) |
| **VLAs Models** | [Pi0Fast](./docs/source/pi0fast.mdx), [Pi0.5](./docs/source/pi05.mdx), [GR00T N1.5](./docs/source/policy_groot_README.md), [SmolVLA](./docs/source/policy_smolvla_README.md), [XVLA](./docs/source/xvla.mdx) |
Similarly to the hardware, you can easily implement your own policy & leverage LeRobot's data collection, training, and visualization tools, and share your model to the HF Hub

View File

@@ -0,0 +1,219 @@
#!/usr/bin/env python
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
from soundfile import read
from lerobot.microphones.configs import MicrophoneConfig
from lerobot.microphones.portaudio import PortAudioMicrophone, PortAudioMicrophoneConfig
from lerobot.microphones.utils import (
async_microphones_start_recording,
async_microphones_stop_recording,
make_microphones_from_configs,
)
from lerobot.utils.robot_utils import (
precise_sleep,
)
def main(
microphones_configs: dict[str, MicrophoneConfig],
audio_chunks_number: int,
audio_chunks_duration: float,
repetitions: int,
multiprocessing: bool = False,
):
recording_dir = Path("outputs/audio_benchmark")
recording_dir.mkdir(parents=True, exist_ok=True)
# Create microphones
microphones = make_microphones_from_configs(microphones_configs)
# Connect microphones
for microphone in microphones.values():
microphone.connect()
all_audio_chunks = []
for i in range(repetitions):
print(f"Repetition {i + 1}/{repetitions}...")
# Create audio chunks
audio_chunks = {}
for microphone_key in microphones:
audio_chunks.update({microphone_key: []})
# Start recording
async_microphones_start_recording(
microphones,
output_files=[
recording_dir / f"{microphone_key}_recording_{i}.wav" for microphone_key in microphones
],
multiprocessing=multiprocessing,
)
# Record audio chunks
for j in range(audio_chunks_number):
precise_sleep(audio_chunks_duration)
for microphone_key, microphone in microphones.items():
audio_chunk = microphone.read()
print(f"{microphone_key} - repetition {i} - chunk {j} - samples {audio_chunk.shape[0]}")
audio_chunks[microphone_key].append(audio_chunk)
# Stop recording
async_microphones_stop_recording(microphones)
for microphone_key in microphones:
audio_chunks[microphone_key] = np.concatenate(audio_chunks[microphone_key], axis=0)
all_audio_chunks.append(audio_chunks)
# Disconnect microphones
for microphone in microphones.values():
microphone.disconnect()
# Compute statistics
cmap = plt.get_cmap("tab10")
_, ax = plt.subplots(nrows=repetitions, ncols=len(microphones))
chunk_length = np.zeros((repetitions, len(microphones)))
record_length = np.zeros((repetitions, len(microphones)))
for i in range(repetitions):
for j, (microphone_key, microphone) in enumerate(microphones.items()):
# Get recorded audio chunks
recorded_audio_chunks = all_audio_chunks[i][microphone_key]
# Load recorded file
recorded_data, _ = read(recording_dir / f"{microphone_key}_recording_{i}.wav")
if recorded_data.ndim == 1:
recorded_data = np.expand_dims(recorded_data, axis=1)
record_length[i, j] = recorded_data.shape[0]
chunk_length[i, j] = recorded_audio_chunks.shape[0]
for k, (chunk_data, record_data) in enumerate(
zip(recorded_audio_chunks.T, recorded_data.T, strict=False)
):
# Plot audio chunks and recorded data
ax[i, j].plot(
np.arange(0, len(chunk_data)) / microphone.sample_rate,
chunk_data,
label=f"audio chunks - channel {k}",
color=cmap(2 * k),
)
ax[i, j].plot(
np.arange(0, len(record_data)) / microphone.sample_rate,
record_data,
label=f"recorded data - channel {k}",
linestyle="dashed",
color=cmap(2 * k + 1),
)
# Plot absolute difference (errors should be located at the end of the recordings)
if recorded_data.shape[0] - recorded_audio_chunks.shape[0] > 0:
chunk_data = np.append(
chunk_data, np.zeros(int(recorded_data.shape[0] - recorded_audio_chunks.shape[0]))
)
else:
record_data = np.append(
record_data, np.zeros(int(-recorded_data.shape[0] + recorded_audio_chunks.shape[0]))
)
ax[i, j].plot(
np.arange(0, len(record_data)) / microphone.sample_rate,
np.abs(chunk_data - record_data),
label=f"differences - channel {k}",
color="red",
linestyle="dotted",
)
ax[i, j].set_title(f"{microphone_key} - repetition {i}")
ax[i, j].legend()
plt.show()
# Print statistics
differences = record_length - chunk_length
for i, (microphone_key, microphone) in enumerate(microphones.items()):
print(
f"Average recorded duration for {microphone_key} : {np.mean(record_length[:, i]) / microphone.sample_rate:.3f} seconds"
)
print(
f"Average chunk duration for {microphone_key} : {np.mean(chunk_length[:, i]) / microphone.sample_rate:.3f} seconds"
)
print(f"Average difference for {microphone_key} : {np.mean(differences[:, i]):.3f} samples")
print(
f"Average difference for {microphone_key} : {np.mean(differences[:, i]) / microphone.sample_rate:.3f} seconds"
)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--microphones_indices",
type=int,
nargs="+",
default=[microphone["index"] for microphone in PortAudioMicrophone.find_microphones()],
)
parser.add_argument(
"--microphones_sample_rate",
type=float,
nargs="+",
default=[None] * len(PortAudioMicrophone.find_microphones()),
)
parser.add_argument(
"--microphones_channels",
type=int,
nargs="+",
default=[None] * len(PortAudioMicrophone.find_microphones()),
)
parser.add_argument("--audio_chunks_number", type=int, default=2)
parser.add_argument(
"--audio_chunks_duration",
type=float,
default=1.0,
)
parser.add_argument(
"--repetitions",
type=int,
default=2,
)
parser.add_argument(
"--multiprocessing",
action="store_true",
)
args = vars(parser.parse_args())
args["microphones_configs"] = {}
for index, sample_rate, channels in zip(
args["microphones_indices"],
args["microphones_sample_rate"],
args["microphones_channels"],
strict=False,
):
microphone_config = PortAudioMicrophoneConfig(
microphone_index=index,
sample_rate=sample_rate,
channels=channels,
)
args["microphones_configs"].update({f"microphone_{index}": microphone_config})
args.pop("microphones_indices")
args.pop("microphones_sample_rate")
args.pop("microphones_channels")
main(**args)

View File

@@ -0,0 +1,137 @@
#!/usr/bin/env python
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
from pathlib import Path
import numpy as np
import soundfile as sf
from lerobot.microphones.configs import MicrophoneConfig
from lerobot.microphones.touchlab import TouchLabSensorConfig
from lerobot.microphones.utils import (
async_microphones_start_recording,
async_microphones_stop_recording,
make_microphones_from_configs,
)
from lerobot.utils.robot_utils import (
precise_sleep,
)
def main(
sensors_configs: dict[str, MicrophoneConfig],
multiprocessing: bool = False,
):
recording_dir = Path("outputs/tactile_benchmark")
recording_dir.mkdir(parents=True, exist_ok=True)
# Create microphones
sensors = make_microphones_from_configs(sensors_configs)
# Connect microphones
for sensor in sensors.values():
sensor.connect()
# Create audio chunks
data_chunks = {}
for sensor_key in sensors:
data_chunks.update({sensor_key: []})
# Start recording
async_microphones_start_recording(
sensors,
output_files=[recording_dir / f"{sensor_key}_recording.wav" for sensor_key in sensors],
multiprocessing=multiprocessing,
)
# Record audio chunks
precise_sleep(10.0)
for sensor_key, sensor in sensors.items():
data_chunk = sensor.read()
print(f"{sensor_key} - samples {data_chunk.shape[0]}")
data_chunks[sensor_key].append(data_chunk)
# Stop recording
async_microphones_stop_recording(sensors)
for sensor_key in sensors:
data_chunks[sensor_key] = np.concatenate(data_chunks[sensor_key], axis=0)
# Disconnect microphones
for sensor in sensors.values():
sensor.disconnect()
for sensor_key in sensors:
data, sample_rate = sf.read(recording_dir / f"{sensor_key}_recording.wav")
print(f"{sensor_key} - samples {data.shape[0]}")
print(f"{sensor_key} - sample rate {sample_rate}")
print(f"{sensor_key} - data {data}")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--sensors_ports",
type=str,
nargs="+",
)
parser.add_argument(
"--sensors_baud_rate",
type=int,
nargs="+",
)
parser.add_argument(
"--sensors_sample_rate",
type=int,
nargs="+",
)
parser.add_argument(
"--sensors_channels",
type=int,
nargs="+",
)
parser.add_argument(
"--multiprocessing",
action="store_true",
)
args = vars(parser.parse_args())
args["sensors_configs"] = {}
for port, baud_rate, sample_rate, channels in zip(
args["sensors_ports"],
args["sensors_baud_rate"],
args["sensors_sample_rate"],
args["sensors_channels"],
strict=False,
):
if isinstance(channels, int):
channels = [channels]
sensor_config = TouchLabSensorConfig(
sensor_port=port,
baud_rate=baud_rate,
sample_rate=sample_rate,
channels=channels,
)
args["sensors_configs"].update({f"sensor_{port}": sensor_config})
args.pop("sensors_ports")
args.pop("sensors_baud_rate")
args.pop("sensors_sample_rate")
args.pop("sensors_channels")
main(**args)

View File

@@ -19,6 +19,8 @@
title: Multi GPU training
- local: peft_training
title: Training with PEFT (e.g., LoRA)
- local: rename_map
title: Using Rename Map and Empty Cameras
title: "Tutorials"
- sections:
- local: lerobot-dataset-v3
@@ -47,6 +49,8 @@
title: NVIDIA GR00T N1.5
- local: xvla
title: X-VLA
- local: multi_task_dit
title: Multitask DiT Policy
- local: walloss
title: WALL-OSS
title: "Policies"
@@ -83,6 +87,8 @@
title: Processors for Robots and Teleoperators
- local: env_processor
title: Environment Processors
- local: action_representations
title: Action Representations
title: "Robot Processors"
- sections:
- local: so101

View File

@@ -0,0 +1,223 @@
# Action Representations
This guide explains the different ways robot actions can be represented in LeRobot, how they relate to each other, and when to use each one.
## Joint Space vs End-Effector Space
Before discussing action representations, it helps to understand the two coordinate spaces actions can live in.
### Joint Space
Joint-space actions directly specify target positions for each motor. For a 6-DOF arm with a gripper, a joint-space action might look like:
```
action = [shoulder_pan: 45.0, shoulder_lift: -20.0, elbow: -30.0, wrist_pitch: 10.0, wrist_roll: 0.0, wrist_yaw: 5.0, gripper: 0.8]
```
Joint space is the default in LeRobot. It is simple, requires no kinematics model, and maps directly to motor commands. Most beginner setups (SO-100, Koch) use joint-space actions.
### End-Effector (EE) Space
End-effector-space actions specify the desired position and orientation of the robot's tool tip (gripper) in Cartesian coordinates:
```
action = [x: 0.25, y: -0.10, z: 0.15, wx: 0.0, wy: 0.0, wz: 0.1, gripper: 0.8]
```
EE space is more intuitive for tasks like pick-and-place because it directly describes where the gripper should go, but it requires a kinematics model (URDF) to convert between EE poses and joint angles.
### Converting Between Spaces
LeRobot provides processor steps for converting between joint and EE spaces using forward and inverse kinematics. These are built on top of `RobotKinematics`, which loads a URDF model of your robot.
```python
from lerobot.model.kinematics import RobotKinematics
from lerobot.robots.so_follower.robot_kinematic_processor import (
ForwardKinematicsJointsToEE,
InverseKinematicsEEToJoints,
)
kinematics = RobotKinematics(
urdf_path="./SO101/so101_new_calib.urdf",
target_frame_name="gripper_frame_link",
joint_names=["shoulder", "elbow", "wrist_pitch", "wrist_roll", "wrist_yaw"],
)
# Joints → EE (for observations: "where is my gripper?")
fk_step = ForwardKinematicsJointsToEE(kinematics=kinematics, motor_names=[...])
# EE → Joints (for actions: "move my gripper here")
ik_step = InverseKinematicsEEToJoints(kinematics=kinematics, motor_names=[...])
```
See [`examples/so100_to_so100_EE/`](https://github.com/huggingface/lerobot/tree/main/examples/so100_to_so100_EE) for a complete working example of recording, replaying, and evaluating with EE-space actions on an SO-100 arm.
## Absolute, Relative, and Delta Actions
Regardless of whether you work in joint space or EE space, the action values can be expressed in three different ways. The terminology follows [UMI (Chi et al., 2024)](https://arxiv.org/abs/2402.10329).
### Absolute Actions (LeRobot default)
Each action specifies the target position directly.
**Example** (joint space, chunk of 4):
```
current_state = [45.0, -30.0, 10.0]
action_chunk = [
[46.0, -29.0, 11.0], # go to 46, -29, 11
[47.5, -27.0, 12.0], # go to 47.5, -27, 12
[49.0, -25.0, 13.5], # go to 49, -25, 13.5
[50.0, -24.0, 15.0], # go to 50, -24, 15
]
```
Each value is a target position in the robot's coordinate frame. Simple and direct, but requires a consistent global coordinate frame. This is the default in LeRobot.
### Relative Actions (used by OpenPI / pi0)
Each action in the chunk is an offset from the **current state at the moment of prediction**. All actions in the chunk share the same reference point:
```
current_state = [45.0, -30.0, 10.0]
relative_chunk = [
[1.0, 1.0, 1.0], # +1 from current → target 46, -29, 11
[2.5, 3.0, 2.0], # +2.5 from current → target 47.5, -27, 12
[4.0, 5.0, 3.5], # +4 from current → target 49, -25, 13.5
[5.0, 6.0, 5.0], # +5 from current → target 50, -24, 15
]
```
The conversion is straightforward: `relative = absolute - current_state`. To recover absolute: `absolute = relative + current_state`.
**Why use relative actions?** The model learns to predict offsets centered around zero, which is easier to normalize and leads to more stable training. Because every chunk references the same current state, there is no error accumulation across chunks.
### Delta Actions (sequential differences)
Each action is an offset from the **previous action** (or from the current state for the first step):
```
current_state = [45.0, -30.0, 10.0]
delta_chunk = [
[1.0, 1.0, 1.0], # current → 46, -29, 11
[1.5, 2.0, 1.0], # previous action → 47.5, -27, 12
[1.5, 2.0, 1.5], # previous action → 49, -25, 13.5
[1.0, 1.0, 1.5], # previous action → 50, -24, 15
]
```
Here each step is relative to the one before it. To recover absolute positions you must sum all previous deltas, which means errors accumulate over time. UMI explicitly argues against this representation for this reason.
### Visual Comparison
The figure below (based on a figure from [UMI, Chi et al., 2024](https://arxiv.org/abs/2402.10329)) illustrates the key difference. With **relative trajectory**, every action in the chunk points back to the same origin (current state), so a new inference step cleanly resets the reference. With **delta**, each action depends on the previous one, so errors accumulate. **Absolute** actions require a consistent global coordinate frame.
<img
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/action_representations_umi.png"
alt="Relative Trajectory as Action Representation (UMI, Chi et al., 2024)"
width="85%"
/>
## Using Relative Actions in LeRobot
LeRobot provides `RelativeActionsProcessorStep` to convert between absolute and relative actions inside the processor pipeline. This is how pi0, pi0.5, and pi0_fast support relative actions.
> **Note:** All pi models (pi0, pi0.5, pi0*fast) apply relative conversion \_before* normalization (`relative → normalize`), so the normalizer always sees delta (relative) values. This means **relative action stats are required** for all of them when training with `use_relative_actions=true`. In pi0_fast the `RelativeActionsProcessorStep` only modifies the action — the state observation is unchanged — so `NormalizerProcessorStep` still runs before the state tokenizer and the tokenizer continues to receive normalized state as expected.
### How it works
During **training** (preprocessing), actions are converted from absolute to relative before the model sees them:
```
raw absolute action → RelativeActionsProcessorStep → normalize → model
```
During **inference** (postprocessing), model predictions are converted back to absolute before being sent to the robot:
```
model output → unnormalize → AbsoluteActionsProcessorStep → robot
```
The `AbsoluteActionsProcessorStep` reads the cached current state from its paired `RelativeActionsProcessorStep`, so the two must be wired together (handled automatically by the policy factory).
### Enabling relative actions for the pi family (pi0, pi0.5, pi0_fast)
**Step 1**: Precompute relative action statistics for your dataset:
```bash
lerobot-edit-dataset \
--repo_id your_dataset \
--operation.type recompute_stats \
--operation.relative_action true \
--operation.chunk_size 50 \
--operation.relative_exclude_joints "['gripper']"
```
**Step 2**: Train with relative actions enabled:
```bash
lerobot-train \
--dataset.repo_id=your_dataset \
--policy.type=pi0 \
--policy.use_relative_actions=true \
--policy.relative_exclude_joints='["gripper"]'
```
The `relative_exclude_joints` parameter specifies joints that should remain in absolute space. For example, gripper commands are typically binary (open/close) and don't benefit from relative encoding.
### Combining relative actions with RTC
[RTC](https://arxiv.org/abs/2506.07339) runs policy inference at high frequency and sends actions to the robot as they are predicted rather than waiting for a full chunk. Relative actions and RTC are fully compatible: because every chunk in relative mode references the **same** current state (captured at the start of inference), each predicted action in the chunk remains a valid offset even if the robot has already moved. No special handling is needed — `RelativeActionsProcessorStep` caches the state once per inference call and `AbsoluteActionsProcessorStep` applies it to every action in the streamed output.
### Combining relative actions with EE space
Relative actions work in both joint space and EE space. For example, if your dataset stores EE actions, relative encoding converts them to offsets from the current EE pose:
```
current_ee_state = [x: 0.25, y: -0.10, z: 0.15, gripper: 0.8]
absolute_ee_chunk = [
[0.26, -0.09, 0.16, 0.8],
[0.28, -0.07, 0.18, 0.8],
]
relative_ee_chunk = [
[0.01, 0.01, 0.01, 0.0], # offset from current EE pose
[0.03, 0.03, 0.03, 0.0], # offset from current EE pose
]
```
## Processing Pipeline Summary
Here is how the different processors compose. Each arrow is a processor step, and they can be chained in a `RobotProcessorPipeline` or `PolicyProcessorPipeline`:
```
┌─────────────────────────────────────────┐
Action Space │ Joint Space ←──IK──→ EE Space │
│ ForwardKinematicsJointsToEE │
│ InverseKinematicsEEToJoints │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
Representation │ Absolute ←────→ Relative │
│ RelativeActionsProcessorStep (pre) │
│ AbsoluteActionsProcessorStep (post) │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
Normalization │ Raw ←────→ Normalized │
│ NormalizerProcessorStep (pre) │
│ UnnormalizerProcessorStep (post) │
└─────────────────────────────────────────┘
```
A typical training preprocessor might chain: `raw absolute joint actions → relative → normalize`. A typical inference postprocessor: `unnormalize → absolute → (optionally IK to joints)`.
## References
- [Universal Manipulation Interface (UMI)](https://arxiv.org/abs/2402.10329) - Chi et al., 2024. Defines the relative trajectory action representation and compares it with absolute and delta actions.
- [Introduction to Processors](./introduction_processors) - How processor pipelines work in LeRobot.
- [`examples/so100_to_so100_EE/`](https://github.com/huggingface/lerobot/tree/main/examples/so100_to_so100_EE) - Complete example of recording and evaluating with EE-space actions.

View File

@@ -310,4 +310,4 @@ Asynchronous inference represents a significant advancement in real-time robotic
- **Universal Compatibility**: Works with all LeRobot-supported policies, from lightweight ACT models to vision-language models like SmolVLA
Start experimenting with the default parameters, monitor your action queue sizes, and iteratively refine your setup to achieve optimal performance for your specific use case.
If you want to discuss this further, hop into our [Discord community](https://discord.gg/s3KuuzsPFb), or open an issue on our [GitHub repository](https://github.com/lerobot/lerobot/issues).
If you want to discuss this further, hop into our [Discord community](https://discord.gg/s3KuuzsPFb), or open an issue on our [GitHub repository](https://github.com/huggingface/lerobot/issues).

View File

@@ -41,13 +41,15 @@ requires = # your-build-system
## Step 2: Define the Policy Configuration
Create a configuration class that inherits from `PreTrainedConfig` and registers your policy type:
Create a configuration class that inherits from [`PreTrainedConfig`](https://github.com/huggingface/lerobot/blob/main/src/lerobot/configs/policies.py) and registers your policy type:
Here is a template to get you started, customize the parameters and methods as needed for your policy's architecture and training requirements.
```python
# configuration_my_custom_policy.py
from dataclasses import dataclass, field
from lerobot.configs.policies import PreTrainedConfig
from lerobot.configs.types import NormalizationMode
from lerobot.optim.optimizers import AdamWConfig
from lerobot.optim.schedulers import CosineDecayWithWarmupSchedulerConfig
@PreTrainedConfig.register_subclass("my_custom_policy")
@dataclass
@@ -61,22 +63,56 @@ class MyCustomPolicyConfig(PreTrainedConfig):
hidden_dim: Hidden dimension for the policy network
# Add your policy-specific parameters here
"""
# ...PreTrainedConfig fields...
pass
horizon: int = 50
n_action_steps: int = 50
hidden_dim: int = 256
optimizer_lr: float = 1e-4
optimizer_weight_decay: float = 1e-4
def __post_init__(self):
super().__post_init__()
# Add any validation logic here
if self.n_action_steps > self.horizon:
raise ValueError("n_action_steps cannot exceed horizon")
def validate_features(self) -> None:
"""Validate input/output feature compatibility."""
# Implement validation logic for your policy's requirements
pass
if not self.image_features:
raise ValueError("MyCustomPolicy requires at least one image feature.")
if self.action_feature is None:
raise ValueError("MyCustomPolicy requires 'action' in output_features.")
def get_optimizer_preset(self) -> AdamWConfig:
return AdamWConfig(lr=self.optimizer_lr, weight_decay=self.optimizer_weight_decay)
def get_scheduler_preset(self):
return None
@property
def observation_delta_indices(self) -> list[int] | None:
"""Relative timestep offsets the dataset loader provides per observation.
Return `None` for single-frame policies. For temporal policies that consume
multiple past or future frames, return a list of offsets, e.g. `[-20, -10, 0, 10]` for
3 past frames at stride 10 and 1 future frame at stride 10.
"""
return None
@property
def action_delta_indices(self) -> list[int]:
"""Relative timestep offsets for the action chunk the dataset loader returns.
"""
return list(range(self.horizon))
@property
def reward_delta_indices(self) -> None:
return None
```
## Step 3: Implement the Policy Class
Create your policy implementation by inheriting from LeRobot's base `PreTrainedPolicy` class:
Create your policy implementation by inheriting from [`PreTrainedPolicy`](https://github.com/huggingface/lerobot/blob/main/src/lerobot/policies/pretrained.py):
```python
# modeling_my_custom_policy.py
@@ -85,38 +121,74 @@ import torch.nn as nn
from typing import Any
from lerobot.policies.pretrained import PreTrainedPolicy
from lerobot.utils.constants import ACTION
from .configuration_my_custom_policy import MyCustomPolicyConfig
class MyCustomPolicy(PreTrainedPolicy):
config_class = MyCustomPolicyConfig
config_class = MyCustomPolicyConfig # must match the string in @register_subclass
name = "my_custom_policy"
def __init__(self, config: MyCustomPolicyConfig, dataset_stats: dict[str, Any] = None):
super().__init__(config, dataset_stats)
config.validate_features() # not called automatically by the base class
self.config = config
self.model = ... # your nn.Module here
def reset(self):
"""Reset episode state."""
...
def get_optim_params(self) -> dict:
"""Return parameters to pass to the optimizer (e.g. with per-group lr/wd)."""
return {"params": self.parameters()}
def predict_action_chunk(self, batch: dict[str, torch.Tensor], **kwargs) -> torch.Tensor:
"""Return the full action chunk (B, chunk_size, action_dim) for the current observation."""
...
def select_action(self, batch: dict[str, torch.Tensor], **kwargs) -> torch.Tensor:
"""Return a single action for the current timestep (called at inference)."""
...
def forward(self, batch: dict[str, torch.Tensor]) -> dict[str, torch.Tensor]:
"""Compute the training loss.
`batch["action_is_pad"]` is a bool mask of shape (B, horizon) that marks
timesteps padded because the episode ended before `horizon` steps, you
can exclude those from your loss.
"""
actions = batch[ACTION]
action_is_pad = batch.get("action_is_pad")
...
return {"loss": ...}
```
## Step 4: Add Data Processors
Create processor functions:
Create processor functions. For a concrete reference, see [processor_act.py](https://github.com/huggingface/lerobot/blob/main/src/lerobot/policies/act/processor_act.py) or [processor_diffusion.py](https://github.com/huggingface/lerobot/blob/main/src/lerobot/policies/diffusion/processor_diffusion.py).
```python
# processor_my_custom_policy.py
from typing import Any
import torch
from lerobot.processor import PolicyAction, PolicyProcessorPipeline
def make_my_custom_policy_pre_post_processors(
config,
dataset_stats: dict[str, dict[str, torch.Tensor]] | None = None,
) -> tuple[
PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
PolicyProcessorPipeline[PolicyAction, PolicyAction],
]:
"""Create preprocessing and postprocessing functions for your policy."""
pass # Define your preprocessing and postprocessing logic here
preprocessor = ... # build your PolicyProcessorPipeline for inputs
postprocessor = ... # build your PolicyProcessorPipeline for outputs
return preprocessor, postprocessor
```
**Important - function naming:** LeRobot discovers your processor by name. The function **must** be called `make_{policy_name}_pre_post_processors` (matching the string you passed to `@PreTrainedConfig.register_subclass`).
## Step 5: Package Initialization
Expose your classes in the package's `__init__.py`:

View File

@@ -204,22 +204,26 @@ Replace `your_username/dataset_name` with your Hugging Face username and a name
Your dataset includes:
**Your Actions (2 things)**:
**Your Actions (2 features)**:
- How much you moved forward/backward
- How much you turned left/right
- `linear_velocity`: How much you moved forward/backward
- `angular_velocity`: How much you turned left/right
**Robot Observations (12 things)**:
**Robot Observations (24 features)**:
- Front camera video
- Rear camera video
- Current speed
- Battery level
- Which way the robot is facing
- GPS location (latitude, longitude, signal strength)
- Orientation
- GPS (latitude, longitude, signal strength)
- Network signal strength
- Vibration level
- Lamp status (on/off)
- Lamp state (on/off)
- Accelerometer (x, y, z)
- Gyroscope (x, y, z)
- Magnetometer (x, y, z)
- Wheel RPMs (4 wheels)
### Where Your Data Goes

View File

@@ -424,7 +424,7 @@ robot = SO100Follower(robot_config)
robot.connect()
dataset = LeRobotDataset("<hf_username>/<dataset_repo_id>", episodes=[episode_idx])
actions = dataset.hf_dataset.select_columns("action")
actions = dataset.select_columns("action")
log_say(f"Replaying episode {episode_idx}")
for idx in range(dataset.num_frames):

View File

@@ -0,0 +1,340 @@
# Multitask DiT Policy
Multitask Diffusion Transformer (DiT) Policy is an evolution of the original Diffusion Policy architecture, which leverages a large DiT with text and vision conditioning for multitask robot learning. This implementation supports both diffusion and flow matching objectives for action generation, enabling robots to perform diverse manipulation tasks conditioned on language instructions.
## Model Overview
The model uses:
- **CLIP Vision Encoder**: Processes RGB images from multiple camera views
- **CLIP Text Encoder**: Encodes language task instructions (frozen weights with learnable projection)
- **Diffusion Transformer**: Predicts action sequences conditioned on observations and language
- **Two Objectives**: Supports both diffusion (DDPM/DDIM) and flow matching for action generation
This model is exciting because you can achieve extremely high dexterity, competitive with multi-billion parameter
VLAs, with only ~450M parameters and significantly less training.
## Installation Requirements
Multitask DiT Policy has additional dependencies. Install it with:
```bash
pip install lerobot[multi_task_dit]
```
This will install all necessary dependencies including the HuggingFace Transformers library for CLIP models.
## Usage
To use Multitask DiT in your LeRobot configuration, specify the policy type as:
```python
policy.type=multi_task_dit
```
## Training
### Basic Training Command
Here's a complete training command for training Multitask DiT on your dataset:
```bash
lerobot-train \
--dataset.repo_id=YOUR_DATASET \
--output_dir=./outputs/multitask_dit_training \
--batch_size=32 \
--steps=5000 \
--save_freq=500 \
--log_freq=100 \
--policy.type=multi_task_dit \
--policy.device=cuda \
--policy.repo_id="HF_USER/multitask-dit-your-robot" \
--wandb.enable=true
```
### Recommended Hyperparameters and Dataset Details (30Hz Control Frequency)
For reliable performance, start with these suggested default hyperparameters:
```bash
lerobot-train \
--dataset.repo_id=YOUR_DATASET \
--output_dir=./outputs/mutitask_dit_training \
--batch_size=320 \
--steps=30000 \
--policy.type=multi_task_dit \
--policy.device=cuda \
--policy.horizon=32 \
--policy.n_action_steps=24 \
--policy.objective=diffusion \
--policy.noise_scheduler_type=DDPM \
--policy.num_train_timesteps=100 \
--policy.repo_id="HF_USER/multitask-dit-your-robot" \
--wandb.enable=true
```
**Key Parameters:**
- **Batch Size**: 192-320 - If you have access to a GPU that can support this, you will get the best training dynamics
- **Horizon**: 32 - number of action steps to predict, ~1.0 sec at 30Hz
- **n_action_steps**: 24 - ~0.8 seconds at 30Hz
- **Objective**: `diffusion` - start with diffusion and experiment with flow matching if generation quality is poor
- **Training Steps**: >30k steps recommended for a single task
### Training Configuration Parameters
#### Objective Selection
Choose between diffusion and flow matching:
```bash
# Diffusion objective (default)
--policy.objective=diffusion \
--policy.noise_scheduler_type=DDPM \ # or "DDIM"
--policy.num_train_timesteps=100 \
--policy.num_inference_steps=10 \ # For faster inference
--policy.beta_schedule=squaredcos_cap_v2 \ # Noise schedule type
--policy.prediction_type=epsilon \ # "epsilon" (predict noise) or "sample" (predict clean)
--policy.clip_sample=true \ # Clip samples during denoising
--policy.clip_sample_range=1.0 # Clipping range [-x, x]
# Flow matching objective
--policy.objective=flow_matching \
--policy.timestep_sampling_strategy=beta \ # or "uniform" | the beta sampling strategy performance appears much better in practice
--policy.num_integration_steps=100 \
--policy.integration_method=euler \ # or "rk4"
--policy.sigma_min=0.0 # Minimum noise in flow interpolation path
```
#### Transformer Architecture
Adjust model capacity based on dataset size:
```bash
# Small datasets (< 100 examples)
--policy.num_layers=4 \
--policy.hidden_dim=512 \
--policy.num_heads=8 # should ideally be hidden_dim // 64
# Medium datasets (100-5k examples) - default
--policy.num_layers=6 \
--policy.hidden_dim=512 \
--policy.num_heads=8 # should ideally be hidden_dim // 64
# Large datasets (> 5k examples)
--policy.num_layers=8 \
--policy.hidden_dim=512 \
--policy.num_heads=8 # should ideally be hidden_dim // 64
```
**Positional Encoding Options:**
The model supports two positional encoding methods for action sequences:
```bash
# Rotary Position Embedding (RoPE) - default, recommended
--policy.use_rope=true \
--policy.rope_base=10000.0 # Base frequency for RoPE
# Absolute positional encoding
--policy.use_positional_encoding=true # Disables RoPE when true
```
**Other Transformer Parameters:**
```bash
--policy.dropout=0.1 # Dropout rate for DiT blocks (0.0-1.0)
--policy.timestep_embed_dim=256 # Timestep embedding dimension
```
#### Vision Encoder Configuration
```bash
# Use different CLIP model for more expressivity at the cost of inference time
# experiment with larger or smaller models depending on the complexity of your tasks and size of dataset
--policy.vision_encoder_name=openai/clip-vit-large-patch14
# Use separate vision encoder per camera
# This may be useful when cameras have significantly different characteristics, but
# be wary of increased VRAM footprint.
--policy.use_separate_rgb_encoder_per_camera=true
# Image preprocessing
--policy.image_resize_shape=[XXX,YYY] \ # you may need to resize your images for inference speed ups
--policy.image_crop_shape=[224,224] \
--policy.image_crop_is_random=true # Random during training, center at inference
```
#### Text Encoder Configuration
```bash
# Use different CLIP text encoder model
# same as vision: experiment with larger or smaller models depending on the
# complexity of your tasks and size of dataset
--policy.text_encoder_name=openai/clip-vit-large-patch14
```
#### Learning Rate Configuration
The vision encoder uses a separate learning rate multiplier, where 1/10th is suggested to be the ideal staritng point:
```bash
--policy.optimizer_lr=2e-5 \
--policy.vision_encoder_lr_multiplier=0.1 # Vision encoder LR = 0.1 * optimizer_lr
```
### Training Tuning Guidelines
#### 1. Flow Matching with Beta Sampling
The original diffusion implementation here is based on the work described in [TRI's LBM paper](https://arxiv.org/abs/2507.05331)
Additionally, we have implemented a flow-matching objective, which is described at a high-level in [Boston Dynamics blog post](https://bostondynamics.com/blog/large-behavior-models-atlas-find-new-footing/).
Consider testing the flow-matching objective and evaluating performance differences for your task:
```bash
--policy.objective=flow_matching \
--policy.timestep_sampling_strategy=beta \
--policy.timestep_sampling_alpha=1.5 \
--policy.timestep_sampling_beta=1.0 \
--policy.timestep_sampling_s=0.999
```
This hasn't been shown to be a silver bullet across every user case, but it occasionally results in smoother and more consistent actions.
#### 2. Number of Transformer Layers
Match model capacity to your dataset size:
- **Small datasets** (< 100 examples): Reduce to 4 layers
- **Large datasets** (> 5k examples): Increase to 8 layers
#### 3. `horizon` Tuning
The model can be sensitive to the horizon you choose. Start with around a 1 second horizon based on your control frequency:
- **30 Hz frequency**: `horizon=30`
- **10 Hz frequency**: `horizon=10`
Then experiment with increasing from there. The horizon determines how far into the future the model predicts actions.
#### 4. `n_action_steps` Sensitivity
The model can also be very sensitive to `n_action_steps`. Start with it being around 0.8 seconds based on your control frequency and tune from there:
- **Lower values**: More reactive but potentially less stable for long-horizon tasks
- **Higher values**: Better for long-horizon execution but open-loop failures are limited in their recovery
### Inference Tuning
For faster inference, use DDIM with fewer sampling steps:
```bash
--policy.noise_scheduler_type=DDIM \
--policy.num_inference_steps=10
```
### Resuming Training
To resume training from a checkpoint:
```bash
lerobot-train \
--config_path=./outputs/mutitask_dit_training/checkpoints/last/pretrained_model/train_config.json \
--resume=true
```
The checkpoint directory should contain `model.safetensors` and `config.json` files (saved automatically during training). When resuming, the configuration is loaded from the checkpoint, so you don't need to specify other parameters.
## Common Failure Modes and Debugging
Training these models can be finicky. Here are common failure modes and debugging approaches:
### Idling / No Motion
The model may "collapse" during inference, resulting in static or no motion. This can occur when:
1. **Insufficient training data**: If you only have 20-50 examples, try to roughly double your dataset size. Once you have above 300 examples, if you're still seeing this, the task may be too complex.
2. **Multiple similar tasks**: When your dataset contains multiple similar tasks (e.g., picking up 2 different objects), the model may rely too heavily on language conditioning which might not be rich enough.
**Debugging tips:**
- Increase dataset size (double until you get to over 300 examples)
- Train for longer, up to 100k steps, even when the loss flatlines
- Check if the model is receiving proper language instructions or increase diversity of instruction
### Executing the Wrong Task
Sometimes the robot will completely ignore your instruction and perform some other task. This generally only happens if you have trained on multiple tasks.
**Potential causes:**
- Language instruction ambiguity
- Insufficient task-specific training data
- Model confusion between similar tasks in the multitask dataset
**Debugging tips:**
- Verify language instruction specificity, especially if descriptions are similar between multiple tasks
- Check task distribution in your training dataset and add weighting to the failing/ignored task
- Consider task-specific fine-tuning
### Training Instability
If training loss is unstable or diverging:
- Try adjusting learning rate between `1e-5` and `3e-4`
- Increase batch size if possible
- Check that your dataset normalization is correct
- Verify image preprocessing is working correctly
## Performance Considerations
### GPU Requirements
- **Inference**: At least an RTX 5070 Ti (or equivalent GPU) is recommended for reasonable speed performance
- **Training**: A GPU with enough VRAM to load batch sizes of >64 is ideal, which will vary depending on the number of image observations, etc
### Batch Size Recommendations
- **Minimum**: 64 (less than this may result in unstable training)
- **Recommended**: 256-320 (best performance, requires larger GPU)
## Example: Training on Custom Dataset
Here's a complete example training on a custom dataset:
```bash
lerobot-train \
--dataset.repo_id=YOUR_DATASET \
--output_dir=./outputs/mutitask_dit_training \
--batch_size=320 \
--steps=30000 \
--save_freq=1000 \
--log_freq=100 \
--eval_freq=1000 \
--policy.type=multi_task_dit \
--policy.device=cuda \
--policy.horizon=32 \
--policy.n_action_steps=24 \
--policy.objective=diffusion \
--policy.noise_scheduler_type=DDPM \
--policy.num_layers=6 \
--policy.hidden_dim=512 \
--policy.vision_encoder_name=openai/clip-vit-base-patch16 \
--policy.image_resize_shape=[320,240] \
--policy.image_crop_shape=[224,224] \
--policy.repo_id="HF_USER/multitask-dit-your-robot" \
--wandb.enable=true \
--wandb.project=multitask_dit
```
## References
For more details on the technical implementation and architecture, see:
- [A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation](https://arxiv.org/abs/2507.05331)
- [Large Behavior Models and Atlas Find New Footing](https://bostondynamics.com/blog/large-behavior-models-atlas-find-new-footing/)
- [Dissecting and Open-Sourcing Multitask Diffusion Transformer Policy](https://brysonkjones.substack.com/p/dissecting-and-open-sourcing-multitask-diffusion-transformer-policy)

View File

@@ -91,6 +91,46 @@ lerobot-train \
**💡 Tip**: Setting `train_expert_only=true` freezes the VLM and trains only the action expert and projections, allowing finetuning with reduced memory usage.
## Relative Actions
By default, π₀ predicts absolute actions. You can enable **relative actions** so the model predicts offsets relative to the current robot state. This can improve training stability for certain setups.
To use relative actions, first recompute your dataset stats in relative space via the CLI:
```bash
lerobot-edit-dataset \
--repo_id your_dataset \
--operation.type recompute_stats \
--operation.relative_action true \
--operation.chunk_size 50 \
--operation.relative_exclude_joints "['gripper']" \
--push_to_hub true
```
Or equivalently in Python:
```python
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.datasets.dataset_tools import recompute_stats
dataset = LeRobotDataset("your_dataset")
recompute_stats(dataset, relative_action=True, chunk_size=50, relative_exclude_joints=["gripper"])
dataset.push_to_hub()
```
The `chunk_size` should match your policy's `chunk_size` (default 50 for π₀). `relative_exclude_joints` lists joint names that should remain in absolute space (e.g. gripper commands). Use `--push_to_hub true` to upload the updated stats to the Hub.
Then train with relative actions enabled:
```bash
lerobot-train \
--dataset.repo_id=your_dataset \
--policy.type=pi0 \
--policy.use_relative_actions=true \
--policy.relative_exclude_joints='["gripper"]' \
...
```
## License
This model follows the **Apache 2.0 License**, consistent with the original [OpenPI repository](https://github.com/Physical-Intelligence/openpi).

View File

@@ -97,6 +97,46 @@ python src/lerobot/datasets/v30/augment_dataset_quantile_stats.py \
Or train pi05 with this normalization mapping: `--policy.normalization_mapping='{"ACTION": "MEAN_STD", "STATE": "MEAN_STD", "VISUAL": "IDENTITY"}'`
## Relative Actions
By default, π₀.₅ predicts absolute actions. You can enable **relative actions** so the model predicts offsets relative to the current robot state. This can improve training stability for certain setups.
To use relative actions, first recompute your dataset stats in relative space via the CLI:
```bash
lerobot-edit-dataset \
--repo_id your_dataset \
--operation.type recompute_stats \
--operation.relative_action true \
--operation.chunk_size 50 \
--operation.relative_exclude_joints "['gripper']" \
--push_to_hub true
```
Or equivalently in Python:
```python
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.datasets.dataset_tools import recompute_stats
dataset = LeRobotDataset("your_dataset")
recompute_stats(dataset, relative_action=True, chunk_size=50, relative_exclude_joints=["gripper"])
dataset.push_to_hub()
```
The `chunk_size` should match your policy's `chunk_size` (default 50 for π₀.₅). `relative_exclude_joints` lists joint names that should remain in absolute space (e.g. gripper commands). Use `--push_to_hub true` to upload the updated stats to the Hub.
Then train with relative actions enabled:
```bash
lerobot-train \
--dataset.repo_id=your_dataset \
--policy.type=pi05 \
--policy.use_relative_actions=true \
--policy.relative_exclude_joints='["gripper"]' \
...
```
## Performance Results
### Libero Benchmark Results

View File

@@ -0,0 +1,37 @@
# Multitask DiT Policy
## Citation
If you use this work, please cite the following works:
```bibtex
@misc{jones2025multitaskditpolicy,
author = {Bryson Jones},
title = {Dissecting and Open-Sourcing Multitask Diffusion Transformer Policy},
year = {2025},
url = {https://brysonkjones.substack.com/p/dissecting-and-open-sourcing-multitask-diffusion-transformer-policy},
note = {Blog post}
}
```
```bibtex
@misc{trilbmteam2025carefulexaminationlargebehaviormodels,
author = {TRI LBM Team},
title = {A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation},
year = {2025},
eprint = {arXiv:2507.05331},
archivePrefix = {arXiv},
primaryClass = {cs.RO},
url = {https://arxiv.org/abs/2507.05331}
}
```
```bibtex
@misc{bostondynamics2025largebehaviormodelsatlas,
author = {Boston Dynamics and TRI Research Team},
title = {Large Behavior Models and Atlas Find New Footing},
year = {2025},
url = {https://bostondynamics.com/blog/large-behavior-models-atlas-find-new-footing/},
note = {Blog post}
}
```

114
docs/source/rename_map.mdx Normal file
View File

@@ -0,0 +1,114 @@
# Rename Map and Empty Cameras
When you train, evaluate, or record with a robot policy, your **dataset** or **environment** provides observations under one set of keys (e.g. `observation.images.front`, `observation.images.eagle`), while your **policy** expects another (e.g. `observation.images.image`, `observation.images.image2`). The **rename map** bridges that gap without changing the policy or data source.
> **Scope:** The rename map only renames **observation** keys (images and state). Action keys are not affected.
## Why observation keys don't always match
Policies have a fixed set of **input feature names** baked into their pretrained config. For example:
- [pi0fast-libero](https://huggingface.co/lerobot/pi0fast-libero) expects `observation.images.base_0_rgb` and `observation.images.left_wrist_0_rgb`.
- [xvla-base](https://huggingface.co/lerobot/xvla-base) expects `observation.images.image`, `observation.images.image2`, and `observation.images.image3`.
Your dataset might use different names entirely (e.g. `observation.images.front`, `observation.images.eagle`, `observation.images.glove`), and your eval environment might use yet another set. Rather than editing the policy config or renaming columns in the dataset, you pass a **rename map**: a JSON dictionary that maps source keys to the keys the policy expects. Renaming happens inside the preprocessor pipeline, so the policy always sees its expected keys.
## Using the rename map
Pass the mapping as a JSON string on the command line. The convention is always:
```
--rename_map='{"source_key": "policy_key", ...}'
```
where **source_key** is what the dataset or environment provides, and **policy_key** is what the policy expects.
Only listed keys are renamed; everything else passes through unchanged. Order of entries doesn't matter.
Supported policies: **PI0**, **PI05**, **PI0Fast**, **SmolVLA**, and **XVLA**.
### Training
Suppose you fine-tune [lerobot/xvla-base](https://huggingface.co/lerobot/xvla-base) on a dataset with images under `observation.images.front`, `observation.images.eagle`, and `observation.images.glove`. XVLA expects `observation.images.image`, `observation.images.image2`, and `observation.images.image3`:
```bash
lerobot-train \
--dataset.repo_id=YOUR_DATASET \
--output_dir=./outputs/xvla_training \
--job_name=xvla_training \
--policy.path="lerobot/xvla-base" \
--policy.repo_id="HF_USER/xvla-your-robot" \
--policy.dtype=bfloat16 \
--policy.action_mode=auto \
--steps=20000 \
--policy.device=cuda \
--policy.freeze_vision_encoder=false \
--policy.freeze_language_encoder=false \
--policy.train_policy_transformer=true \
--policy.train_soft_prompts=true \
--rename_map='{"observation.images.front": "observation.images.image", "observation.images.eagle": "observation.images.image2", "observation.images.glove": "observation.images.image3"}'
```
### Evaluation
A policy that expects `observation.images.base_0_rgb` and `observation.images.left_wrist_0_rgb` (e.g. [pi0fast-libero](https://huggingface.co/lerobot/pi0fast-libero)), but the LIBERO environment returns `observation.images.image` and `observation.images.image2`:
```bash
lerobot-eval \
--policy.path=lerobot/pi0fast-libero \
--env.type=libero \
... \
--rename_map='{"observation.images.image": "observation.images.base_0_rgb", "observation.images.image2": "observation.images.left_wrist_0_rgb"}'
```
### Recording
`lerobot-record` also supports rename maps, nested under the dataset config:
```bash
lerobot-record \ # When running inference
--policy.path="<user>/smolVLA_finetuned" \
... \
--dataset.rename_map='{"observation.images.glove2": "observation.images.image"}'
```
## Alternative: edit the policy config directly
If you always use the same dataset or environment, you can **edit the policy's `config.json`** so its observation keys match your data source. Then no rename map is needed.
The tradeoff: modifying the policy config ties it to one data source. A rename map keeps one policy usable across many datasets and environments.
## Empty cameras: fewer views than the policy expects
Some policies are built for a fixed number of image inputs. If your dataset has fewer cameras, you can set **`empty_cameras`** in the policy config instead of modifying the model architecture.
### How it works
Setting `empty_cameras=N` adds N placeholder image features to the policy config, named:
```
observation.images.empty_camera_0
observation.images.empty_camera_1
...
```
At runtime, these keys have no corresponding data in the batch. The policy fills them with masked dummy tensors (padded with `-1` for SigLIP-based vision encoders, with a zero attention mask), so the extra image slots are effectively ignored during training and inference.
### Example
XVLA-base has three visual inputs and `empty_cameras=0` by default. Your dataset only has two cameras:
1. Set `--policy.empty_cameras=1`.
2. The config adds a third key: `observation.images.empty_camera_0`.
3. Use the rename map for your two real cameras as usual.
4. The third slot is masked out — no fake images needed in your dataset.
## Quick reference
| Goal | What to do |
| ----------------------------------------- | --------------------------------------------------------------------------- |
| Dataset keys ≠ policy keys | `--rename_map='{"dataset_key": "policy_key", ...}'` |
| Env keys ≠ policy keys (eval) | `--rename_map='{"env_key": "policy_key", ...}'` |
| Recording with different keys (inference) | `--dataset.rename_map='{"source_key": "policy_key", ...}'`. |
| Fewer cameras than policy expects | `--policy.empty_cameras=N` (supported by PI0, PI05, PI0Fast, SmolVLA, XVLA) |
| Avoid passing a rename map | Edit the policy's `config.json` so its keys match your data source |

View File

@@ -236,10 +236,10 @@ It is advisable to install one 3-pin cable in the motor after placing them befor
### Joint 1
- Install both motor horns. Secure the top horn with a M3x6mm screw. No screws are required for the bottom horn.
- Place the first motor into the base.
- Fasten the motor with 4 M2x6mm screws (smallest screws). Two from the top and two from the bottom.
- Slide over the first motor holder and fasten it using two M2x6mm screws (one on each side).
- Install both motor horns, securing the top horn with a M3x6mm screw.
- Attach the shoulder part.
- Tighten the shoulder part with 4 M3x6mm screws on top and 4 M3x6mm screws on the bottom
- Add the shoulder motor holder.
@@ -255,9 +255,9 @@ It is advisable to install one 3-pin cable in the motor after placing them befor
### Joint 2
- Install both motor horns. Secure the top horn with a M3x6mm screw. No screws are required for the bottom horn.
- Slide the second motor in from the top.
- Fasten the second motor with 4 M2x6mm screws.
- Attach both motor horns to motor 2, again use the M3x6mm horn screw.
- Attach the upper arm with 4 M3x6mm screws on each side.
<div class="video-container">
@@ -271,8 +271,8 @@ It is advisable to install one 3-pin cable in the motor after placing them befor
### Joint 3
- Insert motor 3 and fasten using 4 M2x6mm screws
- Attach both motor horns to motor 3 and secure one again with a M3x6mm horn screw.
- Install both motor horns. Secure the top horn with a M3x6mm screw. No screws are required for the bottom horn.
- Insert motor 3 and fasten using 4 M2x6mm screws.
- Connect the forearm to motor 3 using 4 M3x6mm screws on each side.
<div class="video-container">
@@ -286,9 +286,10 @@ It is advisable to install one 3-pin cable in the motor after placing them befor
### Joint 4
- Install both motor horns. Secure the top horn with a M3x6mm screw. No screws are required for the bottom horn.
- Slide over motor holder 4.
- Slide in motor 4.
- Fasten motor 4 with 4 M2x6mm screws and attach its motor horns, use a M3x6mm horn screw.
- Fasten motor 4 with 4 M2x6mm screws.
<div class="video-container">
<video controls width="600">
@@ -321,7 +322,7 @@ It is advisable to install one 3-pin cable in the motor after placing them befor
- Attach the gripper to motor 5, attach it to the motor horn on the wrist using 4 M3x6mm screws.
- Insert the gripper motor and secure it with 2 M2x6mm screws on each side.
- Attach the motor horns and again use a M3x6mm horn screw.
- Install both motor horns on the gripper motor. Secure the top horn with a M3x6mm screw; no screws are required for the bottom horn.
- Install the gripper claw and secure it with 4 M3x6mm screws on both sides.
<div class="video-container">

View File

@@ -78,7 +78,7 @@ def replay(cfg: ReplayConfig):
robot = make_robot_from_config(cfg.robot)
dataset = LeRobotDataset(cfg.dataset.repo_id, root=cfg.dataset.root, episodes=[cfg.dataset.episode])
actions = dataset.hf_dataset.select_columns(ACTION)
actions = dataset.select_columns(ACTION)
robot.connect()
try:

View File

@@ -32,7 +32,8 @@ import torch
from huggingface_hub import HfApi
import lerobot
from lerobot.datasets.lerobot_dataset import LeRobotDataset, LeRobotDatasetMetadata
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata
from lerobot.datasets.lerobot_dataset import LeRobotDataset
def main():
@@ -87,9 +88,8 @@ def main():
# The previous metadata class is contained in the 'meta' attribute of the dataset:
print(dataset.meta)
# LeRobotDataset actually wraps an underlying Hugging Face dataset
# (see https://huggingface.co/docs/datasets for more information).
print(dataset.hf_dataset)
# You can inspect the dataset using its repr:
print(dataset)
# LeRobot datasets also subclasses PyTorch datasets so you can do everything you know and love from working
# with the latter, like iterating through the dataset.

View File

@@ -14,8 +14,8 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from lerobot.datasets.feature_utils import hw_to_dataset_features
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.datasets.utils import hw_to_dataset_features
from lerobot.policies.act.modeling_act import ACTPolicy
from lerobot.policies.factory import make_pre_post_processors
from lerobot.processor import make_default_processors

View File

@@ -14,8 +14,8 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from lerobot.datasets.feature_utils import hw_to_dataset_features
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.datasets.utils import hw_to_dataset_features
from lerobot.processor import make_default_processors
from lerobot.robots.lekiwi.config_lekiwi import LeKiwiClientConfig
from lerobot.robots.lekiwi.lekiwi_client import LeKiwiClient

View File

@@ -35,9 +35,7 @@ def main():
# Fetch the dataset to replay
dataset = LeRobotDataset("<hf_username>/<dataset_repo_id>", episodes=[EPISODE_IDX])
# Filter dataset to only include frames from the specified episode since episodes are chunked in dataset V3.0
episode_frames = dataset.hf_dataset.filter(lambda x: x["episode_index"] == EPISODE_IDX)
actions = episode_frames.select_columns(ACTION)
actions = dataset.select_columns(ACTION)
# Connect to the robot
robot.connect()
@@ -48,7 +46,7 @@ def main():
print("Starting replay loop...")
log_say(f"Replaying episode {EPISODE_IDX}")
for idx in range(len(episode_frames)):
for idx in range(dataset.num_frames):
t0 = time.perf_counter()
# Get recorded action from dataset

View File

@@ -43,12 +43,13 @@ def main():
keyboard.connect()
# Init rerun viewer
init_rerun(session_name="lekiwi_teleop")
init_rerun(session_name="lekiwi_teleop", robot=robot, reset_time=True)
if not robot.is_connected or not leader_arm.is_connected or not keyboard.is_connected:
raise ValueError("Robot or teleop is not connected!")
print("Starting teleop loop...")
start = time.perf_counter()
while True:
t0 = time.perf_counter()
@@ -69,7 +70,7 @@ def main():
_ = robot.send_action(action)
# Visualize
log_rerun_data(observation=observation, action=action)
log_rerun_data(observation=observation, action=action, log_time=time.perf_counter() - start)
precise_sleep(max(1.0 / FPS - (time.perf_counter() - t0), 0.0))

View File

@@ -16,9 +16,9 @@
from lerobot.cameras.opencv.configuration_opencv import OpenCVCameraConfig
from lerobot.configs.types import FeatureType, PolicyFeature
from lerobot.datasets.feature_utils import combine_feature_dicts
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.datasets.pipeline_features import aggregate_pipeline_dataset_features, create_initial_features
from lerobot.datasets.utils import combine_feature_dicts
from lerobot.model.kinematics import RobotKinematics
from lerobot.policies.act.modeling_act import ACTPolicy
from lerobot.policies.factory import make_pre_post_processors

View File

@@ -15,9 +15,9 @@
# limitations under the License.
from lerobot.cameras.opencv.configuration_opencv import OpenCVCameraConfig
from lerobot.datasets.feature_utils import combine_feature_dicts
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.datasets.pipeline_features import aggregate_pipeline_dataset_features, create_initial_features
from lerobot.datasets.utils import combine_feature_dicts
from lerobot.model.kinematics import RobotKinematics
from lerobot.processor import RobotProcessorPipeline
from lerobot.processor.converters import (

View File

@@ -67,9 +67,7 @@ def main():
# Fetch the dataset to replay
dataset = LeRobotDataset(HF_REPO_ID, episodes=[EPISODE_IDX])
# Filter dataset to only include frames from the specified episode since episodes are chunked in dataset V3.0
episode_frames = dataset.hf_dataset.filter(lambda x: x["episode_index"] == EPISODE_IDX)
actions = episode_frames.select_columns(ACTION)
actions = dataset.select_columns(ACTION)
# Connect to the robot
robot.connect()
@@ -80,7 +78,7 @@ def main():
print("Starting replay loop...")
log_say(f"Replaying episode {EPISODE_IDX}")
for idx in range(len(episode_frames)):
for idx in range(dataset.num_frames):
t0 = time.perf_counter()
# Get recorded action from dataset

View File

@@ -90,12 +90,13 @@ def main():
teleop_device.connect()
# Init rerun viewer
init_rerun(session_name="phone_so100_teleop")
init_rerun(session_name="phone_so100_teleop", robot=robot, reset_time=True)
if not robot.is_connected or not teleop_device.is_connected:
raise ValueError("Robot or teleop is not connected!")
print("Starting teleop loop. Move your phone to teleoperate the robot...")
start = time.perf_counter()
while True:
t0 = time.perf_counter()
@@ -112,7 +113,7 @@ def main():
_ = robot.send_action(joint_action)
# Visualize
log_rerun_data(observation=phone_obs, action=joint_action)
log_rerun_data(observation=phone_obs, action=joint_action, log_time=time.perf_counter() - start)
precise_sleep(max(1.0 / FPS - (time.perf_counter() - t0), 0.0))

View File

@@ -22,7 +22,8 @@ from pathlib import Path
import numpy as np
import tensorflow_datasets as tfds
from lerobot.datasets.lerobot_dataset import LeRobotDataset, LeRobotDatasetMetadata
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.utils.utils import get_elapsed_time_in_days_hours_minutes_seconds
DROID_SHARDS = 2048

View File

@@ -26,7 +26,7 @@ from huggingface_hub import HfApi
from huggingface_hub.constants import REPOCARD_NAME
from port_droid import DROID_SHARDS
from lerobot.datasets.lerobot_dataset import CODEBASE_VERSION, LeRobotDatasetMetadata
from lerobot.datasets.dataset_metadata import CODEBASE_VERSION, LeRobotDatasetMetadata
from lerobot.datasets.utils import create_lerobot_dataset_card
from lerobot.utils.utils import init_logging
@@ -155,7 +155,7 @@ class UploadDataset(PipelineStep):
from datasets.utils.tqdm import disable_progress_bars
from huggingface_hub import CommitOperationAdd, preupload_lfs_files
from lerobot.datasets.lerobot_dataset import LeRobotDatasetMetadata
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata
from lerobot.utils.utils import init_logging
init_logging()

View File

@@ -113,8 +113,9 @@ from lerobot.configs import parser
from lerobot.configs.default import DatasetConfig
from lerobot.configs.policies import PreTrainedConfig
from lerobot.configs.types import RTCAttentionSchedule
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata
from lerobot.datasets.factory import resolve_delta_timestamps
from lerobot.datasets.lerobot_dataset import LeRobotDataset, LeRobotDatasetMetadata
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.policies.factory import get_policy_class, make_pre_post_processors
from lerobot.policies.rtc.configuration_rtc import RTCConfig
from lerobot.policies.rtc.debug_visualizer import RTCDebugVisualizer

View File

@@ -63,6 +63,26 @@ Usage:
--robot.cameras="{ gripper: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}, front: {type: opencv, index_or_path: 1, width: 640, height: 480, fps: 30}}" \
--task="Move green small object into the purple platform" \
--duration=120
# Run RTC with bi_openarm_follower (dual-arm OpenArms) and pi0.5 policy
python examples/rtc/eval_with_real_robot.py \
--policy.path=lerobot-data-collection/folding_final \
--robot.type=bi_openarm_follower \
--robot.cameras='{left_wrist: {type: opencv, index_or_path: "/dev/video4", width: 1280, height: 720, fps: 30}, base: {type: opencv, index_or_path: "/dev/video2", width: 640, height: 480, fps: 30}, right_wrist: {type: opencv, index_or_path: "/dev/video0", width: 1280, height: 720, fps: 30}}' \
--robot.left_arm_config.port=can1 \
--robot.left_arm_config.side=left \
--robot.left_arm_config.can_interface=socketcan \
--robot.right_arm_config.port=can0 \
--robot.right_arm_config.side=right \
--robot.right_arm_config.can_interface=socketcan \
--task="Fold the T-shirt properly" \
--fps=30 \
--duration=2000 \
--rtc.enabled=true \
--rtc.execution_horizon=20 \
--rtc.max_guidance_weight=5.0 \
--rtc.prefix_attention_schedule=LINEAR \
--device=cuda
"""
import logging
@@ -82,26 +102,34 @@ from lerobot.cameras.zmq.configuration_zmq import ZMQCameraConfig # noqa: F401
from lerobot.configs import parser
from lerobot.configs.policies import PreTrainedConfig
from lerobot.configs.types import RTCAttentionSchedule
from lerobot.datasets.utils import build_dataset_frame, hw_to_dataset_features
from lerobot.datasets.feature_utils import build_dataset_frame, hw_to_dataset_features
from lerobot.policies.factory import get_policy_class, make_pre_post_processors
from lerobot.policies.rtc.action_queue import ActionQueue
from lerobot.policies.rtc.configuration_rtc import RTCConfig
from lerobot.policies.rtc.latency_tracker import LatencyTracker
from lerobot.processor import (
NormalizerProcessorStep,
RelativeActionsProcessorStep,
TransitionKey,
create_transition,
)
from lerobot.processor.factory import (
make_default_robot_action_processor,
make_default_robot_observation_processor,
)
from lerobot.processor.relative_action_processor import to_relative_actions
from lerobot.rl.process import ProcessSignalHandler
from lerobot.robots import ( # noqa: F401
Robot,
RobotConfig,
bi_openarm_follower,
bi_so_follower,
koch_follower,
so_follower,
unitree_g1,
)
from lerobot.robots.utils import make_robot_from_config
from lerobot.utils.constants import OBS_IMAGES
from lerobot.utils.constants import OBS_IMAGES, OBS_STATE
from lerobot.utils.hub import HubMixin
from lerobot.utils.utils import init_logging
@@ -212,6 +240,35 @@ def is_image_key(k: str) -> bool:
return k.startswith(OBS_IMAGES)
def _reanchor_relative_rtc_prefix(
prev_actions_absolute: Tensor,
current_state: Tensor,
relative_step: RelativeActionsProcessorStep,
normalizer_step: NormalizerProcessorStep | None,
policy_device: torch.device | str,
) -> Tensor:
"""Convert absolute leftovers into model-space for relative-action RTC policies.
When a policy uses relative actions, the RTC prefix (leftover actions from
the previous chunk) is stored in absolute space. Before feeding it back to
the policy we need to re-express it relative to the *current* robot state
and then re-normalize.
"""
state = current_state.detach().cpu()
if state.dim() == 1:
state = state.unsqueeze(0)
action_cpu = prev_actions_absolute.detach().cpu()
mask = relative_step._build_mask(action_cpu.shape[-1])
relative_actions = to_relative_actions(action_cpu, state, mask)
transition = create_transition(action=relative_actions)
if normalizer_step is not None:
transition = normalizer_step(transition)
return transition[TransitionKey.ACTION].to(policy_device)
def get_actions(
policy,
robot: RobotWrapper,
@@ -237,7 +294,15 @@ def get_actions(
fps = cfg.fps
time_per_chunk = 1.0 / fps
dataset_features = hw_to_dataset_features(robot.observation_features(), "observation")
# Only keep .pos joints + camera streams if the policy was trained on positions,
# not the full pos/vel/torque state the robot exposes.
observation_features_hw = {
key: value
for key, value in robot.observation_features().items()
if key.endswith(".pos") or isinstance(value, tuple)
}
dataset_features = hw_to_dataset_features(observation_features_hw, "observation")
policy_device = policy.config.device
# Load preprocessor and postprocessor from pretrained files
@@ -255,6 +320,25 @@ def get_actions(
logger.info("[GET_ACTIONS] Preprocessor/postprocessor loaded successfully with embedded stats")
relative_step = next(
(s for s in preprocessor.steps if isinstance(s, RelativeActionsProcessorStep) and s.enabled),
None,
)
normalizer_step = next(
(s for s in preprocessor.steps if isinstance(s, NormalizerProcessorStep)),
None,
)
if relative_step is not None:
if relative_step.action_names is None:
cfg_names = getattr(cfg.policy, "action_feature_names", None)
if cfg_names:
relative_step.action_names = list(cfg_names)
else:
relative_step.action_names = [
k for k in robot.robot.action_features if k.endswith(".pos")
]
logger.info("[GET_ACTIONS] Relative actions enabled: will re-anchor RTC prefix")
get_actions_threshold = cfg.action_queue_size_to_get_new_actions
if not cfg.rtc.enabled:
@@ -297,6 +381,28 @@ def get_actions(
preproceseded_obs = preprocessor(obs_with_policy_features)
# Re-anchor leftover actions for relative-action policies.
# We need the *postprocessed* (absolute) leftover, not the original
# (normalized/relative) one that get_left_over() returns.
if (
prev_actions is not None
and relative_step is not None
and OBS_STATE in obs_with_policy_features
):
with action_queue.lock:
if action_queue.queue is not None:
prev_actions_abs = action_queue.queue[action_queue.last_index :].clone()
else:
prev_actions_abs = None
if prev_actions_abs is not None and prev_actions_abs.numel() > 0:
prev_actions = _reanchor_relative_rtc_prefix(
prev_actions_absolute=prev_actions_abs,
current_state=obs_with_policy_features[OBS_STATE],
relative_step=relative_step,
normalizer_step=normalizer_step,
policy_device=policy_device,
)
# Generate actions WITH RTC
actions = policy.predict_action_chunk(
preproceseded_obs,
@@ -352,6 +458,8 @@ def actor_control(
try:
logger.info("[ACTOR] Starting actor thread")
action_keys = [k for k in robot.action_features() if k.endswith(".pos")]
action_count = 0
action_interval = 1.0 / cfg.fps
@@ -363,7 +471,7 @@ def actor_control(
if action is not None:
action = action.cpu()
action_dict = {key: action[i].item() for i, key in enumerate(robot.action_features())}
action_dict = {key: action[i].item() for i, key in enumerate(action_keys)}
action_processed = robot_action_processor((action_dict, None))
robot.send_action(action_processed)

View File

@@ -16,9 +16,9 @@
from lerobot.cameras.opencv.configuration_opencv import OpenCVCameraConfig
from lerobot.configs.types import FeatureType, PolicyFeature
from lerobot.datasets.feature_utils import combine_feature_dicts
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.datasets.pipeline_features import aggregate_pipeline_dataset_features, create_initial_features
from lerobot.datasets.utils import combine_feature_dicts
from lerobot.model.kinematics import RobotKinematics
from lerobot.policies.act.modeling_act import ACTPolicy
from lerobot.policies.factory import make_pre_post_processors

View File

@@ -16,9 +16,9 @@
from lerobot.cameras.opencv.configuration_opencv import OpenCVCameraConfig
from lerobot.datasets.feature_utils import combine_feature_dicts
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.datasets.pipeline_features import aggregate_pipeline_dataset_features, create_initial_features
from lerobot.datasets.utils import combine_feature_dicts
from lerobot.model.kinematics import RobotKinematics
from lerobot.processor import RobotProcessorPipeline
from lerobot.processor.converters import (

View File

@@ -68,9 +68,7 @@ def main():
# Fetch the dataset to replay
dataset = LeRobotDataset(HF_REPO_ID, episodes=[EPISODE_IDX])
# Filter dataset to only include frames from the specified episode since episodes are chunked in dataset V3.0
episode_frames = dataset.hf_dataset.filter(lambda x: x["episode_index"] == EPISODE_IDX)
actions = episode_frames.select_columns(ACTION)
actions = dataset.select_columns(ACTION)
# Connect to the robot
robot.connect()
@@ -81,7 +79,7 @@ def main():
print("Starting replay loop...")
log_say(f"Replaying episode {EPISODE_IDX}")
for idx in range(len(episode_frames)):
for idx in range(dataset.num_frames):
t0 = time.perf_counter()
# Get recorded action from dataset

View File

@@ -95,9 +95,10 @@ def main():
leader.connect()
# Init rerun viewer
init_rerun(session_name="so100_so100_EE_teleop")
init_rerun(session_name="so100_so100_EE_teleop", robot=follower, reset_time=True)
print("Starting teleop loop...")
start = time.perf_counter()
while True:
t0 = time.perf_counter()
@@ -117,7 +118,9 @@ def main():
_ = follower.send_action(follower_joints_act)
# Visualize
log_rerun_data(observation=leader_ee_act, action=follower_joints_act)
log_rerun_data(
observation=leader_ee_act, action=follower_joints_act, log_time=time.perf_counter() - start
)
precise_sleep(max(1.0 / FPS - (time.perf_counter() - t0), 0.0))

View File

@@ -19,8 +19,9 @@ from pathlib import Path
import torch
from lerobot.configs.types import FeatureType
from lerobot.datasets.lerobot_dataset import LeRobotDataset, LeRobotDatasetMetadata
from lerobot.datasets.utils import dataset_to_policy_features
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata
from lerobot.datasets.feature_utils import dataset_to_policy_features
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.policies.diffusion.configuration_diffusion import DiffusionConfig
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
from lerobot.policies.factory import make_pre_post_processors

View File

@@ -20,9 +20,9 @@ from pathlib import Path
import torch
from lerobot.configs.types import FeatureType
from lerobot.datasets.lerobot_dataset import LeRobotDatasetMetadata
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata
from lerobot.datasets.feature_utils import dataset_to_policy_features
from lerobot.datasets.streaming_dataset import StreamingLeRobotDataset
from lerobot.datasets.utils import dataset_to_policy_features
from lerobot.policies.act.configuration_act import ACTConfig
from lerobot.policies.act.modeling_act import ACTPolicy
from lerobot.policies.factory import make_pre_post_processors

View File

@@ -5,8 +5,9 @@ from pathlib import Path
import torch
from lerobot.configs.types import FeatureType
from lerobot.datasets.lerobot_dataset import LeRobotDataset, LeRobotDatasetMetadata
from lerobot.datasets.utils import dataset_to_policy_features
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata
from lerobot.datasets.feature_utils import dataset_to_policy_features
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.policies.act.configuration_act import ACTConfig
from lerobot.policies.act.modeling_act import ACTPolicy
from lerobot.policies.factory import make_pre_post_processors

View File

@@ -1,7 +1,7 @@
import torch
from lerobot.cameras.opencv.configuration_opencv import OpenCVCameraConfig
from lerobot.datasets.lerobot_dataset import LeRobotDatasetMetadata
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata
from lerobot.policies.act.modeling_act import ACTPolicy
from lerobot.policies.factory import make_pre_post_processors
from lerobot.policies.utils import build_inference_frame, make_robot_action

View File

@@ -5,8 +5,9 @@ from pathlib import Path
import torch
from lerobot.configs.types import FeatureType
from lerobot.datasets.lerobot_dataset import LeRobotDataset, LeRobotDatasetMetadata
from lerobot.datasets.utils import dataset_to_policy_features
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata
from lerobot.datasets.feature_utils import dataset_to_policy_features
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.policies.diffusion.configuration_diffusion import DiffusionConfig
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
from lerobot.policies.factory import make_pre_post_processors

View File

@@ -1,7 +1,7 @@
import torch
from lerobot.cameras.opencv.configuration_opencv import OpenCVCameraConfig
from lerobot.datasets.lerobot_dataset import LeRobotDatasetMetadata
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
from lerobot.policies.factory import make_pre_post_processors
from lerobot.policies.utils import build_inference_frame, make_robot_action

View File

@@ -1,7 +1,7 @@
import torch
from lerobot.cameras.opencv.configuration_opencv import OpenCVCameraConfig
from lerobot.datasets.utils import hw_to_dataset_features
from lerobot.datasets.feature_utils import hw_to_dataset_features
from lerobot.policies.factory import make_pre_post_processors
from lerobot.policies.pi0.modeling_pi0 import PI0Policy
from lerobot.policies.utils import build_inference_frame, make_robot_action

View File

@@ -6,8 +6,8 @@ from queue import Empty, Full
import torch
import torch.optim as optim
from lerobot.datasets.feature_utils import hw_to_dataset_features
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.datasets.utils import hw_to_dataset_features
from lerobot.envs.configs import HILSerlProcessorConfig, HILSerlRobotEnvConfig
from lerobot.policies.sac.configuration_sac import SACConfig
from lerobot.policies.sac.modeling_sac import SACPolicy

View File

@@ -1,7 +1,7 @@
import torch
from lerobot.cameras.opencv.configuration_opencv import OpenCVCameraConfig
from lerobot.datasets.utils import hw_to_dataset_features
from lerobot.datasets.feature_utils import hw_to_dataset_features
from lerobot.policies.factory import make_pre_post_processors
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
from lerobot.policies.utils import build_inference_frame, make_robot_action

View File

@@ -99,7 +99,7 @@ dependencies = [
# Common
pygame-dep = ["pygame>=2.5.1,<2.7.0"]
placo-dep = ["placo>=0.9.6,<0.9.17"]
transformers-dep = ["transformers>=5.3.0,<6.0.0"]
transformers-dep = ["transformers==5.3.0"] # TODO(Steven): https://github.com/huggingface/lerobot/pull/3249
grpcio-dep = ["grpcio==1.73.1", "protobuf>=6.31.1,<6.32.0"]
can-dep = ["python-can>=4.2.0,<5.0.0"]
peft-dep = ["peft>=0.18.0,<1.0.0"]
@@ -145,6 +145,7 @@ wallx = [
]
pi = ["lerobot[transformers-dep]", "lerobot[scipy-dep]"]
smolvla = ["lerobot[transformers-dep]", "num2words>=0.5.14,<0.6.0", "accelerate>=1.7.0,<2.0.0", "safetensors>=0.4.3,<1.0.0"]
multi_task_dit = ["lerobot[transformers-dep]"]
groot = [
"lerobot[transformers-dep]",
"lerobot[peft]",
@@ -163,6 +164,7 @@ hilserl = ["lerobot[transformers-dep]", "gym-hil>=0.1.13,<0.2.0", "lerobot[grpci
# Features
async = ["lerobot[grpcio-dep]", "lerobot[matplotlib-dep]"]
peft = ["lerobot[transformers-dep]", "lerobot[peft-dep]"]
audio = ["sounddevice>=0.5.1,<0.6.0", "soundfile>=0.13.1,<0.14.0", "librosa>=0.11.0,<0.12.0", "torchaudio>=2.6.0,<2.10.0"]
# Development
dev = ["pre-commit>=3.7.0,<5.0.0", "debugpy>=1.8.1,<1.9.0", "lerobot[grpcio-dep]", "grpcio-tools==1.73.1", "mypy>=1.19.1"]
@@ -197,6 +199,7 @@ all = [
"lerobot[xvla]",
"lerobot[hilserl]",
"lerobot[async]",
"lerobot[audio]",
"lerobot[dev]",
"lerobot[test]",
"lerobot[video_benchmark]",

View File

@@ -29,6 +29,7 @@ Example:
print(lerobot.available_policies_per_env)
print(lerobot.available_robots)
print(lerobot.available_cameras)
print(lerobot.available_microphones)
print(lerobot.available_motors)
```
@@ -174,6 +175,12 @@ available_cameras = [
"intelrealsense",
]
# lists all available microphones from `lerobot/microphones`
available_microphones = [
"portaudio",
"touchlab",
]
# lists all available motors from `lerobot/motors`
available_motors = [
"dynamixel",

View File

@@ -23,7 +23,7 @@ from typing import Any
import torch
from lerobot.configs.types import PolicyFeature
from lerobot.datasets.utils import build_dataset_frame, hw_to_dataset_features
from lerobot.datasets.feature_utils import build_dataset_frame, hw_to_dataset_features
# NOTE: Configs need to be loaded for the client to be able to instantiate the policy config
from lerobot.policies import ( # noqa: F401

View File

@@ -49,6 +49,8 @@ import torch
from lerobot.cameras.opencv.configuration_opencv import OpenCVCameraConfig # noqa: F401
from lerobot.cameras.realsense.configuration_realsense import RealSenseCameraConfig # noqa: F401
from lerobot.microphones.portaudio.configuration_portaudio import PortAudioMicrophoneConfig # noqa: F401
from lerobot.microphones.touchlab.configuration_touchlab import TouchLabSensorConfig # noqa: F401
from lerobot.robots import ( # noqa: F401
Robot,
RobotConfig,

View File

@@ -27,7 +27,8 @@ class DatasetConfig:
# "dataset_index" into the returned item. The index mapping is made according to the order in which the
# datasets are provided.
repo_id: str
# Root directory where the dataset will be stored (e.g. 'dataset/path'). If None, defaults to $HF_LEROBOT_HOME/repo_id.
# Root directory for a concrete local dataset tree (e.g. 'dataset/path'). If None, local datasets are
# looked up under $HF_LEROBOT_HOME/repo_id and Hub downloads use a revision-safe cache under $HF_LEROBOT_HOME/hub.
root: str | None = None
episodes: list[int] | None = None
image_transforms: ImageTransformsConfig = field(default_factory=ImageTransformsConfig)
@@ -36,6 +37,16 @@ class DatasetConfig:
video_backend: str = field(default_factory=get_safe_default_codec)
streaming: bool = False
def __post_init__(self) -> None:
if self.episodes is not None:
if any(ep < 0 for ep in self.episodes):
raise ValueError(
f"Episode indices must be non-negative, got: {[ep for ep in self.episodes if ep < 0]}"
)
if len(self.episodes) != len(set(self.episodes)):
duplicates = sorted({ep for ep in self.episodes if self.episodes.count(ep) > 1})
raise ValueError(f"Episode indices contain duplicates: {duplicates}")
@dataclass
class WandBConfig:

View File

@@ -151,6 +151,12 @@ class PreTrainedConfig(draccus.ChoiceRegistry, HubMixin, abc.ABC): # type: igno
return {}
return {key: ft for key, ft in self.input_features.items() if ft.type is FeatureType.VISUAL}
@property
def audio_features(self) -> dict[str, PolicyFeature]:
if not self.input_features:
return {}
return {key: ft for key, ft in self.input_features.items() if ft.type is FeatureType.AUDIO}
@property
def action_feature(self) -> PolicyFeature | None:
if not self.output_features:

View File

@@ -20,6 +20,7 @@ from enum import Enum
class FeatureType(str, Enum):
STATE = "STATE"
VISUAL = "VISUAL"
AUDIO = "AUDIO"
ENV = "ENV"
ACTION = "ACTION"
REWARD = "REWARD"

View File

@@ -746,7 +746,8 @@ def save_annotations_to_dataset(
dataset_path: Path, annotations: dict[int, SubtaskAnnotation], fps: int, prefix: str = "sparse"
):
"""Save annotations to LeRobot dataset parquet format."""
from lerobot.datasets.utils import DEFAULT_EPISODES_PATH, load_episodes
from lerobot.datasets.io_utils import load_episodes
from lerobot.datasets.utils import DEFAULT_EPISODES_PATH
episodes_dataset = load_episodes(dataset_path)
if not episodes_dataset or len(episodes_dataset) == 0:
@@ -840,7 +841,7 @@ def generate_auto_sparse_annotations(
def load_annotations_from_dataset(dataset_path: Path, prefix: str = "sparse") -> dict[int, SubtaskAnnotation]:
"""Load annotations from LeRobot dataset parquet files."""
from lerobot.datasets.utils import load_episodes
from lerobot.datasets.io_utils import load_episodes
episodes_dataset = load_episodes(dataset_path)
if not episodes_dataset or len(episodes_dataset) == 0:

View File

@@ -0,0 +1,33 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team.
# All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.datasets.multi_dataset import MultiLeRobotDataset
from lerobot.datasets.sampler import EpisodeAwareSampler
from lerobot.datasets.streaming_dataset import StreamingLeRobotDataset
from lerobot.datasets.transforms import ImageTransforms, ImageTransformsConfig
__all__ = [
"EpisodeAwareSampler",
"ImageTransforms",
"ImageTransformsConfig",
"LeRobotDataset",
"LeRobotDatasetMetadata",
"MultiLeRobotDataset",
"StreamingLeRobotDataset",
]

View File

@@ -24,24 +24,28 @@ import pandas as pd
import tqdm
from lerobot.datasets.compute_stats import aggregate_stats
from lerobot.datasets.lerobot_dataset import LeRobotDatasetMetadata
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata
from lerobot.datasets.feature_utils import get_hf_features_from_features
from lerobot.datasets.io_utils import (
get_file_size_in_mb,
get_parquet_file_size_in_mb,
to_parquet_with_hf_images,
write_info,
write_stats,
write_tasks,
)
from lerobot.datasets.utils import (
DEFAULT_AUDIO_FILE_SIZE_IN_MB,
DEFAULT_AUDIO_PATH,
DEFAULT_CHUNK_SIZE,
DEFAULT_DATA_FILE_SIZE_IN_MB,
DEFAULT_DATA_PATH,
DEFAULT_EPISODES_PATH,
DEFAULT_VIDEO_FILE_SIZE_IN_MB,
DEFAULT_VIDEO_PATH,
get_file_size_in_mb,
get_hf_features_from_features,
get_parquet_file_size_in_mb,
to_parquet_with_hf_images,
update_chunk_file_indices,
write_info,
write_stats,
write_tasks,
)
from lerobot.datasets.video_utils import concatenate_video_files, get_video_duration_in_s
from lerobot.datasets.video_utils import concatenate_media_files, get_media_duration_in_s
def validate_all_metadata(all_metadata: list[LeRobotDatasetMetadata]):
@@ -110,6 +114,7 @@ def update_meta_data(
meta_idx,
data_idx,
videos_idx,
audios_idx,
):
"""Updates metadata DataFrame with new chunk, file, and timestamp indices.
@@ -125,7 +130,7 @@ def update_meta_data(
meta_idx: Dictionary containing current metadata chunk and file indices.
data_idx: Dictionary containing current data chunk and file indices.
videos_idx: Dictionary containing current video indices and timestamps.
audios_idx: Dictionary containing current audio indices and timestamps.
Returns:
pd.DataFrame: Updated DataFrame with adjusted indices and timestamps.
"""
@@ -223,6 +228,36 @@ def update_meta_data(
# Clean up temporary columns
df = df.drop(columns=["_orig_chunk", "_orig_file"])
for key, audio_idx in audios_idx.items():
# Store original audio file indices before updating
orig_chunk_col = f"audio/{key}/chunk_index"
orig_file_col = f"audio/{key}/file_index"
df["_orig_chunk"] = df[orig_chunk_col].copy()
df["_orig_file"] = df[orig_file_col].copy()
# Update chunk and file indices to point to destination
df[orig_chunk_col] = audio_idx["chunk"]
df[orig_file_col] = audio_idx["file"]
# Apply per-source-file timestamp offsets
src_to_offset = audio_idx.get("src_to_offset", {})
if src_to_offset:
# Apply offset based on original source file
for idx in df.index:
src_key = (df.at[idx, "_orig_chunk"], df.at[idx, "_orig_file"])
offset = src_to_offset.get(src_key, 0)
df.at[idx, f"audio/{key}/from_timestamp"] += offset
df.at[idx, f"audio/{key}/to_timestamp"] += offset
else:
# Fallback to simple offset (for backward compatibility)
df[f"audio/{key}/from_timestamp"] = (
df[f"audio/{key}/from_timestamp"] + audio_idx["latest_duration"]
)
df[f"audio/{key}/to_timestamp"] = df[f"audio/{key}/to_timestamp"] + audio_idx["latest_duration"]
# Clean up temporary columns
df = df.drop(columns=["_orig_chunk", "_orig_file"])
df["dataset_from_index"] = df["dataset_from_index"] + dst_meta.info["total_frames"]
df["dataset_to_index"] = df["dataset_to_index"] + dst_meta.info["total_frames"]
df["episode_index"] = df["episode_index"] + dst_meta.info["total_episodes"]
@@ -237,6 +272,7 @@ def aggregate_datasets(
aggr_root: Path | None = None,
data_files_size_in_mb: float | None = None,
video_files_size_in_mb: float | None = None,
audio_files_size_in_mb: float | None = None,
chunk_size: int | None = None,
):
"""Aggregates multiple LeRobot datasets into a single unified dataset.
@@ -254,6 +290,7 @@ def aggregate_datasets(
aggr_root: Optional root path for the aggregated dataset.
data_files_size_in_mb: Maximum size for data files in MB (defaults to DEFAULT_DATA_FILE_SIZE_IN_MB)
video_files_size_in_mb: Maximum size for video files in MB (defaults to DEFAULT_VIDEO_FILE_SIZE_IN_MB)
audio_files_size_in_mb: Maximum size for audio files in MB (defaults to DEFAULT_AUDIO_FILE_SIZE_IN_MB)
chunk_size: Maximum number of files per chunk (defaults to DEFAULT_CHUNK_SIZE)
"""
logging.info("Start aggregate_datasets")
@@ -262,6 +299,8 @@ def aggregate_datasets(
data_files_size_in_mb = DEFAULT_DATA_FILE_SIZE_IN_MB
if video_files_size_in_mb is None:
video_files_size_in_mb = DEFAULT_VIDEO_FILE_SIZE_IN_MB
if audio_files_size_in_mb is None:
audio_files_size_in_mb = DEFAULT_AUDIO_FILE_SIZE_IN_MB
if chunk_size is None:
chunk_size = DEFAULT_CHUNK_SIZE
@@ -274,6 +313,7 @@ def aggregate_datasets(
)
fps, robot_type, features = validate_all_metadata(all_metadata)
video_keys = [key for key in features if features[key]["dtype"] == "video"]
audio_keys = [key for key in features if features[key]["dtype"] == "audio"]
dst_meta = LeRobotDatasetMetadata.create(
repo_id=aggr_repo_id,
@@ -285,6 +325,7 @@ def aggregate_datasets(
chunks_size=chunk_size,
data_files_size_in_mb=data_files_size_in_mb,
video_files_size_in_mb=video_files_size_in_mb,
audio_files_size_in_mb=audio_files_size_in_mb,
)
logging.info("Find all tasks")
@@ -298,14 +339,18 @@ def aggregate_datasets(
videos_idx = {
key: {"chunk": 0, "file": 0, "latest_duration": 0, "episode_duration": 0} for key in video_keys
}
audios_idx = {
key: {"chunk": 0, "file": 0, "latest_duration": 0, "episode_duration": 0} for key in audio_keys
}
dst_meta.episodes = {}
for src_meta in tqdm.tqdm(all_metadata, desc="Copy data and videos"):
videos_idx = aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chunk_size)
audios_idx = aggregate_audio(src_meta, dst_meta, audios_idx, audio_files_size_in_mb, chunk_size)
data_idx = aggregate_data(src_meta, dst_meta, data_idx, data_files_size_in_mb, chunk_size)
meta_idx = aggregate_metadata(src_meta, dst_meta, meta_idx, data_idx, videos_idx)
meta_idx = aggregate_metadata(src_meta, dst_meta, meta_idx, data_idx, videos_idx, audios_idx)
# Clear the src_to_dst mapping after processing each source dataset
# to avoid interference between different source datasets
@@ -373,7 +418,7 @@ def aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chu
file_index=file_idx,
)
src_duration = get_video_duration_in_s(src_path)
src_duration = get_media_duration_in_s(src_path, media_type="video")
dst_key = (chunk_idx, file_idx)
if not dst_path.exists():
@@ -412,7 +457,7 @@ def aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chu
current_dst_duration = dst_file_durations.get(dst_key, 0)
videos_idx[key]["src_to_offset"][(src_chunk_idx, src_file_idx)] = current_dst_duration
videos_idx[key]["src_to_dst"][(src_chunk_idx, src_file_idx)] = dst_key
concatenate_video_files(
concatenate_media_files(
[dst_path, src_path],
dst_path,
)
@@ -427,6 +472,101 @@ def aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chu
return videos_idx
def aggregate_audio(src_meta, dst_meta, audios_idx, audio_files_size_in_mb, chunk_size):
"""Aggregates audio files from a source dataset into the destination dataset.
Handles audio file concatenation and rotation based on file size limits.
Creates new audio files when size limits are exceeded.
Args:
src_meta: Source dataset metadata.
dst_meta: Destination dataset metadata.
audio_idx: Dictionary tracking audio chunk and file indices.
audio_files_size_in_mb: Maximum size for audio files in MB (defaults to DEFAULT_AUDIO_FILE_SIZE_IN_MB)
chunk_size: Maximum number of files per chunk (defaults to DEFAULT_CHUNK_SIZE)
Returns:
dict: Updated audio_idx with current chunk and file indices.
"""
for key in audios_idx:
audios_idx[key]["episode_duration"] = 0
# Track offset for each source (chunk, file) pair
audios_idx[key]["src_to_offset"] = {}
for key, audio_idx in audios_idx.items():
unique_chunk_file_pairs = {
(chunk, file)
for chunk, file in zip(
src_meta.episodes[f"audio/{key}/chunk_index"],
src_meta.episodes[f"audio/{key}/file_index"],
strict=False,
)
}
unique_chunk_file_pairs = sorted(unique_chunk_file_pairs)
chunk_idx = audio_idx["chunk"]
file_idx = audio_idx["file"]
current_offset = audio_idx["latest_duration"]
for src_chunk_idx, src_file_idx in unique_chunk_file_pairs:
src_path = src_meta.root / DEFAULT_AUDIO_PATH.format(
audio_key=key,
chunk_index=src_chunk_idx,
file_index=src_file_idx,
)
dst_path = dst_meta.root / DEFAULT_AUDIO_PATH.format(
audio_key=key,
chunk_index=chunk_idx,
file_index=file_idx,
)
src_duration = get_media_duration_in_s(src_path, media_type="audio")
if not dst_path.exists():
# Store offset before incrementing
audios_idx[key]["src_to_offset"][(src_chunk_idx, src_file_idx)] = current_offset
dst_path.parent.mkdir(parents=True, exist_ok=True)
shutil.copy(str(src_path), str(dst_path))
audios_idx[key]["episode_duration"] += src_duration
current_offset += src_duration
continue
# Check file sizes before appending
src_size = get_file_size_in_mb(src_path)
dst_size = get_file_size_in_mb(dst_path)
if dst_size + src_size >= audio_files_size_in_mb:
# Rotate to a new file, this source becomes start of new destination
# So its offset should be 0
audios_idx[key]["src_to_offset"][(src_chunk_idx, src_file_idx)] = 0
chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, chunk_size)
dst_path = dst_meta.root / DEFAULT_AUDIO_PATH.format(
audio_key=key,
chunk_index=chunk_idx,
file_index=file_idx,
)
dst_path.parent.mkdir(parents=True, exist_ok=True)
shutil.copy(str(src_path), str(dst_path))
# Reset offset for next file
current_offset = src_duration
else:
# Append to existing video file - use current accumulated offset
audios_idx[key]["src_to_offset"][(src_chunk_idx, src_file_idx)] = current_offset
concatenate_media_files(
[dst_path, src_path],
dst_path,
)
current_offset += src_duration
audios_idx[key]["episode_duration"] += src_duration
audios_idx[key]["chunk"] = chunk_idx
audios_idx[key]["file"] = file_idx
return audios_idx
def aggregate_data(src_meta, dst_meta, data_idx, data_files_size_in_mb, chunk_size):
"""Aggregates data chunks from a source dataset into the destination dataset.
@@ -499,7 +639,7 @@ def aggregate_data(src_meta, dst_meta, data_idx, data_files_size_in_mb, chunk_si
return data_idx
def aggregate_metadata(src_meta, dst_meta, meta_idx, data_idx, videos_idx):
def aggregate_metadata(src_meta, dst_meta, meta_idx, data_idx, videos_idx, audios_idx):
"""Aggregates metadata from a source dataset into the destination dataset.
Reads source metadata files, updates all indices and timestamps,
@@ -511,6 +651,7 @@ def aggregate_metadata(src_meta, dst_meta, meta_idx, data_idx, videos_idx):
meta_idx: Dictionary tracking metadata chunk and file indices.
data_idx: Dictionary tracking data chunk and file indices.
videos_idx: Dictionary tracking video indices and timestamps.
audios_idx: Dictionary tracking audio indices and timestamps.
Returns:
dict: Updated meta_idx with current chunk and file indices.
@@ -534,6 +675,7 @@ def aggregate_metadata(src_meta, dst_meta, meta_idx, data_idx, videos_idx):
meta_idx,
data_idx,
videos_idx,
audios_idx,
)
meta_idx, _ = append_or_create_parquet_file(
@@ -550,7 +692,8 @@ def aggregate_metadata(src_meta, dst_meta, meta_idx, data_idx, videos_idx):
# Increment latest_duration by the total duration added from this source dataset
for k in videos_idx:
videos_idx[k]["latest_duration"] += videos_idx[k]["episode_duration"]
for k in audios_idx:
audios_idx[k]["latest_duration"] += audios_idx[k]["episode_duration"]
return meta_idx

View File

@@ -0,0 +1,275 @@
#!/usr/bin/env python
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
from pathlib import Path
import av
import torch
import torchaudio
import torchcodec
from numpy import ceil
CHANNELS_LAYOUTS_MAPPING = {
1: "mono",
2: "stereo",
3: "2.1",
4: "3.1",
5: "4.1",
6: "5.1",
7: "6.1",
8: "7.1",
16: "hexadecagonal",
24: "22.2",
}
def decode_audio(
audio_path: Path | str,
timestamps: list[float],
duration: float,
start_time_s: float | None = 0.0,
backend: str | None = "torchcodec",
) -> torch.Tensor:
"""
Decodes audio using the specified backend.
Args:
audio_path (Path): Path to the audio file.
timestamps (list[float]): List of (starting) timestamps to extract audio chunks.
duration (float): Duration of the audio chunks in seconds.
backend (str, optional): Backend to use for decoding. Defaults to "torchcodec".
Returns:
torch.Tensor: Decoded audio chunks.
Currently supports torchaudio.
"""
if backend == "torchcodec":
return decode_audio_torchcodec(audio_path, timestamps, duration, start_time_s)
elif backend == "torchaudio":
return decode_audio_torchaudio(audio_path, timestamps, duration, start_time_s)
else:
raise ValueError(f"Unsupported video backend: {backend}")
def decode_audio_torchcodec(
audio_path: Path | str,
timestamps: list[float],
duration: float,
start_time_s: float | None = 0.0,
log_loaded_timestamps: bool = False,
) -> torch.Tensor:
# TODO(CarolinePascal) : add channels selection
audio_decoder = torchcodec.decoders.AudioDecoder(audio_path)
audio_sample_rate = audio_decoder.metadata.sample_rate
audio_channels = audio_decoder.metadata.num_channels
# TODO(CarolinePascal) : assert ts < total record duration
audio_chunks = []
timestamps = [
timestamp + start_time_s for timestamp in timestamps
] # Add an offset of start_time_s to each timestamp
for ts in timestamps:
current_audio_chunk = audio_decoder.get_samples_played_in_range(
start_seconds=max(0.0, ts - duration), stop_seconds=ts
)
current_audio_chunk_data = current_audio_chunk.data
# Case where the requested audio chunk starts before the beginning of the audio stream
if ts - duration < 0:
# No useful audio sample has been recorded
if ts < 1 / audio_sample_rate:
# TODO(CarolinePascal) : add low level white noise instead of zeros ?
current_audio_chunk_data = torch.zeros(
(audio_channels, int(ceil(duration * audio_sample_rate)))
)
# At least one useful audio sample has been recorded
else:
# Pad the beginning of the audio chunk with zeros
# TODO(CarolinePascal) : add low level white noise instead of zeros ?
current_audio_chunk_data = torch.nn.functional.pad(
current_audio_chunk_data,
(int(ceil((duration - ts) * audio_sample_rate)), 0, 0, 0), # left, right, top, bottom
)
if log_loaded_timestamps:
logging.info(
f"audio chunk loaded at timestamp={current_audio_chunk.pts_seconds:.4f} with duration={current_audio_chunk.duration_seconds:.4f}"
)
audio_chunks.append(current_audio_chunk_data)
audio_chunks = torch.stack(audio_chunks)
assert len(timestamps) == len(audio_chunks)
return audio_chunks
def decode_audio_torchaudio(
audio_path: Path | str,
timestamps: list[float],
duration: float,
start_time_s: float | None = 0.0,
log_loaded_timestamps: bool = False,
) -> torch.Tensor:
# TODO(CarolinePascal) : add channels selection
audio_path = str(audio_path)
reader = torchaudio.io.StreamReader(src=audio_path)
audio_sample_rate = reader.get_src_stream_info(reader.default_audio_stream).sample_rate
audio_channels = reader.get_src_stream_info(reader.default_audio_stream).num_channels
# TODO(CarolinePascal) : assert ts < total record duration
# TODO(CarolinePascal) : sort timestamps ?
reader.add_basic_audio_stream(
frames_per_chunk=int(ceil(duration * audio_sample_rate)), # Too much is better than not enough
buffer_chunk_size=-1, # No dropping frames
format="fltp", # Format as float32
)
audio_chunks = []
timestamps = [
timestamp + start_time_s for timestamp in timestamps
] # Add an offset of start_time_s to each timestamp
for ts in timestamps:
reader.seek(max(0.0, ts - duration)) # Default to closest audio sample. Needs to be non-negative !
status = reader.fill_buffer()
if status != 0:
# Should not happen, but just in case
logging.warning("Audio stream reached end of recording before decoding desired timestamps.")
current_audio_chunk = reader.pop_chunks()[0]
current_audio_chunk_data = current_audio_chunk.t() # Channel first format
# Case where the requested audio chunk starts before the beginning of the audio stream
if ts - duration < 0:
# No useful audio sample has been recorded
if ts < 1 / audio_sample_rate:
current_audio_chunk_data = torch.zeros(
(audio_channels, int(ceil(duration * audio_sample_rate)))
)
# At least one useful audio sample has been recorded
else:
# Remove the superfluous last samples of the audio chunk
current_audio_chunk_data = current_audio_chunk_data[:, : int(ceil(ts * audio_sample_rate))]
# Pad the beginning of the audio chunk with zeros
# TODO(CarolinePascal) : add low level white noise instead of zeros ?
current_audio_chunk_data = torch.nn.functional.pad(
current_audio_chunk_data,
(int(ceil((duration - ts) * audio_sample_rate)), 0, 0, 0), # left, right, top, bottom
)
if log_loaded_timestamps:
logging.info(
f"audio chunk loaded at starting timestamp={current_audio_chunk['pts']:.4f} with duration={len(current_audio_chunk) / audio_sample_rate:.4f}"
)
audio_chunks.append(current_audio_chunk_data)
audio_chunks = torch.stack(audio_chunks)
assert len(timestamps) == len(audio_chunks)
return audio_chunks
def encode_audio(
input_path: Path | str,
output_path: Path | str,
codec: str = "aac", # TODO(CarolinePascal) : investigate Fraunhofer FDK AAC (libfdk_aac) codec and and constant (file size control) /variable (quality control) bitrate options
bit_rate: int | None = None,
sample_rate: int | None = None,
log_level: int | None = av.logging.ERROR,
overwrite: bool = False,
) -> None:
"""Encodes an audio file using ffmpeg."""
output_path = Path(output_path)
output_path.parent.mkdir(parents=True, exist_ok=overwrite)
# Set logging level
if log_level is not None:
# "While less efficient, it is generally preferable to modify logging with Pythons logging"
logging.getLogger("libav").setLevel(log_level)
# Open input file
with av.open(str(input_path), "r") as input:
input_stream = input.streams.audio[0] # Assuming the first stream is the audio stream to be encoded
# Define sub-sampling options
if sample_rate is None:
sample_rate = input_stream.rate
# Create and open output file (overwrite by default)
with av.open(str(output_path), "w") as output:
output_stream = output.add_stream(
codec, rate=sample_rate, layout=CHANNELS_LAYOUTS_MAPPING[input_stream.channels]
)
if bit_rate is not None:
output_stream.bit_rate = bit_rate
# Loop through input WAV packets and encode them
for input_frame in input.decode(
input_stream
): # This step handles both demuxing and decoding under the hood
packet = output_stream.encode(input_frame)
if packet:
output.mux(packet)
# Flush the encoder
packet = output_stream.encode()
if packet:
output.mux(packet)
# Reset logging level
if log_level is not None:
av.logging.restore_default_callback()
if not output_path.exists():
raise OSError(f"Audio encoding did not work. File not found: {output_path}.")
def get_audio_info(video_path: Path | str) -> dict:
# Set logging level
logging.getLogger("libav").setLevel(av.logging.ERROR)
# Getting audio stream information
audio_info = {}
with av.open(str(video_path), "r") as audio_file:
try:
audio_stream = audio_file.streams.audio[0]
except IndexError:
# Reset logging level
av.logging.restore_default_callback()
return {"has_audio": False}
audio_info["audio.channels"] = audio_stream.channels
audio_info["audio.codec"] = audio_stream.codec.canonical_name
# In an ideal loseless case : bit depth x sample rate x channels = bit rate.
# In an actual compressed case, the bit rate is set according to the compression level : the lower the bit rate, the more compression is applied.
audio_info["audio.bit_rate"] = audio_stream.bit_rate
audio_info["audio.sample_rate"] = audio_stream.sample_rate # Number of samples per second
# In an ideal loseless case : fixed number of bits per sample.
# In an actual compressed case : variable number of bits per sample (often reduced to match a given depth rate).
audio_info["audio.bit_depth"] = audio_stream.format.bits
audio_info["audio.channel_layout"] = audio_stream.layout.name
audio_info["has_audio"] = True
# Reset logging level
av.logging.restore_default_callback()
return audio_info

View File

@@ -1,56 +0,0 @@
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import packaging.version
V30_MESSAGE = """
The dataset you requested ({repo_id}) is in {version} format.
We introduced a new format since v3.0 which is not backward compatible with v2.1.
Please, update your dataset to the new format using this command:
```
python -m lerobot.scripts.convert_dataset_v21_to_v30 --repo-id={repo_id}
```
If you already have a converted version uploaded to the hub, then this error might be because of
an older version in your local cache. Consider deleting the cached version and retrying.
If you encounter a problem, contact LeRobot maintainers on [Discord](https://discord.com/invite/s3KuuzsPFb)
or open an [issue on GitHub](https://github.com/huggingface/lerobot/issues/new/choose).
"""
FUTURE_MESSAGE = """
The dataset you requested ({repo_id}) is only available in {version} format.
As we cannot ensure forward compatibility with it, please update your current version of lerobot.
"""
class CompatibilityError(Exception): ...
class BackwardCompatibilityError(CompatibilityError):
def __init__(self, repo_id: str, version: packaging.version.Version):
if version.major == 2 and version.minor == 1:
message = V30_MESSAGE.format(repo_id=repo_id, version=version)
else:
raise NotImplementedError(
"Contact the maintainer on [Discord](https://discord.com/invite/s3KuuzsPFb)."
)
super().__init__(message)
class ForwardCompatibilityError(CompatibilityError):
def __init__(self, repo_id: str, version: packaging.version.Version):
message = FUTURE_MESSAGE.format(repo_id=repo_id, version=version)
super().__init__(message)

View File

@@ -13,9 +13,13 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import annotations
import logging
import numpy as np
from lerobot.datasets.utils import load_image_as_numpy
from lerobot.datasets.io_utils import load_audio_from_path, load_image_as_numpy
DEFAULT_QUANTILES = [0.01, 0.10, 0.50, 0.90, 0.99]
@@ -245,6 +249,20 @@ def sample_images(image_paths: list[str]) -> np.ndarray:
return images
def sample_audio_from_path(audio_path: str) -> np.ndarray:
"""Samples audio data from an audio recording stored in a WAV file."""
data = load_audio_from_path(audio_path)
sampled_indices = sample_indices(len(data))
return data[sampled_indices]
def sample_audio_from_data(data: np.ndarray) -> np.ndarray:
"""Samples audio data from an audio recording stored in a numpy array."""
sampled_indices = sample_indices(len(data))
return data[sampled_indices]
def _reshape_stats_by_axis(
stats: dict[str, np.ndarray],
axis: int | tuple[int, ...] | None,
@@ -512,6 +530,13 @@ def compute_episode_stats(
ep_ft_array = sample_images(data)
axes_to_reduce = (0, 2, 3)
keepdims = True
elif features[key]["dtype"] == "audio":
try:
ep_ft_array = sample_audio_from_path(data[0])
except TypeError: # Should only be triggered for LeKiwi robot, for which audio is stored chunk by chunk in a visual frame-like manner
ep_ft_array = sample_audio_from_data(data)
axes_to_reduce = 0
keepdims = True
else:
ep_ft_array = data
axes_to_reduce = 0
@@ -624,3 +649,141 @@ def aggregate_stats(stats_list: list[dict[str, dict]]) -> dict[str, dict[str, np
aggregated_stats[key] = aggregate_feature_stats(stats_with_key)
return aggregated_stats
def _get_valid_chunk_starts(episode_indices: np.ndarray, chunk_size: int) -> np.ndarray:
"""Return all start indices where a chunk of ``chunk_size`` stays within one episode."""
total = len(episode_indices)
if total < chunk_size:
return np.array([], dtype=np.int64)
max_start = total - chunk_size
starts = np.arange(max_start + 1)
valid = episode_indices[starts] == episode_indices[starts + chunk_size - 1]
return starts[valid]
def _compute_relative_chunk_batch(
start_indices: np.ndarray,
all_actions: np.ndarray,
all_states: np.ndarray,
chunk_size: int,
relative_mask: np.ndarray,
) -> np.ndarray:
"""Vectorised relative-action computation for a batch of start indices.
Returns an ``(N * chunk_size, action_dim)`` float32 array.
"""
if len(start_indices) == 0:
return np.empty((0, all_actions.shape[1]), dtype=np.float32)
offsets = np.arange(chunk_size)
frame_idx = start_indices[:, None] + offsets[None, :]
chunks = all_actions[frame_idx].copy()
states = all_states[start_indices]
mask_dim = len(relative_mask)
chunks[:, :, :mask_dim] -= states[:, None, :mask_dim] * relative_mask[None, None, :]
return chunks.reshape(-1, all_actions.shape[1])
def compute_relative_action_stats(
hf_dataset,
features: dict,
chunk_size: int,
exclude_joints: list[str] | None = None,
num_workers: int = 0,
) -> dict[str, np.ndarray]:
"""Compute normalization statistics for relative actions over the full dataset.
Iterates *all* valid action chunks (within single episodes), converts them to
relative actions (action current_state), and computes per-dimension
statistics suitable for normalization.
Args:
hf_dataset: The underlying HuggingFace dataset with "action",
"observation.state", and "episode_index" columns.
features: Dataset feature metadata (must contain "action" with "shape"
and optionally "names").
chunk_size: Number of consecutive frames per action chunk.
exclude_joints: Joint names whose dimensions should remain absolute
(not converted to relative actions).
num_workers: Number of parallel threads for computation. Values ≤1
mean single-threaded. Numpy releases the GIL so threads give
real parallelism here.
Returns:
Statistics dict with keys "mean", "std", "min", "max", "q01", …, "q99".
Raises:
ValueError: If the dataset has fewer frames than ``chunk_size``.
RuntimeError: If no valid (single-episode) chunks are found.
"""
from lerobot.processor.relative_action_processor import RelativeActionsProcessorStep
if exclude_joints is None:
exclude_joints = []
action_dim = features[ACTION]["shape"][0]
action_names = features.get(ACTION, {}).get("names")
mask_step = RelativeActionsProcessorStep(
enabled=True,
exclude_joints=exclude_joints,
action_names=action_names,
)
relative_mask = np.array(mask_step._build_mask(action_dim), dtype=np.float32)
logging.info("Loading action/state data for relative action stats...")
all_actions = np.array(hf_dataset[ACTION], dtype=np.float32)
all_states = np.array(hf_dataset[OBS_STATE], dtype=np.float32)
episode_indices = np.array(hf_dataset["episode_index"])
valid_starts = _get_valid_chunk_starts(episode_indices, chunk_size)
if len(valid_starts) == 0:
raise RuntimeError(
f"No valid chunks found (total_frames={len(episode_indices)}, chunk_size={chunk_size})"
)
effective_workers = max(num_workers, 1)
logging.info(
f"Computing relative action stats from {len(valid_starts)} chunks "
f"(chunk_size={chunk_size}, workers={effective_workers})"
)
batch_size = 50_000
batches = [valid_starts[i : i + batch_size] for i in range(0, len(valid_starts), batch_size)]
running_stats = RunningQuantileStats()
if num_workers > 1:
from concurrent.futures import ThreadPoolExecutor, as_completed
with ThreadPoolExecutor(max_workers=num_workers) as pool:
futures = [
pool.submit(
_compute_relative_chunk_batch,
batch,
all_actions,
all_states,
chunk_size,
relative_mask,
)
for batch in batches
]
for future in as_completed(futures):
running_stats.update(future.result())
else:
for batch in batches:
running_stats.update(
_compute_relative_chunk_batch(batch, all_actions, all_states, chunk_size, relative_mask)
)
stats = running_stats.get_statistics()
excluded_dims = int(len(relative_mask) - relative_mask.sum())
total_frames = len(valid_starts) * chunk_size
logging.info(
f"Relative action stats ({len(valid_starts)} chunks, {total_frames} frames): "
f"relative_dims={int(relative_mask.sum())}/{len(relative_mask)} (excluded={excluded_dims}), "
f"mean={np.abs(stats['mean']).mean():.4f}, std={stats['std'].mean():.4f}, "
f"q01={stats['q01'].mean():.4f}, q99={stats['q99'].mean():.4f}"
)
return stats

View File

@@ -0,0 +1,720 @@
#!/usr/bin/env python
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import contextlib
from pathlib import Path
import numpy as np
import packaging.version
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from huggingface_hub import snapshot_download
from lerobot.datasets.audio_utils import get_audio_info
from lerobot.datasets.compute_stats import aggregate_stats
from lerobot.datasets.feature_utils import _validate_feature_names, create_empty_dataset_info
from lerobot.datasets.io_utils import (
get_file_size_in_mb,
load_episodes,
load_info,
load_stats,
load_subtasks,
load_tasks,
write_info,
write_json,
write_stats,
write_tasks,
)
from lerobot.datasets.utils import (
DEFAULT_EPISODES_PATH,
DEFAULT_FEATURES,
DEFAULT_INITIAL_AUDIO_BUFFER_DURATION,
INFO_PATH,
check_version_compatibility,
flatten_dict,
get_safe_version,
has_legacy_hub_download_metadata,
is_valid_version,
update_chunk_file_indices,
)
from lerobot.datasets.video_utils import get_video_info
from lerobot.utils.constants import HF_LEROBOT_HOME, HF_LEROBOT_HUB_CACHE
CODEBASE_VERSION = "v3.0"
class LeRobotDatasetMetadata:
"""Metadata container for a LeRobot dataset.
Manages the ``info.json``, ``stats.json``, ``tasks.parquet``, and
``episodes/`` parquet files that describe a dataset's structure, content,
and statistics.
"""
def __init__(
self,
repo_id: str,
root: str | Path | None = None,
revision: str | None = None,
force_cache_sync: bool = False,
metadata_buffer_size: int = 10,
):
"""Load or download metadata for an existing LeRobot dataset.
Attempts to load metadata from local disk. If files are missing or
``force_cache_sync`` is ``True``, downloads the ``meta/`` directory from
the Hub.
Args:
repo_id: Repository identifier (e.g. ``'lerobot/aloha_sim'``).
root: Local directory for the dataset. When provided, Hub downloads
are materialized directly into this directory. When omitted,
existing local datasets are still looked up under
``$HF_LEROBOT_HOME/{repo_id}``, but Hub downloads use a
revision-safe snapshot cache under
``$HF_LEROBOT_HOME/hub``.
revision: Git revision (branch, tag, or commit hash). Defaults to
the current codebase version.
force_cache_sync: If ``True``, re-download metadata from the Hub
even when local files exist.
metadata_buffer_size: Number of episode metadata records to buffer
in memory before flushing to parquet.
"""
self.repo_id = repo_id
self.revision = revision if revision else CODEBASE_VERSION
self._requested_root = Path(root) if root is not None else None
self.root = self._requested_root if self._requested_root is not None else HF_LEROBOT_HOME / repo_id
self._pq_writer = None
self.latest_episode = None
self._metadata_buffer: list[dict] = []
self._metadata_buffer_size = metadata_buffer_size
self._finalized = False
try:
if force_cache_sync or (
self._requested_root is None and has_legacy_hub_download_metadata(self.root)
):
raise FileNotFoundError
self._load_metadata()
except (FileNotFoundError, NotADirectoryError):
if is_valid_version(self.revision):
self.revision = get_safe_version(self.repo_id, self.revision)
self._pull_from_repo(allow_patterns="meta/")
self._load_metadata()
def _flush_metadata_buffer(self) -> None:
"""Write all buffered episode metadata to parquet file."""
if not hasattr(self, "_metadata_buffer") or len(self._metadata_buffer) == 0:
return
combined_dict = {}
for episode_dict in self._metadata_buffer:
for key, value in episode_dict.items():
if key not in combined_dict:
combined_dict[key] = []
# Extract value and serialize numpy arrays
# because PyArrow's from_pydict function doesn't support numpy arrays
val = value[0] if isinstance(value, list) else value
combined_dict[key].append(val.tolist() if isinstance(val, np.ndarray) else val)
first_ep = self._metadata_buffer[0]
chunk_idx = first_ep["meta/episodes/chunk_index"][0]
file_idx = first_ep["meta/episodes/file_index"][0]
table = pa.Table.from_pydict(combined_dict)
if not self._pq_writer:
path = Path(self.root / DEFAULT_EPISODES_PATH.format(chunk_index=chunk_idx, file_index=file_idx))
path.parent.mkdir(parents=True, exist_ok=True)
self._pq_writer = pq.ParquetWriter(
path, schema=table.schema, compression="snappy", use_dictionary=True
)
self._pq_writer.write_table(table)
self.latest_episode = self._metadata_buffer[-1]
self._metadata_buffer.clear()
def _close_writer(self) -> None:
"""Close and cleanup the parquet writer if it exists."""
self._flush_metadata_buffer()
writer = getattr(self, "_pq_writer", None)
if writer is not None:
writer.close()
self._pq_writer = None
def finalize(self) -> None:
"""Flush metadata buffer and close the parquet writer.
Idempotent — safe to call multiple times.
"""
if getattr(self, "_finalized", False):
return
self._close_writer()
self._finalized = True
def __del__(self):
"""Safety net: flush and close parquet writer on garbage collection."""
# During interpreter shutdown, referenced objects may already be collected.
with contextlib.suppress(Exception):
self.finalize()
def _load_metadata(self):
self.info = load_info(self.root)
check_version_compatibility(self.repo_id, self._version, CODEBASE_VERSION)
self.tasks = load_tasks(self.root)
self.subtasks = load_subtasks(self.root)
self.episodes = load_episodes(self.root)
self.stats = load_stats(self.root)
def _pull_from_repo(
self,
allow_patterns: list[str] | str | None = None,
ignore_patterns: list[str] | str | None = None,
) -> None:
if self._requested_root is None:
self.root = Path(
snapshot_download(
self.repo_id,
repo_type="dataset",
revision=self.revision,
cache_dir=HF_LEROBOT_HUB_CACHE,
allow_patterns=allow_patterns,
ignore_patterns=ignore_patterns,
)
)
return
self._requested_root.mkdir(exist_ok=True, parents=True)
snapshot_download(
self.repo_id,
repo_type="dataset",
revision=self.revision,
local_dir=self._requested_root,
allow_patterns=allow_patterns,
ignore_patterns=ignore_patterns,
)
self.root = self._requested_root
@property
def url_root(self) -> str:
"""Hugging Face Hub URL root for this dataset."""
return f"hf://datasets/{self.repo_id}"
@property
def _version(self) -> packaging.version.Version:
"""Codebase version used to create this dataset."""
return packaging.version.parse(self.info["codebase_version"])
def get_data_file_path(self, ep_index: int) -> Path:
"""Return the relative parquet file path for the given episode index.
Args:
ep_index: Zero-based episode index.
Returns:
Path to the parquet file containing this episode's data.
Raises:
IndexError: If ``ep_index`` is out of range.
"""
if self.episodes is None:
self.episodes = load_episodes(self.root)
if ep_index >= len(self.episodes):
raise IndexError(
f"Episode index {ep_index} out of range. Episodes: {len(self.episodes) if self.episodes else 0}"
)
ep = self.episodes[ep_index]
chunk_idx = ep["data/chunk_index"]
file_idx = ep["data/file_index"]
fpath = self.data_path.format(chunk_index=chunk_idx, file_index=file_idx)
return Path(fpath)
def get_video_file_path(self, ep_index: int, vid_key: str) -> Path:
"""Return the relative video file path for the given episode and video key.
Args:
ep_index: Zero-based episode index.
vid_key: Feature key identifying the video stream
(e.g. ``'observation.images.laptop'``).
Returns:
Path to the video file containing this episode's frames.
Raises:
IndexError: If ``ep_index`` is out of range.
"""
if self.episodes is None:
self.episodes = load_episodes(self.root)
if ep_index >= len(self.episodes):
raise IndexError(
f"Episode index {ep_index} out of range. Episodes: {len(self.episodes) if self.episodes else 0}"
)
ep = self.episodes[ep_index]
chunk_idx = ep[f"videos/{vid_key}/chunk_index"]
file_idx = ep[f"videos/{vid_key}/file_index"]
fpath = self.video_path.format(video_key=vid_key, chunk_index=chunk_idx, file_index=file_idx)
return Path(fpath)
def get_audio_file_path(self, ep_index: int, audio_key: str) -> Path:
"""Return the relative audio file path for the given episode and audio key.
Args:
ep_index: Zero-based episode index.
audio_key: Feature key identifying the audio stream
(e.g. ``'observation.audio.microphone'``).
Returns:
Path to the audio file containing this episode's audio.
Raises:
IndexError: If ``ep_index`` is out of range.
"""
if self.episodes is None:
self.episodes = load_episodes(self.root)
if ep_index >= len(self.episodes):
raise IndexError(
f"Episode index {ep_index} out of range. Episodes: {len(self.episodes) if self.episodes else 0}"
)
ep = self.episodes[ep_index]
chunk_idx = ep[f"audio/{audio_key}/chunk_index"]
file_idx = ep[f"audio/{audio_key}/file_index"]
fpath = self.audio_path.format(audio_key=audio_key, chunk_index=chunk_idx, file_index=file_idx)
return Path(fpath)
@property
def data_path(self) -> str:
"""Formattable string for the parquet files."""
return self.info["data_path"]
@property
def video_path(self) -> str | None:
"""Formattable string for the video files."""
return self.info["video_path"]
@property
def audio_path(self) -> str | None:
"""Formattable string for the audio files."""
return self.info["audio_path"]
@property
def robot_type(self) -> str | None:
"""Robot type used in recording this dataset."""
return self.info["robot_type"]
@property
def fps(self) -> int:
"""Frames per second used during data collection."""
return self.info["fps"]
@property
def features(self) -> dict[str, dict]:
"""All features contained in the dataset."""
return self.info["features"]
@property
def image_keys(self) -> list[str]:
"""Keys to access visual modalities stored as images."""
return [key for key, ft in self.features.items() if ft["dtype"] == "image"]
@property
def video_keys(self) -> list[str]:
"""Keys to access visual modalities stored as videos."""
return [key for key, ft in self.features.items() if ft["dtype"] == "video"]
@property
def camera_keys(self) -> list[str]:
"""Keys to access visual modalities (regardless of their storage method)."""
return [key for key, ft in self.features.items() if ft["dtype"] in ["video", "image"]]
@property
def audio_keys(self) -> list[str]:
"""Keys to access audio modalities."""
return [key for key, ft in self.features.items() if ft["dtype"] == "audio"]
@property
def names(self) -> dict[str, list | dict]:
"""Names of the various dimensions of vector modalities."""
return {key: ft["names"] for key, ft in self.features.items()}
@property
def shapes(self) -> dict:
"""Shapes for the different features."""
return {key: tuple(ft["shape"]) for key, ft in self.features.items()}
@property
def total_episodes(self) -> int:
"""Total number of episodes available."""
return self.info["total_episodes"]
@property
def total_frames(self) -> int:
"""Total number of frames saved in this dataset."""
return self.info["total_frames"]
@property
def total_tasks(self) -> int:
"""Total number of different tasks performed in this dataset."""
return self.info["total_tasks"]
@property
def chunks_size(self) -> int:
"""Max number of files per chunk."""
return self.info["chunks_size"]
@property
def data_files_size_in_mb(self) -> int:
"""Max size of data file in mega bytes."""
return self.info["data_files_size_in_mb"]
@property
def video_files_size_in_mb(self) -> int:
"""Max size of video file in mega bytes."""
return self.info["video_files_size_in_mb"]
@property
def audio_files_size_in_mb(self) -> int:
"""Max size of audio file in mega bytes."""
return self.info["audio_files_size_in_mb"]
def get_task_index(self, task: str) -> int | None:
"""
Given a task in natural language, returns its task_index if the task already exists in the dataset,
otherwise return None.
"""
if task in self.tasks.index:
return int(self.tasks.loc[task].task_index)
else:
return None
def save_episode_tasks(self, tasks: list[str]):
"""Register tasks for the current episode and persist to disk.
New tasks that do not already exist in the dataset are assigned
sequential task indices and appended to the tasks parquet file.
Args:
tasks: List of unique task descriptions in natural language.
Raises:
ValueError: If ``tasks`` contains duplicates.
"""
if len(set(tasks)) != len(tasks):
raise ValueError(f"Tasks are not unique: {tasks}")
if self.tasks is None:
new_tasks = tasks
task_indices = range(len(tasks))
self.tasks = pd.DataFrame({"task_index": task_indices}, index=pd.Index(tasks, name="task"))
else:
new_tasks = [task for task in tasks if task not in self.tasks.index]
new_task_indices = range(len(self.tasks), len(self.tasks) + len(new_tasks))
for task_idx, task in zip(new_task_indices, new_tasks, strict=False):
self.tasks.loc[task] = task_idx
if len(new_tasks) > 0:
# Update on disk
write_tasks(self.tasks, self.root)
def _save_episode_metadata(self, episode_dict: dict) -> None:
"""Buffer episode metadata and write to parquet in batches for efficiency.
This function accumulates episode metadata in a buffer and flushes it when the buffer
reaches the configured size. This reduces I/O overhead by writing multiple episodes
at once instead of one row at a time.
Notes: We both need to update parquet files and HF dataset:
- `pandas` loads parquet file in RAM
- `datasets` relies on a memory mapping from pyarrow (no RAM). It either converts parquet files to a pyarrow cache on disk,
or loads directly from pyarrow cache.
"""
# Convert to list format for each value
episode_dict = {key: [value] for key, value in episode_dict.items()}
num_frames = episode_dict["length"][0]
if self.latest_episode is None:
# Initialize indices and frame count for a new dataset made of the first episode data
chunk_idx, file_idx = 0, 0
if self.episodes is not None and len(self.episodes) > 0:
# It means we are resuming recording, so we need to load the latest episode
# Update the indices to avoid overwriting the latest episode
chunk_idx = self.episodes[-1]["meta/episodes/chunk_index"]
file_idx = self.episodes[-1]["meta/episodes/file_index"]
latest_num_frames = self.episodes[-1]["dataset_to_index"]
episode_dict["dataset_from_index"] = [latest_num_frames]
episode_dict["dataset_to_index"] = [latest_num_frames + num_frames]
# When resuming, move to the next file
chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, self.chunks_size)
else:
episode_dict["dataset_from_index"] = [0]
episode_dict["dataset_to_index"] = [num_frames]
episode_dict["meta/episodes/chunk_index"] = [chunk_idx]
episode_dict["meta/episodes/file_index"] = [file_idx]
else:
chunk_idx = self.latest_episode["meta/episodes/chunk_index"][0]
file_idx = self.latest_episode["meta/episodes/file_index"][0]
latest_path = (
self.root / DEFAULT_EPISODES_PATH.format(chunk_index=chunk_idx, file_index=file_idx)
if self._pq_writer is None
else self._pq_writer.where
)
if Path(latest_path).exists():
latest_size_in_mb = get_file_size_in_mb(Path(latest_path))
latest_num_frames = self.latest_episode["episode_index"][0]
av_size_per_frame = latest_size_in_mb / latest_num_frames if latest_num_frames > 0 else 0.0
if latest_size_in_mb + av_size_per_frame * num_frames >= self.data_files_size_in_mb:
# Size limit is reached, flush buffer and prepare new parquet file
self._flush_metadata_buffer()
chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, self.chunks_size)
self._close_writer()
# Update the existing pandas dataframe with new row
episode_dict["meta/episodes/chunk_index"] = [chunk_idx]
episode_dict["meta/episodes/file_index"] = [file_idx]
episode_dict["dataset_from_index"] = [self.latest_episode["dataset_to_index"][0]]
episode_dict["dataset_to_index"] = [self.latest_episode["dataset_to_index"][0] + num_frames]
# Add to buffer
self._metadata_buffer.append(episode_dict)
self.latest_episode = episode_dict
if len(self._metadata_buffer) >= self._metadata_buffer_size:
self._flush_metadata_buffer()
def save_episode(
self,
episode_index: int,
episode_length: int,
episode_tasks: list[str],
episode_stats: dict[str, dict],
episode_metadata: dict,
) -> None:
"""Persist episode metadata, update dataset info, and aggregate stats.
Writes the episode's metadata to the buffered parquet writer, increments
the total episode/frame counters in ``info.json``, and merges the
episode's statistics into the running dataset statistics.
Args:
episode_index: Zero-based index of the episode being saved.
episode_length: Number of frames in this episode.
episode_tasks: List of task descriptions for this episode.
episode_stats: Per-feature statistics for this episode.
episode_metadata: Additional metadata (chunk/file indices, frame
ranges, video timestamps, etc.).
"""
episode_dict = {
"episode_index": episode_index,
"tasks": episode_tasks,
"length": episode_length,
}
episode_dict.update(episode_metadata)
episode_dict.update(flatten_dict({"stats": episode_stats}))
self._save_episode_metadata(episode_dict)
# Update info
self.info["total_episodes"] += 1
self.info["total_frames"] += episode_length
self.info["total_tasks"] = len(self.tasks)
self.info["splits"] = {"train": f"0:{self.info['total_episodes']}"}
write_info(self.info, self.root)
self.stats = aggregate_stats([self.stats, episode_stats]) if self.stats is not None else episode_stats
write_stats(self.stats, self.root)
def update_video_info(self, video_key: str | None = None) -> None:
"""
Warning: this function writes info from first episode videos, implicitly assuming that all videos have
been encoded the same way. Also, this means it assumes the first episode exists.
"""
if video_key is not None and video_key not in self.video_keys:
raise ValueError(f"Video key {video_key} not found in dataset")
video_keys = [video_key] if video_key is not None else self.video_keys
for key in video_keys:
if not self.features[key].get("info", None):
video_path = self.root / self.video_path.format(video_key=key, chunk_index=0, file_index=0)
self.info["features"][key]["info"] = get_video_info(video_path)
def update_audio_info(self, audio_key: str | None = None) -> None:
"""
Warning: this function writes info from first episode audio, implicitly assuming that all audio have
been encoded the same way. Also, this means it assumes the first episode exists.
"""
if audio_key is not None and audio_key not in self.audio_keys:
raise ValueError(f"Audio key {audio_key} not found in dataset")
audio_keys = [audio_key] if audio_key is not None else self.audio_keys
for key in audio_keys:
if not self.features[key].get("info", None):
audio_path = self.root / self.audio_path.format(audio_key=key, chunk_index=0, file_index=0)
self.info["features"][key]["info"] = get_audio_info(audio_path)
self.info["features"][key]["info"]["start_time_s"] = DEFAULT_INITIAL_AUDIO_BUFFER_DURATION
def update_chunk_settings(
self,
chunks_size: int | None = None,
data_files_size_in_mb: int | None = None,
video_files_size_in_mb: int | None = None,
audio_files_size_in_mb: int | None = None,
) -> None:
"""Update chunk and file size settings after dataset creation.
This allows users to customize storage organization without modifying the constructor.
These settings control how episodes are chunked and how large files can grow before
creating new ones.
Args:
chunks_size: Maximum number of files per chunk directory. If None, keeps current value.
data_files_size_in_mb: Maximum size for data parquet files in MB. If None, keeps current value.
video_files_size_in_mb: Maximum size for video files in MB. If None, keeps current value.
audio_files_size_in_mb: Maximum size for audio files in MB. If None, keeps current value.
"""
if chunks_size is not None:
if chunks_size <= 0:
raise ValueError(f"chunks_size must be positive, got {chunks_size}")
self.info["chunks_size"] = chunks_size
if data_files_size_in_mb is not None:
if data_files_size_in_mb <= 0:
raise ValueError(f"data_files_size_in_mb must be positive, got {data_files_size_in_mb}")
self.info["data_files_size_in_mb"] = data_files_size_in_mb
if video_files_size_in_mb is not None:
if video_files_size_in_mb <= 0:
raise ValueError(f"video_files_size_in_mb must be positive, got {video_files_size_in_mb}")
self.info["video_files_size_in_mb"] = video_files_size_in_mb
if audio_files_size_in_mb is not None:
if audio_files_size_in_mb <= 0:
raise ValueError(f"audio_files_size_in_mb must be positive, got {audio_files_size_in_mb}")
self.info["audio_files_size_in_mb"] = audio_files_size_in_mb
# Update the info file on disk
write_info(self.info, self.root)
def get_chunk_settings(self) -> dict[str, int]:
"""Get current chunk and file size settings.
Returns:
Dict containing chunks_size, data_files_size_in_mb, video_files_size_in_mb, and audio_files_size_in_mb.
"""
return {
"chunks_size": self.chunks_size,
"data_files_size_in_mb": self.data_files_size_in_mb,
"video_files_size_in_mb": self.video_files_size_in_mb,
"audio_files_size_in_mb": self.audio_files_size_in_mb,
}
def __repr__(self):
feature_keys = list(self.features)
return (
f"{self.__class__.__name__}({{\n"
f" Repository ID: '{self.repo_id}',\n"
f" Total episodes: '{self.total_episodes}',\n"
f" Total frames: '{self.total_frames}',\n"
f" Features: '{feature_keys}',\n"
"})',\n"
)
@classmethod
def create(
cls,
repo_id: str,
fps: int,
features: dict,
robot_type: str | None = None,
root: str | Path | None = None,
use_videos: bool = True,
metadata_buffer_size: int = 10,
chunks_size: int | None = None,
data_files_size_in_mb: int | None = None,
video_files_size_in_mb: int | None = None,
audio_files_size_in_mb: int | None = None,
) -> "LeRobotDatasetMetadata":
"""Create metadata for a new LeRobot dataset from scratch.
Initializes the ``info.json`` file on disk with the provided feature
schema and dataset settings. No episode data is written yet.
Args:
repo_id: Repository identifier (e.g. ``'user/my_dataset'``).
fps: Frames per second used during data collection.
features: Feature specification dict mapping feature names to their
type/shape metadata.
robot_type: Optional robot type string stored in metadata.
root: Local directory for the dataset. Defaults to
``$HF_LEROBOT_HOME/{repo_id}``. Must not already exist.
use_videos: If ``True``, visual modalities are encoded as MP4 videos.
metadata_buffer_size: Number of episode metadata records to buffer
before flushing to parquet.
chunks_size: Max number of files per chunk directory. ``None`` uses
the default.
data_files_size_in_mb: Max parquet file size in MB. ``None`` uses the
default.
video_files_size_in_mb: Max video file size in MB. ``None`` uses the
default.
Returns:
A new :class:`LeRobotDatasetMetadata` instance.
"""
obj = cls.__new__(cls)
obj.repo_id = repo_id
obj._requested_root = Path(root) if root is not None else None
obj.root = obj._requested_root if obj._requested_root is not None else HF_LEROBOT_HOME / repo_id
obj.root.mkdir(parents=True, exist_ok=False)
features = {**features, **DEFAULT_FEATURES}
_validate_feature_names(features)
obj.tasks = None
obj.subtasks = None
obj.episodes = None
obj.stats = None
obj.info = create_empty_dataset_info(
CODEBASE_VERSION,
fps,
features,
use_videos,
robot_type,
chunks_size,
data_files_size_in_mb,
video_files_size_in_mb,
audio_files_size_in_mb,
)
if len(obj.video_keys) > 0 and not use_videos:
raise ValueError(
f"Features contain video keys {obj.video_keys}, but 'use_videos' is set to False. "
"Either remove video features from the features dict, or set 'use_videos=True'."
)
write_json(obj.info, obj.root / INFO_PATH)
obj.revision = None
obj._pq_writer = None
obj.latest_episode = None
obj._metadata_buffer = []
obj._metadata_buffer_size = metadata_buffer_size
obj._finalized = False
return obj

View File

@@ -0,0 +1,329 @@
#!/usr/bin/env python
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Private reader component for LeRobotDataset. Handles random-access reading (HF dataset, delta indices, video decoding)."""
from collections.abc import Callable
from pathlib import Path
import datasets
import torch
from lerobot.datasets.audio_utils import decode_audio
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata
from lerobot.datasets.feature_utils import (
check_delta_timestamps,
get_delta_indices,
get_hf_features_from_features,
)
from lerobot.datasets.utils import DEFAULT_AUDIO_CHUNK_DURATION
from lerobot.datasets.io_utils import (
hf_transform_to_torch,
load_nested_dataset,
)
from lerobot.datasets.video_utils import decode_video_frames
class DatasetReader:
"""Encapsulates read-side state and methods for LeRobotDataset.
Owns: hf_dataset, _absolute_to_relative_idx, delta_indices.
"""
def __init__(
self,
meta: LeRobotDatasetMetadata,
root: Path,
episodes: list[int] | None,
tolerance_s: float,
video_backend: str,
delta_timestamps: dict[str, list[float]] | None,
image_transforms: Callable | None,
):
"""Initialize the reader with metadata, filtering, and transform config.
The HF dataset is not loaded here — call :meth:`try_load` or
:meth:`load_and_activate` afterward.
Args:
meta: Dataset metadata instance.
root: Local dataset root directory.
episodes: Optional list of episode indices to select. ``None``
means all episodes.
tolerance_s: Timestamp synchronization tolerance in seconds.
video_backend: Video decoding backend identifier.
delta_timestamps: Optional dict mapping feature keys to lists of
relative timestamp offsets for temporal context windows.
image_transforms: Optional torchvision v2 transform applied to
visual features.
"""
self._meta = meta
self.root = root
self.episodes = episodes
self._tolerance_s = tolerance_s
self._video_backend = video_backend
self._image_transforms = image_transforms
self.hf_dataset: datasets.Dataset | None = None
self._absolute_to_relative_idx: dict[int, int] | None = None
# Setup delta_indices (doesn't depend on hf_dataset)
self.delta_indices = None
if delta_timestamps is not None:
check_delta_timestamps(delta_timestamps, meta.fps, tolerance_s)
self.delta_indices = get_delta_indices(delta_timestamps, meta.fps)
def try_load(self) -> bool:
"""Attempt to load from local cache. Returns True if data is sufficient."""
try:
self.hf_dataset = self._load_hf_dataset()
except (FileNotFoundError, NotADirectoryError):
self.hf_dataset = None
return False
if not self._check_cached_episodes_sufficient():
self.hf_dataset = None
return False
self._build_index_mapping()
return True
def load_and_activate(self) -> None:
"""Load HF dataset from disk and build index mapping. Call after data is on disk."""
self.hf_dataset = self._load_hf_dataset()
self._build_index_mapping()
def _build_index_mapping(self) -> None:
"""Build absolute-to-relative index mapping from loaded hf_dataset."""
self._absolute_to_relative_idx = None
if self.episodes is not None and self.hf_dataset is not None:
self._absolute_to_relative_idx = {
abs_idx.item() if isinstance(abs_idx, torch.Tensor) else abs_idx: rel_idx
for rel_idx, abs_idx in enumerate(self.hf_dataset["index"])
}
@property
def num_frames(self) -> int:
"""Number of frames in selected episodes."""
if self.episodes is not None and self.hf_dataset is not None:
return len(self.hf_dataset)
return self._meta.total_frames
@property
def num_episodes(self) -> int:
"""Number of episodes selected."""
return len(self.episodes) if self.episodes is not None else self._meta.total_episodes
def _load_hf_dataset(self) -> datasets.Dataset:
"""hf_dataset contains all the observations, states, actions, rewards, etc."""
features = get_hf_features_from_features(self._meta.features)
hf_dataset = load_nested_dataset(self.root / "data", features=features, episodes=self.episodes)
hf_dataset.set_transform(hf_transform_to_torch)
return hf_dataset
def _check_cached_episodes_sufficient(self) -> bool:
"""Check if the cached dataset contains all requested episodes and their video and audio files."""
if self.hf_dataset is None or len(self.hf_dataset) == 0:
return False
available_episodes = {
ep_idx.item() if isinstance(ep_idx, torch.Tensor) else ep_idx
for ep_idx in self.hf_dataset.unique("episode_index")
}
if self.episodes is None:
requested_episodes = set(range(self._meta.total_episodes))
else:
requested_episodes = set(self.episodes)
if not requested_episodes.issubset(available_episodes):
return False
if len(self._meta.video_keys) > 0:
for ep_idx in requested_episodes:
for vid_key in self._meta.video_keys:
video_path = self.root / self._meta.get_video_file_path(ep_idx, vid_key)
if not video_path.exists():
return False
if len(self._meta.audio_keys) > 0:
for ep_idx in requested_episodes:
for audio_key in self._meta.audio_keys:
audio_path = self.root / self._meta.get_compressed_audio_file_path(ep_idx, audio_key)
if not audio_path.exists():
return False
return True
def get_episodes_file_paths(self) -> list[Path]:
"""Return deduplicated file paths (data + video) for selected episodes.
Used to build the ``allow_patterns`` list for ``snapshot_download``.
"""
episodes = self.episodes if self.episodes is not None else list(range(self._meta.total_episodes))
fpaths = [str(self._meta.get_data_file_path(ep_idx)) for ep_idx in episodes]
if len(self._meta.video_keys) > 0:
video_files = [
str(self._meta.get_video_file_path(ep_idx, vid_key))
for vid_key in self._meta.video_keys
for ep_idx in episodes
]
fpaths += video_files
if len(self._meta.audio_keys) > 0:
audio_files = [
str(self._meta.get_compressed_audio_file_path(ep_idx, audio_key))
for audio_key in self._meta.audio_keys
for ep_idx in episodes
]
fpaths += audio_files
# episodes are stored in the same files, so we return unique paths only
fpaths = list(set(fpaths))
return fpaths
def _get_query_indices(
self, abs_idx: int, ep_idx: int
) -> tuple[dict[str, list[int]], dict[str, torch.Tensor]]:
"""Compute query indices for delta timestamps."""
ep = self._meta.episodes[ep_idx]
ep_start = ep["dataset_from_index"]
ep_end = ep["dataset_to_index"]
query_indices = {
key: [max(ep_start, min(ep_end - 1, abs_idx + delta)) for delta in delta_idx]
for key, delta_idx in self.delta_indices.items()
}
padding = {
f"{key}_is_pad": torch.BoolTensor(
[(abs_idx + delta < ep_start) | (abs_idx + delta >= ep_end) for delta in delta_idx]
)
for key, delta_idx in self.delta_indices.items()
}
return query_indices, padding
def _get_query_timestamps(
self,
current_ts: float,
query_indices: dict[str, list[int]] | None = None,
) -> dict[str, list[float]]:
query_timestamps = {}
for key in self._meta.video_keys + self._meta.audio_keys:
if query_indices is not None and key in query_indices:
if self._absolute_to_relative_idx is not None:
relative_indices = [self._absolute_to_relative_idx[idx] for idx in query_indices[key]]
timestamps = self.hf_dataset[relative_indices]["timestamp"]
else:
timestamps = self.hf_dataset[query_indices[key]]["timestamp"]
query_timestamps[key] = torch.stack(timestamps).tolist()
else:
query_timestamps[key] = [current_ts]
return query_timestamps
def _query_hf_dataset(self, query_indices: dict[str, list[int]]) -> dict:
"""Query dataset for indices across keys, skipping video and audio keys."""
result: dict = {}
for key, q_idx in query_indices.items():
if key in self._meta.video_keys or key in self._meta.audio_keys:
continue
relative_indices = (
q_idx
if self._absolute_to_relative_idx is None
else [self._absolute_to_relative_idx[idx] for idx in q_idx]
)
try:
result[key] = torch.stack(self.hf_dataset[key][relative_indices])
except (KeyError, TypeError, IndexError):
result[key] = torch.stack(self.hf_dataset[relative_indices][key])
return result
def _query_videos(self, query_timestamps: dict[str, list[float]], ep_idx: int) -> dict[str, torch.Tensor]:
"""Note: When using data workers (e.g. DataLoader with num_workers>0), do not call this function
in the main process (e.g. by using a second Dataloader with num_workers=0). It will result in a
Segmentation Fault.
"""
ep = self._meta.episodes[ep_idx]
item = {}
for vid_key, query_ts in query_timestamps.items():
from_timestamp = ep[f"videos/{vid_key}/from_timestamp"]
shifted_query_ts = [from_timestamp + ts for ts in query_ts]
video_path = self.root / self._meta.get_video_file_path(ep_idx, vid_key)
frames = decode_video_frames(video_path, shifted_query_ts, self._tolerance_s, self._video_backend)
item[vid_key] = frames.squeeze(0)
return item
# TODO(CarolinePascal): add variable query durations
def _query_audio(
self, query_timestamps: dict[str, list[float]], query_duration: float, ep_idx: int
) -> dict[str, torch.Tensor]:
ep = self.meta.episodes[ep_idx]
item = {}
for audio_key, query_ts in query_timestamps.items():
# Episodes are stored sequentially on a single mp4 to reduce the number of files.
# Thus we load the start timestamp of the episode on this mp4 and,
# shift the query timestamp accordingly.
from_timestamp = ep[f"audio/{audio_key}/from_timestamp"]
shifted_query_ts = [from_timestamp + ts for ts in query_ts]
audio_path = self.root / self.meta.get_audio_file_path(ep_idx, audio_key)
start_time_s = self.meta.features[audio_key]["info"].get("start_time_s", 0.0)
audio_chunk = decode_audio(
audio_path, shifted_query_ts, query_duration, start_time_s, self.audio_backend
)
item[audio_key] = audio_chunk.squeeze(0)
return item
def get_item(self, idx) -> dict:
"""Core __getitem__ logic. Assumes hf_dataset is loaded.
``idx`` is a *relative* index into the (possibly episode-filtered)
HF dataset, **not** the absolute frame index stored in the ``index``
column. The absolute index is retrieved from the row itself.
"""
item = self.hf_dataset[idx]
ep_idx = item["episode_index"].item()
abs_idx = item["index"].item()
query_indices = None
if self.delta_indices is not None:
query_indices, padding = self._get_query_indices(abs_idx, ep_idx)
query_result = self._query_hf_dataset(query_indices)
item = {**item, **padding}
for key, val in query_result.items():
item[key] = val
if len(self._meta.video_keys) > 0 or len(self._meta.audio_keys) > 0:
current_ts = item["timestamp"].item()
query_timestamps = self._get_query_timestamps(current_ts, query_indices)
video_frames = self._query_videos(query_timestamps, ep_idx)
audio_chunks = self._query_audio(query_timestamps, DEFAULT_AUDIO_CHUNK_DURATION, ep_idx)
item = {**video_frames, **audio_chunks, **item}
if self._image_transforms is not None:
image_keys = self._meta.camera_keys
for cam in image_keys:
item[cam] = self._image_transforms(item[cam])
# Add task as a string
task_idx = item["task_index"].item()
item["task"] = self._meta.tasks.iloc[task_idx].name
# add subtask information if available
if "subtask_index" in self._meta.features and self._meta.subtasks is not None:
subtask_idx = item["subtask_index"].item()
item["subtask"] = self._meta.subtasks.iloc[subtask_idx].name
return item

View File

@@ -37,23 +37,30 @@ import torch
from tqdm import tqdm
from lerobot.datasets.aggregate import aggregate_datasets
from lerobot.datasets.compute_stats import aggregate_stats
from lerobot.datasets.lerobot_dataset import LeRobotDataset, LeRobotDatasetMetadata
from lerobot.datasets.compute_stats import (
aggregate_stats,
compute_episode_stats,
compute_relative_action_stats,
)
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata
from lerobot.datasets.io_utils import (
get_parquet_file_size_in_mb,
load_episodes,
write_info,
write_stats,
write_tasks,
)
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.datasets.utils import (
DATA_DIR,
DEFAULT_CHUNK_SIZE,
DEFAULT_DATA_FILE_SIZE_IN_MB,
DEFAULT_DATA_PATH,
DEFAULT_EPISODES_PATH,
get_parquet_file_size_in_mb,
load_episodes,
update_chunk_file_indices,
write_info,
write_stats,
write_tasks,
)
from lerobot.datasets.video_utils import encode_video_frames, get_video_info
from lerobot.utils.constants import HF_LEROBOT_HOME, OBS_IMAGE
from lerobot.utils.constants import ACTION, HF_LEROBOT_HOME, OBS_IMAGE, OBS_STATE
def _load_episode_with_stats(src_dataset: LeRobotDataset, episode_idx: int) -> dict:
@@ -888,7 +895,7 @@ def _copy_and_reindex_episodes_metadata(
total_frames += src_episode["length"]
dst_meta._close_writer()
dst_meta.finalize()
dst_meta.info.update(
{
@@ -915,7 +922,8 @@ def _write_parquet(df: pd.DataFrame, path: Path, meta: LeRobotDatasetMetadata) -
This ensures images are properly embedded and the file can be loaded correctly by HF datasets.
"""
from lerobot.datasets.utils import embed_images, get_hf_features_from_features
from lerobot.datasets.feature_utils import get_hf_features_from_features
from lerobot.datasets.io_utils import embed_images
hf_features = get_hf_features_from_features(meta.features)
ep_dataset = datasets.Dataset.from_dict(df.to_dict(orient="list"), features=hf_features, split="train")
@@ -1529,6 +1537,114 @@ def modify_tasks(
return dataset
def recompute_stats(
dataset: LeRobotDataset,
skip_image_video: bool = True,
relative_action: bool = False,
relative_exclude_joints: list[str] | None = None,
chunk_size: int = 50,
num_workers: int = 0,
) -> LeRobotDataset:
"""Recompute stats.json from scratch by iterating all episodes.
Args:
dataset: The LeRobotDataset to recompute stats for.
skip_image_video: If True (default), only recompute stats for numeric features
(action, state, etc.) and keep existing image/video stats unchanged.
relative_action: If True, compute action stats in relative space by
iterating all valid action chunks and subtracting the current state.
This matches the normalization distribution the model sees during
training with ``use_relative_actions=True``.
relative_exclude_joints: Joint names to exclude from relative conversion when
relative_action=True. These dims keep absolute stats.
chunk_size: Action chunk size used for relative stats computation. Should match
``policy.chunk_size``. Only used when ``relative_action=True``.
num_workers: Number of parallel threads for relative action stats computation.
Values ≤1 mean single-threaded. Only used when ``relative_action=True``.
Returns:
The same dataset with updated stats.
"""
features = dataset.meta.features
meta_keys = {"index", "episode_index", "task_index", "frame_index", "timestamp"}
numeric_features = {
k: v
for k, v in features.items()
if v["dtype"] not in ["image", "video", "string"] and k not in meta_keys
}
if skip_image_video:
features_to_compute = numeric_features
else:
features_to_compute = {
k: v for k, v in features.items() if v["dtype"] != "string" and k not in meta_keys
}
# When relative_action is enabled, compute action stats via chunk-based sampling
# (matching what the model sees during training) and skip action in the
# per-episode pass below.
relative_action_stats = None
if relative_action and ACTION in features and OBS_STATE in features:
if relative_exclude_joints is None:
relative_exclude_joints = ["gripper"]
relative_action_stats = compute_relative_action_stats(
hf_dataset=dataset.hf_dataset,
features=features,
chunk_size=chunk_size,
exclude_joints=relative_exclude_joints,
num_workers=num_workers,
)
features_to_compute.pop(ACTION, None)
logging.info(f"Recomputing stats for features: {list(features_to_compute.keys())}")
data_dir = dataset.root / DATA_DIR
parquet_files = sorted(data_dir.glob("*/*.parquet"))
if not parquet_files:
raise ValueError(f"No parquet files found in {data_dir}")
all_episode_stats = []
numeric_keys = [k for k, v in features_to_compute.items() if v["dtype"] not in ["image", "video"]]
for parquet_path in tqdm(parquet_files, desc="Computing stats from data files"):
df = pd.read_parquet(parquet_path)
for ep_idx in sorted(df["episode_index"].unique()):
ep_df = df[df["episode_index"] == ep_idx]
episode_data = {}
for key in numeric_keys:
if key in ep_df.columns:
values = ep_df[key].values
if hasattr(values[0], "__len__"):
episode_data[key] = np.stack(values)
else:
episode_data[key] = np.array(values)
ep_stats = compute_episode_stats(episode_data, features_to_compute)
all_episode_stats.append(ep_stats)
if features_to_compute and not all_episode_stats:
logging.warning("No episode stats computed")
return dataset
new_stats = aggregate_stats(all_episode_stats) if all_episode_stats else {}
if relative_action_stats is not None:
new_stats[ACTION] = relative_action_stats
# Merge: keep existing stats for features we didn't recompute
if dataset.meta.stats:
for key, value in dataset.meta.stats.items():
if key not in new_stats:
new_stats[key] = value
write_stats(new_stats, dataset.root)
dataset.meta.stats = new_stats
logging.info("Stats recomputed successfully")
return dataset
def convert_image_to_video_dataset(
dataset: LeRobotDataset,
output_dir: Path | None = None,

View File

@@ -0,0 +1,844 @@
#!/usr/bin/env python
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Private writer component for LeRobotDataset. Handles sequential recording (episode buffer, ParquetWriter, image writer, video encoding)."""
from __future__ import annotations
import concurrent.futures
import contextlib
import logging
import shutil
import tempfile
from pathlib import Path
import datasets
import numpy as np
import pandas as pd
import PIL.Image
import pyarrow.parquet as pq
import torch
from lerobot.datasets.audio_utils import encode_audio
from lerobot.datasets.compute_stats import compute_episode_stats
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata
from lerobot.datasets.feature_utils import (
get_hf_features_from_features,
validate_episode_buffer,
validate_frame,
)
from lerobot.datasets.image_writer import AsyncImageWriter, write_image
from lerobot.datasets.io_utils import (
embed_images,
get_file_size_in_mb,
load_episodes,
write_info,
)
from lerobot.datasets.utils import (
DEFAULT_EPISODES_PATH,
DEFAULT_IMAGE_PATH,
DEFAULT_RAW_AUDIO_PATH,
update_chunk_file_indices,
)
from lerobot.datasets.video_utils import (
StreamingVideoEncoder,
concatenate_media_files,
encode_video_frames,
get_media_duration_in_s,
)
from lerobot.microphones.microphone import Microphone
from lerobot.microphones.utils import async_microphones_start_recording
logger = logging.getLogger(__name__)
def _encode_video_worker(
video_key: str,
episode_index: int,
root: Path,
fps: int,
vcodec: str = "libsvtav1",
encoder_threads: int | None = None,
) -> Path:
temp_path = Path(tempfile.mkdtemp(dir=root)) / f"{video_key}_{episode_index:03d}.mp4"
fpath = DEFAULT_IMAGE_PATH.format(image_key=video_key, episode_index=episode_index, frame_index=0)
img_dir = (root / fpath).parent
encode_video_frames(
img_dir, temp_path, fps, vcodec=vcodec, overwrite=True, encoder_threads=encoder_threads
)
shutil.rmtree(img_dir)
return temp_path
class DatasetWriter:
"""Encapsulates write-side state and methods for LeRobotDataset.
Owns: episode_buffer, image_writer, _pq_writer (ParquetWriter), _latest_episode,
_current_file_start_frame, _streaming_encoder, _episodes_since_last_encoding, _recorded_frames.
"""
def __init__(
self,
meta: LeRobotDatasetMetadata,
root: Path,
vcodec: str,
encoder_threads: int | None,
batch_encoding_size: int,
streaming_encoder: StreamingVideoEncoder | None = None,
initial_frames: int = 0,
):
"""Initialize the writer with metadata, codec, and encoding config.
Args:
meta: Dataset metadata instance (used for feature schema, chunk
settings, and episode persistence).
root: Local dataset root directory.
vcodec: Video codec for encoding (e.g. ``'libsvtav1'``, ``'h264'``).
encoder_threads: Threads per encoder instance. ``None`` for auto.
batch_encoding_size: Number of episodes to accumulate before
batch-encoding videos.
streaming_encoder: Optional pre-built :class:`StreamingVideoEncoder`
for real-time encoding. ``None`` disables streaming mode.
initial_frames: Starting frame count (non-zero when resuming).
"""
self._meta = meta
self._root = root
self._vcodec = vcodec
self._encoder_threads = encoder_threads
self._batch_encoding_size = batch_encoding_size
self._streaming_encoder = streaming_encoder
# Writer state
self.image_writer: AsyncImageWriter | None = None
self.episode_buffer: dict = self._create_episode_buffer()
self._pq_writer: pq.ParquetWriter | None = None
self._latest_episode: dict | None = None
self._current_file_start_frame: int | None = None
self._episodes_since_last_encoding: int = 0
self._recorded_frames: int = initial_frames
self._finalized = False
def _create_episode_buffer(self, episode_index: int | None = None) -> dict:
current_ep_idx = self._meta.total_episodes if episode_index is None else episode_index
ep_buffer = {}
ep_buffer["size"] = 0
ep_buffer["task"] = []
for key in self._meta.features:
ep_buffer[key] = current_ep_idx if key == "episode_index" else []
return ep_buffer
def _get_image_file_path(self, episode_index: int, image_key: str, frame_index: int) -> Path:
fpath = DEFAULT_IMAGE_PATH.format(
image_key=image_key, episode_index=episode_index, frame_index=frame_index
)
return self._root / fpath
def _get_image_file_dir(self, episode_index: int, image_key: str) -> Path:
return self._get_image_file_path(episode_index, image_key, frame_index=0).parent
def _get_raw_audio_file_path(self, episode_index: int, audio_key: str) -> Path:
fpath = DEFAULT_RAW_AUDIO_PATH.format(audio_key=audio_key, episode_index=episode_index)
return self._root / fpath
def _save_image(
self, image: torch.Tensor | np.ndarray | PIL.Image.Image, fpath: Path, compress_level: int = 1
) -> None:
if self.image_writer is None:
if isinstance(image, torch.Tensor):
image = image.cpu().numpy()
write_image(image, fpath, compress_level=compress_level)
else:
self.image_writer.save_image(image=image, fpath=fpath, compress_level=compress_level)
def add_frame(self, frame: dict) -> None:
"""
Add a single frame to the current episode buffer.
Apart from images written to a temporary directory, nothing is written to disk
until ``save_episode()`` is called.
The caller must provide all user-defined features plus ``"task"``, and must
not provide ``"timestamp"`` or ``"frame_index"``; those are computed
automatically.
"""
# Convert torch to numpy if needed
for name in frame:
if isinstance(frame[name], torch.Tensor):
frame[name] = frame[name].numpy()
validate_frame(frame, self._meta.features)
if self.episode_buffer is None:
self.episode_buffer = self._create_episode_buffer()
# Automatically add frame_index and timestamp to episode buffer
frame_index = self.episode_buffer["size"]
timestamp = frame_index / self._meta.fps
self.episode_buffer["frame_index"].append(frame_index)
self.episode_buffer["timestamp"].append(timestamp)
self.episode_buffer["task"].append(frame.pop("task"))
# Start streaming encoder on first frame of episode
if frame_index == 0 and self._streaming_encoder is not None:
self._streaming_encoder.start_episode(
video_keys=list(self._meta.video_keys),
temp_dir=self._root,
)
# Add frame features to episode_buffer
for key in frame:
if key not in self._meta.features:
raise ValueError(
f"An element of the frame is not in the features. '{key}' not in '{self._meta.features.keys()}'."
)
if self._meta.features[key]["dtype"] == "video" and self._streaming_encoder is not None:
self._streaming_encoder.feed_frame(key, frame[key])
self.episode_buffer[key].append(None)
elif self._meta.features[key]["dtype"] in ["image", "video"]:
img_path = self._get_image_file_path(
episode_index=self.episode_buffer["episode_index"], image_key=key, frame_index=frame_index
)
if frame_index == 0:
img_path.parent.mkdir(parents=True, exist_ok=True)
compress_level = 1 if self._meta.features[key]["dtype"] == "video" else 6
self._save_image(frame[key], img_path, compress_level)
self.episode_buffer[key].append(str(img_path))
elif self._meta.features[key]["dtype"] == "audio":
if (
self._meta.robot_type == "lekiwi"
): # Raw data storage should only be triggered for LeKiwi robot, for which audio is stored chunk by chunk in a visual frame-like manner
self.episode_buffer[key].append(frame[key])
else: # Otherwise, only the audio file path is stored in the episode buffer
if frame_index == 0:
audio_path = self._get_raw_audio_file_path(
episode_index=self.episode_buffer["episode_index"], audio_key=key
)
self.episode_buffer[key].append(str(audio_path))
else:
self.episode_buffer[key].append(frame[key])
self.episode_buffer["size"] += 1
def add_microphone_recording(self, microphone_key: str, microphone: Microphone) -> None:
"""
Starts recording audio data provided by the microphone and directly writes it in a .wav file.
"""
audio_file = self._get_raw_audio_file_path(self._meta.total_episodes, "observation.audio." + microphone_key)
microphone.start_recording(output_file=audio_file)
def add_microphones_recordings(self, microphones: dict[str, Microphone]) -> None:
"""
Starts recording audio data provided by multiple microphones and directly writes it in appropriate .wav files.
"""
output_files = []
for microphone_key in microphones:
output_files.append(
self._get_raw_audio_file_path(self._meta.total_episodes, "observation.audio." + microphone_key)
)
async_microphones_start_recording(microphones, output_files)
def save_episode(
self,
episode_data: dict | None = None,
parallel_encoding: bool = True,
) -> None:
"""Save the current episode in self.episode_buffer to disk."""
episode_buffer = episode_data if episode_data is not None else self.episode_buffer
validate_episode_buffer(episode_buffer, self._meta.total_episodes, self._meta.features)
# size and task are special cases that won't be added to hf_dataset
episode_length = episode_buffer.pop("size")
tasks = episode_buffer.pop("task")
episode_tasks = list(set(tasks))
episode_index = episode_buffer["episode_index"]
episode_buffer["index"] = np.arange(self._meta.total_frames, self._meta.total_frames + episode_length)
episode_buffer["episode_index"] = np.full((episode_length,), episode_index)
# Update tasks and task indices with new tasks if any
self._meta.save_episode_tasks(episode_tasks)
# Given tasks in natural language, find their corresponding task indices
episode_buffer["task_index"] = np.array([self._meta.get_task_index(task) for task in tasks])
for key, ft in self._meta.features.items():
if key in ["index", "episode_index", "task_index"] or ft["dtype"] in ["image", "video"]:
continue
elif ft["dtype"] == "audio":
if (
self._meta.robot_type == "lekiwi"
): # Raw data storage should only be triggered for LeKiwi robot, for which audio is stored chunk by chunk in a visual frame-like manner
episode_buffer[key] = np.concatenate(episode_buffer[key], axis=0)
continue
episode_buffer[key] = np.stack(episode_buffer[key])
# Wait for image writer to end, so that episode stats over images can be computed
self._wait_image_writer()
has_video_keys = len(self._meta.video_keys) > 0
has_audio_keys = len(self._meta.audio_keys) > 0
use_streaming = self._streaming_encoder is not None and has_video_keys
use_batched_encoding = self._batch_encoding_size > 1
if use_streaming:
non_video_buffer = {
k: v
for k, v in episode_buffer.items()
if self._meta.features.get(k, {}).get("dtype") not in ("video",)
}
non_video_features = {k: v for k, v in self._meta.features.items() if v["dtype"] != "video"}
ep_stats = compute_episode_stats(non_video_buffer, non_video_features)
else:
ep_stats = compute_episode_stats(episode_buffer, self._meta.features)
ep_metadata = self._save_episode_data(episode_buffer)
if use_streaming:
streaming_results = self._streaming_encoder.finish_episode()
for video_key in self._meta.video_keys:
temp_path, video_stats = streaming_results[video_key]
if video_stats is not None:
ep_stats[video_key] = {
k: v if k == "count" else np.squeeze(v.reshape(1, -1, 1, 1) / 255.0, axis=0)
for k, v in video_stats.items()
}
ep_metadata.update(self._save_episode_video(video_key, episode_index, temp_path=temp_path))
elif (has_video_keys or has_audio_keys) and not use_batched_encoding:
num_cameras = len(self._meta.video_keys)
if parallel_encoding and num_cameras > 1:
with concurrent.futures.ProcessPoolExecutor(max_workers=num_cameras) as executor:
future_to_key = {
executor.submit(
_encode_video_worker,
video_key,
episode_index,
self._root,
self._meta.fps,
self._vcodec,
self._encoder_threads,
): video_key
for video_key in self._meta.video_keys
}
results = {}
for future in concurrent.futures.as_completed(future_to_key):
video_key = future_to_key[future]
try:
temp_path = future.result()
results[video_key] = temp_path
except Exception as exc:
logger.error(f"Video encoding failed for {video_key}: {exc}")
raise exc
for video_key in self._meta.video_keys:
temp_path = results[video_key]
ep_metadata.update(
self._save_episode_video(video_key, episode_index, temp_path=temp_path)
)
else:
for video_key in self._meta.video_keys:
ep_metadata.update(self._save_episode_video(video_key, episode_index))
# TODO(Caroline): add parallel encoding for audio as well
for audio_key in self._meta.audio_keys:
ep_metadata.update(self._save_episode_audio(audio_key, episode_index))
# `meta.save_episode` need to be executed after encoding the videos
self._meta.save_episode(episode_index, episode_length, episode_tasks, ep_stats, ep_metadata)
if (has_video_keys or has_audio_keys) and use_batched_encoding:
self._episodes_since_last_encoding += 1
if self._episodes_since_last_encoding == self._batch_encoding_size:
start_ep = self._meta.total_episodes - self._batch_encoding_size
end_ep = self._meta.total_episodes
if has_video_keys:
self._batch_save_episode_video(start_ep, end_ep)
if has_audio_keys:
self._batch_save_episode_audio(start_ep, end_ep)
self._episodes_since_last_encoding = 0
if episode_data is None:
self.clear_episode_buffer(
delete_images=len(self._meta.image_keys) > 0, delete_audio=len(self._meta.audio_keys) > 0
)
def _batch_save_episode_video(self, start_episode: int, end_episode: int | None = None) -> None:
"""Batch save videos for multiple episodes."""
if end_episode is None:
end_episode = self._meta.total_episodes
logger.info(
f"Batch encoding {self._batch_encoding_size} videos for episodes {start_episode} to {end_episode - 1}"
)
chunk_idx = self._meta.episodes[start_episode]["data/chunk_index"]
file_idx = self._meta.episodes[start_episode]["data/file_index"]
episode_df_path = self._root / DEFAULT_EPISODES_PATH.format(
chunk_index=chunk_idx, file_index=file_idx
)
episode_df = pd.read_parquet(episode_df_path)
for ep_idx in range(start_episode, end_episode):
logger.info(f"Encoding videos for episode {ep_idx}")
if (
self._meta.episodes[ep_idx]["data/chunk_index"] != chunk_idx
or self._meta.episodes[ep_idx]["data/file_index"] != file_idx
):
episode_df.to_parquet(episode_df_path)
self._meta.episodes = load_episodes(self._root)
chunk_idx = self._meta.episodes[ep_idx]["data/chunk_index"]
file_idx = self._meta.episodes[ep_idx]["data/file_index"]
episode_df_path = self._root / DEFAULT_EPISODES_PATH.format(
chunk_index=chunk_idx, file_index=file_idx
)
episode_df = pd.read_parquet(episode_df_path)
video_ep_metadata = {}
for video_key in self._meta.video_keys:
video_ep_metadata.update(self._save_episode_video(video_key, ep_idx))
video_ep_metadata.pop("episode_index")
video_ep_df = pd.DataFrame(video_ep_metadata, index=[ep_idx]).convert_dtypes(
dtype_backend="pyarrow"
)
episode_df = episode_df.combine_first(video_ep_df)
episode_df.to_parquet(episode_df_path)
self._meta.episodes = load_episodes(self._root)
def _batch_save_episode_audio(self, start_episode: int, end_episode: int | None = None) -> None:
"""
Batch save audio for multiple episodes.
Args:
start_episode: Starting episode index (inclusive)
end_episode: Ending episode index (exclusive). If None, encodes all episodes from start_episode to the current episode.
"""
if end_episode is None:
end_episode = self._meta.total_episodes
logging.info(
f"Batch encoding {self.batch_encoding_size} audio for episodes {start_episode} to {end_episode - 1}"
)
chunk_idx = self._meta.episodes[start_episode]["data/chunk_index"]
file_idx = self._meta.episodes[start_episode]["data/file_index"]
episode_df_path = self._root / DEFAULT_EPISODES_PATH.format(chunk_index=chunk_idx, file_index=file_idx)
episode_df = pd.read_parquet(episode_df_path)
for ep_idx in range(start_episode, end_episode):
logging.info(f"Encoding audio for episode {ep_idx}")
if (
self._meta.episodes[ep_idx]["data/chunk_index"] != chunk_idx
or self._meta.episodes[ep_idx]["data/file_index"] != file_idx
):
# The current episode is in a new chunk or file.
# Save previous episode dataframe and update the Hugging Face dataset by reloading it.
episode_df.to_parquet(episode_df_path)
self._meta.episodes = load_episodes(self._root)
# Load new episode dataframe
chunk_idx = self._meta.episodes[ep_idx]["data/chunk_index"]
file_idx = self._meta.episodes[ep_idx]["data/file_index"]
episode_df_path = self._root / DEFAULT_EPISODES_PATH.format(
chunk_index=chunk_idx, file_index=file_idx
)
episode_df = pd.read_parquet(episode_df_path)
# Save the current episode's video metadata to the dataframe
audio_ep_metadata = {}
for audio_key in self._meta.audio_keys:
audio_ep_metadata.update(self._save_episode_audio(audio_key, ep_idx))
audio_ep_metadata.pop("episode_index")
audio_ep_df = pd.DataFrame(audio_ep_metadata, index=[ep_idx]).convert_dtypes(
dtype_backend="pyarrow"
) # allows NaN values along with integers
episode_df = episode_df.combine_first(audio_ep_df)
episode_df.to_parquet(episode_df_path)
self._meta.episodes = load_episodes(self._root)
def _save_episode_data(self, episode_buffer: dict) -> dict:
"""Save episode data to a parquet file."""
# Use metadata features as the authoritative schema
hf_features = get_hf_features_from_features(self._meta.features)
ep_dict = {key: episode_buffer[key] for key in hf_features}
ep_dataset = datasets.Dataset.from_dict(ep_dict, features=hf_features, split="train")
ep_dataset = embed_images(ep_dataset)
ep_num_frames = len(ep_dataset)
if self._latest_episode is None:
chunk_idx, file_idx = 0, 0
global_frame_index = 0
self._current_file_start_frame = 0
if self._meta.episodes is not None and len(self._meta.episodes) > 0:
latest_ep = self._meta.episodes[-1]
global_frame_index = latest_ep["dataset_to_index"]
chunk_idx = latest_ep["data/chunk_index"]
file_idx = latest_ep["data/file_index"]
chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, self._meta.chunks_size)
self._current_file_start_frame = global_frame_index
else:
latest_ep = self._latest_episode
chunk_idx = latest_ep["data/chunk_index"]
file_idx = latest_ep["data/file_index"]
global_frame_index = latest_ep["index"][-1] + 1
latest_path = self._root / self._meta.data_path.format(chunk_index=chunk_idx, file_index=file_idx)
latest_size_in_mb = get_file_size_in_mb(latest_path)
frames_in_current_file = global_frame_index - self._current_file_start_frame
av_size_per_frame = (
latest_size_in_mb / frames_in_current_file if frames_in_current_file > 0 else 0
)
if latest_size_in_mb + av_size_per_frame * ep_num_frames >= self._meta.data_files_size_in_mb:
chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, self._meta.chunks_size)
self.close_writer()
self._current_file_start_frame = global_frame_index
ep_dict["data/chunk_index"] = chunk_idx
ep_dict["data/file_index"] = file_idx
path = self._root / self._meta.data_path.format(chunk_index=chunk_idx, file_index=file_idx)
path.parent.mkdir(parents=True, exist_ok=True)
table = ep_dataset.with_format("arrow")[:]
if not self._pq_writer:
self._pq_writer = pq.ParquetWriter(
path, schema=table.schema, compression="snappy", use_dictionary=True
)
self._pq_writer.write_table(table)
metadata = {
"data/chunk_index": chunk_idx,
"data/file_index": file_idx,
"dataset_from_index": global_frame_index,
"dataset_to_index": global_frame_index + ep_num_frames,
}
self._latest_episode = {**ep_dict, **metadata}
self._recorded_frames += ep_num_frames
return metadata
def _save_episode_video(
self,
video_key: str,
episode_index: int,
temp_path: Path | None = None,
) -> dict:
if temp_path is None:
ep_path = self._encode_temporary_episode_video(video_key, episode_index)
else:
ep_path = temp_path
ep_size_in_mb = get_file_size_in_mb(ep_path)
ep_duration_in_s = get_media_duration_in_s(ep_path, media_type="video")
if (
episode_index == 0
or self._meta.latest_episode is None
or f"videos/{video_key}/chunk_index" not in self._meta.latest_episode
):
chunk_idx, file_idx = 0, 0
if self._meta.episodes is not None and len(self._meta.episodes) > 0:
old_chunk_idx = self._meta.episodes[-1][f"videos/{video_key}/chunk_index"]
old_file_idx = self._meta.episodes[-1][f"videos/{video_key}/file_index"]
chunk_idx, file_idx = update_chunk_file_indices(
old_chunk_idx, old_file_idx, self._meta.chunks_size
)
latest_duration_in_s = 0.0
new_path = self._root / self._meta.video_path.format(
video_key=video_key, chunk_index=chunk_idx, file_index=file_idx
)
new_path.parent.mkdir(parents=True, exist_ok=True)
shutil.move(str(ep_path), str(new_path))
else:
latest_ep = self._meta.latest_episode
chunk_idx = latest_ep[f"videos/{video_key}/chunk_index"][0]
file_idx = latest_ep[f"videos/{video_key}/file_index"][0]
latest_path = self._root / self._meta.video_path.format(
video_key=video_key, chunk_index=chunk_idx, file_index=file_idx
)
latest_size_in_mb = get_file_size_in_mb(latest_path)
latest_duration_in_s = latest_ep[f"videos/{video_key}/to_timestamp"][0]
if latest_size_in_mb + ep_size_in_mb >= self._meta.video_files_size_in_mb:
chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, self._meta.chunks_size)
new_path = self._root / self._meta.video_path.format(
video_key=video_key, chunk_index=chunk_idx, file_index=file_idx
)
new_path.parent.mkdir(parents=True, exist_ok=True)
shutil.move(str(ep_path), str(new_path))
latest_duration_in_s = 0.0
else:
concatenate_media_files(
[latest_path, ep_path],
latest_path,
)
# Remove temporary directory
shutil.rmtree(str(ep_path.parent))
# Update video info (only needed when first episode is encoded)
if episode_index == 0:
self._meta.update_video_info(video_key)
write_info(self._meta.info, self._meta.root)
metadata = {
"episode_index": episode_index,
f"videos/{video_key}/chunk_index": chunk_idx,
f"videos/{video_key}/file_index": file_idx,
f"videos/{video_key}/from_timestamp": latest_duration_in_s,
f"videos/{video_key}/to_timestamp": latest_duration_in_s + ep_duration_in_s,
}
return metadata
def _encode_temporary_episode_audio(self, audio_key: str, episode_index: int) -> Path:
"""
Use ffmpeg to convert raw audio files into m4a audio files.
Note: `encode_episode_audio` is a blocking call. Making it asynchronous shouldn't speedup encoding,
since audio encoding with ffmpeg is already using multithreading.
"""
temp_path = Path(tempfile.mkdtemp(dir=self._root)) / f"{audio_key}_{episode_index:03d}.m4a"
raw_audio_file = self._get_raw_audio_file_path(episode_index, audio_key)
encode_audio(raw_audio_file, temp_path, overwrite=True)
raw_audio_file.unlink()
return temp_path
def _save_episode_audio(self, audio_key: str, episode_index: int) -> dict:
# Encode episode audio into a temporary audio file
ep_path = self._encode_temporary_episode_audio(audio_key, episode_index)
ep_size_in_mb = get_file_size_in_mb(ep_path)
ep_duration_in_s = get_media_duration_in_s(ep_path, media_type="audio")
if (
episode_index == 0
or self._meta.latest_episode is None
or f"audio/{audio_key}/chunk_index" not in self._meta.latest_episode
):
# Initialize indices for a new dataset made of the first episode data
chunk_idx, file_idx = 0, 0
if self._meta.episodes is not None and len(self._meta.episodes) > 0:
# It means we are resuming recording, so we need to load the latest episode
# Update the indices to avoid overwriting the latest episode
old_chunk_idx = self._meta.episodes[-1][f"audio/{audio_key}/chunk_index"]
old_file_idx = self._meta.episodes[-1][f"audio/{audio_key}/file_index"]
chunk_idx, file_idx = update_chunk_file_indices(
old_chunk_idx, old_file_idx, self._meta.chunks_size
)
latest_duration_in_s = 0.0
new_path = self._root / self._meta.audio_path.format(
audio_key=audio_key, chunk_index=chunk_idx, file_index=file_idx
)
new_path.parent.mkdir(parents=True, exist_ok=True)
shutil.move(str(ep_path), str(new_path))
else:
# Retrieve information from the latest updated audio file using latest_episode
latest_ep = self._meta.latest_episode
chunk_idx = latest_ep[f"audio/{audio_key}/chunk_index"][0]
file_idx = latest_ep[f"audio/{audio_key}/file_index"][0]
latest_path = self._root / self._meta.audio_path.format(
audio_key=audio_key, chunk_index=chunk_idx, file_index=file_idx
)
latest_size_in_mb = get_file_size_in_mb(latest_path)
latest_duration_in_s = latest_ep[f"audio/{audio_key}/to_timestamp"][0]
if latest_size_in_mb + ep_size_in_mb >= self._meta.audio_files_size_in_mb:
# Move temporary episode audio to a new audio file in the dataset
chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, self._meta.chunks_size)
new_path = self._root / self._meta.audio_path.format(
audio_key=audio_key, chunk_index=chunk_idx, file_index=file_idx
)
new_path.parent.mkdir(parents=True, exist_ok=True)
shutil.move(str(ep_path), str(new_path))
latest_duration_in_s = 0.0
else:
# Update latest audio file
concatenate_media_files(
[latest_path, ep_path],
latest_path,
)
# Remove temporary directory
shutil.rmtree(str(ep_path.parent))
# Update audio info (only needed when first episode is encoded since it reads from episode 0)
if episode_index == 0:
self._meta.update_audio_info(audio_key)
write_info(self._meta.info, self._meta.root) # ensure audio info always written properly
metadata = {
"episode_index": episode_index,
f"audio/{audio_key}/chunk_index": chunk_idx,
f"audio/{audio_key}/file_index": file_idx,
f"audio/{audio_key}/from_timestamp": latest_duration_in_s,
f"audio/{audio_key}/to_timestamp": latest_duration_in_s + ep_duration_in_s,
}
return metadata
def clear_episode_buffer(self, delete_images: bool = True, delete_audio: bool = True) -> None:
"""Discard the current episode buffer and optionally delete temp images.
Args:
delete_images: If ``True``, remove temporary image directories
written for the current episode.
"""
# Cancel streaming encoder if active
if self._streaming_encoder is not None:
self._streaming_encoder.cancel_episode()
if delete_images:
if self.image_writer is not None:
self._wait_image_writer()
episode_index = self.episode_buffer["episode_index"]
# episode_index is `int` when freshly created, but becomes `np.ndarray` after
# save_episode() mutates the buffer. Handle both types here.
if isinstance(episode_index, np.ndarray):
episode_index = episode_index.item() if episode_index.size == 1 else episode_index[0]
for cam_key in self._meta.image_keys:
img_dir = self._get_image_file_dir(episode_index, cam_key)
if img_dir.is_dir():
shutil.rmtree(img_dir)
if delete_audio:
episode_index = self.episode_buffer["episode_index"]
if isinstance(episode_index, np.ndarray):
episode_index = episode_index.item() if episode_index.size == 1 else episode_index[0]
for audio_key in self._meta.audio_keys:
audio_file = self._get_raw_audio_file_path(episode_index, audio_key)
if audio_file.is_file():
audio_file.unlink()
self.episode_buffer = self._create_episode_buffer()
def start_image_writer(self, num_processes: int = 0, num_threads: int = 4) -> None:
"""Start an :class:`AsyncImageWriter` for background image persistence.
Args:
num_processes: Number of subprocesses. ``0`` means threads only.
num_threads: Number of threads per process.
"""
if isinstance(self.image_writer, AsyncImageWriter):
logger.warning(
"You are starting a new AsyncImageWriter that is replacing an already existing one in the dataset."
)
self.image_writer = AsyncImageWriter(
num_processes=num_processes,
num_threads=num_threads,
)
def stop_image_writer(self) -> None:
"""Stop the image writer (needed before pickling the dataset for DataLoader)."""
if self.image_writer is not None:
self.image_writer.stop()
self.image_writer = None
def _wait_image_writer(self) -> None:
"""Wait for asynchronous image writer to finish."""
if self.image_writer is not None:
self.image_writer.wait_until_done()
def _encode_temporary_episode_video(self, video_key: str, episode_index: int) -> Path:
"""Use ffmpeg to convert frames stored as png into mp4 videos."""
return _encode_video_worker(
video_key, episode_index, self._root, self._meta.fps, self._vcodec, self._encoder_threads
)
def close_writer(self) -> None:
"""Close and cleanup the parquet writer if it exists."""
if self._pq_writer is not None:
self._pq_writer.close()
self._pq_writer = None
def flush_pending_videos(self) -> None:
"""Flush any pending video encoding (streaming or batch).
For streaming encoding: closes the encoder.
For batch encoding: encodes any remaining episodes that haven't been batch-encoded yet.
"""
if self._streaming_encoder is not None:
self._streaming_encoder.close()
elif self._episodes_since_last_encoding > 0:
start_ep = self._meta.total_episodes - self._episodes_since_last_encoding
end_ep = self._meta.total_episodes
logger.info(
f"Encoding remaining {self._episodes_since_last_encoding} episodes, "
f"from episode {start_ep} to {end_ep - 1}"
)
self._batch_save_episode_video(start_ep, end_ep)
def cancel_pending_videos(self) -> None:
"""Cancel any in-progress streaming encoding without flushing."""
if self._streaming_encoder is not None:
self._streaming_encoder.cancel_episode()
def cleanup_interrupted_episode(self, episode_index: int) -> None:
"""Remove temporary image and audio directories for an interrupted episode."""
for key in self._meta.video_keys:
img_dir = self._get_image_file_path(
episode_index=episode_index, image_key=key, frame_index=0
).parent
if img_dir.exists():
logger.debug(
f"Cleaning up interrupted episode images for episode {episode_index}, camera {key}"
)
shutil.rmtree(img_dir)
for key in self._meta.audio_keys:
audio_file = self._get_raw_audio_file_path(episode_index=episode_index, audio_key=key)
if audio_file.exists():
logger.debug(
f"Cleaning up interrupted episode audio for episode {episode_index}, microphone {key}"
)
audio_file.unlink()
def finalize(self) -> None:
"""Flush all pending work and release all resources.
Idempotent — safe to call multiple times.
"""
if getattr(self, "_finalized", False):
return
# 1. Wait for async image writes to complete, then stop
if self.image_writer is not None:
self.image_writer.wait_until_done()
self.image_writer.stop()
self.image_writer = None
# 2. Flush pending video encoding (streaming or batch)
self.flush_pending_videos()
# 3. Close own parquet writer
self.close_writer()
# 4. Finalize metadata (idempotent)
self._meta.finalize()
self._finalized = True
def __del__(self):
"""Safety net: release resources on garbage collection."""
# During interpreter shutdown, referenced objects may already be collected.
with contextlib.suppress(Exception):
self.finalize()

View File

@@ -20,11 +20,9 @@ import torch
from lerobot.configs.policies import PreTrainedConfig
from lerobot.configs.train import TrainPipelineConfig
from lerobot.datasets.lerobot_dataset import (
LeRobotDataset,
LeRobotDatasetMetadata,
MultiLeRobotDataset,
)
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.datasets.multi_dataset import MultiLeRobotDataset
from lerobot.datasets.streaming_dataset import StreamingLeRobotDataset
from lerobot.datasets.transforms import ImageTransforms
from lerobot.utils.constants import ACTION, OBS_PREFIX, REWARD

View File

@@ -0,0 +1,612 @@
#!/usr/bin/env python
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from pprint import pformat
from typing import Any
import datasets
import numpy as np
from PIL import Image as PILImage
from lerobot.configs.types import FeatureType, PolicyFeature
from lerobot.datasets.utils import (
DEFAULT_AUDIO_FILE_SIZE_IN_MB,
DEFAULT_AUDIO_PATH,
DEFAULT_CHUNK_SIZE,
DEFAULT_DATA_FILE_SIZE_IN_MB,
DEFAULT_DATA_PATH,
DEFAULT_FEATURES,
DEFAULT_VIDEO_FILE_SIZE_IN_MB,
DEFAULT_VIDEO_PATH,
)
from lerobot.utils.constants import ACTION, OBS_ENV_STATE, OBS_STR
from lerobot.utils.utils import is_valid_numpy_dtype_string
def get_hf_features_from_features(features: dict) -> datasets.Features:
"""Convert a LeRobot features dictionary to a `datasets.Features` object.
Args:
features (dict): A LeRobot-style feature dictionary.
Returns:
datasets.Features: The corresponding Hugging Face `datasets.Features` object.
Raises:
ValueError: If a feature has an unsupported shape.
"""
hf_features = {}
for key, ft in features.items():
if ft["dtype"] == "video" or ft["dtype"] == "audio":
continue
elif ft["dtype"] == "image":
hf_features[key] = datasets.Image()
elif ft["shape"] == (1,):
hf_features[key] = datasets.Value(dtype=ft["dtype"])
elif len(ft["shape"]) == 1:
hf_features[key] = datasets.Sequence(
length=ft["shape"][0], feature=datasets.Value(dtype=ft["dtype"])
)
elif len(ft["shape"]) == 2:
hf_features[key] = datasets.Array2D(shape=ft["shape"], dtype=ft["dtype"])
elif len(ft["shape"]) == 3:
hf_features[key] = datasets.Array3D(shape=ft["shape"], dtype=ft["dtype"])
elif len(ft["shape"]) == 4:
hf_features[key] = datasets.Array4D(shape=ft["shape"], dtype=ft["dtype"])
elif len(ft["shape"]) == 5:
hf_features[key] = datasets.Array5D(shape=ft["shape"], dtype=ft["dtype"])
else:
raise ValueError(f"Corresponding feature is not valid: {ft}")
return datasets.Features(hf_features)
def _validate_feature_names(features: dict[str, dict]) -> None:
"""Validate that feature names do not contain invalid characters.
Args:
features (dict): The LeRobot features dictionary.
Raises:
ValueError: If any feature name contains '/'.
"""
invalid_features = {name: ft for name, ft in features.items() if "/" in name}
if invalid_features:
raise ValueError(f"Feature names should not contain '/'. Found '/' in '{invalid_features}'.")
def hw_to_dataset_features(
hw_features: dict[str, type | tuple], prefix: str, use_video: bool = True
) -> dict[str, dict]:
"""Convert hardware-specific features to a LeRobot dataset feature dictionary.
This function takes a dictionary describing hardware outputs (like joint states
or camera image shapes) and formats it into the standard LeRobot feature
specification.
Args:
hw_features (dict): Dictionary mapping feature names to their type (float for
joints) or shape (tuple for images).
prefix (str): The prefix to add to the feature keys (e.g., "observation"
or "action").
use_video (bool): If True, image features are marked as "video", otherwise "image".
Returns:
dict: A LeRobot features dictionary.
"""
features = {}
joint_fts = {
key: ftype
for key, ftype in hw_features.items()
if ftype is float or (isinstance(ftype, PolicyFeature) and ftype.type != FeatureType.VISUAL)
}
cam_fts = {
key: shape for key, shape in hw_features.items() if isinstance(shape, tuple) and len(shape) == 3
}
mic_fts = {
key: shape for key, shape in hw_features.items() if isinstance(shape, tuple) and len(shape) == 2
}
if joint_fts and prefix == ACTION:
features[prefix] = {
"dtype": "float32",
"shape": (len(joint_fts),),
"names": list(joint_fts),
}
if joint_fts and prefix == OBS_STR:
features[f"{prefix}.state"] = {
"dtype": "float32",
"shape": (len(joint_fts),),
"names": list(joint_fts),
}
for key, shape in cam_fts.items():
features[f"{prefix}.images.{key}"] = {
"dtype": "video" if use_video else "image",
"shape": shape,
"names": ["height", "width", "channels"],
}
for key, parameters in mic_fts.items():
features[f"{prefix}.audio.{key}"] = {
"dtype": "audio",
"shape": (len(parameters[1]),),
"names": ["channels"],
"info": {"sample_rate": parameters[0]},
}
_validate_feature_names(features)
return features
def build_dataset_frame(
ds_features: dict[str, dict], values: dict[str, Any], prefix: str
) -> dict[str, np.ndarray]:
"""Construct a single data frame from raw values based on dataset features.
A "frame" is a dictionary containing all the data for a single timestep,
formatted as numpy arrays according to the feature specification.
Args:
ds_features (dict): The LeRobot dataset features dictionary.
values (dict): A dictionary of raw values from the hardware/environment.
prefix (str): The prefix to filter features by (e.g., "observation"
or "action").
Returns:
dict: A dictionary representing a single frame of data.
"""
frame = {}
for key, ft in ds_features.items():
if key in DEFAULT_FEATURES or not key.startswith(prefix):
continue
elif ft["dtype"] == "float32" and len(ft["shape"]) == 1:
frame[key] = np.array([values[name] for name in ft["names"]], dtype=np.float32)
elif ft["dtype"] in ["image", "video"]:
frame[key] = values[key.removeprefix(f"{prefix}.images.")]
elif ft["dtype"] == "audio":
frame[key] = values[key.removeprefix(f"{prefix}.audio.")]
return frame
def dataset_to_policy_features(features: dict[str, dict]) -> dict[str, PolicyFeature]:
"""Convert dataset features to policy features.
This function transforms the dataset's feature specification into a format
that a policy can use, classifying features by type (e.g., visual, state,
action) and ensuring correct shapes (e.g., channel-first for images).
Args:
features (dict): The LeRobot dataset features dictionary.
Returns:
dict: A dictionary mapping feature keys to `PolicyFeature` objects.
Raises:
ValueError: If an image feature does not have a 3D shape.
"""
# TODO(aliberts): Implement "type" in dataset features and simplify this
policy_features = {}
for key, ft in features.items():
shape = ft["shape"]
if ft["dtype"] in ["image", "video"]:
type = FeatureType.VISUAL
if len(shape) != 3:
raise ValueError(f"Number of dimensions of {key} != 3 (shape={shape})")
names = ft["names"]
# Backward compatibility for "channel" which is an error introduced in LeRobotDataset v2.0 for ported datasets.
if names[2] in ["channel", "channels"]: # (h, w, c) -> (c, h, w)
shape = (shape[2], shape[0], shape[1])
elif ft["dtype"] == "audio":
type = FeatureType.AUDIO
if len(shape) != 2:
raise ValueError(f"Number of dimensions of {key} != 2 (shape={shape})")
elif key == OBS_ENV_STATE:
type = FeatureType.ENV
elif key.startswith(OBS_STR):
type = FeatureType.STATE
elif key.startswith(ACTION):
type = FeatureType.ACTION
else:
continue
policy_features[key] = PolicyFeature(
type=type,
shape=shape,
)
return policy_features
def combine_feature_dicts(*dicts: dict) -> dict:
"""Merge LeRobot grouped feature dicts.
- For 1D numeric specs (dtype not image/video/string) with "names": we merge the names and recompute the shape.
- For others (e.g. `observation.images.*`), the last one wins (if they are identical).
Args:
*dicts: A variable number of LeRobot feature dictionaries to merge.
Returns:
dict: A single merged feature dictionary.
Raises:
ValueError: If there's a dtype mismatch for a feature being merged.
"""
out: dict = {}
for d in dicts:
for key, value in d.items():
if not isinstance(value, dict):
out[key] = value
continue
dtype = value.get("dtype")
shape = value.get("shape")
is_vector = (
dtype not in ("image", "video", "string")
and isinstance(shape, tuple)
and len(shape) == 1
and "names" in value
)
if is_vector:
# Initialize or retrieve the accumulating dict for this feature key
target = out.setdefault(key, {"dtype": dtype, "names": [], "shape": (0,)})
# Ensure consistent data types across merged entries
if "dtype" in target and dtype != target["dtype"]:
raise ValueError(f"dtype mismatch for '{key}': {target['dtype']} vs {dtype}")
# Merge feature names: append only new ones to preserve order without duplicates
seen = set(target["names"])
for n in value["names"]:
if n not in seen:
target["names"].append(n)
seen.add(n)
# Recompute the shape to reflect the updated number of features
target["shape"] = (len(target["names"]),)
else:
# For images/videos and non-1D entries: override with the latest definition
out[key] = value
return out
def create_empty_dataset_info(
codebase_version: str,
fps: int,
features: dict,
use_videos: bool,
robot_type: str | None = None,
chunks_size: int | None = None,
data_files_size_in_mb: int | None = None,
video_files_size_in_mb: int | None = None,
audio_files_size_in_mb: int | None = None,
) -> dict:
"""Create a template dictionary for a new dataset's `info.json`.
Args:
codebase_version (str): The version of the LeRobot codebase.
fps (int): The frames per second of the data.
features (dict): The LeRobot features dictionary for the dataset.
use_videos (bool): Whether the dataset will store videos.
robot_type (str | None): The type of robot used, if any.
chunks_size (int | None): The number of files per chunk.
data_files_size_in_mb (int | None): The maximum size per data file in MB.
video_files_size_in_mb (int | None): The maximum size per video file in MB.
audio_files_size_in_mb (int | None): The maximum size per audio file in MB.
Returns:
dict: A dictionary with the initial dataset metadata.
"""
return {
"codebase_version": codebase_version,
"robot_type": robot_type,
"total_episodes": 0,
"total_frames": 0,
"total_tasks": 0,
"chunks_size": chunks_size or DEFAULT_CHUNK_SIZE,
"data_files_size_in_mb": data_files_size_in_mb or DEFAULT_DATA_FILE_SIZE_IN_MB,
"video_files_size_in_mb": video_files_size_in_mb or DEFAULT_VIDEO_FILE_SIZE_IN_MB,
"audio_files_size_in_mb": audio_files_size_in_mb or DEFAULT_AUDIO_FILE_SIZE_IN_MB,
"fps": fps,
"splits": {},
"data_path": DEFAULT_DATA_PATH,
"video_path": DEFAULT_VIDEO_PATH if use_videos else None,
"audio_path": DEFAULT_AUDIO_PATH,
"features": features,
}
def check_delta_timestamps(
delta_timestamps: dict[str, list[float]], fps: int, tolerance_s: float, raise_value_error: bool = True
) -> bool:
"""Check if delta timestamps are multiples of 1/fps +/- tolerance.
This ensures that adding these delta timestamps to any existing timestamp in
the dataset will result in a value that aligns with the dataset's frame rate.
Args:
delta_timestamps (dict): A dictionary where values are lists of time
deltas in seconds.
fps (int): The frames per second of the dataset.
tolerance_s (float): The allowed tolerance in seconds.
raise_value_error (bool): If True, raises an error on failure.
Returns:
bool: True if all deltas are valid, False otherwise.
Raises:
ValueError: If any delta is outside the tolerance and `raise_value_error` is True.
"""
outside_tolerance = {}
for key, delta_ts in delta_timestamps.items():
within_tolerance = [abs(ts * fps - round(ts * fps)) / fps <= tolerance_s for ts in delta_ts]
if not all(within_tolerance):
outside_tolerance[key] = [
ts for ts, is_within in zip(delta_ts, within_tolerance, strict=True) if not is_within
]
if len(outside_tolerance) > 0:
if raise_value_error:
raise ValueError(
f"""
The following delta_timestamps are found outside of tolerance range.
Please make sure they are multiples of 1/{fps} +/- tolerance and adjust
their values accordingly.
\n{pformat(outside_tolerance)}
"""
)
return False
return True
def get_delta_indices(delta_timestamps: dict[str, list[float]], fps: int) -> dict[str, list[int]]:
"""Convert delta timestamps in seconds to delta indices in frames.
Args:
delta_timestamps (dict): A dictionary of time deltas in seconds.
fps (int): The frames per second of the dataset.
Returns:
dict: A dictionary of frame delta indices.
"""
delta_indices = {}
for key, delta_ts in delta_timestamps.items():
delta_indices[key] = [round(d * fps) for d in delta_ts]
return delta_indices
def validate_frame(frame: dict, features: dict) -> None:
# DEFAULT_FEATURES (timestamp, frame_index, episode_index, index, task_index) are
# auto-populated by the recording pipeline (add_frame / save_episode) and must not
# be supplied by the caller. Excluding them here means any frame dict that contains
# these keys will be rejected as extra features.
expected_features = set(features) - set(DEFAULT_FEATURES)
actual_features = set(frame)
# task is a special required field that's not part of regular features
if "task" not in actual_features:
raise ValueError("Feature mismatch in `frame` dictionary:\nMissing features: {'task'}\n")
# Remove task from actual_features for regular feature validation
actual_features_for_validation = actual_features - {"task"}
error_message = validate_features_presence(actual_features_for_validation, expected_features)
common_features = actual_features_for_validation & expected_features
for name in common_features:
error_message += validate_feature_dtype_and_shape(name, features[name], frame[name])
if error_message:
raise ValueError(error_message)
def validate_features_presence(actual_features: set[str], expected_features: set[str]) -> str:
"""Check for missing or extra features in a frame.
Args:
actual_features (set[str]): The set of feature names present in the frame.
expected_features (set[str]): The set of feature names expected in the frame.
Returns:
str: An error message string if there's a mismatch, otherwise an empty string.
"""
error_message = ""
missing_features = expected_features - actual_features
extra_features = actual_features - expected_features
if missing_features or extra_features:
error_message += "Feature mismatch in `frame` dictionary:\n"
if missing_features:
error_message += f"Missing features: {missing_features}\n"
if extra_features:
error_message += f"Extra features: {extra_features}\n"
return error_message
def validate_feature_dtype_and_shape(
name: str, feature: dict, value: np.ndarray | PILImage.Image | str
) -> str:
"""Validate the dtype and shape of a single feature's value.
Args:
name (str): The name of the feature.
feature (dict): The feature specification from the LeRobot features dictionary.
value: The value of the feature to validate.
Returns:
str: An error message if validation fails, otherwise an empty string.
Raises:
NotImplementedError: If the feature dtype is not supported for validation.
"""
expected_dtype = feature["dtype"]
expected_shape = feature["shape"]
if is_valid_numpy_dtype_string(expected_dtype):
return validate_feature_numpy_array(name, expected_dtype, expected_shape, value)
elif expected_dtype in ["image", "video"]:
return validate_feature_image_or_video(name, expected_shape, value)
elif expected_dtype == "audio":
return validate_feature_audio(name, expected_shape, value)
elif expected_dtype == "string":
return validate_feature_string(name, value)
else:
raise NotImplementedError(f"The feature dtype '{expected_dtype}' is not implemented yet.")
def validate_feature_numpy_array(
name: str, expected_dtype: str, expected_shape: list[int], value: np.ndarray
) -> str:
"""Validate a feature that is expected to be a numpy array.
Args:
name (str): The name of the feature.
expected_dtype (str): The expected numpy dtype as a string.
expected_shape (list[int]): The expected shape.
value (np.ndarray): The numpy array to validate.
Returns:
str: An error message if validation fails, otherwise an empty string.
"""
error_message = ""
if isinstance(value, np.ndarray):
actual_dtype = value.dtype
actual_shape = value.shape
if actual_dtype != np.dtype(expected_dtype):
error_message += f"The feature '{name}' of dtype '{actual_dtype}' is not of the expected dtype '{expected_dtype}'.\n"
if actual_shape != expected_shape:
error_message += f"The feature '{name}' of shape '{actual_shape}' does not have the expected shape '{expected_shape}'.\n"
else:
error_message += f"The feature '{name}' is not a 'np.ndarray'. Expected type is '{expected_dtype}', but type '{type(value)}' provided instead.\n"
return error_message
def validate_feature_image_or_video(
name: str, expected_shape: list[str], value: np.ndarray | PILImage.Image
) -> str:
"""Validate a feature that is expected to be an image or video frame.
Accepts `np.ndarray` (channel-first or channel-last) or `PIL.Image.Image`.
Args:
name (str): The name of the feature.
expected_shape (list[str]): The expected shape (C, H, W).
value: The image data to validate.
Returns:
str: An error message if validation fails, otherwise an empty string.
"""
# Note: The check of pixels range ([0,1] for float and [0,255] for uint8) is done by the image writer threads.
error_message = ""
if isinstance(value, np.ndarray):
actual_shape = value.shape
c, h, w = expected_shape
if len(actual_shape) != 3 or (actual_shape != (c, h, w) and actual_shape != (h, w, c)):
error_message += f"The feature '{name}' of shape '{actual_shape}' does not have the expected shape '{(c, h, w)}' or '{(h, w, c)}'.\n"
elif isinstance(value, PILImage.Image):
pass
else:
error_message += f"The feature '{name}' is expected to be of type 'PIL.Image' or 'np.ndarray' channel first or channel last, but type '{type(value)}' provided instead.\n"
return error_message
def validate_feature_audio(name: str, expected_shape: list[str], value: np.ndarray):
"""Validate a feature that is expected to be an audio frame.
Args:
name (str): The name of the feature.
expected_shape (list[str]): The expected shape (C,).
value: The audio data to validate.
Returns:
str: An error message if validation fails, otherwise an empty string.
"""
error_message = ""
if isinstance(value, np.ndarray):
actual_shape = value.shape
c = expected_shape
if (len(actual_shape) != 2 and len(actual_shape) != 1) or actual_shape[-1] != c[
-1
]: # The number of frames might be different
error_message += (
f"The feature '{name}' of shape '{actual_shape}' does not have the expected shape '{c}'.\n"
)
else:
error_message += f"The feature '{name}' is expected to be of type 'np.ndarray', but type '{type(value)}' provided instead.\n"
return error_message
def validate_feature_string(name: str, value: str) -> str:
"""Validate a feature that is expected to be a string.
Args:
name (str): The name of the feature.
value (str): The value to validate.
Returns:
str: An error message if validation fails, otherwise an empty string.
"""
if not isinstance(value, str):
return f"The feature '{name}' is expected to be of type 'str', but type '{type(value)}' provided instead.\n"
return ""
def validate_episode_buffer(episode_buffer: dict, total_episodes: int, features: dict) -> None:
"""Validate the episode buffer before it's written to disk.
Ensures the buffer has the required keys, contains at least one frame, and
has features consistent with the dataset's specification.
Args:
episode_buffer (dict): The buffer containing data for a single episode.
total_episodes (int): The current total number of episodes in the dataset.
features (dict): The LeRobot features dictionary for the dataset.
Raises:
ValueError: If the buffer is invalid.
NotImplementedError: If the episode index is manually set and doesn't match.
"""
if "size" not in episode_buffer:
raise ValueError("size key not found in episode_buffer")
if "task" not in episode_buffer:
raise ValueError("task key not found in episode_buffer")
if episode_buffer["episode_index"] != total_episodes:
# TODO(aliberts): Add option to use existing episode_index
raise NotImplementedError(
"You might have manually provided the episode_buffer with an episode_index that doesn't "
"match the total number of episodes already in the dataset. This is not supported for now."
)
if episode_buffer["size"] == 0:
raise ValueError("You must add one or several frames with `add_frame` before calling `add_episode`.")
buffer_keys = set(episode_buffer.keys()) - {"task", "size"}
if not buffer_keys == set(features):
raise ValueError(
f"Features from `episode_buffer` don't match the ones in `features`."
f"In episode_buffer not in features: {buffer_keys - set(features)}"
f"In features not in episode_buffer: {set(features) - buffer_keys}"
)

View File

@@ -13,6 +13,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import multiprocessing
import queue
import threading
@@ -22,6 +23,8 @@ import numpy as np
import PIL.Image
import torch
logger = logging.getLogger(__name__)
def safe_stop_image_writer(func):
def wrapper(*args, **kwargs):
@@ -29,10 +32,10 @@ def safe_stop_image_writer(func):
return func(*args, **kwargs)
except Exception as e:
dataset = kwargs.get("dataset")
image_writer = getattr(dataset, "image_writer", None) if dataset else None
if image_writer is not None:
print("Waiting for image writer to terminate...")
image_writer.stop()
writer = getattr(dataset, "writer", None) if dataset else None
if writer is not None and writer.image_writer is not None:
logger.warning("Waiting for image writer to terminate...")
writer.image_writer.stop()
raise e
return wrapper
@@ -89,8 +92,7 @@ def write_image(image: np.ndarray | PIL.Image.Image, fpath: Path, compress_level
PIL.Image.Image object.
Side Effects:
Prints an error message to the console if the image writing process
fails for any reason.
Logs an error message if the image writing process fails for any reason.
"""
try:
if isinstance(image, np.ndarray):
@@ -101,7 +103,7 @@ def write_image(image: np.ndarray | PIL.Image.Image, fpath: Path, compress_level
raise TypeError(f"Unsupported image type: {type(image)}")
img.save(fpath, compress_level=compress_level)
except Exception as e:
print(f"Error writing image {fpath}: {e}")
logger.error("Error writing image %s: %s", fpath, e)
def worker_thread_loop(queue: queue.Queue):

View File

@@ -0,0 +1,361 @@
#!/usr/bin/env python
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
from pathlib import Path
from typing import Any
import datasets
import numpy as np
import pandas
import pandas as pd
import pyarrow.dataset as pa_ds
import pyarrow.parquet as pq
import soundfile as sf
import torch
from datasets import Dataset
from datasets.table import embed_table_storage
from PIL import Image as PILImage
from torchvision import transforms
from lerobot.datasets.utils import (
DEFAULT_DATA_FILE_SIZE_IN_MB,
DEFAULT_EPISODES_PATH,
DEFAULT_SUBTASKS_PATH,
DEFAULT_TASKS_PATH,
EPISODES_DIR,
INFO_PATH,
STATS_PATH,
flatten_dict,
serialize_dict,
unflatten_dict,
)
from lerobot.utils.utils import SuppressProgressBars
def get_parquet_file_size_in_mb(parquet_path: str | Path) -> float:
metadata = pq.read_metadata(parquet_path)
total_uncompressed_size = 0
for row_group in range(metadata.num_row_groups):
rg_metadata = metadata.row_group(row_group)
for column in range(rg_metadata.num_columns):
col_metadata = rg_metadata.column(column)
total_uncompressed_size += col_metadata.total_uncompressed_size
return total_uncompressed_size / (1024**2)
def get_hf_dataset_size_in_mb(hf_ds: Dataset) -> int:
return hf_ds.data.nbytes // (1024**2)
def load_nested_dataset(
pq_dir: Path, features: datasets.Features | None = None, episodes: list[int] | None = None
) -> Dataset:
"""Find parquet files in provided directory {pq_dir}/chunk-xxx/file-xxx.parquet
Convert parquet files to pyarrow memory mapped in a cache folder for efficient RAM usage
Concatenate all pyarrow references to return HF Dataset format
Args:
pq_dir: Directory containing parquet files
features: Optional features schema to ensure consistent loading of complex types like images
episodes: Optional list of episode indices to filter. Uses PyArrow predicate pushdown for efficiency.
"""
paths = sorted(pq_dir.glob("*/*.parquet"))
if len(paths) == 0:
raise FileNotFoundError(f"Provided directory does not contain any parquet file: {pq_dir}")
with SuppressProgressBars():
# We use .from_parquet() memory-mapped loading for efficiency
filters = pa_ds.field("episode_index").isin(episodes) if episodes is not None else None
return Dataset.from_parquet([str(path) for path in paths], filters=filters, features=features)
def get_parquet_num_frames(parquet_path: str | Path) -> int:
metadata = pq.read_metadata(parquet_path)
return metadata.num_rows
def get_file_size_in_mb(file_path: Path) -> float:
"""Get file size on disk in megabytes.
Args:
file_path (Path): Path to the file.
"""
file_size_bytes = file_path.stat().st_size
return file_size_bytes / (1024**2)
def embed_images(dataset: datasets.Dataset) -> datasets.Dataset:
"""Embed image bytes into the dataset table before saving to Parquet.
This function prepares a Hugging Face dataset for serialization by converting
image objects into an embedded format that can be stored in Arrow/Parquet.
Args:
dataset (datasets.Dataset): The input dataset, possibly containing image features.
Returns:
datasets.Dataset: The dataset with images embedded in the table storage.
"""
# Embed image bytes into the table before saving to parquet
format = dataset.format
dataset = dataset.with_format("arrow")
dataset = dataset.map(embed_table_storage, batched=False)
dataset = dataset.with_format(**format)
return dataset
def load_json(fpath: Path) -> Any:
"""Load data from a JSON file.
Args:
fpath (Path): Path to the JSON file.
Returns:
Any: The data loaded from the JSON file.
"""
with open(fpath) as f:
return json.load(f)
def write_json(data: dict, fpath: Path) -> None:
"""Write data to a JSON file.
Creates parent directories if they don't exist.
Args:
data (dict): The dictionary to write.
fpath (Path): The path to the output JSON file.
"""
fpath.parent.mkdir(exist_ok=True, parents=True)
with open(fpath, "w") as f:
json.dump(data, f, indent=4, ensure_ascii=False)
def write_info(info: dict, local_dir: Path) -> None:
write_json(info, local_dir / INFO_PATH)
def load_info(local_dir: Path) -> dict:
"""Load dataset info metadata from its standard file path.
Also converts shape lists to tuples for consistency.
Args:
local_dir (Path): The root directory of the dataset.
Returns:
dict: The dataset information dictionary.
"""
info = load_json(local_dir / INFO_PATH)
for ft in info["features"].values():
ft["shape"] = tuple(ft["shape"])
return info
def write_stats(stats: dict, local_dir: Path) -> None:
"""Serialize and write dataset statistics to their standard file path.
Args:
stats (dict): The statistics dictionary (can contain tensors/numpy arrays).
local_dir (Path): The root directory of the dataset.
"""
serialized_stats = serialize_dict(stats)
write_json(serialized_stats, local_dir / STATS_PATH)
def cast_stats_to_numpy(stats: dict) -> dict[str, dict[str, np.ndarray]]:
"""Recursively cast numerical values in a stats dictionary to numpy arrays.
Args:
stats (dict): The statistics dictionary.
Returns:
dict: The statistics dictionary with values cast to numpy arrays.
"""
stats = {key: np.array(value) for key, value in flatten_dict(stats).items()}
return unflatten_dict(stats)
def load_stats(local_dir: Path) -> dict[str, dict[str, np.ndarray]] | None:
"""Load dataset statistics and cast numerical values to numpy arrays.
Returns None if the stats file doesn't exist.
Args:
local_dir (Path): The root directory of the dataset.
Returns:
A dictionary of statistics or None if the file is not found.
"""
if not (local_dir / STATS_PATH).exists():
return None
stats = load_json(local_dir / STATS_PATH)
return cast_stats_to_numpy(stats)
def write_tasks(tasks: pandas.DataFrame, local_dir: Path) -> None:
path = local_dir / DEFAULT_TASKS_PATH
path.parent.mkdir(parents=True, exist_ok=True)
tasks.to_parquet(path)
def load_tasks(local_dir: Path) -> pandas.DataFrame:
tasks = pd.read_parquet(local_dir / DEFAULT_TASKS_PATH)
tasks.index.name = "task"
return tasks
def load_subtasks(local_dir: Path) -> pandas.DataFrame | None:
"""Load subtasks from subtasks.parquet if it exists."""
subtasks_path = local_dir / DEFAULT_SUBTASKS_PATH
if subtasks_path.exists():
return pd.read_parquet(subtasks_path)
return None
def write_episodes(episodes: Dataset, local_dir: Path) -> None:
"""Write episode metadata to a parquet file in the LeRobot v3.0 format.
This function writes episode-level metadata to a single parquet file.
Used primarily during dataset conversion (v2.1 → v3.0) and in test fixtures.
Args:
episodes: HuggingFace Dataset containing episode metadata
local_dir: Root directory where the dataset will be stored
"""
episode_size_mb = get_hf_dataset_size_in_mb(episodes)
if episode_size_mb > DEFAULT_DATA_FILE_SIZE_IN_MB:
raise NotImplementedError(
f"Episodes dataset is too large ({episode_size_mb} MB) to write to a single file. "
f"The current limit is {DEFAULT_DATA_FILE_SIZE_IN_MB} MB. "
"This function only supports single-file episode metadata. "
)
fpath = local_dir / DEFAULT_EPISODES_PATH.format(chunk_index=0, file_index=0)
fpath.parent.mkdir(parents=True, exist_ok=True)
episodes.to_parquet(fpath)
def load_episodes(local_dir: Path) -> datasets.Dataset:
episodes = load_nested_dataset(local_dir / EPISODES_DIR)
# Select episode features/columns containing references to episode data and videos
# (e.g. tasks, dataset_from_index, dataset_to_index, data/chunk_index, data/file_index, etc.)
# This is to speedup access to these data, instead of having to load episode stats.
episodes = episodes.select_columns([key for key in episodes.features if not key.startswith("stats/")])
return episodes
def load_image_as_numpy(
fpath: str | Path, dtype: np.dtype = np.float32, channel_first: bool = True
) -> np.ndarray:
"""Load an image from a file into a numpy array.
Args:
fpath (str | Path): Path to the image file.
dtype (np.dtype): The desired data type of the output array. If floating,
pixels are scaled to [0, 1].
channel_first (bool): If True, converts the image to (C, H, W) format.
Otherwise, it remains in (H, W, C) format.
Returns:
np.ndarray: The image as a numpy array.
"""
img = PILImage.open(fpath).convert("RGB")
img_array = np.array(img, dtype=dtype)
if channel_first: # (H, W, C) -> (C, H, W)
img_array = np.transpose(img_array, (2, 0, 1))
if np.issubdtype(dtype, np.floating):
img_array /= 255.0
return img_array
def load_audio_from_path(fpath: str | Path) -> np.ndarray:
"""Load an audio file from a path into a numpy array.
Args:
fpath (str | Path): Path to the audio file.
Returns:
np.ndarray: The audio as a numpy array.
"""
audio_data, _ = sf.read(fpath, dtype="float32")
# Fill missing channel dimension when loading mono audio data
if audio_data.ndim == 1:
audio_data = np.expand_dims(audio_data, axis=1)
return audio_data
def hf_transform_to_torch(items_dict: dict[str, list[Any]]) -> dict[str, list[torch.Tensor | str]]:
"""Convert a batch from a Hugging Face dataset to torch tensors.
This transform function converts items from Hugging Face dataset format (pyarrow)
to torch tensors. Importantly, images are converted from PIL objects (H, W, C, uint8)
to a torch image representation (C, H, W, float32) in the range [0, 1]. Other
types are converted to torch.tensor.
Args:
items_dict (dict): A dictionary representing a batch of data from a
Hugging Face dataset.
Returns:
dict: The batch with items converted to torch tensors.
"""
for key in items_dict:
first_item = items_dict[key][0]
if isinstance(first_item, PILImage.Image):
to_tensor = transforms.ToTensor()
items_dict[key] = [to_tensor(img) for img in items_dict[key]]
elif first_item is None:
pass
else:
items_dict[key] = [x if isinstance(x, str) else torch.tensor(x) for x in items_dict[key]]
return items_dict
def to_parquet_with_hf_images(
df: pandas.DataFrame, path: Path, features: datasets.Features | None = None
) -> None:
"""This function correctly writes to parquet a panda DataFrame that contains images encoded by HF dataset.
This way, it can be loaded by HF dataset and correctly formatted images are returned.
Args:
df: DataFrame to write to parquet.
path: Path to write the parquet file.
features: Optional HuggingFace Features schema. If provided, ensures image columns
are properly typed as Image() in the parquet schema.
"""
# TODO(qlhoest): replace this weird synthax by `df.to_parquet(path)` only
ds = datasets.Dataset.from_dict(df.to_dict(orient="list"), features=features)
ds.to_parquet(path)
def item_to_torch(item: dict) -> dict:
"""Convert all items in a dictionary to PyTorch tensors where appropriate.
This function is used to convert an item from a streaming dataset to PyTorch tensors.
Args:
item (dict): Dictionary of items from a dataset.
Returns:
dict: Dictionary with all tensor-like items converted to torch.Tensor.
"""
for key, val in item.items():
if isinstance(val, (np.ndarray | list)) and key not in ["task"]:
# Convert numpy arrays and lists to torch tensors
item[key] = torch.tensor(val)
return item

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,217 @@
#!/usr/bin/env python
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
from collections.abc import Callable
from pathlib import Path
import datasets
import torch
import torch.utils
from lerobot.datasets.compute_stats import aggregate_stats
from lerobot.datasets.feature_utils import get_hf_features_from_features
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.datasets.video_utils import VideoFrame
from lerobot.utils.constants import HF_LEROBOT_HOME
logger = logging.getLogger(__name__)
class MultiLeRobotDataset(torch.utils.data.Dataset):
"""A dataset consisting of multiple underlying `LeRobotDataset`s.
The underlying `LeRobotDataset`s are effectively concatenated, and this class adopts much of the API
structure of `LeRobotDataset`.
"""
def __init__(
self,
repo_ids: list[str],
root: str | Path | None = None,
episodes: dict | None = None,
image_transforms: Callable | None = None,
delta_timestamps: dict[str, list[float]] | None = None,
tolerances_s: dict | None = None,
download_videos: bool = True,
video_backend: str | None = None,
):
super().__init__()
self.repo_ids = repo_ids
self.root = Path(root) if root else HF_LEROBOT_HOME
self.tolerances_s = tolerances_s if tolerances_s else dict.fromkeys(repo_ids, 0.0001)
# Construct the underlying datasets passing everything but `transform` and `delta_timestamps` which
# are handled by this class.
self._datasets = [
LeRobotDataset(
repo_id,
root=self.root / repo_id,
episodes=episodes[repo_id] if episodes else None,
image_transforms=image_transforms,
delta_timestamps=delta_timestamps,
tolerance_s=self.tolerances_s[repo_id],
download_videos=download_videos,
video_backend=video_backend,
)
for repo_id in repo_ids
]
# Disable any data keys that are not common across all of the datasets. Note: we may relax this
# restriction in future iterations of this class. For now, this is necessary at least for being able
# to use PyTorch's default DataLoader collate function.
self.disabled_features = set()
intersection_features = set(self._datasets[0].features)
for ds in self._datasets:
intersection_features.intersection_update(ds.features)
if len(intersection_features) == 0:
raise RuntimeError(
"Multiple datasets were provided but they had no keys common to all of them. "
"The multi-dataset functionality currently only keeps common keys."
)
for repo_id, ds in zip(self.repo_ids, self._datasets, strict=True):
extra_keys = set(ds.features).difference(intersection_features)
if extra_keys:
logger.warning(
f"keys {extra_keys} of {repo_id} were disabled as they are not contained in all the "
"other datasets."
)
self.disabled_features.update(extra_keys)
self.image_transforms = image_transforms
self.delta_timestamps = delta_timestamps
# TODO(rcadene, aliberts): We should not perform this aggregation for datasets
# with multiple robots of different ranges. Instead we should have one normalization
# per robot.
self.stats = aggregate_stats([dataset.meta.stats for dataset in self._datasets])
@property
def repo_id_to_index(self):
"""Return a mapping from dataset repo_id to a dataset index automatically created by this class.
This index is incorporated as a data key in the dictionary returned by `__getitem__`.
"""
return {repo_id: i for i, repo_id in enumerate(self.repo_ids)}
@property
def fps(self) -> int:
"""Frames per second used during data collection.
NOTE: Fow now, this relies on a check in __init__ to make sure all sub-datasets have the same info.
"""
return self._datasets[0].meta.info["fps"]
@property
def video(self) -> bool:
"""Returns True if this dataset loads video frames from mp4 files.
Returns False if it only loads images from png files.
NOTE: Fow now, this relies on a check in __init__ to make sure all sub-datasets have the same info.
"""
return self._datasets[0].meta.info.get("video", False)
@property
def features(self) -> datasets.Features:
features = {}
for dataset in self._datasets:
features.update(
{
k: v
for k, v in get_hf_features_from_features(dataset.features).items()
if k not in self.disabled_features
}
)
return features
@property
def camera_keys(self) -> list[str]:
"""Keys to access image and video stream from cameras."""
keys = []
for key, feats in self.features.items():
if isinstance(feats, (datasets.Image | VideoFrame)):
keys.append(key)
return keys
@property
def video_frame_keys(self) -> list[str]:
"""Keys to access video frames that requires to be decoded into images.
Note: It is empty if the dataset contains images only,
or equal to `self.cameras` if the dataset contains videos only,
or can even be a subset of `self.cameras` in a case of a mixed image/video dataset.
"""
video_frame_keys = []
for key, feats in self.features.items():
if isinstance(feats, VideoFrame):
video_frame_keys.append(key)
return video_frame_keys
@property
def num_frames(self) -> int:
"""Number of samples/frames."""
return sum(d.num_frames for d in self._datasets)
@property
def num_episodes(self) -> int:
"""Number of episodes."""
return sum(d.num_episodes for d in self._datasets)
@property
def tolerance_s(self) -> float:
"""Tolerance in seconds used to discard loaded frames when their timestamps
are not close enough from the requested frames. It is only used when `delta_timestamps`
is provided or when loading video frames from mp4 files.
"""
# 1e-4 to account for possible numerical error
return 1 / self.fps - 1e-4
def __len__(self):
return self.num_frames
def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
if idx >= len(self):
raise IndexError(f"Index {idx} out of bounds.")
# Determine which dataset to get an item from based on the index.
start_idx = 0
dataset_idx = 0
for dataset in self._datasets:
if idx >= start_idx + dataset.num_frames:
start_idx += dataset.num_frames
dataset_idx += 1
continue
break
else:
raise AssertionError("We expect the loop to break out as long as the index is within bounds.")
item = self._datasets[dataset_idx][idx - start_idx]
item["dataset_index"] = torch.tensor(dataset_idx)
for data_key in self.disabled_features:
if data_key in item:
del item[data_key]
return item
def __repr__(self):
return (
f"{self.__class__.__name__}(\n"
f" Repository IDs: '{self.repo_ids}',\n"
f" Number of Samples: {self.num_frames},\n"
f" Number of Episodes: {self.num_episodes},\n"
f" Type: {'video (.mp4)' if self.video else 'image (.png)'},\n"
f" Recorded Frames per Second: {self.fps},\n"
f" Camera Keys: {self.camera_keys},\n"
f" Video Frame Keys: {self.video_frame_keys if self.video else 'N/A'},\n"
f" Transformations: {self.image_transforms},\n"
f")"
)

View File

@@ -17,7 +17,7 @@ from collections.abc import Sequence
from typing import Any
from lerobot.configs.types import PipelineFeatureType
from lerobot.datasets.utils import hw_to_dataset_features
from lerobot.datasets.feature_utils import hw_to_dataset_features
from lerobot.processor import DataProcessorPipeline
from lerobot.types import RobotAction, RobotObservation
from lerobot.utils.constants import ACTION, OBS_IMAGES, OBS_STATE, OBS_STR
@@ -44,11 +44,11 @@ def create_initial_features(
return features
# Helper to filter state/action keys based on regex patterns.
def should_keep(key: str, patterns: tuple[str]) -> bool:
# Helper to filter state/action keys based on compiled regex patterns.
def should_keep(key: str, patterns: tuple[re.Pattern] | None) -> bool:
if patterns is None:
return True
return any(re.search(pat, key) for pat in patterns)
return any(pat.search(key) for pat in patterns)
def strip_prefix(key: str, prefixes_to_strip: tuple[str]) -> str:
@@ -89,6 +89,8 @@ def aggregate_pipeline_dataset_features(
Returns:
A dictionary of features formatted for a Hugging Face LeRobot Dataset.
"""
compiled_patterns = tuple(re.compile(p) for p in patterns) if patterns is not None else None
all_features = pipeline.transform_features(initial_features)
# Intermediate storage for categorized and filtered features.
@@ -120,7 +122,7 @@ def aggregate_pipeline_dataset_features(
# 2. Apply filtering rules.
if is_image and not use_videos:
continue
if not is_image and not should_keep(key, patterns):
if not is_image and not should_keep(key, compiled_patterns):
continue
# 3. Add the feature to the appropriate group with a clean name.

View File

@@ -13,10 +13,13 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
from collections.abc import Iterator
import torch
logger = logging.getLogger(__name__)
class EpisodeAwareSampler:
def __init__(
@@ -39,13 +42,35 @@ class EpisodeAwareSampler:
drop_n_last_frames: Number of frames to drop from the end of each episode.
shuffle: Whether to shuffle the indices.
"""
if drop_n_first_frames < 0:
raise ValueError(f"drop_n_first_frames must be >= 0, got {drop_n_first_frames}")
if drop_n_last_frames < 0:
raise ValueError(f"drop_n_last_frames must be >= 0, got {drop_n_last_frames}")
indices = []
for episode_idx, (start_index, end_index) in enumerate(
zip(dataset_from_indices, dataset_to_indices, strict=True)
):
if episode_indices_to_use is None or episode_idx in episode_indices_to_use:
ep_length = end_index - start_index
if drop_n_first_frames + drop_n_last_frames >= ep_length:
logger.warning(
"Episode %d has %d frames but drop_n_first_frames=%d and "
"drop_n_last_frames=%d removes all frames. Skipping.",
episode_idx,
ep_length,
drop_n_first_frames,
drop_n_last_frames,
)
continue
indices.extend(range(start_index + drop_n_first_frames, end_index - drop_n_last_frames))
if not indices:
raise ValueError(
"No valid frames remain after applying drop_n_first_frames and drop_n_last_frames. "
"All episodes were either filtered out or had too few frames."
)
self.indices = indices
self.shuffle = shuffle

View File

@@ -13,7 +13,8 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from collections.abc import Callable, Generator, Iterator
from collections import deque
from collections.abc import Callable, Generator, Iterable, Iterator
from pathlib import Path
import datasets
@@ -21,16 +22,13 @@ import numpy as np
import torch
from datasets import load_dataset
from lerobot.datasets.lerobot_dataset import CODEBASE_VERSION, LeRobotDatasetMetadata
from lerobot.datasets.dataset_metadata import CODEBASE_VERSION, LeRobotDatasetMetadata
from lerobot.datasets.feature_utils import get_delta_indices
from lerobot.datasets.io_utils import item_to_torch
from lerobot.datasets.utils import (
Backtrackable,
LookAheadError,
LookBackError,
check_version_compatibility,
find_float_index,
get_delta_indices,
is_float_in_list,
item_to_torch,
safe_shard,
)
from lerobot.datasets.video_utils import (
@@ -40,6 +38,164 @@ from lerobot.datasets.video_utils import (
from lerobot.utils.constants import HF_LEROBOT_HOME, LOOKAHEAD_BACKTRACKTABLE, LOOKBACK_BACKTRACKTABLE
class LookBackError(Exception):
"""
Exception raised when trying to look back in the history of a Backtrackable object.
"""
pass
class LookAheadError(Exception):
"""
Exception raised when trying to look ahead in the future of a Backtrackable object.
"""
pass
class Backtrackable[T]:
"""
Wrap any iterator/iterable so you can step back up to `history` items
and look ahead up to `lookahead` items.
This is useful for streaming datasets where you need to access previous and future items
but can't load the entire dataset into memory.
Example:
-------
```python
ds = load_dataset("c4", "en", streaming=True, split="train")
rev = Backtrackable(ds, history=3, lookahead=2)
x0 = next(rev) # forward
x1 = next(rev)
x2 = next(rev)
# Look ahead
x3_peek = rev.peek_ahead(1) # next item without moving cursor
x4_peek = rev.peek_ahead(2) # two items ahead
# Look back
x1_again = rev.peek_back(1) # previous item without moving cursor
x0_again = rev.peek_back(2) # two items back
# Move backward
x1_back = rev.prev() # back one step
next(rev) # returns x2, continues forward from where we were
```
"""
__slots__ = ("_source", "_back_buf", "_ahead_buf", "_cursor", "_history", "_lookahead")
def __init__(self, iterable: Iterable[T], *, history: int = 1, lookahead: int = 0):
if history < 1:
raise ValueError("history must be >= 1")
if lookahead <= 0:
raise ValueError("lookahead must be > 0")
self._source: Iterator[T] = iter(iterable)
self._back_buf: deque[T] = deque(maxlen=history)
self._ahead_buf: deque[T] = deque(maxlen=lookahead) if lookahead > 0 else deque()
self._cursor: int = 0
self._history = history
self._lookahead = lookahead
def __iter__(self) -> "Backtrackable[T]":
return self
def __next__(self) -> T:
# If we've stepped back, consume from back buffer first
if self._cursor < 0: # -1 means "last item", etc.
self._cursor += 1
return self._back_buf[self._cursor]
# If we have items in the ahead buffer, use them first
item = self._ahead_buf.popleft() if self._ahead_buf else next(self._source)
# Add current item to back buffer and reset cursor
self._back_buf.append(item)
self._cursor = 0
return item
def prev(self) -> T:
"""
Step one item back in history and return it.
Raises IndexError if already at the oldest buffered item.
"""
if len(self._back_buf) + self._cursor <= 1:
raise LookBackError("At start of history")
self._cursor -= 1
return self._back_buf[self._cursor]
def peek_back(self, n: int = 1) -> T:
"""
Look `n` items back (n=1 == previous item) without moving the cursor.
"""
if n < 0 or n + 1 > len(self._back_buf) + self._cursor:
raise LookBackError("peek_back distance out of range")
return self._back_buf[self._cursor - (n + 1)]
def peek_ahead(self, n: int = 1) -> T:
"""
Look `n` items ahead (n=1 == next item) without moving the cursor.
Fills the ahead buffer if necessary.
"""
if n < 1:
raise LookAheadError("peek_ahead distance must be 1 or more")
elif n > self._lookahead:
raise LookAheadError("peek_ahead distance exceeds lookahead limit")
# Fill ahead buffer if we don't have enough items
while len(self._ahead_buf) < n:
try:
item = next(self._source)
self._ahead_buf.append(item)
except StopIteration as err:
raise LookAheadError("peek_ahead: not enough items in source") from err
return self._ahead_buf[n - 1]
def history(self) -> list[T]:
"""
Return a copy of the buffered history (most recent last).
The list length ≤ `history` argument passed at construction.
"""
if self._cursor == 0:
return list(self._back_buf)
# When cursor<0, slice so the order remains chronological
return list(self._back_buf)[: self._cursor or None]
def can_peek_back(self, steps: int = 1) -> bool:
"""
Check if we can go back `steps` items without raising an IndexError.
"""
return steps <= len(self._back_buf) + self._cursor
def can_peek_ahead(self, steps: int = 1) -> bool:
"""
Check if we can peek ahead `steps` items.
This may involve trying to fill the ahead buffer.
"""
if self._lookahead > 0 and steps > self._lookahead:
return False
# Try to fill ahead buffer to check if we can peek that far
try:
while len(self._ahead_buf) < steps:
if self._lookahead > 0 and len(self._ahead_buf) >= self._lookahead:
return False
item = next(self._source)
self._ahead_buf.append(item)
return True
except StopIteration:
return False
class StreamingLeRobotDataset(torch.utils.data.IterableDataset):
"""LeRobotDataset with streaming capabilities.
@@ -99,7 +255,9 @@ class StreamingLeRobotDataset(torch.utils.data.IterableDataset):
Args:
repo_id (str): This is the repo id that will be used to fetch the dataset.
root (Path | None, optional): Local directory to use for downloading/writing files.
root (Path | None, optional): Local directory to use for local datasets. When omitted, Hub
metadata is resolved through a revision-safe snapshot cache under
``$HF_LEROBOT_HOME/hub``.
episodes (list[int] | None, optional): If specified, this will only load episodes specified by
their episode_index in this list.
image_transforms (Callable | None, optional): Transform to apply to image data.
@@ -115,7 +273,8 @@ class StreamingLeRobotDataset(torch.utils.data.IterableDataset):
"""
super().__init__()
self.repo_id = repo_id
self.root = Path(root) if root else HF_LEROBOT_HOME / repo_id
self._requested_root = Path(root) if root else None
self.root = self._requested_root if self._requested_root is not None else HF_LEROBOT_HOME / repo_id
self.streaming_from_local = root is not None
self.image_transforms = image_transforms
@@ -132,12 +291,15 @@ class StreamingLeRobotDataset(torch.utils.data.IterableDataset):
# We cache the video decoders to avoid re-initializing them at each frame (avoiding a ~10x slowdown)
self.video_decoder_cache = None
self.root.mkdir(exist_ok=True, parents=True)
if self._requested_root is not None:
self.root.mkdir(exist_ok=True, parents=True)
# Load metadata
self.meta = LeRobotDatasetMetadata(
self.repo_id, self.root, self.revision, force_cache_sync=force_cache_sync
self.repo_id, self._requested_root, self.revision, force_cache_sync=force_cache_sync
)
self.root = self.meta.root
self.revision = self.meta.revision
# Check version
check_version_compatibility(self.repo_id, self.meta._version, CODEBASE_VERSION)

File diff suppressed because it is too large Load Diff

View File

@@ -37,6 +37,8 @@ import torchvision
from datasets.features.features import register_feature
from PIL import Image
logger = logging.getLogger(__name__)
# List of hardware encoders to probe for auto-selection. Availability depends on the platform and FFmpeg build.
# Determines the order of preference for auto-selection when vcodec="auto" is used.
HW_ENCODERS = [
@@ -94,7 +96,7 @@ def detect_available_hw_encoders() -> list[str]:
av.codec.Codec(codec_name, "w")
available.append(codec_name)
except Exception: # nosec B110
pass # nosec B110
logger.debug("HW encoder '%s' not available", codec_name) # nosec B110
return available
@@ -103,14 +105,14 @@ def resolve_vcodec(vcodec: str) -> str:
if vcodec not in VALID_VIDEO_CODECS:
raise ValueError(f"Invalid vcodec '{vcodec}'. Must be one of: {sorted(VALID_VIDEO_CODECS)}")
if vcodec != "auto":
logging.info(f"Using video codec: {vcodec}")
logger.info(f"Using video codec: {vcodec}")
return vcodec
available = detect_available_hw_encoders()
for encoder in HW_ENCODERS:
if encoder in available:
logging.info(f"Auto-selected video codec: {encoder}")
logger.info(f"Auto-selected video codec: {encoder}")
return encoder
logging.info("No hardware encoder available, falling back to software encoder 'libsvtav1'")
logger.info("No hardware encoder available, falling back to software encoder 'libsvtav1'")
return "libsvtav1"
@@ -118,7 +120,7 @@ def get_safe_default_codec():
if importlib.util.find_spec("torchcodec"):
return "torchcodec"
else:
logging.warning(
logger.warning(
"'torchcodec' is not available in your platform, falling back to 'pyav' as a default decoder"
)
return "pyav"
@@ -208,7 +210,7 @@ def decode_video_frames_torchvision(
for frame in reader:
current_ts = frame["pts"]
if log_loaded_timestamps:
logging.info(f"frame loaded at timestamp={current_ts:.4f}")
logger.info(f"frame loaded at timestamp={current_ts:.4f}")
loaded_frames.append(frame["data"])
loaded_ts.append(current_ts)
if current_ts >= last_ts:
@@ -244,7 +246,7 @@ def decode_video_frames_torchvision(
closest_ts = loaded_ts[argmin_]
if log_loaded_timestamps:
logging.info(f"{closest_ts=}")
logger.info(f"{closest_ts=}")
# convert to the pytorch format which is float32 in [0,1] range (and channel first)
closest_frames = closest_frames.type(torch.float32) / 255
@@ -348,7 +350,7 @@ def decode_video_frames_torchcodec(
loaded_frames.append(frame)
loaded_ts.append(pts.item())
if log_loaded_timestamps:
logging.info(f"Frame loaded at timestamp={pts:.4f}")
logger.info(f"Frame loaded at timestamp={pts:.4f}")
query_ts = torch.tensor(timestamps)
loaded_ts = torch.tensor(loaded_ts)
@@ -374,7 +376,7 @@ def decode_video_frames_torchcodec(
closest_ts = loaded_ts[argmin_]
if log_loaded_timestamps:
logging.info(f"{closest_ts=}")
logger.info(f"{closest_ts=}")
# convert to float32 in [0,1] range
closest_frames = (closest_frames / 255.0).type(torch.float32)
@@ -408,14 +410,14 @@ def encode_video_frames(
imgs_dir = Path(imgs_dir)
if video_path.exists() and not overwrite:
logging.warning(f"Video file already exists: {video_path}. Skipping encoding.")
logger.warning(f"Video file already exists: {video_path}. Skipping encoding.")
return
video_path.parent.mkdir(parents=True, exist_ok=True)
# Encoders/pixel formats incompatibility check
if (vcodec == "libsvtav1" or vcodec == "hevc") and pix_fmt == "yuv444p":
logging.warning(
logger.warning(
f"Incompatible pixel format 'yuv444p' for codec {vcodec}, auto-selecting format 'yuv420p'"
)
pix_fmt = "yuv420p"
@@ -484,42 +486,42 @@ def encode_video_frames(
raise OSError(f"Video encoding did not work. File not found: {video_path}.")
def concatenate_video_files(
input_video_paths: list[Path | str], output_video_path: Path, overwrite: bool = True
def concatenate_media_files(
input_media_paths: list[Path | str], output_media_path: Path, overwrite: bool = True
):
"""
Concatenate multiple video files into a single video file using pyav.
Concatenate multiple media files (video & audio) into a single media file using pyav.
This function takes a list of video input file paths and concatenates them into a single
output video file. It uses ffmpeg's concat demuxer with stream copy mode for fast
This function takes a list of input media file paths and concatenates them into a single
output media file. It uses ffmpeg's concat demuxer with stream copy mode for fast
concatenation without re-encoding.
Args:
input_video_paths: Ordered list of input video file paths to concatenate.
output_video_path: Path to the output video file.
overwrite: Whether to overwrite the output video file if it already exists. Default is True.
input_media_paths: Ordered list of input media file paths to concatenate.
output_media_path: Path to the output media file.
overwrite: Whether to overwrite the output media file if it already exists. Default is True.
Note:
- Creates a temporary directory for intermediate files that is cleaned up after use.
- Uses ffmpeg's concat demuxer which requires all input videos to have the same
- Creates a temporary .ffconcat file and container audio/video file that are cleaned up after use.
- Uses ffmpeg's concat demuxer which requires all input media files to have the same
codec, resolution, and frame rate for proper concatenation.
"""
output_video_path = Path(output_video_path)
output_media_path = Path(output_media_path)
if output_video_path.exists() and not overwrite:
logging.warning(f"Video file already exists: {output_video_path}. Skipping concatenation.")
if output_media_path.exists() and not overwrite:
logging.warning(f"Media file already exists: {output_media_path}. Skipping concatenation.")
return
output_video_path.parent.mkdir(parents=True, exist_ok=True)
output_media_path.parent.mkdir(parents=True, exist_ok=True)
if len(input_video_paths) == 0:
raise FileNotFoundError("No input video paths provided.")
if len(input_media_paths) == 0:
raise FileNotFoundError("No input media paths provided.")
# Create a temporary .ffconcat file to list the input video paths
# Create a temporary .ffconcat file to list the input media paths
with tempfile.NamedTemporaryFile(mode="w", suffix=".ffconcat", delete=False) as tmp_concatenate_file:
tmp_concatenate_file.write("ffconcat version 1.0\n")
for input_path in input_video_paths:
for input_path in input_media_paths:
tmp_concatenate_file.write(f"file '{str(input_path.resolve())}'\n")
tmp_concatenate_file.flush()
tmp_concatenate_path = tmp_concatenate_file.name
@@ -529,11 +531,12 @@ def concatenate_video_files(
tmp_concatenate_path, mode="r", format="concat", options={"safe": "0"}
) # safe = 0 allows absolute paths as well as relative paths
with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as tmp_named_file:
tmp_output_video_path = tmp_named_file.name
# Using an intermediate container to store the concatenated media file is necessary to avoid inplace concatenation read-write race conditions.
with tempfile.NamedTemporaryFile(suffix=output_media_path.suffix, delete=False) as tmp_named_file:
tmp_output_media_path = tmp_named_file.name
output_container = av.open(
tmp_output_video_path, mode="w", options={"movflags": "faststart"}
tmp_output_media_path, mode="w", options={"movflags": "faststart"}
) # faststart is to move the metadata to the beginning of the file to speed up loading
# Replicate input streams in output container
@@ -548,6 +551,7 @@ def concatenate_video_files(
stream_map[input_stream.index].time_base = input_stream.time_base
# Demux + remux packets (no re-encode)
last_dts = None
for packet in input_container.demux():
# Skip packets from un-mapped streams
if packet.stream.index not in stream_map:
@@ -556,6 +560,16 @@ def concatenate_video_files(
# Skip demux flushing packets
if packet.dts is None:
continue
else:
# Enforce strictly increasing decoding timestamps (DTS)
if last_dts is not None and packet.dts <= last_dts:
shift = last_dts - packet.dts + 1
packet.dts += shift
packet.pts += shift # Presenting timestamps (PTS) are the same as DTS here
logging.warning(
f"Non-monotonic DTS; previous: {last_dts}, current: {packet.dts - shift}; changing to {packet.dts}. This may result in incorrect timestamps in the output file."
)
last_dts = packet.dts
output_stream = stream_map[packet.stream.index]
packet.stream = output_stream
@@ -563,7 +577,7 @@ def concatenate_video_files(
input_container.close()
output_container.close()
shutil.move(tmp_output_video_path, output_video_path)
shutil.move(tmp_output_media_path, output_media_path)
Path(tmp_concatenate_path).unlink()
@@ -693,7 +707,7 @@ class _CameraEncoderThread(threading.Thread):
self.result_queue.put(("ok", None))
except Exception as e:
logging.error(f"Encoder thread error: {e}")
logger.error(f"Encoder thread error: {e}")
if container is not None:
with contextlib.suppress(Exception):
container.close()
@@ -739,6 +753,7 @@ class StreamingVideoEncoder:
self._video_paths: dict[str, Path] = {}
self._dropped_frames: dict[str, int] = {}
self._episode_active = False
self._closed = False
def start_episode(self, video_keys: list[str], temp_dir: Path) -> None:
"""Start encoder threads for a new episode.
@@ -819,7 +834,7 @@ class StreamingVideoEncoder:
count = self._dropped_frames[video_key]
# Log periodically to avoid spam (1st, then every 10th)
if count == 1 or count % 10 == 0:
logging.warning(
logger.warning(
f"Encoder queue full for {video_key}, dropped {count} frame(s). "
f"Consider using vcodec='auto' for hardware encoding or increasing encoder_queue_maxsize."
)
@@ -841,7 +856,7 @@ class StreamingVideoEncoder:
# Report dropped frames
for video_key, count in self._dropped_frames.items():
if count > 0:
logging.warning(f"Episode finished with {count} dropped frame(s) for {video_key}.")
logger.warning(f"Episode finished with {count} dropped frame(s) for {video_key}.")
# Send sentinel to all queues
for video_key in self._frame_queues:
@@ -851,7 +866,7 @@ class StreamingVideoEncoder:
for video_key in self._threads:
self._threads[video_key].join(timeout=120)
if self._threads[video_key].is_alive():
logging.error(f"Encoder thread for {video_key} did not finish in time")
logger.error(f"Encoder thread for {video_key} did not finish in time")
self._stop_events[video_key].set()
self._threads[video_key].join(timeout=5)
results[video_key] = (self._video_paths[video_key], None)
@@ -863,7 +878,7 @@ class StreamingVideoEncoder:
raise RuntimeError(f"Encoder thread for {video_key} failed: {data}")
results[video_key] = (self._video_paths[video_key], data)
except queue.Empty:
logging.error(f"No result from encoder thread for {video_key}")
logger.error(f"No result from encoder thread for {video_key}")
results[video_key] = (self._video_paths[video_key], None)
self._cleanup()
@@ -893,8 +908,11 @@ class StreamingVideoEncoder:
def close(self) -> None:
"""Close the encoder, canceling any in-progress episode."""
if self._closed:
return
if self._episode_active:
self.cancel_episode()
self._closed = True
def _cleanup(self) -> None:
"""Clean up queues and thread tracking dicts."""
@@ -941,38 +959,6 @@ with warnings.catch_warnings():
register_feature(VideoFrame, "VideoFrame")
def get_audio_info(video_path: Path | str) -> dict:
# Set logging level
logging.getLogger("libav").setLevel(av.logging.WARNING)
# Getting audio stream information
audio_info = {}
with av.open(str(video_path), "r") as audio_file:
try:
audio_stream = audio_file.streams.audio[0]
except IndexError:
# Reset logging level
av.logging.restore_default_callback()
return {"has_audio": False}
audio_info["audio.channels"] = audio_stream.channels
audio_info["audio.codec"] = audio_stream.codec.canonical_name
# In an ideal loseless case : bit depth x sample rate x channels = bit rate.
# In an actual compressed case, the bit rate is set according to the compression level : the lower the bit rate, the more compression is applied.
audio_info["audio.bit_rate"] = audio_stream.bit_rate
audio_info["audio.sample_rate"] = audio_stream.sample_rate # Number of samples per second
# In an ideal loseless case : fixed number of bits per sample.
# In an actual compressed case : variable number of bits per sample (often reduced to match a given depth rate).
audio_info["audio.bit_depth"] = audio_stream.format.bits
audio_info["audio.channel_layout"] = audio_stream.layout.name
audio_info["has_audio"] = True
# Reset logging level
av.logging.restore_default_callback()
return audio_info
def get_video_info(video_path: Path | str) -> dict:
# Set logging level
logging.getLogger("libav").setLevel(av.logging.WARNING)
@@ -1002,9 +988,6 @@ def get_video_info(video_path: Path | str) -> dict:
# Reset logging level
av.logging.restore_default_callback()
# Adding audio stream information
video_info.update(**get_audio_info(video_path))
return video_info
@@ -1019,22 +1002,22 @@ def get_video_pixel_channels(pix_fmt: str) -> int:
raise ValueError("Unknown format")
def get_video_duration_in_s(video_path: Path | str) -> float:
def get_media_duration_in_s(media_path: Path | str, media_type: str = "video") -> float:
"""
Get the duration of a video file in seconds using PyAV.
Get the duration of a media file (video & audio) in seconds using PyAV.
Args:
video_path: Path to the video file.
media_path: Path to the media file.
Returns:
Duration of the video in seconds.
Duration of the media file in seconds.
"""
with av.open(str(video_path)) as container:
# Get the first video stream
video_stream = container.streams.video[0]
with av.open(str(media_path)) as container:
# Get the first stream
stream = container.streams.video[0] if media_type == "video" else container.streams.audio[0]
# Calculate duration: stream.duration * stream.time_base gives duration in seconds
if video_stream.duration is not None:
duration = float(video_stream.duration * video_stream.time_base)
if stream.duration is not None:
duration = float(stream.duration * stream.time_base)
else:
# Fallback to container duration if stream duration is not available
duration = float(container.duration / av.time_base)
@@ -1043,12 +1026,12 @@ def get_video_duration_in_s(video_path: Path | str) -> float:
class VideoEncodingManager:
"""
Context manager that ensures proper video encoding and data cleanup even if exceptions occur.
Context manager that ensures proper video and audio encoding and data cleanup even if exceptions occur.
This manager handles:
- Batch encoding for any remaining episodes when recording interrupted
- Cleaning up temporary image files from interrupted episodes
- Removing empty image directories
- Cleaning up temporary image and audio files from interrupted episodes
- Removing empty image and audio directories
Args:
dataset: The LeRobotDataset instance
@@ -1061,43 +1044,19 @@ class VideoEncodingManager:
return self
def __exit__(self, exc_type, exc_val, exc_tb):
streaming_encoder = getattr(self.dataset, "_streaming_encoder", None)
writer = self.dataset.writer
if writer is not None:
if exc_type is not None and writer._streaming_encoder is not None:
writer.cancel_pending_videos()
if streaming_encoder is not None:
# Handle streaming encoder cleanup
if exc_type is not None:
streaming_encoder.cancel_episode()
streaming_encoder.close()
elif self.dataset.episodes_since_last_encoding > 0:
# Handle any remaining episodes that haven't been batch encoded
if exc_type is not None:
logging.info("Exception occurred. Encoding remaining episodes before exit...")
else:
logging.info("Recording stopped. Encoding remaining episodes...")
# finalize() handles flush_pending_videos + parquet + metadata
self.dataset.finalize()
start_ep = self.dataset.num_episodes - self.dataset.episodes_since_last_encoding
end_ep = self.dataset.num_episodes
logging.info(
f"Encoding remaining {self.dataset.episodes_since_last_encoding} episodes, "
f"from episode {start_ep} to {end_ep - 1}"
)
self.dataset._batch_save_episode_video(start_ep, end_ep)
# Finalize the dataset to properly close all writers
self.dataset.finalize()
# Clean up episode images if recording was interrupted (only for non-streaming mode)
if exc_type is not None and streaming_encoder is None:
interrupted_episode_index = self.dataset.num_episodes
for key in self.dataset.meta.video_keys:
img_dir = self.dataset._get_image_file_path(
episode_index=interrupted_episode_index, image_key=key, frame_index=0
).parent
if img_dir.exists():
logging.debug(
f"Cleaning up interrupted episode images for episode {interrupted_episode_index}, camera {key}"
)
shutil.rmtree(img_dir)
# Clean up episode images if recording was interrupted (only for non-streaming mode)
if exc_type is not None and writer._streaming_encoder is None:
writer.cleanup_interrupted_episode(self.dataset.num_episodes)
else:
self.dataset.finalize()
# Clean up any remaining images directory if it's empty
img_dir = self.dataset.root / "images"
@@ -1105,8 +1064,20 @@ class VideoEncodingManager:
png_files = list(img_dir.rglob("*.png"))
if len(png_files) == 0:
shutil.rmtree(img_dir)
logging.debug("Cleaned up empty images directory")
logger.debug("Cleaned up empty images directory")
else:
logging.debug(f"Images directory is not empty, containing {len(png_files)} PNG files")
logger.debug(f"Images directory is not empty, containing {len(png_files)} PNG files")
# Clean up any remaining audio directory if it's empty
audio_dir = self.dataset.root / "raw_audio"
# Check for any remaining WAV files
wav_files = list(audio_dir.rglob("*.wav"))
if len(wav_files) == 0:
# Only remove the raw_audio directory if no WAV files remain
if audio_dir.exists():
shutil.rmtree(audio_dir)
logging.debug("Cleaned up empty audio directory")
else:
logging.debug(f"Audio directory is not empty, containing {len(wav_files)} WAV files")
return False # Don't suppress the original exception

View File

@@ -0,0 +1,17 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .configs import MicrophoneConfig
from .microphone import Microphone
from .utils import make_microphones_from_configs

View File

@@ -0,0 +1,28 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import abc
from dataclasses import dataclass
import draccus
@dataclass(kw_only=True)
class MicrophoneConfig(draccus.ChoiceRegistry, abc.ABC):
sample_rate: int | None = None
channels: list[int] | None = None
@property
def type(self) -> str:
return self.get_choice_name(self.__class__)

View File

@@ -0,0 +1,140 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import abc
from pathlib import Path
from threading import Barrier
from typing import Any
import numpy as np
from .configs import MicrophoneConfig
class Microphone(abc.ABC):
"""Base class for microphone implementations.
Defines a standard interface for microphone operations across different backends.
Subclasses must implement all abstract methods.
Manages basic microphone properties (sample rate, channels) and core operations:
- Connection/disconnection
- Start/stop recording
- Audio chunk reading
Attributes:
sample_rate (int | None): Configured sample rate in Hz
channels (list[int] | None): List of channel numbers to record
Example:
class MyMicrophone(Microphone):
def __init__(self, config): ...
@property
def is_connected(self) -> bool: ...
def connect(self): ...
# Plus other required methods
"""
def __init__(self, config: MicrophoneConfig):
"""Initialize the microphone with the given configuration.
Args:
config: Microphone configuration containing sample rate and channels.
"""
self.sample_rate: int | None = config.sample_rate
self.channels: list[int] | None = config.channels
@property
@abc.abstractmethod
def is_connected(self) -> bool:
"""Check if the microphone is currently connected.
Returns:
bool: True if the microphone is connected and ready to start recording,
False otherwise.
"""
pass
@property
@abc.abstractmethod
def is_recording(self) -> bool:
"""Check if the microphone is currently recording.
Returns:
bool: True if the microphone is recording, False otherwise.
"""
pass
@property
@abc.abstractmethod
def is_writing(self) -> bool:
"""Check if the microphone is currently writing to a file.
Returns:
bool: True if the microphone is writing to a file, False otherwise.
"""
pass
@staticmethod
@abc.abstractmethod
def find_microphones() -> list[dict[str, Any]]:
"""Detects available microphones connected to the system.
Returns:
List[Dict[str, Any]]: A list of dictionaries,
where each dictionary contains information about a detected microphone.
"""
pass
@abc.abstractmethod
def connect(self) -> None:
"""Establish connection to the microphone."""
pass
@abc.abstractmethod
def start_recording(
self,
output_file: str | Path | None = None,
multiprocessing: bool | None = False,
overwrite: bool | None = True,
barrier: Barrier | None = None,
) -> None:
"""Start recording audio from the microphone.
Args:
output_file: Optional path to save the recorded audio.
multiprocessing: If True, enables multiprocessing for recording. Defaults to multithreading otherwise.
overwrite: If True, overwrites existing files at output_file path.
barrier: If not None, ensures that multiple microphones start recording at the same time.
"""
pass
@abc.abstractmethod
def read(self) -> np.ndarray:
"""Capture and return a single audio chunk from the microphone.
Returns:
np.ndarray: Captured audio chunk as a numpy array.
"""
pass
@abc.abstractmethod
def stop_recording(self) -> None:
"""Stop recording audio from the microphone."""
pass
@abc.abstractmethod
def disconnect(self) -> None:
"""Disconnect the microphone and release any resources."""
pass

View File

@@ -0,0 +1,16 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .configuration_portaudio import PortAudioMicrophoneConfig
from .microphone_portaudio import PortAudioMicrophone

View File

@@ -0,0 +1,41 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass
from ..configs import MicrophoneConfig
@MicrophoneConfig.register_subclass("portaudio")
@dataclass
class PortAudioMicrophoneConfig(MicrophoneConfig):
"""Configuration class for PortAudio-based microphone devices.
This class provides configuration options for microphones accessed through PortAudio with the sounddevice Python package.
including device index, sample rate and channels.
Example configurations:
```python
# Basic configurations
PortAudioMicrophoneConfig(0, 16000, [1]) # Device index 0, 16000Hz, mono
PortAudioMicrophoneConfig(1, 44100, [1, 2]) # Device index 1, 44100Hz, stereo
```
Attributes:
microphone_index: Device index for the microphone.
sample_rate: Sample rate in Hz for the microphone.
channels: List of channel numbers to use for the microphone.
"""
microphone_index: int

View File

@@ -0,0 +1,394 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import abc
import time
from collections.abc import Callable
from threading import Event, Thread
from typing import Any
import numpy as np
from sounddevice import PortAudioError
from lerobot.utils.robot_utils import precise_sleep
# --- Interface definitions for InputStream ---
class IInputStream(abc.ABC):
@abc.abstractmethod
def __init__(
self,
samplerate: float | None = None,
blocksize: int | None = None,
device: int | str | None = None,
channels: int | None = None,
dtype: str | np.dtype | None = None,
latency: float | str | None = None,
callback: Callable[[Any, int, Any, Any], None] | None = None,
):
pass
@abc.abstractmethod
def start(self) -> None:
pass
@abc.abstractmethod
def stop(self) -> None:
pass
@abc.abstractmethod
def close(self) -> None:
pass
class ISounddeviceSDK(abc.ABC):
"""Interface defining the contract for the Sounddevice SDK."""
InputStream: type[IInputStream]
@abc.abstractmethod
def query_devices(self, device: int | str | None = None, kind: str | None = None) -> list[dict[str, Any]]:
pass
# --- Real SDK Adapter ---
class SounddeviceSDKAdapter(ISounddeviceSDK):
"""Adapts the real sounddevice library to the ISounddeviceSDK interface."""
_sounddevice = None
def __init__(self):
try:
import sounddevice
SounddeviceSDKAdapter._sounddevice = sounddevice
except ImportError as e:
raise ImportError("sounddevice library not found") from e
# --- Inner Class Implementation ---
class RealInputStream(IInputStream):
def __init__(
self,
samplerate: int | None = None,
blocksize: int | None = None,
device: int | None = None,
channels: int | None = None,
dtype: str | np.dtype | None = None,
latency: float | str | None = None,
callback: Callable[[Any, int, Any, Any], None] | None = None,
):
import sounddevice
self._input_stream = sounddevice.InputStream(
samplerate=samplerate,
blocksize=blocksize,
device=device,
channels=channels,
dtype=dtype,
latency=latency,
callback=callback,
)
def start(self) -> None:
self._input_stream.start()
def stop(self) -> None:
self._input_stream.stop()
def close(self) -> None:
self._input_stream.close()
def __del__(self):
self._input_stream.stop()
self._input_stream.close()
@property
def active(self) -> bool:
return self._input_stream.active
@property
def stopped(self) -> bool:
return self._input_stream.stopped
@property
def closed(self) -> bool:
return self._input_stream.closed
InputStream = RealInputStream
def query_devices(self, device: int | str | None = None, kind: str | None = None) -> list[dict[str, Any]]:
return SounddeviceSDKAdapter._sounddevice.query_devices(device, kind)
# Emulates a 48kHz stereo microphone
VALID_DTYPE = {
"float32",
"int32",
"int16",
"int8",
"uint8",
np.float32,
np.int32,
np.int16,
np.int8,
np.uint8,
}
VALID_LATENCY = {"low", "high"}
VALID_DEVICES = [
{
"index": 0,
"name": "Built-in Microphone",
"hostapi": 0,
"max_input_channels": 2,
"max_output_channels": 0,
"default_low_input_latency": 0.01,
"default_low_output_latency": 0.001,
"default_high_input_latency": 0.1,
"default_high_output_latency": 0.01,
"default_samplerate": 48000.0,
},
{
"index": 1,
"name": "Built-in Output",
"hostapi": 0,
"max_input_channels": 0,
"max_output_channels": 2,
"default_low_input_latency": 0.04,
"default_low_output_latency": 0.04,
"default_high_input_latency": 0.12,
"default_high_output_latency": 0.12,
"default_samplerate": 48000.0,
},
{
"index": 2,
"name": "USB Audio Device",
"hostapi": 0,
"max_input_channels": 1,
"max_output_channels": 0,
"default_low_input_latency": 0.03,
"default_low_output_latency": 0.01,
"default_high_input_latency": 0.04,
"default_high_output_latency": 0.03,
"default_samplerate": 16000.0,
},
]
# -- Fake SDK Adapter ---
class FakeSounddeviceSDKAdapter(ISounddeviceSDK):
"""Implements the ISounddeviceSDK interface with fake behaviour for testing."""
# --- Inner Class Implementation ---
class FakeInputStream(IInputStream):
def __init__(
self,
samplerate: float | None = None,
blocksize: int | None = None,
device: int | str | None = None,
channels: int | None = None,
dtype: str | None = None,
latency: str | None = None,
callback: Callable[[Any, int, Any, Any], None] | None = None,
):
self.samplerate = samplerate
self.blocksize = blocksize
self.device = device
self.channels = channels
self.dtype = dtype
self.latency = latency
self.callback = callback
self._validate_settings()
self._active = False
self._closed = False
if self.callback is not None:
self._streaming_thread = Thread(target=self._streaming_loop, daemon=True)
self._streaming_thread_stop_event = Event()
@property
def active(self) -> bool:
"""True when the stream is active, False otherwise."""
return self._active
@property
def stopped(self) -> bool:
"""True when the stream is stopped, False otherwise."""
return not self._active
@property
def closed(self) -> bool:
"""True after a call to close(), False otherwise."""
return self._closed
def _get_device_info(self):
"""Returns the device info for the device."""
for device in VALID_DEVICES:
if (isinstance(self.device, int) and device["index"] == self.device) or (
isinstance(self.device, str) and device["name"] == self.device
):
return device
raise PortAudioError(f"No input device matching {self.device}")
def _validate_device(self):
"""Validates the device against the valid devices."""
valid_device_indices = [device["index"] for device in VALID_DEVICES]
valid_device_names = [device["name"] for device in VALID_DEVICES]
if self.device is not None:
if isinstance(self.device, (int, str)):
# Check if device index is valid
if isinstance(self.device, int) and self.device not in valid_device_indices:
raise PortAudioError(f"Error querying device {self.device}")
# Check if device name is valid
if isinstance(self.device, str) and self.device not in valid_device_names:
raise PortAudioError(f"No input device matching {self.device}")
else:
raise PortAudioError(f"Device must be int or str, got {type(self.device)}")
else:
# Default to first input device
input_devices = [d for d in VALID_DEVICES if d["max_input_channels"] > 0]
if input_devices:
self.device = input_devices[0]["index"]
def _validate_samplerate(self):
"""Validates the samplerate against the device's maximum samplerate."""
device_info = self._get_device_info()
if self.samplerate is None:
self.samplerate = device_info["default_samplerate"]
elif self.samplerate > device_info["default_samplerate"] or self.samplerate < 1000:
raise PortAudioError("Error opening InputStream: Invalid sample rate")
def _validate_channels(self):
"""Validates the channels against the device's maximum channels."""
device_info = self._get_device_info()
if self.channels is None:
self.channels = device_info["max_input_channels"]
elif self.channels > device_info["max_input_channels"] or self.channels < 1:
raise PortAudioError("Error opening InputStream: Invalid number of channels")
def _validate_dtype(self):
"""Validates the dtype against the valid dtypes."""
if self.dtype is not None:
if self.dtype not in VALID_DTYPE:
raise PortAudioError("Invalid input sample format")
else:
self.dtype = "float32" # Default dtype
def _validate_latency(self):
"""Validates the latency against the valid latencies."""
if self.latency is not None:
if self.latency not in VALID_LATENCY:
raise PortAudioError("Invalid latency")
else:
self.latency = "low" # Default latency
if isinstance(self.latency, str):
device_info = self._get_device_info()
if self.latency == "low":
self.latency = device_info["default_low_input_latency"]
elif self.latency == "high":
self.latency = device_info["default_high_input_latency"]
def _validate_settings(self):
"""Validates the input parameters against available devices and valid options."""
self._validate_device()
self._validate_samplerate()
self._validate_channels()
self._validate_dtype()
self._validate_latency()
def _simulated_audio_data(self) -> np.ndarray:
"""Generates a simulated audio signal for testing purposes with proper value ranges."""
duration_samples = int(self.samplerate * self.latency)
# Generate output according to dtype
if self.dtype in {"float32", np.float32}:
# Generate values between -1 and 1 for float32
data = np.random.uniform(-1.0, 1.0, (duration_samples, self.channels)).astype(self.dtype)
else:
# Use np.iinfo to get proper range for integer types
info = np.iinfo(self.dtype)
data = np.random.randint(
info.min, info.max + 1, (duration_samples, self.channels), dtype=self.dtype
)
return data
def _streaming_loop(self):
if self.callback is not None:
while not self._streaming_thread_stop_event.is_set():
precise_sleep(self.latency)
tmp_data = self._simulated_audio_data()
self.callback(
tmp_data,
len(tmp_data),
time.perf_counter(),
None,
)
def start(self) -> None:
"""Start the fake input stream."""
if not self.active and self.callback is not None:
self._streaming_thread.start()
self._active = True
def stop(self) -> None:
"""Stop the fake input stream."""
if self.callback is not None:
self._streaming_thread_stop_event.set()
self._streaming_thread.join()
self._active = False
def close(self) -> None:
"""Close the fake input stream."""
if self.active and self.callback is not None:
self.stop()
self._active = False
self._closed = True
def __del__(self):
self.close()
InputStream = FakeInputStream
def query_devices(self, device: int | str | None = None, kind: str | None = None) -> list[dict[str, Any]]:
"""Returns a realistic list of audio devices including speakers and microphones."""
if device is not None:
# Return specific device
for valid_device in VALID_DEVICES:
if (isinstance(device, int) and valid_device["index"] == device) or (
isinstance(device, str) and valid_device["name"] == device
):
return valid_device
raise PortAudioError(f"Error querying device {device}")
elif kind is not None:
for valid_device in VALID_DEVICES:
if (
valid_device["max_input_channels"] > 0
and kind == "input"
or valid_device["max_output_channels"] > 0
and kind == "output"
):
return valid_device
raise PortAudioError(f"No {kind} device found")
return VALID_DEVICES

View File

@@ -0,0 +1,566 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Provides the PortAudioMicrophone class for capturing audio from microphones using the PortAudio library through the sounddevice Python package.
"""
import logging
import time
from multiprocessing import (
Event as process_Event,
JoinableQueue as process_Queue,
Process,
)
from pathlib import Path
from queue import Empty
from threading import Barrier, Event, Event as thread_Event, Thread
from typing import Any
import numpy as np
from soundfile import SoundFile
from lerobot.microphones.portaudio.interface_sounddevice_sdk import ISounddeviceSDK, SounddeviceSDKAdapter
from lerobot.utils.errors import (
DeviceAlreadyConnectedError,
DeviceAlreadyRecordingError,
DeviceNotConnectedError,
DeviceNotRecordingError,
)
from lerobot.utils.shared_array import SharedArray
from ..microphone import Microphone
from .configuration_portaudio import PortAudioMicrophoneConfig
logger = logging.getLogger(__name__)
class PortAudioMicrophone(Microphone):
"""
The PortAudioMicrophone class handles all microphones compatible with sounddevice (and the underlying PortAudio library). Most microphones and sound cards are compatible, across all OS (Linux, Mac, Windows).
A PortAudioMicrophone instance requires the sounddevice index of the microphone, which may be obtained using `python -m sounddevice`. It also requires the recording sample rate as well as the list of recorded channels.
Example of usage:
```python
from lerobot.common.robot_devices.microphones.configs import PortAudioMicrophoneConfig
config = PortAudioMicrophoneConfig(microphone_index=0, sample_rate=16000, channels=[1])
microphone = PortAudioMicrophone(config)
microphone.connect()
microphone.start_recording("some/output/file.wav")
...
audio_readings = microphone.read() # Gets all recorded audio data since the last read or since the beginning of the recording. The longer the period the longer the reading time !
...
microphone.stop_recording()
microphone.disconnect()
```
"""
def __init__(self, config: PortAudioMicrophoneConfig, sounddevice_sdk: ISounddeviceSDK = None):
"""
Initializes the PortAudioMicrophone instance.
Args:
config: The configuration settings for the microphone.
"""
super().__init__(config)
if sounddevice_sdk is None:
self.sounddevice_sdk = SounddeviceSDKAdapter()
else:
self.sounddevice_sdk = sounddevice_sdk
# Microphone index
self.microphone_index = config.microphone_index
# Input audio recording process and events
self.record_process = None
self.record_stop_event = process_Event()
self.record_start_event = process_Event()
self.record_close_event = process_Event()
self.record_is_started_event = process_Event()
self.audio_callback_start_event = process_Event()
# Process-safe concurrent queue to send audio from the recording process to the writing process/thread
self.write_queue = process_Queue()
# SharedArray to store audio from the recording process.
self.read_shared_array = None
self.local_read_shared_array = None
# Thread/Process to handle data writing in a separate thread/process (safely)
self.write_thread = None
self.write_stop_event = None
self.write_is_started_event = None
self.logs = {}
def __str__(self) -> str:
return f"{self.__class__.__name__}({self.microphone_index})"
@property
def is_connected(self) -> bool:
return self.record_process is not None and self.record_process.is_alive()
@property
def is_recording(self) -> bool:
return self.record_is_started_event.is_set()
@property
def is_writing(self) -> bool:
return self.write_thread is not None and self.write_is_started_event.is_set()
@staticmethod
def find_microphones(
device: int | str | None = None, sounddevice_sdk: ISounddeviceSDK = None
) -> list[dict[str, Any]] | dict[str, Any]:
"""
Detects available microphones connected to the system.
Args:
device: The device to find microphones for. If None, all microphones are found.
Returns:
List[Dict[str, Any]]: A list of dictionaries,
where each dictionary contains information about a detected microphone : index, name, sample rate, channels.
"""
if sounddevice_sdk is None:
sounddevice_sdk = SounddeviceSDKAdapter()
found_microphones_info = []
devices = sounddevice_sdk.query_devices()
for d in devices:
if d["max_input_channels"] > 0:
microphone_info = {
"index": d["index"],
"name": d["name"],
"sample_rate": int(d["default_samplerate"]),
"channels": np.arange(1, d["max_input_channels"] + 1),
}
if device is None or (
(isinstance(device, int) and d["index"] == device)
or (isinstance(device, str) and d["name"] == device)
):
found_microphones_info.append(microphone_info)
if device is not None:
if len(found_microphones_info) == 0:
raise RuntimeError(f"No microphone found for device {device}")
else:
return found_microphones_info[0]
if len(found_microphones_info) == 0:
logger.warning("No microphone found !")
return found_microphones_info
def _configure_capture_settings(self) -> None:
"""
Validates the microphone index, sample rate and channels settings specified in the constructor's config to the un-connected microphone.
This method actually checks the specified settings and fills the sample rate and channels settings if not specified before attempting to start a PortAudio stream.
Raises:
RuntimeError: If one of the specified settings is not compatible with the microphone.
DeviceAlreadyConnectedError: If the microphone is connected when attempting to configure settings.
"""
if self.is_connected:
raise DeviceAlreadyConnectedError(
f"Cannot configure settings for {self} as it is already connected."
)
self._validate_microphone_index()
self._validate_sample_rate()
self._validate_channels()
def _validate_microphone_index(self) -> None:
""" "Validates the microphone index against available devices by checking if it has at least one input channel."""
try:
PortAudioMicrophone.find_microphones(self.microphone_index, self.sounddevice_sdk)
except RuntimeError as e:
raise RuntimeError(
f"{e}. Available microphones: {PortAudioMicrophone.find_microphones(sounddevice_sdk=self.sounddevice_sdk)}"
) from e
def _validate_sample_rate(self) -> None:
"""Validates the sample rate against the actual microphone's default sample rate."""
actual_sample_rate = PortAudioMicrophone.find_microphones(
self.microphone_index, self.sounddevice_sdk
)["sample_rate"]
if self.sample_rate is not None:
try:
self.sample_rate = int(self.sample_rate)
except ValueError as e:
raise RuntimeError(
f"Cannot convert the provided sample rate ({self.sample_rate} Hz) to an integer."
) from e
if self.sample_rate > actual_sample_rate or self.sample_rate < 1000:
raise RuntimeError(
f"Provided sample rate {self.sample_rate} is either too low or too high compared to the sample rate of the microphone {actual_sample_rate}."
)
else:
if self.sample_rate < actual_sample_rate:
logger.warning(
"Provided sample rate is lower than the sample rate of the microphone. Performance may be impacted."
)
else:
self.sample_rate = actual_sample_rate
def _validate_channels(self) -> None:
"""Validates the channels against the actual microphone's maximum input channels."""
actual_channels = PortAudioMicrophone.find_microphones(self.microphone_index, self.sounddevice_sdk)[
"channels"
]
if self.channels is not None and len(self.channels) > 0:
if not all(channel in actual_channels for channel in self.channels):
raise RuntimeError(
f"Some of the provided channels {self.channels} are outside the possible channel range of the microphone {actual_channels}."
)
else:
self.channels = actual_channels
# Get channels index instead of number for slicing
self.channels_index = np.array(self.channels) - 1
def connect(self) -> None:
"""
Connects the microphone and checks if the requested acquisition parameters are compatible with the microphone.
"""
if self.is_connected:
raise DeviceAlreadyConnectedError(f"Microphone {self.microphone_index} is already connected.")
self._configure_capture_settings()
# Create or reset queue and shared array
self.read_shared_array = SharedArray(
shape=(self.sample_rate * 10, len(self.channels)),
dtype=np.dtype("float32"),
)
self.local_read_shared_array = self.read_shared_array.get_local_array()
self.write_queue = process_Queue()
# Reset events
self.record_start_event.clear()
self.record_stop_event.clear()
self.record_close_event.clear()
self.record_is_started_event.clear()
self.audio_callback_start_event.clear()
# Create and start an audio input stream with a recording callback
# Remark: this is done in a separate process so that audio recording is not impacted by the main thread CPU usage, especially the precise_sleep function.
process_init_event = process_Event()
self.record_process = Process(
target=self._record_process,
args=(
self.microphone_index,
self.sample_rate,
self.channels,
process_init_event,
self.record_start_event,
self.record_stop_event,
self.record_close_event,
self.record_is_started_event,
self.audio_callback_start_event,
self.write_queue,
self.read_shared_array,
self.sounddevice_sdk,
),
)
self.record_process.daemon = True
self.record_process.start()
is_init = process_init_event.wait(
timeout=5.0
) # Wait for the recording process to be started, and to potentially raise an error on failure.
if not self.is_connected or not is_init:
raise RuntimeError(f"Error connecting microphone {self.microphone_index}.")
logger.info(f"{self} connected.")
def disconnect(self) -> None:
"""
Disconnects the microphone and stops the recording.
"""
if not self.is_connected:
raise DeviceNotConnectedError(f"Microphone {self.microphone_index} is not connected.")
if self.is_recording:
self.stop_recording()
self.record_close_event.set()
self.read_shared_array.delete()
self.write_queue.close()
self.record_process.join()
if self.is_connected:
raise RuntimeError(f"Error disconnecting microphone {self.microphone_index}.")
logger.info(f"{self} disconnected.")
def _read(self) -> np.ndarray:
"""
Thread/Process-safe callback to read available audio data
"""
return self.read_shared_array.read(self.local_read_shared_array, flush=True)
def read(self) -> np.ndarray:
"""
Reads the last audio chunk recorded by the microphone, e.g. all samples recorded since the last read or since the beginning of the recording.
"""
if not self.is_connected:
raise DeviceNotConnectedError(f"Microphone {self.microphone_index} is not connected.")
if not self.is_recording:
raise RuntimeError(f"Microphone {self.microphone_index} is not recording.")
start_time = time.perf_counter()
audio_readings = self._read()
# log the number of seconds it took to read the audio chunk
self.logs["delta_timestamp_s"] = time.perf_counter() - start_time
# log the utc time at which the audio chunk was received
self.logs["timestamp_utc"] = time.perf_counter()
return audio_readings
@staticmethod
def _record_process(
microphone_index,
sample_rate,
channels,
process_init_event,
record_start_event,
record_stop_event,
record_close_event,
record_is_started_event,
audio_callback_start_event,
write_queue,
read_shared_array,
sounddevice_sdk,
) -> None:
"""
Process callback used to create an unpickable sounddevice audio input stream with a recording callback and start, stop and close it based on multiprocessing events.
"""
channels_index = np.array(channels) - 1
local_read_shared_array = read_shared_array.get_local_array()
def audio_callback(indata, frames, timestamp, status) -> None:
"""
Low-level sounddevice callback.
"""
if status:
logger.warning(status)
if audio_callback_start_event.is_set():
write_queue.put_nowait(indata[:, channels_index])
read_shared_array.write(local_read_shared_array, indata[:, channels_index])
# Create the audio stream
# InputStream must be instantiated in the process as it is not pickable.
stream = sounddevice_sdk.InputStream(
device=microphone_index,
samplerate=sample_rate,
channels=max(channels),
dtype="float32",
blocksize=0, # Varying input buffer length, but no additional latency
latency="low", # Low latency mode (not enabled by default !)
# never_drop_input=True, # Disabled as it generates an error for some devices
callback=audio_callback,
)
process_init_event.set()
while True:
start_flag = record_start_event.wait(timeout=0.1)
if record_close_event.is_set():
break
elif not start_flag:
continue
stream.start()
record_is_started_event.set()
record_stop_event.wait()
stream.stop() # stream.stop() waits for all buffers to be processed, stream.abort() flushes the buffers !
record_is_started_event.clear()
stream.close()
def start_recording(
self,
output_file: str | None = None,
multiprocessing: bool | None = False,
overwrite: bool | None = True,
barrier: Barrier | None = None,
) -> None:
"""
Starts the recording of the microphone. If output_file is provided, the audio will be written to this file.
"""
if not self.is_connected:
raise DeviceNotConnectedError(f"Microphone {self.microphone_index} is not connected.")
if self.is_recording:
raise DeviceAlreadyRecordingError(f"Microphone {self.microphone_index} is already recording.")
# Reset queue and shared memory
self.read_shared_array.reset()
self._clear_queue(self.write_queue)
# Reset stop event
self.record_stop_event.clear()
# Write recordings into a file if output_file is provided
if output_file is not None:
output_file = Path(output_file)
output_file.parent.mkdir(parents=True, exist_ok=True)
if output_file.exists():
if overwrite:
output_file.unlink()
else:
raise FileExistsError(
f"Output file {output_file} already exists. Set overwrite to True to overwrite it."
)
if multiprocessing:
self.write_stop_event = process_Event()
self.write_is_started_event = process_Event()
self.write_thread = Process(
target=PortAudioMicrophone._write_loop,
args=(
self.write_queue,
self.write_stop_event,
self.write_is_started_event,
self.sample_rate,
self.channels,
output_file,
),
)
else:
self.write_stop_event = thread_Event()
self.write_is_started_event = thread_Event()
self.write_thread = Thread(
target=PortAudioMicrophone._write_loop,
args=(
self.write_queue,
self.write_stop_event,
self.write_is_started_event,
self.sample_rate,
self.channels,
output_file,
),
)
self.write_thread.daemon = True
self.write_thread.start()
self.write_is_started_event.wait() # Wait for the writing thread/process to be started.
self.record_start_event.set() # Start the input audio stream process
self.record_is_started_event.wait() # Wait for the input audio stream process to be actually started
if barrier is not None:
barrier.wait() # Wait for multiple input audio streams to be started at the same time
self.audio_callback_start_event.set()
if not self.is_recording:
raise RuntimeError(f"Error starting recording for microphone {self.microphone_index}.")
if output_file is not None and not self.is_writing:
raise RuntimeError(f"Error starting writing for microphone {self.microphone_index}.")
def stop_recording(self) -> None:
"""
Stops the recording of the microphones.
"""
if not self.is_connected:
raise DeviceNotConnectedError(f"Microphone {self.microphone_index} is not connected.")
if not self.is_recording:
raise DeviceNotRecordingError(f"Microphone {self.microphone_index} is not recording.")
self.audio_callback_start_event.clear()
self.record_start_event.clear() # Ensures the audio stream is not started again !
self.record_stop_event.set()
# Wait for the stream to be stopped (might lead to race condition if the stream is not properly stopped on array reset and queue clearing)
timeout = 1.0
while self.is_recording and timeout > 0:
time.sleep(0.01)
timeout -= 0.01
self.read_shared_array.reset()
self._clear_queue(self.write_queue, join_queue=True)
if self.is_writing:
self.write_stop_event.set()
self.write_thread.join()
if self.is_recording:
raise RuntimeError(f"Error stopping recording for microphone {self.microphone_index}.")
if self.is_writing:
raise RuntimeError(f"Error stopping writing for microphone {self.microphone_index}.")
@staticmethod
def _write_loop(
queue,
write_stop_event: Event,
write_is_started_event: Event,
sample_rate: int,
channels: list[int],
output_file: Path,
) -> None:
"""
Thread/Process-safe loop to write audio data into a file.
"""
# Can only be run on a single process/thread for file writing safety
with SoundFile(
output_file,
mode="w",
samplerate=sample_rate,
channels=len(channels),
format="WAV",
subtype="FLOAT", # By default, a much lower quality WAV file is created !
) as file:
write_is_started_event.set()
while not write_stop_event.is_set():
try:
file.write(
queue.get(timeout=0.005)
) # Timeout set as the usual sounddevice buffer size. get_nowait is not possible here as it saturates the thread.
queue.task_done()
except Empty:
continue
write_is_started_event.clear()
def __del__(self) -> None:
if self.is_connected:
self.disconnect()
@staticmethod
def _clear_queue(queue, join_queue: bool = False):
"""
Clears the queue by getting all items until it is empty. The longer the queue, the longer it takes to clear it.
"""
try:
while True:
queue.get_nowait()
queue.task_done()
except Empty:
if join_queue:
queue.join()
return

View File

@@ -0,0 +1,16 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .configuration_touchlab import TouchLabSensorConfig
from .sensor_touchlab import TouchLabSensor

View File

@@ -0,0 +1,42 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass
from ..configs import MicrophoneConfig
@MicrophoneConfig.register_subclass("touchlab")
@dataclass
class TouchLabSensorConfig(MicrophoneConfig):
"""Configuration class for TouchLab tactile sensors (technically not a microphone, but behaves like one acquisition-wise).
This class provides configuration options for TouchLab tactile sensors, including serial port, sample rate and channels.
Example configurations:
```python
# Basic configurations
TouchLabSensorConfig("/dev/ttyACM0", 16000) # Serial port /dev/ttyACM0, 16000Hz
TouchLabSensorConfig("/dev/ttyACM1", 44100) # Serial port /dev/ttyACM1, 44100Hz
```
Attributes:
sensor_port: Serial port of the tactile sensor.
baud_rate: Baud rate of the tactile sensor.
sample_rate: Sample rate in Hz for the tactile sensor.
channels: List of channel numbers to use for the tactile sensor.
"""
sensor_port: str
baud_rate: int = 115_200

View File

@@ -0,0 +1,469 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Provides the TouchLabSensor class for capturing tactile data from TouchLab tactile sensors.
"""
import logging
import time
from multiprocessing import (
Event as process_Event,
JoinableQueue as process_Queue,
Process,
)
from pathlib import Path
from queue import Empty
from threading import Barrier, Event, Event as thread_Event, Thread
from typing import Any
import numpy as np
from serial import Serial
from soundfile import SoundFile
from lerobot.utils.errors import (
DeviceAlreadyConnectedError,
DeviceAlreadyRecordingError,
DeviceNotConnectedError,
DeviceNotRecordingError,
)
from lerobot.utils.shared_array import SharedArray
from ..microphone import Microphone
from .configuration_touchlab import TouchLabSensorConfig
logger = logging.getLogger(__name__)
MAX_SERIAL_READ_SIZE = 512
class TouchLabSensor(Microphone):
"""
The TouchLabSensor class handles all TouchLab tactile sensors.
A TouchLabSensor instance requires the serial port of the tactile sensor, which may be obtained using `python -m lerobot.find_port`. It also requires the recording sample rate as well as the list of recorded channels.
Example of usage:
```python
from lerobot.common.robot_devices.microphones.configs import TouchLabSensorConfig
config = TouchLabSensorConfig(sensor_port="/dev/ttyACM0", baud_rate=115200, sample_rate=115, channels=[1])
microphone = TouchLabSensor(config)
microphone.connect()
microphone.start_recording("some/output/file.wav")
...
audio_readings = microphone.read() # Gets all recorded audio data since the last read or since the beginning of the recording. The longer the period the longer the reading time !
...
microphone.stop_recording()
microphone.disconnect()
```
"""
def __init__(self, config: TouchLabSensorConfig):
""" "
Initializes the TouchLabSensor instance.
Args:
config: The configuration settings for the sensor.
"""
super().__init__(config)
# Sensor port
self.sensor_port = config.sensor_port
# Baud rate
self.baud_rate = config.baud_rate
# Input audio recording process and events
self.record_process = None
self.record_stop_event = process_Event()
self.record_start_event = process_Event()
self.record_close_event = process_Event()
self.record_is_started_event = process_Event()
self.audio_callback_start_event = process_Event()
# Process-safe concurrent queue to send audio from the recording process to the writing process/thread
self.write_queue = process_Queue()
# SharedArray to store audio from the recording process.
self.read_shared_array = None
self.local_read_shared_array = None
# Thread/Process to handle data writing in a separate thread/process (safely)
self.write_thread = None
self.write_stop_event = None
self.write_is_started_event = None
self.logs = {}
def __str__(self) -> str:
return f"{self.__class__.__name__}({self.sensor_port})"
@property
def is_connected(self) -> bool:
"""Check if the sensor is currently connected.
Returns:
bool: True if the sensor is connected and ready to start recording,
False otherwise.
"""
return self.record_process is not None and self.record_process.is_alive()
@property
def is_recording(self) -> bool:
"""Check if the sensor is currently recording.
Returns:
bool: True if the sensor is recording, False otherwise.
"""
return self.record_is_started_event.is_set()
@property
def is_writing(self) -> bool:
"""Check if the sensor is currently writing to a file.
Returns:
bool: True if the sensor is writing to a file, False otherwise.
"""
return self.write_thread is not None and self.write_is_started_event.is_set()
@staticmethod
def find_microphones() -> list[dict[str, Any]]:
"""Detects available sensors connected to the system.
Returns:
List[Dict[str, Any]]: A list of dictionaries,
where each dictionary contains information about a detected sensor.
"""
pass
def connect(self) -> None:
"""
Establish connection to the sensor.
"""
if self.is_connected:
raise DeviceAlreadyConnectedError(f"Sensor connected to {self.sensor_port} is already connected.")
# Create or reset queue and shared array
self.read_shared_array = SharedArray(
shape=(self.sample_rate * 10, len(self.channels)),
dtype=np.dtype("int16"),
)
self.local_read_shared_array = self.read_shared_array.get_local_array()
self.write_queue = process_Queue()
# Reset events
self.record_start_event.clear()
self.record_stop_event.clear()
self.record_close_event.clear()
self.record_is_started_event.clear()
self.audio_callback_start_event.clear()
# Create and start an audio input stream with a recording callback
# Remark: this is done in a separate process so that audio recording is not impacted by the main thread CPU usage, especially the precise_sleep function.
process_init_event = process_Event()
self.record_process = Process(
target=self._record_process,
args=(
self.sensor_port,
self.baud_rate,
self.channels,
process_init_event,
self.record_start_event,
self.record_stop_event,
self.record_close_event,
self.record_is_started_event,
self.audio_callback_start_event,
self.write_queue,
self.read_shared_array,
),
)
self.record_process.daemon = True
self.record_process.start()
is_init = process_init_event.wait(
timeout=5.0
) # Wait for the recording process to be started, and to potentially raise an error on failure.
if not self.is_connected or not is_init:
raise RuntimeError(f"Error connecting sensor connected to {self.sensor_port}.")
logger.info(f"{self} connected.")
@staticmethod
def _record_process(
sensor_port,
baud_rate,
channels,
process_init_event,
record_start_event,
record_stop_event,
record_close_event,
record_is_started_event,
audio_callback_start_event,
write_queue,
read_shared_array,
) -> None:
channels_index = np.array(channels) - 1
local_read_shared_array = read_shared_array.get_local_array()
def tactile_callback(serial_connection):
"""
Parse the tactile data from the raw input data.
"""
buffer = serial_connection.readline()
if audio_callback_start_event.is_set():
strings = buffer.decode("utf8").split(",")
num_taxels = len(strings)
if num_taxels > 0 and num_taxels < MAX_SERIAL_READ_SIZE: # Make sure we didn't read rubbish
indata = np.empty((1, num_taxels))
for i in range(num_taxels):
indata[0, i] = int(strings[i])
write_queue.put_nowait(indata[:, channels_index])
read_shared_array.write(local_read_shared_array, indata[:, channels_index])
process_init_event.set()
while True:
start_flag = record_start_event.wait(timeout=0.1)
if record_close_event.is_set():
break
elif not start_flag:
continue
with Serial(sensor_port, baud_rate, timeout=0.5) as serial_connection:
serial_connection.flush()
record_is_started_event.set()
while not record_stop_event.is_set():
tactile_callback(serial_connection)
record_is_started_event.clear()
serial_connection.close()
def disconnect(self) -> None:
"""
Disconnect the sensor and release any resources.
"""
if not self.is_connected:
raise DeviceNotConnectedError(f"Sensor connected to {self.sensor_port} is not connected.")
if self.is_recording:
self.stop_recording()
self.record_close_event.set()
self.read_shared_array.delete()
self.write_queue.close()
self.record_process.join()
if self.is_connected:
raise RuntimeError(f"Error disconnecting sensor connected to {self.sensor_port}.")
logger.info(f"{self} disconnected.")
def start_recording(
self,
output_file: str | Path | None = None,
multiprocessing: bool | None = False,
overwrite: bool | None = True,
barrier: Barrier | None = None,
) -> None:
"""
Start recording tactile data from the sensor.
Args:
output_file: Optional path to save the recorded tactile data.
multiprocessing: If True, enables multiprocessing for recording. Defaults to multithreading otherwise.
overwrite: If True, overwrites existing files at output_file path.
barrier: If not None, ensures that multiple sensors start recording at the same time.
"""
if not self.is_connected:
raise DeviceNotConnectedError(f"Sensor connected to {self.sensor_port} is not connected.")
if self.is_recording:
raise DeviceAlreadyRecordingError(f"Sensor connected to {self.sensor_port} is already recording.")
# Reset queue and shared memory
self.read_shared_array.reset()
self._clear_queue(self.write_queue)
# Reset stop event
self.record_stop_event.clear()
# Write recordings into a file if output_file is provided
if output_file is not None:
output_file = Path(output_file)
output_file.parent.mkdir(parents=True, exist_ok=True)
if output_file.exists():
if overwrite:
output_file.unlink()
else:
raise FileExistsError(
f"Output file {output_file} already exists. Set overwrite to True to overwrite it."
)
if multiprocessing:
self.write_stop_event = process_Event()
self.write_is_started_event = process_Event()
self.write_thread = Process(
target=TouchLabSensor._write_loop,
args=(
self.write_queue,
self.write_stop_event,
self.write_is_started_event,
self.sample_rate,
self.channels,
output_file,
),
)
else:
self.write_stop_event = thread_Event()
self.write_is_started_event = thread_Event()
self.write_thread = Thread(
target=TouchLabSensor._write_loop,
args=(
self.write_queue,
self.write_stop_event,
self.write_is_started_event,
self.sample_rate,
self.channels,
output_file,
),
)
self.write_thread.daemon = True
self.write_thread.start()
self.write_is_started_event.wait() # Wait for the writing thread/process to be started.
self.record_start_event.set() # Start the input audio stream process
self.record_is_started_event.wait() # Wait for the input audio stream process to be actually started
if barrier is not None:
barrier.wait() # Wait for multiple input audio streams to be started at the same time
self.audio_callback_start_event.set()
if not self.is_recording:
raise RuntimeError(f"Error starting recording for sensor connected to {self.sensor_port}.")
if output_file is not None and not self.is_writing:
raise RuntimeError(f"Error starting writing for sensor connected to {self.sensor_port}.")
def _read(self) -> np.ndarray:
"""
Thread/Process-safe callback to read available audio data
"""
return self.read_shared_array.read(self.local_read_shared_array, flush=True)
def read(self) -> np.ndarray:
"""Capture and return a single audio chunk from the sensor.
Returns:
np.ndarray: Captured audio chunk as a numpy array.
"""
if not self.is_connected:
raise DeviceNotConnectedError(f"Sensor connected to {self.sensor_port} is not connected.")
if not self.is_recording:
raise RuntimeError(f"Sensor connected to {self.sensor_port} is not recording.")
start_time = time.perf_counter()
tactile_readings = self._read()
# log the number of seconds it took to read the audio chunk
self.logs["delta_timestamp_s"] = time.perf_counter() - start_time
# log the utc time at which the audio chunk was received
self.logs["timestamp_utc"] = time.perf_counter()
return tactile_readings
def _read_loop(self) -> None:
"""Internal loop run by the background thread for asynchronous reading."""
def stop_recording(self) -> None:
"""Stop recording audio from the sensor."""
if not self.is_connected:
raise DeviceNotConnectedError(f"Sensor connected to {self.sensor_port} is not connected.")
if not self.is_recording:
raise DeviceNotRecordingError(f"Sensor connected to {self.sensor_port} is not recording.")
self.audio_callback_start_event.clear()
self.record_start_event.clear() # Ensures the audio stream is not started again !
self.record_stop_event.set()
self.read_shared_array.reset()
self._clear_queue(self.write_queue, join_queue=True)
if self.is_writing:
self.write_stop_event.set()
self.write_thread.join()
timeout = 1.0
while self.is_recording and timeout > 0:
time.sleep(0.01)
timeout -= 0.01
if self.is_recording:
raise RuntimeError(f"Error stopping recording for sensor connected to {self.sensor_port}.")
if self.is_writing:
raise RuntimeError(f"Error stopping writing for sensor connected to {self.sensor_port}.")
def __del__(self) -> None:
if self.is_connected:
self.disconnect()
@staticmethod
def _clear_queue(queue, join_queue: bool = False):
"""
Clears the queue by getting all items until it is empty. The longer the queue, the longer it takes to clear it.
"""
try:
while True:
queue.get_nowait()
queue.task_done()
except Empty:
if join_queue:
queue.join()
return
@staticmethod
def _write_loop(
queue,
write_stop_event: Event,
write_is_started_event: Event,
sample_rate: int,
channels: list[int],
output_file: Path,
) -> None:
"""
Thread/Process-safe loop to write audio data into a file.
"""
# Can only be run on a single process/thread for file writing safety
with SoundFile(
output_file,
mode="w",
samplerate=sample_rate,
channels=len(channels),
format="WAV",
subtype="PCM_16", # Subtype for int16 values
) as file:
write_is_started_event.set()
while not write_stop_event.is_set():
try:
file.write(
queue.get(timeout=0.005)
) # Timeout set as the usual sounddevice buffer size. get_nowait is not possible here as it saturates the thread.
queue.task_done()
except Empty:
continue
write_is_started_event.clear()

View File

@@ -0,0 +1,89 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from multiprocessing import Barrier
from threading import Thread
from .configs import MicrophoneConfig
from .microphone import Microphone
def make_microphones_from_configs(microphone_configs: dict[str, MicrophoneConfig]) -> dict[str, Microphone]:
microphones = {}
for key, cfg in microphone_configs.items():
if cfg.type == "portaudio":
from .portaudio import PortAudioMicrophone
microphones[key] = PortAudioMicrophone(cfg)
elif cfg.type == "touchlab":
from .touchlab import TouchLabSensor
microphones[key] = TouchLabSensor(cfg)
else:
raise ValueError(f"The microphone type '{cfg.type}' is not valid.")
return microphones
def async_microphones_start_recording(
microphones: dict[str, Microphone],
output_files: list[str | None] | None = None,
multiprocessing: bool = False,
overwrite: bool = True,
) -> None:
"""
Starts recording on multiple microphones asynchronously to avoid delays.
Args:
microphones: A dictionary of microphones.
output_files: A list of output files.
multiprocessing: If True, enables multiprocessing for recording.
overwrite: If True, overwrites existing files at output_file path.
"""
start_recording_threads = []
if output_files is None:
output_files = [None] * len(microphones)
barrier = Barrier(len(microphones))
for microphone, output_file in zip(microphones.values(), output_files, strict=False):
start_recording_threads.append(
Thread(target=microphone.start_recording, args=(output_file, multiprocessing, overwrite, barrier))
)
for thread in start_recording_threads:
thread.start()
for thread in start_recording_threads:
thread.join()
def async_microphones_stop_recording(microphones: dict[str, Microphone]) -> None:
"""
Stops recording on multiple microphones asynchronously to avoid delays.
Args:
microphones: A dictionary of microphones.
"""
stop_recording_threads = []
for microphone in microphones.values():
stop_recording_threads.append(Thread(target=microphone.stop_recording))
for thread in stop_recording_threads:
thread.start()
for thread in stop_recording_threads:
thread.join()

View File

@@ -23,7 +23,8 @@ import draccus
import torch
from safetensors.torch import load_file, save_file
from lerobot.datasets.utils import flatten_dict, unflatten_dict, write_json
from lerobot.datasets.io_utils import write_json
from lerobot.datasets.utils import flatten_dict, unflatten_dict
from lerobot.utils.constants import (
OPTIMIZER_PARAM_GROUPS,
OPTIMIZER_STATE,

View File

@@ -23,7 +23,7 @@ import draccus
from torch.optim import Optimizer
from torch.optim.lr_scheduler import LambdaLR, LRScheduler
from lerobot.datasets.utils import write_json
from lerobot.datasets.io_utils import write_json
from lerobot.utils.constants import SCHEDULER_STATE
from lerobot.utils.io_utils import deserialize_json_into_object

View File

@@ -15,6 +15,7 @@
from .act.configuration_act import ACTConfig as ACTConfig
from .diffusion.configuration_diffusion import DiffusionConfig as DiffusionConfig
from .groot.configuration_groot import GrootConfig as GrootConfig
from .multi_task_dit.configuration_multi_task_dit import MultiTaskDiTConfig as MultiTaskDiTConfig
from .pi0.configuration_pi0 import PI0Config as PI0Config
from .pi0_fast.configuration_pi0_fast import PI0FastConfig as PI0FastConfig
from .pi05.configuration_pi05 import PI05Config as PI05Config
@@ -28,6 +29,7 @@ from .xvla.configuration_xvla import XVLAConfig as XVLAConfig
__all__ = [
"ACTConfig",
"DiffusionConfig",
"MultiTaskDiTConfig",
"PI0Config",
"PI05Config",
"PI0FastConfig",

View File

@@ -89,6 +89,7 @@ class ACTConfig(PreTrainedConfig):
normalization_mapping: dict[str, NormalizationMode] = field(
default_factory=lambda: {
"VISUAL": NormalizationMode.MEAN_STD,
"AUDIO": NormalizationMode.IDENTITY,
"STATE": NormalizationMode.MEAN_STD,
"ACTION": NormalizationMode.MEAN_STD,
}
@@ -99,6 +100,10 @@ class ACTConfig(PreTrainedConfig):
vision_backbone: str = "resnet18"
pretrained_backbone_weights: str | None = "ResNet18_Weights.IMAGENET1K_V1"
replace_final_stride_with_dilation: int = False
# Audio backbone.
audio_backbone: str = vision_backbone
pretrained_backbone_weights_audio: str | None = None
replace_final_stride_with_dilation_audio: int = False
# Transformer layers.
pre_norm: bool = False
dim_model: int = 512
@@ -161,8 +166,10 @@ class ACTConfig(PreTrainedConfig):
return None
def validate_features(self) -> None:
if not self.image_features and not self.env_state_feature:
raise ValueError("You must provide at least one image or the environment state among the inputs.")
if not (self.image_features or self.audio_features) and not self.env_state_feature:
raise ValueError(
"You must provide at least one image/audio or the environment state among the inputs."
)
@property
def observation_delta_indices(self) -> None:

View File

@@ -35,7 +35,7 @@ from torchvision.ops.misc import FrozenBatchNorm2d
from lerobot.policies.act.configuration_act import ACTConfig
from lerobot.policies.pretrained import PreTrainedPolicy
from lerobot.utils.constants import ACTION, OBS_ENV_STATE, OBS_IMAGES, OBS_STATE
from lerobot.utils.constants import ACTION, OBS_AUDIO, OBS_ENV_STATE, OBS_IMAGES, OBS_STATE
class ACTPolicy(PreTrainedPolicy):
@@ -106,6 +106,8 @@ class ACTPolicy(PreTrainedPolicy):
"""
self.eval() # keeping the policy in eval mode as it could be set to train mode while queue is consumed
# If we are doing temporal ensembling, do online updates where we keep track of the number of actions
# we are ensembling over.
if self.config.temporal_ensemble_coeff is not None:
actions = self.predict_action_chunk(batch)
action = self.temporal_ensembler.update(actions)
@@ -331,12 +333,26 @@ class ACT(nn.Module):
# Note: The forward method of this returns a dict: {"feature_map": output}.
self.backbone = IntermediateLayerGetter(backbone_model, return_layers={"layer4": "feature_map"})
# Backbone for audio feature extraction.
if self.config.audio_features:
audio_backbone_model = getattr(torchvision.models, config.audio_backbone)(
replace_stride_with_dilation=[False, False, config.replace_final_stride_with_dilation_audio],
weights=config.pretrained_backbone_weights_audio,
norm_layer=FrozenBatchNorm2d,
)
# Note: The assumption here is that we are using a ResNet model (and hence layer4 is the final
# feature map).
# Note: The forward method of this returns a dict: {"feature_map": output}.
self.audio_backbone = IntermediateLayerGetter(
audio_backbone_model, return_layers={"layer4": "feature_map"}
)
# Transformer (acts as VAE decoder when training with the variational objective).
self.encoder = ACTEncoder(config)
self.decoder = ACTDecoder(config)
# Transformer encoder input projections. The tokens will be structured like
# [latent, (robot_state), (env_state), (image_feature_map_pixels)].
# [latent, (robot_state), (env_state), (image_feature_map_pixels), (audio_feature)].
if self.config.robot_state_feature:
self.encoder_robot_state_input_proj = nn.Linear(
self.config.robot_state_feature.shape[0], config.dim_model
@@ -350,6 +366,10 @@ class ACT(nn.Module):
self.encoder_img_feat_input_proj = nn.Conv2d(
backbone_model.fc.in_features, config.dim_model, kernel_size=1
)
if self.config.audio_features:
self.encoder_audio_feat_input_proj = nn.Conv2d(
audio_backbone_model.fc.in_features, config.dim_model, kernel_size=1
)
# Transformer encoder positional embeddings.
n_1d_tokens = 1 # for the latent
if self.config.robot_state_feature:
@@ -359,6 +379,8 @@ class ACT(nn.Module):
self.encoder_1d_feature_pos_embed = nn.Embedding(n_1d_tokens, config.dim_model)
if self.config.image_features:
self.encoder_cam_feat_pos_embed = ACTSinusoidalPositionEmbedding2d(config.dim_model // 2)
if self.config.audio_features:
self.encoder_audio_feat_pos_embed = ACTSinusoidalPositionEmbedding2d(config.dim_model // 2)
# Transformer decoder.
# Learnable positional embedding for the transformer's decoder (in the style of DETR object queries).
@@ -483,6 +505,21 @@ class ACT(nn.Module):
encoder_in_tokens.extend(list(cam_features))
encoder_in_pos_embed.extend(list(cam_pos_embed))
if self.config.audio_features:
for audio in batch[OBS_AUDIO]:
audio_features = self.audio_backbone(audio)["feature_map"]
audio_pos_embed = self.encoder_audio_feat_pos_embed(audio_features).to(
dtype=audio_features.dtype
)
audio_features = self.encoder_audio_feat_input_proj(audio_features)
# Rearrange features to (sequence, batch, dim).
audio_features = einops.rearrange(audio_features, "b c h w -> (h w) b c")
audio_pos_embed = einops.rearrange(audio_pos_embed, "b c h w -> (h w) b c")
encoder_in_tokens.extend(list(audio_features))
encoder_in_pos_embed.extend(list(audio_pos_embed))
# Stack all tokens along the sequence dimension.
encoder_in_tokens = torch.stack(encoder_in_tokens, axis=0)
encoder_in_pos_embed = torch.stack(encoder_in_pos_embed, axis=0)

View File

@@ -17,9 +17,11 @@ from typing import Any
import torch
from lerobot.datasets.utils import DEFAULT_AUDIO_CHUNK_DURATION
from lerobot.policies.act.configuration_act import ACTConfig
from lerobot.processor import (
AddBatchDimensionProcessorStep,
AudioProcessorStep,
DeviceProcessorStep,
NormalizerProcessorStep,
PolicyAction,
@@ -63,6 +65,15 @@ def make_act_pre_post_processors(
stats=dataset_stats,
device=config.device,
),
AudioProcessorStep(
output_height=224,
output_width=224,
output_channels=3,
input_audio_chunk_duration=DEFAULT_AUDIO_CHUNK_DURATION,
input_sample_rate=48000,
intermediate_sample_rate=16000,
n_fft=1024,
),
]
output_steps = [
UnnormalizerProcessorStep(

View File

@@ -24,13 +24,14 @@ import torch
from lerobot.configs.policies import PreTrainedConfig
from lerobot.configs.types import FeatureType
from lerobot.datasets.lerobot_dataset import LeRobotDatasetMetadata
from lerobot.datasets.utils import dataset_to_policy_features
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata
from lerobot.datasets.feature_utils import dataset_to_policy_features
from lerobot.envs.configs import EnvConfig
from lerobot.envs.utils import env_to_policy_features
from lerobot.policies.act.configuration_act import ACTConfig
from lerobot.policies.diffusion.configuration_diffusion import DiffusionConfig
from lerobot.policies.groot.configuration_groot import GrootConfig
from lerobot.policies.multi_task_dit.configuration_multi_task_dit import MultiTaskDiTConfig
from lerobot.policies.pi0.configuration_pi0 import PI0Config
from lerobot.policies.pi05.configuration_pi05 import PI05Config
from lerobot.policies.pretrained import PreTrainedPolicy
@@ -58,6 +59,29 @@ from lerobot.utils.constants import (
)
def _reconnect_relative_absolute_steps(
preprocessor: PolicyProcessorPipeline, postprocessor: PolicyProcessorPipeline
) -> None:
"""Wire AbsoluteActionsProcessorStep.relative_step to the RelativeActionsProcessorStep after deserialization.
After a policy is loaded from disk, the preprocessor and postprocessor are reconstructed
independently from their configs. AbsoluteActionsProcessorStep needs a live reference to
the RelativeActionsProcessorStep so it can read the cached state at inference time.
That reference is not serializable, so we re-establish it here after loading.
"""
from lerobot.processor.relative_action_processor import (
AbsoluteActionsProcessorStep,
RelativeActionsProcessorStep,
)
relative_step = next((s for s in preprocessor.steps if isinstance(s, RelativeActionsProcessorStep)), None)
if relative_step is None:
return
for step in postprocessor.steps:
if isinstance(step, AbsoluteActionsProcessorStep) and step.relative_step is None:
step.relative_step = relative_step
def get_policy_class(name: str) -> type[PreTrainedPolicy]:
"""
Retrieves a policy class by its registered name.
@@ -67,8 +91,7 @@ def get_policy_class(name: str) -> type[PreTrainedPolicy]:
Args:
name: The name of the policy. Supported names are "tdmpc", "diffusion", "act",
"vqbet", "pi0", "pi05", "sac", "reward_classifier", "smolvla", "wall_x".
"multi_task_dit", "vqbet", "pi0", "pi05", "sac", "reward_classifier", "smolvla", "wall_x".
Returns:
The policy class corresponding to the given name.
@@ -87,6 +110,10 @@ def get_policy_class(name: str) -> type[PreTrainedPolicy]:
from lerobot.policies.act.modeling_act import ACTPolicy
return ACTPolicy
elif name == "multi_task_dit":
from lerobot.policies.multi_task_dit.modeling_multi_task_dit import MultiTaskDiTPolicy
return MultiTaskDiTPolicy
elif name == "vqbet":
from lerobot.policies.vqbet.modeling_vqbet import VQBeTPolicy
@@ -147,8 +174,8 @@ def make_policy_config(policy_type: str, **kwargs) -> PreTrainedConfig:
Args:
policy_type: The type of the policy. Supported types include "tdmpc",
"diffusion", "act", "vqbet", "pi0", "pi05", "sac", "smolvla",
"reward_classifier", "wall_x".
"multi_task_dit", "diffusion", "act", "vqbet", "pi0", "pi05", "sac",
"smolvla", "reward_classifier", "wall_x".
**kwargs: Keyword arguments to be passed to the configuration class constructor.
Returns:
@@ -163,6 +190,8 @@ def make_policy_config(policy_type: str, **kwargs) -> PreTrainedConfig:
return DiffusionConfig(**kwargs)
elif policy_type == "act":
return ACTConfig(**kwargs)
elif policy_type == "multi_task_dit":
return MultiTaskDiTConfig(**kwargs)
elif policy_type == "vqbet":
return VQBeTConfig(**kwargs)
elif policy_type == "pi0":
@@ -263,26 +292,26 @@ def make_pre_post_processors(
kwargs["preprocessor_overrides"] = preprocessor_overrides
kwargs["postprocessor_overrides"] = postprocessor_overrides
return (
PolicyProcessorPipeline.from_pretrained(
pretrained_model_name_or_path=pretrained_path,
config_filename=kwargs.get(
"preprocessor_config_filename", f"{POLICY_PREPROCESSOR_DEFAULT_NAME}.json"
),
overrides=kwargs.get("preprocessor_overrides", {}),
to_transition=batch_to_transition,
to_output=transition_to_batch,
),
PolicyProcessorPipeline.from_pretrained(
pretrained_model_name_or_path=pretrained_path,
config_filename=kwargs.get(
"postprocessor_config_filename", f"{POLICY_POSTPROCESSOR_DEFAULT_NAME}.json"
),
overrides=kwargs.get("postprocessor_overrides", {}),
to_transition=policy_action_to_transition,
to_output=transition_to_policy_action,
preprocessor = PolicyProcessorPipeline.from_pretrained(
pretrained_model_name_or_path=pretrained_path,
config_filename=kwargs.get(
"preprocessor_config_filename", f"{POLICY_PREPROCESSOR_DEFAULT_NAME}.json"
),
overrides=kwargs.get("preprocessor_overrides", {}),
to_transition=batch_to_transition,
to_output=transition_to_batch,
)
postprocessor = PolicyProcessorPipeline.from_pretrained(
pretrained_model_name_or_path=pretrained_path,
config_filename=kwargs.get(
"postprocessor_config_filename", f"{POLICY_POSTPROCESSOR_DEFAULT_NAME}.json"
),
overrides=kwargs.get("postprocessor_overrides", {}),
to_transition=policy_action_to_transition,
to_output=transition_to_policy_action,
)
_reconnect_relative_absolute_steps(preprocessor, postprocessor)
return preprocessor, postprocessor
# Create a new processor based on policy type
if isinstance(policy_cfg, TDMPCConfig):
@@ -309,6 +338,16 @@ def make_pre_post_processors(
dataset_stats=kwargs.get("dataset_stats"),
)
elif isinstance(policy_cfg, MultiTaskDiTConfig):
from lerobot.policies.multi_task_dit.processor_multi_task_dit import (
make_multi_task_dit_pre_post_processors,
)
processors = make_multi_task_dit_pre_post_processors(
config=policy_cfg,
dataset_stats=kwargs.get("dataset_stats"),
)
elif isinstance(policy_cfg, VQBeTConfig):
from lerobot.policies.vqbet.processor_vqbet import make_vqbet_pre_post_processors
@@ -470,6 +509,13 @@ def make_policy(
cfg.output_features = {key: ft for key, ft in features.items() if ft.type is FeatureType.ACTION}
if not cfg.input_features:
cfg.input_features = {key: ft for key, ft in features.items() if key not in cfg.output_features}
# Store action feature names for relative_exclude_joints support
if ds_meta is not None and hasattr(cfg, "action_feature_names"):
action_names = ds_meta.features.get(ACTION, {}).get("names")
if action_names is not None:
cfg.action_feature_names = list(action_names)
kwargs["config"] = cfg
# Pass dataset_stats to the policy if available (needed for some policies like SARM)

View File

@@ -0,0 +1,37 @@
# Multitask DiT Policy
## Citation
If you use this work, please cite the following works:
```bibtex
@misc{jones2025multitaskditpolicy,
author = {Bryson Jones},
title = {Dissecting and Open-Sourcing Multitask Diffusion Transformer Policy},
year = {2025},
url = {https://brysonkjones.substack.com/p/dissecting-and-open-sourcing-multitask-diffusion-transformer-policy},
note = {Blog post}
}
```
```bibtex
@misc{trilbmteam2025carefulexaminationlargebehaviormodels,
author = {TRI LBM Team},
title = {A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation},
year = {2025},
eprint = {arXiv:2507.05331},
archivePrefix = {arXiv},
primaryClass = {cs.RO},
url = {https://arxiv.org/abs/2507.05331}
}
```
```bibtex
@misc{bostondynamics2025largebehaviormodelsatlas,
author = {Boston Dynamics and TRI Research Team},
title = {Large Behavior Models and Atlas Find New Footing},
year = {2025},
url = {https://bostondynamics.com/blog/large-behavior-models-atlas-find-new-footing/},
note = {Blog post}
}
```

View File

@@ -0,0 +1,21 @@
#!/usr/bin/env python
# Copyright 2025 Bryson Jones and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .configuration_multi_task_dit import MultiTaskDiTConfig
from .modeling_multi_task_dit import MultiTaskDiTPolicy
from .processor_multi_task_dit import make_multi_task_dit_pre_post_processors
__all__ = ["MultiTaskDiTConfig", "MultiTaskDiTPolicy", "make_multi_task_dit_pre_post_processors"]

View File

@@ -0,0 +1,256 @@
#!/usr/bin/env python
# Copyright 2025 Bryson Jones and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
from dataclasses import dataclass, field
from lerobot.configs.policies import PreTrainedConfig
from lerobot.configs.types import NormalizationMode
from lerobot.optim.optimizers import AdamConfig
from lerobot.optim.schedulers import DiffuserSchedulerConfig
@PreTrainedConfig.register_subclass("multi_task_dit")
@dataclass
class MultiTaskDiTConfig(PreTrainedConfig):
"""Configuration for the Multi-Task Diffusion Transformer (DiT) policy.
A transformer-based policy that supports both diffusion and flow matching objectives
for multi-task robot learning with text and vision conditioning.
"""
n_obs_steps: int = 2 # Number of observation steps for temporal context
horizon: int = 32 # Number of action steps to predict
n_action_steps: int = 24 # Actions executed per policy call (~0.8s at 30Hz)
# Objective Selection
objective: str = "diffusion" # "diffusion" or "flow_matching"
# --- Diffusion-specific (used when objective="diffusion") ---
noise_scheduler_type: str = "DDPM" # "DDPM" or "DDIM"
num_train_timesteps: int = 100 # Number of diffusion timesteps
beta_schedule: str = "squaredcos_cap_v2" # Noise schedule type
beta_start: float = 0.0001 # Starting noise level
beta_end: float = 0.02 # Ending noise level
prediction_type: str = "epsilon" # "epsilon" (predict noise) or "sample" (predict clean)
clip_sample: bool = True # Clip samples during denoising
clip_sample_range: float = 1.0 # Clipping range [-x, x]
num_inference_steps: int | None = None # Denoising steps at inference (defaults to num_train_timesteps)
# --- Flow Matching-specific (used when objective="flow_matching") ---
sigma_min: float = 0.0 # Minimum noise in flow interpolation path
num_integration_steps: int = 100 # ODE integration steps at inference
integration_method: str = "euler" # ODE solver: "euler" or "rk4"
timestep_sampling_strategy: str = "beta" # "uniform" or "beta"
timestep_sampling_s: float = 0.999 # (beta only) Max timestep threshold
timestep_sampling_alpha: float = 1.5 # (beta only) Beta distribution alpha
timestep_sampling_beta: float = 1.0 # (beta only) Beta distribution beta
# Transformer Architecture
hidden_dim: int = 512 # Transformer hidden dimension
num_layers: int = 6 # Number of transformer layers
num_heads: int = 8 # Number of attention heads
dropout: float = 0.1 # Dropout rate
use_positional_encoding: bool = False # Use absolute positional encoding
timestep_embed_dim: int = 256 # Timestep embedding dimension
use_rope: bool = True # Use Rotary Position Embedding
rope_base: float = 10000.0 # RoPE base frequency
# Vision Encoder (CLIP)
vision_encoder_name: str = "openai/clip-vit-base-patch16" # HuggingFace CLIP model
use_separate_rgb_encoder_per_camera: bool = False # Separate encoder per camera view
vision_encoder_lr_multiplier: float = 0.1 # LR multiplier for vision encoder
image_resize_shape: tuple[int, int] | None = None # Resize images before crop
image_crop_shape: tuple[int, int] | None = (224, 224) # Crop shape (CLIP default)
image_crop_is_random: bool = True # Random crop during training, center at inference
# Text Encoder (CLIP)
text_encoder_name: str = "openai/clip-vit-base-patch16" # HuggingFace CLIP model
tokenizer_max_length: int = 77 # Max length for tokenized text (CLIP default is 77)
tokenizer_padding: str = "max_length" # Padding strategy: "max_length" or "longest"
tokenizer_padding_side: str = "right" # Padding side: "left" or "right"
tokenizer_truncation: bool = True # Whether to truncate sequences longer than max_length
# Normalization
normalization_mapping: dict[str, NormalizationMode] = field(
default_factory=lambda: {
"VISUAL": NormalizationMode.MEAN_STD,
"STATE": NormalizationMode.MIN_MAX,
"ACTION": NormalizationMode.MIN_MAX,
}
)
# Training/Optimizer
optimizer_lr: float = 2e-5
optimizer_betas: tuple = (0.95, 0.999)
optimizer_eps: float = 1e-8
optimizer_weight_decay: float = 0.0
scheduler_name: str = "cosine"
scheduler_warmup_steps: int = 0
do_mask_loss_for_padding: bool = False
# Auto-calculated
drop_n_last_frames: int | None = None
def __post_init__(self):
super().__post_init__()
if self.drop_n_last_frames is None:
self.drop_n_last_frames = self.horizon - self.n_action_steps - self.n_obs_steps + 1
self._validate()
def _validate(self):
"""Validate configuration parameters."""
# Objective validation
if self.objective not in ["diffusion", "flow_matching"]:
raise ValueError(f"objective must be 'diffusion' or 'flow_matching', got '{self.objective}'")
# Transformer validation
if self.hidden_dim <= 0:
raise ValueError("hidden_dim must be positive")
if self.num_layers <= 0:
raise ValueError("num_layers must be positive")
if self.num_heads <= 0:
raise ValueError("num_heads must be positive")
if self.hidden_dim % self.num_heads != 0:
raise ValueError("hidden_dim must be divisible by num_heads")
if not (0.0 <= self.dropout <= 1.0):
raise ValueError("dropout must be between 0.0 and 1.0")
# Vision encoder validation
if "clip" not in self.vision_encoder_name.lower():
raise ValueError(
f"vision_encoder_name must be a CLIP model (contain 'clip'), got '{self.vision_encoder_name}'"
)
if (
self.image_resize_shape
and self.image_crop_shape
and (
self.image_crop_shape[0] > self.image_resize_shape[0]
or self.image_crop_shape[1] > self.image_resize_shape[1]
)
):
logging.warning(
"image_crop_shape %s must be <= image_resize_shape %s; disabling cropping.",
self.image_crop_shape,
self.image_resize_shape,
)
self.image_crop_shape = None
# Text encoder validation
if "clip" not in self.text_encoder_name.lower():
raise ValueError(
f"text_encoder_name must be a CLIP model (contain 'clip'), got '{self.text_encoder_name}'"
)
# Objective-specific validation
if self.objective == "diffusion":
if self.noise_scheduler_type not in ["DDPM", "DDIM"]:
raise ValueError(
f"noise_scheduler_type must be 'DDPM' or 'DDIM', got {self.noise_scheduler_type}"
)
if self.prediction_type not in ["epsilon", "sample"]:
raise ValueError(f"prediction_type must be 'epsilon' or 'sample', got {self.prediction_type}")
if self.num_train_timesteps <= 0:
raise ValueError(f"num_train_timesteps must be positive, got {self.num_train_timesteps}")
if not (0.0 <= self.beta_start <= self.beta_end <= 1.0):
raise ValueError(f"Invalid beta values: {self.beta_start}, {self.beta_end}")
elif self.objective == "flow_matching":
if not (0.0 <= self.sigma_min <= 1.0):
raise ValueError(f"sigma_min must be in [0, 1], got {self.sigma_min}")
if self.num_integration_steps <= 0:
raise ValueError(f"num_integration_steps must be positive, got {self.num_integration_steps}")
if self.integration_method not in ["euler", "rk4"]:
raise ValueError(
f"integration_method must be 'euler' or 'rk4', got {self.integration_method}"
)
if self.timestep_sampling_strategy not in ["uniform", "beta"]:
raise ValueError("timestep_sampling_strategy must be 'uniform' or 'beta'")
if self.timestep_sampling_strategy == "beta":
if not (0.0 < self.timestep_sampling_s <= 1.0):
raise ValueError(f"timestep_sampling_s must be in (0, 1], got {self.timestep_sampling_s}")
if self.timestep_sampling_alpha <= 0:
raise ValueError("timestep_sampling_alpha must be positive")
if self.timestep_sampling_beta <= 0:
raise ValueError("timestep_sampling_beta must be positive")
def get_optimizer_preset(self) -> AdamConfig:
return AdamConfig(
lr=self.optimizer_lr,
betas=self.optimizer_betas,
eps=self.optimizer_eps,
weight_decay=self.optimizer_weight_decay,
)
def get_scheduler_preset(self) -> DiffuserSchedulerConfig:
return DiffuserSchedulerConfig(
name=self.scheduler_name,
num_warmup_steps=self.scheduler_warmup_steps,
)
def validate_features(self) -> None:
"""Validate that required input features are present and properly configured."""
# If the configured crop doesn't fit, disable cropping instead of erroring.
# Note: if image_resize_shape is set, cropping is applied *after* resizing.
if self.image_crop_shape is not None:
for key, image_ft in self.image_features.items():
# image_ft.shape is (C, H, W)
effective_h, effective_w = (
self.image_resize_shape
if self.image_resize_shape is not None
else (image_ft.shape[1], image_ft.shape[2])
)
if self.image_crop_shape[0] > effective_h or self.image_crop_shape[1] > effective_w:
logging.warning(
"image_crop_shape %s doesn't fit within effective image shape (%s, %s) for '%s'; disabling cropping.",
self.image_crop_shape,
effective_h,
effective_w,
key,
)
self.image_crop_shape = None
break
if len(self.image_features) > 0:
first_key, first_ft = next(iter(self.image_features.items()))
for key, image_ft in self.image_features.items():
if image_ft.shape != first_ft.shape:
raise ValueError(
f"Image '{key}' shape {image_ft.shape} != '{first_key}' shape {first_ft.shape}"
)
@property
def is_diffusion(self) -> bool:
return self.objective == "diffusion"
@property
def is_flow_matching(self) -> bool:
return self.objective == "flow_matching"
@property
def observation_delta_indices(self) -> list:
return list(range(1 - self.n_obs_steps, 1))
@property
def action_delta_indices(self) -> list:
return list(range(1 - self.n_obs_steps, 1 - self.n_obs_steps + self.horizon))
@property
def reward_delta_indices(self) -> None:
return None

View File

@@ -0,0 +1,803 @@
#!/usr/bin/env python
# Copyright 2025 Bryson Jones and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Multi-Task Diffusion Transformer (DiT) Policy
Transformer-based diffusion policy for multi-task robot learning with text and vision conditioning.
Supports both diffusion and flow matching objectives for action generation.
References:
- https://arxiv.org/abs/2507.05331
- https://bostondynamics.com/blog/large-behavior-models-atlas-find-new-footing/
- https://brysonkjones.substack.com/p/dissecting-and-open-sourcing-multitask-diffusion-transformer-policy
"""
import math
from collections import deque
from typing import TYPE_CHECKING
import einops
import torch
import torch.nn as nn
import torch.nn.functional as F # noqa: N812
import torchvision
from diffusers.schedulers.scheduling_ddim import DDIMScheduler
from diffusers.schedulers.scheduling_ddpm import DDPMScheduler
from torch import Tensor
from lerobot.policies.multi_task_dit.configuration_multi_task_dit import MultiTaskDiTConfig
from lerobot.utils.import_utils import _transformers_available
# Conditional import for type checking and lazy loading
if TYPE_CHECKING or _transformers_available:
from transformers import CLIPTextModel, CLIPVisionModel
else:
CLIPTextModel = None
CLIPVisionModel = None
from lerobot.policies.pretrained import PreTrainedPolicy
from lerobot.policies.utils import populate_queues
from lerobot.utils.constants import (
ACTION,
OBS_IMAGES,
OBS_LANGUAGE_ATTENTION_MASK,
OBS_LANGUAGE_TOKENS,
OBS_STATE,
)
# -- Policy --
class MultiTaskDiTPolicy(PreTrainedPolicy):
config_class = MultiTaskDiTConfig
name = "multi_task_dit"
def __init__(self, config: MultiTaskDiTConfig, **kwargs):
super().__init__(config)
config.validate_features()
self.config = config
self._queues = None
self.observation_encoder = ObservationEncoder(config)
conditioning_dim = self.observation_encoder.conditioning_dim
self.noise_predictor = DiffusionTransformer(config, conditioning_dim=conditioning_dim)
action_dim = config.action_feature.shape[0]
horizon = config.horizon
if config.is_diffusion:
self.objective = DiffusionObjective(
config,
action_dim=action_dim,
horizon=horizon,
do_mask_loss_for_padding=config.do_mask_loss_for_padding,
)
elif config.is_flow_matching:
self.objective = FlowMatchingObjective(
config,
action_dim=action_dim,
horizon=horizon,
do_mask_loss_for_padding=config.do_mask_loss_for_padding,
)
else:
raise ValueError(f"Unsupported objective: {config.objective}")
self.reset()
def get_optim_params(self) -> list:
"""Returns parameter groups with different learning rates for vision vs non-vision parameters"""
non_vision_params = []
vision_encoder_params = []
for name, param in self.named_parameters():
if not param.requires_grad:
continue
if "observation_encoder.vision_encoder" in name:
vision_encoder_params.append(param)
else:
non_vision_params.append(param)
return [
{"params": non_vision_params},
{
"params": vision_encoder_params,
"lr": self.config.optimizer_lr * self.config.vision_encoder_lr_multiplier,
},
]
def _generate_actions(self, batch: dict[str, Tensor]) -> Tensor:
batch_size, n_obs_steps = batch[OBS_STATE].shape[:2]
assert n_obs_steps == self.config.n_obs_steps
conditioning_vec = self.observation_encoder.encode(batch)
actions = self.objective.conditional_sample(self.noise_predictor, batch_size, conditioning_vec)
start = n_obs_steps - 1
end = start + self.config.n_action_steps
actions = actions[:, start:end]
return actions
def reset(self):
"""Clear observation and action queues. Should be called on `env.reset()`"""
self._queues = {
OBS_STATE: deque(maxlen=self.config.n_obs_steps),
ACTION: deque(maxlen=self.config.n_action_steps),
}
if self.config.image_features:
self._queues[OBS_IMAGES] = deque(maxlen=self.config.n_obs_steps)
@torch.no_grad()
def predict_action_chunk(self, batch: dict[str, Tensor]) -> Tensor:
"""Predict a chunk of actions given environment observations"""
self.eval()
for k in batch:
if k in self._queues:
batch[k] = torch.stack(list(self._queues[k]), dim=1)
actions = self._generate_actions(batch)
return actions
def _prepare_batch(self, batch: dict[str, Tensor]) -> dict[str, Tensor]:
"""Prepare batch by stacking image features if needed."""
if self.config.image_features:
batch = dict(batch) # shallow copy to avoid modifying original
batch[OBS_IMAGES] = torch.stack([batch[key] for key in self.config.image_features], dim=-4)
return batch
@torch.no_grad()
def select_action(self, batch: dict[str, Tensor]) -> Tensor:
"""Select a single action given environment observations"""
if ACTION in batch:
batch = dict(batch) # shallow copy to avoid modifying original
batch.pop(ACTION)
batch = self._prepare_batch(batch)
self._queues = populate_queues(self._queues, batch)
if len(self._queues[ACTION]) == 0:
actions = self.predict_action_chunk(batch)
self._queues[ACTION].extend(actions.transpose(0, 1))
action = self._queues[ACTION].popleft()
return action
def forward(self, batch: dict[str, Tensor]) -> tuple[Tensor, dict | None]:
"""Run the batch through the model and compute the loss for training"""
batch = self._prepare_batch(batch)
conditioning_vec = self.observation_encoder.encode(batch)
loss = self.objective.compute_loss(self.noise_predictor, batch, conditioning_vec)
return loss, None
# -- Observation Encoders --
class CLIPVisionEncoder(nn.Module):
"""CLIP vision encoder using the CLS token for global image representation."""
def __init__(self, model_name: str):
super().__init__()
self.model_name = model_name
self.model = CLIPVisionModel.from_pretrained(self.model_name)
self.num_non_spatial_tokens = 1
self.embed_dim = self.model.config.hidden_size
def forward(self, x: Tensor) -> Tensor:
"""Encode RGB image to CLS token."""
outputs = self.model(pixel_values=x, output_hidden_states=False)
cls_token = outputs.last_hidden_state[:, 0]
b, embed_dim = cls_token.shape
return cls_token.reshape(b, embed_dim, 1, 1)
def get_output_shape(self) -> tuple:
return (self.embed_dim, 1, 1)
class CLIPTextEncoder(nn.Module):
"""CLIP text encoder with frozen weights and a learnable projection layer.
Accepts pre-tokenized inputs (input_ids and attention_mask) from the processor pipeline. See the processor
pipeline to see how the tokenization is handled.
"""
def __init__(self, model_name: str = "openai/clip-vit-base-patch16", projection_dim: int = 512):
super().__init__()
self.model_name = model_name
self.projection_dim = projection_dim
self.text_encoder = CLIPTextModel.from_pretrained(model_name)
for param in self.text_encoder.parameters():
param.requires_grad = False
self.text_embed_dim = self.text_encoder.config.hidden_size
self.projection = nn.Linear(self.text_embed_dim, projection_dim)
def forward(self, input_ids: Tensor, attention_mask: Tensor) -> Tensor:
"""Encode pre-tokenized text to feature vectors."""
# Ensure inputs are on the same device as the model
device = next(self.parameters()).device
input_ids = input_ids.to(device)
attention_mask = attention_mask.to(device)
with torch.no_grad():
outputs = self.text_encoder(input_ids=input_ids, attention_mask=attention_mask)
clip_features = outputs.pooler_output
return self.projection(clip_features)
class ObservationEncoder(nn.Module):
"""Handles all observation processing for the conditioning vector."""
def __init__(self, config):
super().__init__()
self.config = config
self._setup_preprocessing(config)
if config.image_features:
self.num_cameras = len(config.image_features)
self.camera_names = list(config.image_features.keys())
if config.use_separate_rgb_encoder_per_camera:
self.vision_encoders = nn.ModuleList(
[CLIPVisionEncoder(model_name=config.vision_encoder_name) for _ in self.camera_names]
)
self.vision_encoder = None
else:
self.vision_encoder = CLIPVisionEncoder(model_name=config.vision_encoder_name)
self.vision_encoders = None
else:
self.vision_encoder = None
self.vision_encoders = None
self.camera_names = []
self.num_cameras = 0
if hasattr(config, "robot_state_feature") and config.robot_state_feature:
self.robot_state_dim = config.robot_state_feature.shape[0]
else:
self.robot_state_dim = 0
self.text_dim = config.hidden_dim
self.text_encoder = CLIPTextEncoder(model_name=config.text_encoder_name, projection_dim=self.text_dim)
self._setup_vector_output()
def _apply_preprocessing(self, images: Tensor) -> Tensor:
if self.do_resize:
images = self.resize(images)
if self.do_crop:
images = self.maybe_random_crop(images) if self.training else self.center_crop(images)
return images
def _setup_preprocessing(self, config):
if config.image_resize_shape is not None:
self.do_resize = True
self.resize = torchvision.transforms.Resize(
size=config.image_resize_shape,
interpolation=torchvision.transforms.InterpolationMode.BILINEAR,
antialias=True,
)
else:
self.do_resize = False
if config.image_crop_shape is not None:
self.do_crop = True
self.center_crop = torchvision.transforms.CenterCrop(config.image_crop_shape)
if config.image_crop_is_random:
self.maybe_random_crop = torchvision.transforms.RandomCrop(config.image_crop_shape)
else:
self.maybe_random_crop = self.center_crop
else:
self.do_crop = False
def _setup_vector_output(self):
total_dim = 0
if self.vision_encoder is not None or self.vision_encoders is not None:
encoder_to_check = self.vision_encoder or next(iter(self.vision_encoders))
feature_map_shape = encoder_to_check.get_output_shape()
c, h, w = feature_map_shape
spatial_feature_dim = c * h * w
total_dim += spatial_feature_dim * self.num_cameras
total_dim += self.robot_state_dim
total_dim += self.text_dim
self.conditioning_dim = total_dim * self.config.n_obs_steps
def encode(self, batch: dict) -> Tensor:
"""Encode observations to vector format."""
batch_size, n_obs_steps = batch[OBS_STATE].shape[:2]
conditioning_feats = []
conditioning_feats.append(batch[OBS_STATE])
if self.vision_encoder is not None or self.vision_encoders is not None:
images = batch[OBS_IMAGES]
if len(images.shape) == 5:
images = images.unsqueeze(1)
if self.config.use_separate_rgb_encoder_per_camera:
camera_features = []
for cam_idx in range(self.num_cameras):
cam_images = images[:, :, cam_idx]
cam_images_flat = einops.rearrange(cam_images, "b s c h w -> (b s) c h w")
cam_images_flat = self._apply_preprocessing(cam_images_flat)
cam_features = self.vision_encoders[cam_idx](cam_images_flat)
cam_visual_features = cam_features.flatten(start_dim=1)
cam_features_reshaped = einops.rearrange(
cam_visual_features, "(b s) f -> b s f", b=batch_size, s=n_obs_steps
)
camera_features.append(cam_features_reshaped)
img_features = torch.cat(camera_features, dim=-1)
conditioning_feats.append(img_features)
else:
images_flat = einops.rearrange(images, "b s n c h w -> (b s n) c h w")
images_flat = self._apply_preprocessing(images_flat)
visual_features = self.vision_encoder(images_flat).flatten(start_dim=1)
img_features = einops.rearrange(
visual_features, "(b s n) f -> b s (n f)", b=batch_size, s=n_obs_steps, n=self.num_cameras
)
conditioning_feats.append(img_features)
if self.text_encoder is not None and OBS_LANGUAGE_TOKENS in batch:
input_ids = batch[OBS_LANGUAGE_TOKENS] # [batch_size, seq_length]
attention_mask = batch[OBS_LANGUAGE_ATTENTION_MASK] # [batch_size, seq_length]
text_features = self.text_encoder(input_ids, attention_mask)
text_features = text_features.unsqueeze(1).expand(-1, n_obs_steps, -1)
conditioning_feats.append(text_features)
combined_features = torch.cat(conditioning_feats, dim=-1)
return combined_features.flatten(start_dim=1)
# -- Transformer Components --
def modulate(x: Tensor, shift: Tensor, scale: Tensor) -> Tensor:
"""Modulate input with shift and scale for AdaLN-Zero."""
return x * (1 + scale) + shift
class SinusoidalPosEmb(nn.Module):
"""Sinusoidal positional embeddings for timesteps."""
def __init__(self, dim: int):
super().__init__()
self.dim = dim
def forward(self, x: Tensor) -> Tensor:
device = x.device
half_dim = self.dim // 2
emb = math.log(10000) / (half_dim - 1)
emb = torch.exp(torch.arange(half_dim, device=device) * -emb)
emb = x[:, None] * emb[None, :]
emb = torch.cat((emb.sin(), emb.cos()), dim=-1)
return emb
class RotaryPositionalEmbedding(nn.Module):
"""Rotary Position Embedding (RoPE) for transformers."""
def __init__(self, head_dim: int, max_seq_len: int = 512, base: float = 10000.0):
super().__init__()
assert head_dim % 2 == 0, "head_dim must be even for RoPE"
self.head_dim = head_dim
self.max_seq_len = max_seq_len
self.base = base
inv_freq = 1.0 / (base ** (torch.arange(0, head_dim, 2).float() / head_dim))
self.register_buffer("inv_freq", inv_freq, persistent=False)
self._precompute_cache(max_seq_len)
def _precompute_cache(self, seq_len: int):
t = torch.arange(seq_len, dtype=self.inv_freq.dtype)
freqs = torch.outer(t, self.inv_freq)
emb = torch.cat((freqs, freqs), dim=-1)
self.register_buffer("_cos_cached", emb.cos()[None, None, :, :], persistent=False)
self.register_buffer("_sin_cached", emb.sin()[None, None, :, :], persistent=False)
def _rotate_half(self, x: Tensor) -> Tensor:
x1 = x[..., : x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2 :]
return torch.cat((-x2, x1), dim=-1)
def forward(self, q: Tensor, k: Tensor) -> tuple[Tensor, Tensor]:
seq_len = q.shape[2]
if seq_len > self.max_seq_len:
raise ValueError(f"Sequence length {seq_len} exceeds max_seq_len {self.max_seq_len}.")
cos = self._cos_cached[:, :, :seq_len, :].to(q.dtype)
sin = self._sin_cached[:, :, :seq_len, :].to(q.dtype)
q_rotated = (q * cos) + (self._rotate_half(q) * sin)
k_rotated = (k * cos) + (self._rotate_half(k) * sin)
return q_rotated, k_rotated
class RoPEAttention(nn.Module):
"""Multi-head self-attention with Rotary Position Embedding (RoPE)."""
def __init__(
self,
hidden_size: int,
num_heads: int,
dropout: float = 0.0,
max_seq_len: int = 512,
rope_base: float = 10000.0,
):
super().__init__()
assert hidden_size % num_heads == 0, "hidden_size must be divisible by num_heads"
self.hidden_size = hidden_size
self.num_heads = num_heads
self.head_dim = hidden_size // num_heads
self.scale = self.head_dim**-0.5
self.qkv_proj = nn.Linear(hidden_size, 3 * hidden_size, bias=True)
self.out_proj = nn.Linear(hidden_size, hidden_size, bias=True)
self.dropout = nn.Dropout(dropout) if dropout > 0 else nn.Identity()
self.rope = RotaryPositionalEmbedding(head_dim=self.head_dim, max_seq_len=max_seq_len, base=rope_base)
def forward(self, x: Tensor) -> Tensor:
B, T, _ = x.shape # noqa: N806
qkv = self.qkv_proj(x)
qkv = qkv.reshape(B, T, 3, self.num_heads, self.head_dim)
qkv = qkv.permute(2, 0, 3, 1, 4)
q, k, v = qkv[0], qkv[1], qkv[2]
q, k = self.rope(q, k)
attn_out = torch.nn.functional.scaled_dot_product_attention(
q,
k,
v,
dropout_p=self.dropout.p if isinstance(self.dropout, nn.Dropout) and self.training else 0.0,
)
attn_out = attn_out.transpose(1, 2).reshape(B, T, self.hidden_size)
return self.out_proj(attn_out)
class TransformerBlock(nn.Module):
"""DiT-style transformer block with AdaLN-Zero."""
def __init__(
self,
hidden_size: int = 128,
num_heads: int = 4,
num_features: int = 128,
dropout: float = 0.0,
use_rope: bool = False,
max_seq_len: int = 512,
rope_base: float = 10000.0,
):
super().__init__()
self.use_rope = use_rope
if use_rope:
self.attn = RoPEAttention(
hidden_size=hidden_size,
num_heads=num_heads,
dropout=dropout,
max_seq_len=max_seq_len,
rope_base=rope_base,
)
else:
self.multihead_attn = nn.MultiheadAttention(
hidden_size, num_heads=num_heads, batch_first=True, dropout=dropout
)
self.norm1 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
self.norm2 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
self.mlp = nn.Sequential(
nn.Linear(hidden_size, hidden_size * 4),
nn.GELU(approximate="tanh"),
nn.Linear(hidden_size * 4, hidden_size),
)
self.adaLN_modulation = nn.Sequential(nn.SiLU(), nn.Linear(num_features, 6 * hidden_size, bias=True))
def forward(self, x: Tensor, features: Tensor) -> Tensor:
shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.adaLN_modulation(
features
).chunk(6, dim=1)
attn_input = modulate(self.norm1(x), shift_msa.unsqueeze(1), scale_msa.unsqueeze(1))
if self.use_rope:
attn_out = self.attn(attn_input)
else:
attn_out, _ = self.multihead_attn(attn_input, attn_input, attn_input)
x = x + gate_msa.unsqueeze(1) * attn_out
mlp_input = modulate(self.norm2(x), shift_mlp.unsqueeze(1), scale_mlp.unsqueeze(1))
mlp_out = self.mlp(mlp_input)
x = x + gate_mlp.unsqueeze(1) * mlp_out
return x
class DiffusionTransformer(nn.Module):
"""Transformer-based diffusion noise prediction model."""
def __init__(self, config, conditioning_dim: int):
super().__init__()
self.config = config
self.conditioning_dim = conditioning_dim
self.action_dim = config.action_feature.shape[0]
self.horizon = config.horizon
self.hidden_size = config.hidden_dim
self.num_layers = config.num_layers
self.num_heads = config.num_heads
self.dropout = config.dropout
self.use_rope = config.use_rope
self.timestep_embed_dim = config.timestep_embed_dim
self.time_mlp = nn.Sequential(
SinusoidalPosEmb(self.timestep_embed_dim),
nn.Linear(self.timestep_embed_dim, 2 * self.timestep_embed_dim),
nn.GELU(),
nn.Linear(2 * self.timestep_embed_dim, self.timestep_embed_dim),
nn.GELU(),
)
self.cond_dim = self.timestep_embed_dim + conditioning_dim
self.input_proj = nn.Linear(self.action_dim, self.hidden_size)
if config.use_positional_encoding:
self.pos_embedding = nn.Parameter(
torch.empty(1, self.horizon, self.hidden_size).normal_(std=0.02)
)
else:
self.pos_embedding = None
self.transformer_blocks = nn.ModuleList(
[
TransformerBlock(
hidden_size=self.hidden_size,
num_heads=self.num_heads,
num_features=self.cond_dim,
dropout=self.dropout,
use_rope=self.use_rope,
max_seq_len=self.horizon,
rope_base=config.rope_base,
)
for _ in range(self.num_layers)
]
)
self.output_proj = nn.Linear(self.hidden_size, self.action_dim)
self._initialize_weights()
def _initialize_weights(self):
for block in self.transformer_blocks:
nn.init.constant_(block.adaLN_modulation[-1].weight, 0)
nn.init.constant_(block.adaLN_modulation[-1].bias, 0)
def forward(self, x: Tensor, timestep: Tensor, conditioning_vec: Tensor) -> Tensor:
_, seq_len, _ = x.shape
timestep_features = self.time_mlp(timestep)
cond_features = torch.cat([timestep_features, conditioning_vec], dim=-1)
hidden_seq = self.input_proj(x)
if self.pos_embedding is not None:
hidden_seq = hidden_seq + self.pos_embedding[:, :seq_len, :]
for block in self.transformer_blocks:
hidden_seq = block(hidden_seq, cond_features)
return self.output_proj(hidden_seq)
# -- Objectives --
class DiffusionObjective(nn.Module):
"""Standard diffusion (DDPM/DDIM) objective implementation."""
def __init__(self, config, action_dim: int, horizon: int, do_mask_loss_for_padding: bool = False):
super().__init__()
self.config = config
self.action_dim = action_dim
self.horizon = horizon
self.do_mask_loss_for_padding = do_mask_loss_for_padding
scheduler_kwargs = {
"num_train_timesteps": config.num_train_timesteps,
"beta_start": config.beta_start,
"beta_end": config.beta_end,
"beta_schedule": config.beta_schedule,
"clip_sample": config.clip_sample,
"clip_sample_range": config.clip_sample_range,
"prediction_type": config.prediction_type,
}
if config.noise_scheduler_type == "DDPM":
self.noise_scheduler: DDPMScheduler | DDIMScheduler = DDPMScheduler(**scheduler_kwargs)
elif config.noise_scheduler_type == "DDIM":
self.noise_scheduler = DDIMScheduler(**scheduler_kwargs)
else:
raise ValueError(f"Unsupported noise scheduler type {config.noise_scheduler_type}")
self.num_inference_steps = (
config.num_inference_steps
if config.num_inference_steps is not None
else self.noise_scheduler.config.num_train_timesteps
)
def compute_loss(self, model: nn.Module, batch: dict[str, Tensor], conditioning_vec: Tensor) -> Tensor:
clean_actions = batch[ACTION]
noise = torch.randn_like(clean_actions)
timesteps = torch.randint(
low=0,
high=self.noise_scheduler.config.num_train_timesteps,
size=(clean_actions.shape[0],),
device=clean_actions.device,
).long()
noisy_actions = self.noise_scheduler.add_noise(clean_actions, noise, timesteps)
prediction_type = self.noise_scheduler.config.prediction_type
if prediction_type == "epsilon":
target = noise
elif prediction_type == "sample":
target = clean_actions
else:
raise ValueError(f"Unsupported prediction type: {prediction_type}")
predicted = model(noisy_actions, timesteps, conditioning_vec=conditioning_vec)
loss = F.mse_loss(predicted, target, reduction="none")
if self.do_mask_loss_for_padding and "action_is_pad" in batch:
valid_actions = ~batch["action_is_pad"]
loss = loss * valid_actions.unsqueeze(-1)
return loss.mean()
def conditional_sample(self, model: nn.Module, batch_size: int, conditioning_vec: Tensor) -> Tensor:
device = next(model.parameters()).device
dtype = next(model.parameters()).dtype
sample = torch.randn(
size=(batch_size, self.horizon, self.action_dim),
dtype=dtype,
device=device,
)
self.noise_scheduler.set_timesteps(self.num_inference_steps)
for t in self.noise_scheduler.timesteps:
model_output = model(
sample,
torch.full(sample.shape[:1], t, dtype=torch.long, device=sample.device),
conditioning_vec=conditioning_vec,
)
sample = self.noise_scheduler.step(model_output, t, sample).prev_sample
return sample
class FlowMatchingObjective(nn.Module):
"""Flow matching objective: trains a model to predict velocity fields."""
def __init__(self, config, action_dim: int, horizon: int, do_mask_loss_for_padding: bool = False):
super().__init__()
self.config = config
self.action_dim = action_dim
self.horizon = horizon
self.do_mask_loss_for_padding = do_mask_loss_for_padding
def _sample_timesteps(self, batch_size: int, device: torch.device) -> Tensor:
if self.config.timestep_sampling_strategy == "uniform":
return torch.rand(batch_size, device=device)
elif self.config.timestep_sampling_strategy == "beta":
beta_dist = torch.distributions.Beta(
self.config.timestep_sampling_alpha, self.config.timestep_sampling_beta
)
u = beta_dist.sample((batch_size,)).to(device)
return self.config.timestep_sampling_s * (1.0 - u)
else:
raise ValueError(f"Unknown timestep strategy: {self.config.timestep_sampling_strategy}")
def compute_loss(self, model: nn.Module, batch: dict[str, Tensor], conditioning_vec: Tensor) -> Tensor:
data = batch[ACTION]
batch_size = data.shape[0]
device = data.device
noise = torch.randn_like(data)
t = self._sample_timesteps(batch_size, device)
t_expanded = t.view(-1, 1, 1)
x_t = t_expanded * data + (1 - (1 - self.config.sigma_min) * t_expanded) * noise
target_velocity = data - (1 - self.config.sigma_min) * noise
predicted_velocity = model(x_t, t, conditioning_vec=conditioning_vec)
loss = F.mse_loss(predicted_velocity, target_velocity, reduction="none")
if self.do_mask_loss_for_padding and "action_is_pad" in batch:
valid_mask = ~batch["action_is_pad"]
loss = loss * valid_mask.unsqueeze(-1)
return loss.mean()
def conditional_sample(self, model: nn.Module, batch_size: int, conditioning_vec: Tensor) -> Tensor:
device = next(model.parameters()).device
dtype = next(model.parameters()).dtype
x = torch.randn((batch_size, self.horizon, self.action_dim), dtype=dtype, device=device)
num_steps = self.config.num_integration_steps
time_grid = torch.linspace(0, 1, num_steps + 1, device=device)
if self.config.integration_method == "euler":
x = self._euler_integrate(model, x, time_grid, conditioning_vec)
elif self.config.integration_method == "rk4":
x = self._rk4_integrate(model, x, time_grid, conditioning_vec)
else:
raise ValueError(f"Unknown integration method: {self.config.integration_method}")
return x
def _euler_integrate(
self, model: nn.Module, x_init: Tensor, time_grid: Tensor, conditioning_vec: Tensor
) -> Tensor:
x = x_init
for i in range(len(time_grid) - 1):
t_scalar = time_grid[i].item()
dt = (time_grid[i + 1] - time_grid[i]).item()
t_batch = torch.full((x.shape[0],), t_scalar, dtype=x.dtype, device=x.device)
with torch.no_grad():
velocity = model(x, t_batch, conditioning_vec=conditioning_vec)
x = x + dt * velocity
return x
def _rk4_integrate(
self, model: nn.Module, x_init: Tensor, time_grid: Tensor, conditioning_vec: Tensor
) -> Tensor:
x = x_init
def dynamics(x_val: Tensor, t_scalar: float) -> Tensor:
t_batch = torch.full((x_val.shape[0],), t_scalar, dtype=x_val.dtype, device=x_val.device)
with torch.no_grad():
return model(x_val, t_batch, conditioning_vec=conditioning_vec)
for i in range(len(time_grid) - 1):
t = time_grid[i].item()
dt = (time_grid[i + 1] - time_grid[i]).item()
k1 = dynamics(x, t)
k2 = dynamics(x + dt * k1 / 2, t + dt / 2)
k3 = dynamics(x + dt * k2 / 2, t + dt / 2)
k4 = dynamics(x + dt * k3, t + dt)
x = x + dt / 6 * (k1 + 2 * k2 + 2 * k3 + k4)
return x

View File

@@ -0,0 +1,105 @@
#!/usr/bin/env python
# Copyright 2025 Bryson Jones and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Any
import torch
from lerobot.policies.multi_task_dit.configuration_multi_task_dit import MultiTaskDiTConfig
from lerobot.processor import (
AddBatchDimensionProcessorStep,
DeviceProcessorStep,
NormalizerProcessorStep,
PolicyAction,
PolicyProcessorPipeline,
RenameObservationsProcessorStep,
TokenizerProcessorStep,
UnnormalizerProcessorStep,
)
from lerobot.processor.converters import policy_action_to_transition, transition_to_policy_action
from lerobot.utils.constants import POLICY_POSTPROCESSOR_DEFAULT_NAME, POLICY_PREPROCESSOR_DEFAULT_NAME
def make_multi_task_dit_pre_post_processors(
config: MultiTaskDiTConfig,
dataset_stats: dict[str, dict[str, torch.Tensor]] | None = None,
) -> tuple[
PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
PolicyProcessorPipeline[PolicyAction, PolicyAction],
]:
"""
Constructs pre-processor and post-processor pipelines for a Multi-Task DiT policy.
The pre-processing pipeline prepares the input data for the model by:
1. Renaming features.
2. Adding a batch dimension.
3. Tokenizing the language task description (if present).
4. Moving the data to the specified device.
5. Normalizing the input and output features based on dataset statistics.
The post-processing pipeline handles the model's output by:
1. Unnormalizing the output features to their original scale.
2. Moving the data to the CPU.
Args:
config: The configuration object for the Multi-Task DiT policy,
containing feature definitions, normalization mappings, and device information.
dataset_stats: A dictionary of statistics used for normalization.
Defaults to None.
Returns:
A tuple containing the configured pre-processor and post-processor pipelines.
"""
input_steps = [
RenameObservationsProcessorStep(rename_map={}),
AddBatchDimensionProcessorStep(),
TokenizerProcessorStep(
tokenizer_name=config.text_encoder_name,
padding=config.tokenizer_padding,
padding_side=config.tokenizer_padding_side,
max_length=config.tokenizer_max_length,
truncation=config.tokenizer_truncation,
),
DeviceProcessorStep(device=config.device),
NormalizerProcessorStep(
features={**config.input_features, **config.output_features},
norm_map=config.normalization_mapping,
stats=dataset_stats,
device=config.device,
),
]
output_steps = [
UnnormalizerProcessorStep(
features=config.output_features,
norm_map=config.normalization_mapping,
stats=dataset_stats,
),
DeviceProcessorStep(device="cpu"),
]
return (
PolicyProcessorPipeline[dict[str, Any], dict[str, Any]](
steps=input_steps,
name=POLICY_PREPROCESSOR_DEFAULT_NAME,
),
PolicyProcessorPipeline[PolicyAction, PolicyAction](
steps=output_steps,
name=POLICY_POSTPROCESSOR_DEFAULT_NAME,
to_transition=policy_action_to_transition,
to_output=transition_to_policy_action,
),
)

View File

@@ -17,6 +17,65 @@ It is designed as a **Vision-Language-Action model for general robot control**.
---
## Relative Actions
π₀ supports training with **relative actions**, where the model learns relative offsets
from the current robot state instead of absolute joint positions. This mirrors the
relative-action transform in OpenPI (`DeltaActions`) and can improve performance.
### How it works
1. **During preprocessing**, absolute actions are converted to relative offsets:
`relative = action - state` (for selected joints).
2. The relative actions are normalized using statistics computed from the relative distribution.
3. **During postprocessing**, predicted relative actions are converted back to absolute:
`absolute = relative + state`.
Joints listed in `relative_exclude_joints` (e.g., gripper) are kept absolute.
### Configuration
| Parameter | Type | Default | Description |
| ------------------------- | ----------- | ------------- | ---------------------------------------------------------------- |
| `use_relative_actions` | `bool` | `False` | Enable relative-action training |
| `relative_exclude_joints` | `list[str]` | `["gripper"]` | Joint names to keep absolute (matched by substring) |
| `action_feature_names` | `list[str]` | `None` | Auto-populated from dataset metadata at runtime by `make_policy` |
### Training example
```bash
python -m lerobot.scripts.lerobot_train \
--policy.type=pi0 \
--dataset.repo_id=your_org/your_dataset \
--policy.use_relative_actions=true \
--policy.relative_exclude_joints='["gripper"]'
```
When `use_relative_actions=true`, the training script automatically:
- Computes relative action statistics from the dataset (sampled chunk-level relative actions)
- Replaces the standard action stats with relative stats for normalization
- Broadcasts these stats across all ranks in distributed training
### Recomputing stats for an existing dataset
If you want to precompute relative action stats offline, use `recompute_stats` from
`lerobot.datasets.dataset_tools`:
```python
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.datasets.dataset_tools import recompute_stats
dataset = LeRobotDataset("your_org/your_dataset")
dataset = recompute_stats(
dataset,
relative_action=True,
relative_exclude_joints=["gripper"],
)
```
---
## Citation
If you use this work, please cite both **OpenPI** and the π₀ paper:

View File

@@ -50,6 +50,13 @@ class PI0Config(PreTrainedConfig):
min_period: float = 4e-3
max_period: float = 4.0
# Relative actions: converts absolute actions to relative (relative to state).
use_relative_actions: bool = False
# Joint names to exclude from relative (kept absolute). Empty list = all dims relative.
relative_exclude_joints: list[str] = field(default_factory=lambda: ["gripper"])
# Populated at runtime from dataset metadata by make_policy.
action_feature_names: list[str] | None = None
# Real-Time Chunking (RTC) configuration
rtc_config: RTCConfig | None = None

View File

@@ -21,6 +21,7 @@ import torch
from lerobot.configs.types import PipelineFeatureType, PolicyFeature
from lerobot.policies.pi0.configuration_pi0 import PI0Config
from lerobot.processor import (
AbsoluteActionsProcessorStep,
AddBatchDimensionProcessorStep,
ComplementaryDataProcessorStep,
DeviceProcessorStep,
@@ -29,6 +30,7 @@ from lerobot.processor import (
PolicyProcessorPipeline,
ProcessorStep,
ProcessorStepRegistry,
RelativeActionsProcessorStep,
RenameObservationsProcessorStep,
TokenizerProcessorStep,
UnnormalizerProcessorStep,
@@ -126,7 +128,13 @@ def make_pi0_pre_post_processors(
A tuple containing the configured pre-processor and post-processor pipelines.
"""
# Add remaining processors
relative_step = RelativeActionsProcessorStep(
enabled=config.use_relative_actions,
exclude_joints=getattr(config, "relative_exclude_joints", []),
action_names=getattr(config, "action_feature_names", None),
)
# OpenPI order: raw → relative → normalize → model → unnormalize → absolute
input_steps: list[ProcessorStep] = [
RenameObservationsProcessorStep(rename_map={}), # To mimic the same processor as pretrained one
AddBatchDimensionProcessorStep(),
@@ -138,6 +146,7 @@ def make_pi0_pre_post_processors(
padding="max_length",
),
DeviceProcessorStep(device=config.device),
relative_step,
NormalizerProcessorStep(
features={**config.input_features, **config.output_features},
norm_map=config.normalization_mapping,
@@ -149,6 +158,7 @@ def make_pi0_pre_post_processors(
UnnormalizerProcessorStep(
features=config.output_features, norm_map=config.normalization_mapping, stats=dataset_stats
),
AbsoluteActionsProcessorStep(enabled=config.use_relative_actions, relative_step=relative_step),
DeviceProcessorStep(device="cpu"),
]

View File

@@ -17,6 +17,48 @@ It is designed as a **Vision-Language-Action model with open-world generalizatio
---
## Relative Actions
π₀.₅ supports training with **relative actions**, where the model learns relative offsets
from the current robot state instead of absolute joint positions. This mirrors the
relative-action transform in OpenPI (`DeltaActions`) and can improve performance.
### How it works
1. **During preprocessing**, absolute actions are converted to relative offsets:
`relative = action - state` (for selected joints).
2. The relative actions are normalized using statistics computed from the relative distribution.
3. **During postprocessing**, predicted relative actions are converted back to absolute:
`absolute = relative + state`.
Joints listed in `relative_exclude_joints` (e.g., gripper) are kept absolute.
### Configuration
| Parameter | Type | Default | Description |
| ------------------------- | ----------- | ------------- | ---------------------------------------------------------------- |
| `use_relative_actions` | `bool` | `False` | Enable relative-action training |
| `relative_exclude_joints` | `list[str]` | `["gripper"]` | Joint names to keep absolute (matched by substring) |
| `action_feature_names` | `list[str]` | `None` | Auto-populated from dataset metadata at runtime by `make_policy` |
### Training example
```bash
python -m lerobot.scripts.lerobot_train \
--policy.type=pi05 \
--dataset.repo_id=your_org/your_dataset \
--policy.use_relative_actions=true \
--policy.relative_exclude_joints='["gripper"]'
```
When `use_relative_actions=true`, the training script automatically:
- Computes relative action statistics from the dataset (sampled chunk-level relative actions)
- Replaces the standard action stats with relative stats for normalization
- Broadcasts these stats across all ranks in distributed training
---
## Citation
If you use this work, please cite both **OpenPI** and the π₀.₅ paper:

View File

@@ -50,6 +50,13 @@ class PI05Config(PreTrainedConfig):
min_period: float = 4e-3
max_period: float = 4.0
# Relative actions: converts absolute actions to relative (relative to state).
use_relative_actions: bool = False
# Joint names to exclude from relative (kept absolute). Empty list = all dims relative.
relative_exclude_joints: list[str] = field(default_factory=lambda: ["gripper"])
# Populated at runtime from dataset metadata by make_policy.
action_feature_names: list[str] | None = None
# Real-Time Chunking (RTC) configuration
rtc_config: RTCConfig | None = None

View File

@@ -24,6 +24,7 @@ import torch
from lerobot.configs.types import PipelineFeatureType, PolicyFeature
from lerobot.policies.pi05.configuration_pi05 import PI05Config
from lerobot.processor import (
AbsoluteActionsProcessorStep,
AddBatchDimensionProcessorStep,
DeviceProcessorStep,
NormalizerProcessorStep,
@@ -31,6 +32,7 @@ from lerobot.processor import (
PolicyProcessorPipeline,
ProcessorStep,
ProcessorStepRegistry,
RelativeActionsProcessorStep,
RenameObservationsProcessorStep,
TokenizerProcessorStep,
UnnormalizerProcessorStep,
@@ -125,10 +127,17 @@ def make_pi05_pre_post_processors(
A tuple containing the configured pre-processor and post-processor pipelines.
"""
# Add remaining processors
relative_step = RelativeActionsProcessorStep(
enabled=config.use_relative_actions,
exclude_joints=getattr(config, "relative_exclude_joints", []),
action_names=getattr(config, "action_feature_names", None),
)
# OpenPI order: raw → relative → normalize → model → unnormalize → absolute
input_steps: list[ProcessorStep] = [
RenameObservationsProcessorStep(rename_map={}), # To mimic the same processor as pretrained one
AddBatchDimensionProcessorStep(),
relative_step,
# NOTE: NormalizerProcessorStep MUST come before Pi05PrepareStateTokenizerProcessorStep
# because the tokenizer step expects normalized state in [-1, 1] range for discretization
NormalizerProcessorStep(
@@ -150,6 +159,7 @@ def make_pi05_pre_post_processors(
UnnormalizerProcessorStep(
features=config.output_features, norm_map=config.normalization_mapping, stats=dataset_stats
),
AbsoluteActionsProcessorStep(enabled=config.use_relative_actions, relative_step=relative_step),
DeviceProcessorStep(device="cpu"),
]

View File

@@ -41,6 +41,13 @@ class PI0FastConfig(PreTrainedConfig):
max_action_dim: int = 32
max_action_tokens: int = 256
# Relative actions: converts absolute actions to relative (relative to state).
use_relative_actions: bool = False
# Joint names to exclude from relative (kept absolute). Empty list = all dims relative.
relative_exclude_joints: list[str] = field(default_factory=lambda: ["gripper"])
# Populated at runtime from dataset metadata by make_policy.
action_feature_names: list[str] | None = None
# Real-Time Chunking (RTC) configuration
rtc_config: RTCConfig | None = None

Some files were not shown because too many files have changed in this diff Show More