diff --git a/README.md b/README.md index 35b28da87..57fec2e5f 100644 --- a/README.md +++ b/README.md @@ -100,11 +100,11 @@ lerobot-train \ --dataset.repo_id=lerobot/aloha_mobile_cabinet ``` -| Category | Models | -| -------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| **Imitation Learning** | [ACT](./docs/source/policy_act_README.md), [Diffusion](./docs/source/policy_diffusion_README.md), [VQ-BeT](./docs/source/policy_vqbet_README.md) | -| **Reinforcement Learning** | [HIL-SERL](./docs/source/hilserl.mdx), [TDMPC](./docs/source/policy_tdmpc_README.md) & QC-FQL (coming soon) | -| **VLAs Models** | [Pi0.5](./docs/source/pi05.mdx), [GR00T N1.5](./docs/source/policy_groot_README.md), [SmolVLA](./docs/source/policy_smolvla_README.md), [XVLA](./docs/source/xvla.mdx) | +| Category | Models | +| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| **Imitation Learning** | [ACT](./docs/source/policy_act_README.md), [Diffusion](./docs/source/policy_diffusion_README.md), [VQ-BeT](./docs/source/policy_vqbet_README.md) | +| **Reinforcement Learning** | [HIL-SERL](./docs/source/hilserl.mdx), [TDMPC](./docs/source/policy_tdmpc_README.md) & QC-FQL (coming soon) | +| **VLAs Models** | [Pi0Fast](./docs/source/pi0fast.mdx), [Pi0.5](./docs/source/pi05.mdx), [GR00T N1.5](./docs/source/policy_groot_README.md), [SmolVLA](./docs/source/policy_smolvla_README.md), [XVLA](./docs/source/xvla.mdx) | Similarly to the hardware, you can easily implement your own policy & leverage LeRobot's data collection, training, and visualization tools, and share your model to the HF Hub diff --git a/docs/source/groot.mdx b/docs/source/groot.mdx index 729a64656..6c036c9d5 100644 --- a/docs/source/groot.mdx +++ b/docs/source/groot.mdx @@ -12,6 +12,12 @@ Developers and researchers can post-train GR00T N1.5 with their own real or synt GR00T N1.5 (specifically the GR00T-N1.5-3B model) is built using pre-trained vision and language encoders. It utilizes a flow matching action transformer to model a chunk of actions, conditioned on vision, language, and proprioception. +An overview of GR00T + Its strong performance comes from being trained on an expansive and diverse humanoid dataset, which includes: - Real captured data from robots. diff --git a/docs/source/pi0.mdx b/docs/source/pi0.mdx index 89604b6aa..93e0b4c88 100644 --- a/docs/source/pi0.mdx +++ b/docs/source/pi0.mdx @@ -6,6 +6,12 @@ π₀ represents a breakthrough in robotics as the first general-purpose robot foundation model developed by [Physical Intelligence](https://www.physicalintelligence.company/blog/pi0). Unlike traditional robot programs that are narrow specialists programmed for repetitive motions, π₀ is designed to be a generalist policy that can understand visual inputs, interpret natural language instructions, and control a variety of different robots across diverse tasks. +An overview of Pi0 + ### The Vision for Physical Intelligence As described by Physical Intelligence, while AI has achieved remarkable success in digital domains, from chess-playing to drug discovery, human intelligence still dramatically outpaces AI in the physical world. To paraphrase Moravec's paradox, winning a game of chess represents an "easy" problem for AI, but folding a shirt or cleaning up a table requires solving some of the most difficult engineering problems ever conceived. π₀ represents a first step toward developing artificial physical intelligence that enables users to simply ask robots to perform any task they want, just like they can with large language models. diff --git a/docs/source/pi0fast.mdx b/docs/source/pi0fast.mdx index e64355765..c4230fa79 100644 --- a/docs/source/pi0fast.mdx +++ b/docs/source/pi0fast.mdx @@ -6,6 +6,12 @@ π₀-FAST combines the power of Vision-Language Models with a novel action tokenization approach called **FAST (Frequency-space Action Sequence Tokenization)**. This enables training autoregressive VLAs on highly dexterous tasks that are impossible with standard binning-based discretization, while training **up to 5x faster** than diffusion-based approaches like π₀. +An overview of Pi0-FAST + ### Why FAST? Standard approaches for robot action tokenization use simple per-dimension, per-timestep binning schemes. While passable for simple behaviors, this rapidly breaks down for complex and dexterous skills that require precision and high-frequency control. @@ -53,7 +59,7 @@ You have two options for the FAST tokenizer: ### Training Your Own Tokenizer ```bash -python src/lerobot/policies/pi0_fast/train_fast_tokenizer.py \ +lerobot-train-tokenizer \ --repo_id "user/my-lerobot-dataset" \ --action_horizon 10 \ --encoded_dims "0:6" \ @@ -90,7 +96,7 @@ policy.type=pi0_fast For training π₀-FAST, you can use the LeRobot training script: ```bash -python src/lerobot/scripts/lerobot_train.py \ +lerobot-train \ --dataset.repo_id=your_dataset \ --policy.type=pi0_fast \ --output_dir=./outputs/pi0fast_training \ @@ -171,6 +177,64 @@ The model takes images, text instructions, and robot state as input, and outputs | Inference Method | Iterative Denoising | Autoregressive Decoding | | KV-Caching | N/A | Supported | +## Reproducing π₀Fast results + +We reproduce the results of π₀Fast on the LIBERO benchmark using the LeRobot implementation. We take the LeRobot PiFast base model [lerobot/pi0fast-base](https://huggingface.co/lerobot/pi0fast-base) and finetune for an additional 40kk steps in bfloat16, with batch size of 256 on 8 H100 GPUs using the [HuggingFace LIBERO dataset](https://huggingface.co/datasets/HuggingFaceVLA/libero). + +The finetuned model can be found here: + +- **π₀Fast LIBERO**: [lerobot/pi0fast-libero](https://huggingface.co/lerobot/pi0fast-libero) + +With the following training command: + +```bash +lerobot-train \ + --dataset.repo_id=lerobot/libero \ + --output_dir=outputs/libero_pi0fast \ + --job_name=libero_pi0fast \ + --policy.path=lerobot/pi0fast_base \ + --policy.dtype=bfloat16 \ + --steps=100000 \ + --save_freq=20000 \ + --batch_size=4 \ + --policy.device=cuda \ + --policy.scheduler_warmup_steps=4000 \ + --policy.scheduler_decay_steps=100000 \ + --policy.scheduler_decay_lr=1e-5 \ + --policy.gradient_checkpointing=true \ + --policy.chunk_size=10 \ + --policy.n_action_steps=10 \ + --policy.max_action_tokens=256 \ + --policy.empty_cameras=1 \ +``` + +We then evaluate the finetuned model using the LeRobot LIBERO implementation, by running the following command: + +```bash +tasks="libero_object,libero_spatial,libero_goal,libero_10" +lerobot-eval \ + --policy.path=lerobot/pi0fast-libero \ + --policy.max_action_tokens=256 \ + --env.type=libero \ + --policy.gradient_checkpointing=false \ + --env.task=${tasks} \ + --eval.batch_size=1 \ + --eval.n_episodes=1 \ + --rename_map='{"observation.images.image":"observation.images.base_0_rgb","observation.images.image2":"observation.images.left_wrist_0_rgb"}' +``` + +**Note:** We set `n_action_steps=10`, similar to the original OpenPI implementation. + +### Results + +We obtain the following results on the LIBERO benchmark: + +| Model | LIBERO Spatial | LIBERO Object | LIBERO Goal | LIBERO 10 | Average | +| ----------- | -------------- | ------------- | ----------- | --------- | -------- | +| **π₀-fast** | 70.0 | 100.0 | 100.0 | 60.0 | **82.5** | + +The full evaluation output folder, including videos, is available [here](https://drive.google.com/drive/folders/1HXpwPTRm4hx6g1sF2P7OOqGG0TwPU7LQ?usp=sharing) + ## License This model follows the **Apache 2.0 License**, consistent with the original [OpenPI repository](https://github.com/Physical-Intelligence/openpi). diff --git a/docs/source/sarm.mdx b/docs/source/sarm.mdx index 321097692..65e49792b 100644 --- a/docs/source/sarm.mdx +++ b/docs/source/sarm.mdx @@ -4,6 +4,12 @@ SARM (Stage-Aware Reward Modeling) is a video-based reward modeling framework fo **Paper**: [SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation](https://arxiv.org/abs/2509.25358) +An overview of SARM + ## Why Reward Models? Standard behavior cloning treats all demonstration frames equally, but real-world robot datasets are messy. They contain hesitations, corrections, and variable-quality trajectories. Reward models solve this by learning a generalizable notion of **task progress** from demonstrations: given video frames and a task description, they predict how close the robot is to completing the task (0→1). This learned "progress signal" can be used in multiple ways, two promising applications are: (1) **weighted imitation learning** (RA-BC), where high-progress frames receive more weight during policy training, and (2) **reinforcement learning**, where the reward model provides dense rewards for online or offline policy improvement. diff --git a/docs/source/walloss.mdx b/docs/source/walloss.mdx index 12e9b1fc7..c0756c087 100644 --- a/docs/source/walloss.mdx +++ b/docs/source/walloss.mdx @@ -8,6 +8,12 @@ X Square Robot’s WALL-OSS is now integrated into Hugging Face’s LeRobot ecos The WALL-OSS team is building the embodied foundation model to capture and compress the world's most valuable data: the continuous, high-fidelity stream of physical interaction. By creating a direct feedback loop between the model's decisions and the body's lived experience, the emergence of a truly generalizable intelligence is enabled—one that understands not just how the world works, but how to act effectively within it. +An overview of WALL-OSS + Technically, WALL-OSS introduces a tightly coupled multimodal architecture (tightly-coupled MoE structure) that integrates both discrete and continuous action modeling strategies. Through a two-stage training pipeline (Inspiration → Integration), the model gradually unifies semantic reasoning and high-frequency action generation. Its core innovations include: - **Embodied perception–enhanced multimodal pretraining**: Large-scale training on unified vision–language–action data to strengthen spatial, causal, and manipulation understanding. diff --git a/pyproject.toml b/pyproject.toml index 75738d2de..e8f334c77 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -197,6 +197,7 @@ lerobot-setup-motors="lerobot.scripts.lerobot_setup_motors:main" lerobot-teleoperate="lerobot.scripts.lerobot_teleoperate:main" lerobot-eval="lerobot.scripts.lerobot_eval:main" lerobot-train="lerobot.scripts.lerobot_train:main" +lerobot-train-tokenizer="lerobot.scripts.lerobot_train_tokenizer:main" lerobot-dataset-viz="lerobot.scripts.lerobot_dataset_viz:main" lerobot-info="lerobot.scripts.lerobot_info:main" lerobot-find-joint-limits="lerobot.scripts.lerobot_find_joint_limits:main" diff --git a/src/lerobot/policies/pi0_fast/train_fast_tokenizer.py b/src/lerobot/scripts/lerobot_train_tokenizer.py similarity index 94% rename from src/lerobot/policies/pi0_fast/train_fast_tokenizer.py rename to src/lerobot/scripts/lerobot_train_tokenizer.py index 6a3a1fe69..03bfcaaf8 100644 --- a/src/lerobot/policies/pi0_fast/train_fast_tokenizer.py +++ b/src/lerobot/scripts/lerobot_train_tokenizer.py @@ -1,3 +1,16 @@ +# Copyright 2026 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. """Train FAST tokenizer for action encoding. This script: @@ -6,6 +19,26 @@ This script: 3. Trains FAST tokenizer on specified action dimensions 4. Saves tokenizer to assets directory 5. Reports compression statistics + +Example: + +```shell +lerobot-train-tokenizer \ + --repo_id=user/dataset_name \ + --action_horizon=10 \ + --max_episodes=100 \ + --sample_fraction=0.1 \ + --encoded_dims="0:6" \ + --delta_dims="0,1,2,3,4,5" \ + --use_delta_transform=true \ + --state_key="observation.state" \ + --normalization_mode="QUANTILES" \ + --vocab_size=1024 \ + --scale=10.0 \ + --output_dir="./fast_tokenizer_dataset_name" \ + --push_to_hub=true \ + --hub_repo_id="user/fast_tokenizer_dataset_name" \ + --hub_private=false """ import json