add more base models to generate model card

add port and fix formatting
update policy deployment instruction with rollout
2026-06-03 04:11:24 +00:00 · 2026-05-20 12:24:32 +02:00 · 2026-05-20 10:56:59 +02:00 · 2026-05-20 10:55:18 +02:00
81 changed files with 983 additions and 25093 deletions
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -59,12 +59,10 @@
    title: π₀-FAST (Pi0Fast)
  - local: pi05
    title: π₀.₅ (Pi05)
-  - local: molmoact2
-    title: MolmoAct2
  - local: eo1
    title: EO-1
  - local: groot
-    title: NVIDIA GR00T
+    title: NVIDIA GR00T N1.5
  - local: xvla
    title: X-VLA
  - local: multi_task_dit
@@ -75,10 +73,6 @@
 - sections:
  - local: sarm
    title: SARM
-  - local: robometer
-    title: ROBOMETER
-  - local: topreward
-    title: TOPReward
  title: "Reward Models"
 - sections:
  - local: inference
--- a/docs/source/act.mdx
+++ b/docs/source/act.mdx
@@ -79,13 +79,17 @@ If your local computer doesn't have a powerful GPU, you can utilize Google Colab
 Once training is complete, you can evaluate your ACT policy using the `lerobot-record` command with your trained policy. This will run inference and record evaluation episodes:

 ```bash
-lerobot-rollout \
-  --strategy.type=base \
-  --policy.path=${HF_USER}/act_policy \
-  --robot.type=so101_follower \
+lerobot-record \
+  --robot.type=so100_follower \
  --robot.port=/dev/ttyACM0 \
+  --robot.id=my_robot \
  --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
  --display_data=true \
-  --task="Your task description" \ # can be skipped for ACT
-  --duration=60
+  --dataset.repo_id=${HF_USER}/eval_act_your_dataset \
+  --dataset.num_episodes=10 \
+  --dataset.single_task="Your task description" \
+  --dataset.streaming_encoding=true \
+  --dataset.encoder_threads=2 \
+  # --dataset.camera_encoder.vcodec=auto \
+  --policy.path=${HF_USER}/act_policy
 ```
--- a/docs/source/groot.mdx
+++ b/docs/source/groot.mdx
@@ -1,16 +1,16 @@
-# GR00T Policy
+# GR00T N1.5 Policy

-GR00T is an NVIDIA foundation model family for generalized humanoid robot reasoning and skills. It is a cross-embodiment policy that accepts multimodal input, including language, images, and proprioception, to perform manipulation tasks in diverse environments.
+GR00T N1.5 is an open foundation model from NVIDIA designed for generalized humanoid robot reasoning and skills. It is a cross-embodiment model that accepts multimodal input, including language and images, to perform manipulation tasks in diverse environments.

-LeRobot integrates GR00T through the `groot` policy type. The default model family is GR00T N1.5, and GR00T N1.7 can be selected with `policy.model_version=n1.7`.
+This document outlines the specifics of its integration and usage within the LeRobot framework.

 ## Model Overview

-NVIDIA Isaac GR00T N1.5 is an upgraded version of the GR00T N1 foundation model. GR00T N1.7 extends the family with a Cosmos-Reason2/Qwen3-VL backbone and N1.7 checkpoints for SimplerEnv, DROID, and LIBERO.
+NVIDIA Isaac GR00T N1.5 is an upgraded version of the GR00T N1 foundation model. It is built to improve generalization and language-following abilities for humanoid robots.

-Developers and researchers can post-train GR00T with their own real or synthetic data to adapt it for specific humanoid robots or tasks.
+Developers and researchers can post-train GR00T N1.5 with their own real or synthetic data to adapt it for specific humanoid robots or tasks.

-GR00T uses pre-trained vision and language encoders with a flow matching action transformer to model a chunk of actions conditioned on vision, language, and proprioception.
+GR00T N1.5 (specifically the GR00T-N1.5-3B model) is built using pre-trained vision and language encoders. It utilizes a flow matching action transformer to model a chunk of actions, conditioned on vision, language, and proprioception.

 <img
  src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/lerobot-groot-paper1%20(1).png"
@@ -28,35 +28,33 @@ This approach allows the model to be highly adaptable through post-training for

 ## Installation Requirements

-Install LeRobot with the GR00T extra:
+As of today, GR00T N1.5 requires flash attention for it's internal working.
+
+We are working on making this optional, but in the meantime that means that we require an extra installation step and it can only be used in CUDA enabled devices.
+
+1. Following the Environment Setup of our [Installation Guide](./installation). **Attention** don't install `lerobot` in this step.
+2. Install [Flash Attention](https://github.com/Dao-AILab/flash-attention) by running:

 ```bash
-pip install "lerobot[groot]"
+# Check https://pytorch.org/get-started/locally/ for your system
+pip install "torch>=2.2.1,<2.8.0" "torchvision>=0.21.0,<0.23.0" # --index-url https://download.pytorch.org/whl/cu1XX
+pip install ninja "packaging>=24.2,<26.0" # flash attention dependencies
+pip install "flash-attn>=2.5.9,<3.0.0" --no-build-isolation
+python -c "import flash_attn; print(f'Flash Attention {flash_attn.__version__} imported successfully')"
 ```

-GR00T is intended for NVIDIA GPU-accelerated systems. The `groot` extra installs the policy dependencies, including `transformers`, `diffusers`, `peft`, `dm-tree`, and Flash Attention where available. If Flash Attention is unavailable or incompatible, LeRobot falls back to SDPA attention in supported GR00T paths, with lower expected throughput.
-
-For a source checkout, follow the Environment Setup in the [Installation Guide](./installation), then install the extra:
+3. Install LeRobot by running:

 ```bash
-uv sync --locked --extra groot
+pip install lerobot[groot]
 ```

-If you need to install Flash Attention manually for your CUDA/PyTorch build, use the wheel or source build recommended by the [Flash Attention project](https://github.com/Dao-AILab/flash-attention).
-
 ## Usage

-To use GR00T N1.5 in your LeRobot configuration, specify the policy type:
+To use GR00T in your LeRobot configuration, specify the policy type as:

-```bash
--policy.type=groot
-```
-
-To use GR00T N1.7:
-
-```bash
--policy.type=groot \
--policy.model_version=n1.7
+```python
+policy.type=groot
 ```

 ## Training
@@ -87,20 +85,14 @@ accelerate launch \
  --job_name=$JOB_NAME
 ```

-For N1.7, add:
-
-```bash
--policy.model_version=n1.7
-```
-
 ## Performance Results

-### LIBERO Benchmark Results
+### Libero Benchmark Results

 > [!NOTE]
-> Follow the [LIBERO](./libero) setup instructions before running `lerobot-eval`.
+> Follow our instructions for Libero usage: [Libero](./libero)

-GR00T has demonstrated strong performance on the LIBERO benchmark suite. To compare and test its LeRobot implementation, we finetuned the GR00T N1.5 model for 30k steps on the LIBERO dataset and compared the results to the GR00T reference results.
+GR00T has demonstrated strong performance on the Libero benchmark suite. To compare and test its LeRobot implementation, we finetuned the GR00T N1.5 model for 30k steps on the Libero dataset and compared the results to the GR00T reference results.

 | Benchmark          | LeRobot Implementation | GR00T Reference |
 | ------------------ | ---------------------- | --------------- |
@@ -109,58 +101,14 @@ GR00T has demonstrated strong performance on the LIBERO benchmark suite. To comp
 | **Libero Long**    | 82.0%                  | 76.0%           |
 | **Average**        | 87.0%                  | 87.0%           |

-These results demonstrate GR00T's strong generalization capabilities across diverse robotic manipulation tasks. To reproduce these results, follow the instructions in the [LIBERO](./libero) section.
-
-### GR00T N1.7 LIBERO Checkpoints
-
-NVIDIA publishes GR00T N1.7 LIBERO checkpoints at [`nvidia/GR00T-N1.7-LIBERO`](https://huggingface.co/nvidia/GR00T-N1.7-LIBERO), with one subdirectory per LIBERO suite:
-
-| Suite          | Checkpoint subdirectory |
-| -------------- | ----------------------- |
-| LIBERO Spatial | `libero_spatial`        |
-| LIBERO Object  | `libero_object`         |
-| LIBERO Goal    | `libero_goal`           |
-| LIBERO 10      | `libero_10`             |
-
-Preliminary LeRobot integration results:
-
-| Suite          | Status | Success rate | n_episodes |
-| -------------- | ------ | -----------: | ---------: |
-| LIBERO Spatial | ✓      |         ~95% |         XX |
-| LIBERO Object  | ✓      |          XX% |         XX |
-| LIBERO Goal    | ✓      |          XX% |         XX |
-| LIBERO 10      | ✓      |          XX% |         XX |
-| **Average**    | ✓      |      **XX%** |     **XX** |
-
-Replace the `XX` placeholders with final eval artifacts before merge.
-
-Download the suite checkpoint locally, then point `--policy.base_model_path` at the downloaded subdirectory. `--policy.path` is reserved for LeRobot checkpoints that contain a LeRobot `config.json` with a `type` field.
-
-```bash
-huggingface-cli download nvidia/GR00T-N1.7-LIBERO \
-  --include "libero_spatial/*" \
-  --local-dir ./GR00T-N1.7-LIBERO
-
-lerobot-eval \
-  --policy.type=groot \
-  --policy.model_version=n1.7 \
-  --policy.base_model_path=./GR00T-N1.7-LIBERO/libero_spatial \
-  --policy.embodiment_tag=libero_sim \
-  --env.type=libero \
-  --env.task=libero_spatial \
-  --eval.n_episodes=50
-```
-
-Use `eval.n_episodes >= 50` per suite when reporting success rates.
+These results demonstrate GR00T's strong generalization capabilities across diverse robotic manipulation tasks. To reproduce these results, you can follow the instructions in the [Libero](https://huggingface.co/docs/lerobot/libero) section.

 ### Evaluate in your hardware setup

-Once you have trained your model using your parameters you can run inference in your downstream task. Follow the instructions in [Policy Deployment (lerobot-rollout)](./inference). For example:
+Once you have trained your model using your parameters you can run inference in your downstream task. Follow the instructions in [Imitation Learning for Robots](./il_robots). For example:

 ```bash
-lerobot-rollout\
-  --strategy.type=sentry \
-  --strategy.upload_every_n_episodes=5 \
+lerobot-record \
  --robot.type=bi_so_follower \
  --robot.left_arm_port=/dev/ttyACM1 \
  --robot.right_arm_port=/dev/ttyACM0 \
@@ -171,14 +119,16 @@ lerobot-rollout\
  }' \
  --display_data=true \
  --dataset.repo_id=<user>/eval_groot-bimanual  \
+  --dataset.num_episodes=10 \
  --dataset.single_task="Grab and handover the red cube to the other arm" \
  --dataset.streaming_encoding=true \
  --dataset.encoder_threads=2 \
  # --dataset.camera_encoder.vcodec=auto \
  --policy.path=<user>/groot-bimanual \ # your trained model
-  --duration=600
+  --dataset.episode_time_s=30 \
+  --dataset.reset_time_s=10
 ```

 ## License

-GR00T N1.5 follows NVIDIA's license terms, consistent with the original [GR00T repository](https://github.com/NVIDIA/Isaac-GR00T). GR00T N1.7 is released under the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).
+This model follows NVIDIA's proprietary license, consistent with the original [GR00T repository](https://github.com/NVIDIA/Isaac-GR00T). Future versions (starting from N1.7) will follow **Apache 2.0 License**.
--- a/docs/source/il_robots.mdx
+++ b/docs/source/il_robots.mdx
@@ -68,13 +68,13 @@ from lerobot.teleoperators.so_leader import SO101Leader, SO101LeaderConfig
 from lerobot.robots.so_follower import SO101Follower, SO101FollowerConfig

 robot_config = SO101FollowerConfig(
-    port="/dev/tty.usbmodem5AB90687491",
-    id="my_follower_arm",
+    port="/dev/tty.usbmodem58760431541",
+    id="my_red_robot_arm",
 )

 teleop_config = SO101LeaderConfig(
-    port="/dev/tty.usbmodem5AB90689011",
-    id="my_leader_arm",
+    port="/dev/tty.usbmodem58760431551",
+    id="my_blue_leader_arm",
 )

 robot = SO101Follower(robot_config)
@@ -108,13 +108,13 @@ With `rerun`, you can teleoperate again while simultaneously visualizing the cam
 <hfoption id="Command">
 ```bash
 lerobot-teleoperate \
-    --robot.type=so101_follower \
-    --robot.port=/dev/tty.usbmodem5AB90687491 \
-    --robot.id=my_follower_arm \
-    --robot.cameras="{front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
-    --teleop.type=so101_leader \
-    --teleop.port=/dev/tty.usbmodem5AB90689011 \
-    --teleop.id=my_leader_arm \
+    --robot.type=koch_follower \
+    --robot.port=/dev/tty.usbmodem58760431541 \
+    --robot.id=my_awesome_follower_arm \
+    --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 1920, height: 1080, fps: 30}}" \
+    --teleop.type=koch_leader \
+    --teleop.port=/dev/tty.usbmodem58760431551 \
+    --teleop.id=my_awesome_leader_arm \
    --display_data=true
 ```
 </hfoption>
@@ -122,48 +122,34 @@ lerobot-teleoperate \

 <!-- prettier-ignore-start -->
 ```python
-import time
-from lerobot.teleoperators.so_leader import SO101Leader, SO101LeaderConfig
-from lerobot.robots.so_follower import SO101Follower, SO101FollowerConfig
 from lerobot.cameras.opencv import OpenCVCameraConfig
-from lerobot.utils.visualization_utils import init_rerun, log_rerun_data, shutdown_rerun
+from lerobot.teleoperators.koch_leader import KochLeader, KochLeaderConfig
+from lerobot.robots.koch_follower import KochFollower, KochFollowerConfig

-robot_config = SO101FollowerConfig(
-    port="/dev/tty.usbmodem5AB90687491",
-    id="my_follower_arm",
-    cameras={
-        "wrist": OpenCVCameraConfig(index_or_path=0, width=640, height=480, fps=30),
-        "top": OpenCVCameraConfig(index_or_path=1, width=640, height=480, fps=30)
-    }
+camera_config = {
+    "front": OpenCVCameraConfig(index_or_path=0, width=1920, height=1080, fps=30)
+}
+
+robot_config = KochFollowerConfig(
+    port="/dev/tty.usbmodem585A0076841",
+    id="my_red_robot_arm",
+    cameras=camera_config
 )

-teleop_config = SO101LeaderConfig(
-    port="/dev/tty.usbmodem5AB90689011",
-    id="my_leader_arm",
+teleop_config = KochLeaderConfig(
+    port="/dev/tty.usbmodem58760431551",
+    id="my_blue_leader_arm",
 )

-init_rerun(session_name="teleoperation")
-
-robot = SO101Follower(robot_config)
-teleop_device = SO101Leader(teleop_config)
+robot = KochFollower(robot_config)
+teleop_device = KochLeader(teleop_config)
 robot.connect()
 teleop_device.connect()

-TARGET_HZ = 30
-TIME_PER_FRAME = 1.0 / TARGET_HZ
-
 while True:
-    start_time = time.perf_counter()
-
    observation = robot.get_observation()
    action = teleop_device.get_action()
    robot.send_action(action)
-    log_rerun_data(observation=observation, action=action)
-
-    elapsed_time = time.perf_counter() - start_time
-    sleep_time = TIME_PER_FRAME - elapsed_time
-    if sleep_time > 0:
-        time.sleep(sleep_time)
 ```
 <!-- prettier-ignore-end -->

@@ -216,11 +202,10 @@ lerobot-record \
 <!-- prettier-ignore-start -->
 ```python
 from lerobot.cameras.opencv import OpenCVCameraConfig
-from lerobot.datasets.lerobot_dataset import LeRobotDataset
+from lerobot.datasets import LeRobotDataset
 from lerobot.utils.feature_utils import hw_to_dataset_features
-from lerobot.robots.so_follower import SO101Follower, SO101FollowerConfig
-from lerobot.teleoperators.so_leader.config_so_leader import SO101LeaderConfig
-from lerobot.teleoperators.so_leader.so_leader import SO101Leader
+from lerobot.robots.so_follower import SO100Follower, SO100FollowerConfig
+from lerobot.teleoperators.so_leader import SO100Leader, SO100LeaderConfig
 from lerobot.common.control_utils import init_keyboard_listener
 from lerobot.utils.utils import log_say
 from lerobot.utils.visualization_utils import init_rerun
@@ -233,56 +218,71 @@ EPISODE_TIME_SEC = 60
 RESET_TIME_SEC = 10
 TASK_DESCRIPTION = "My task description"

-def main():
-    # Create robot configuration
-    robot_config = SO101FollowerConfig(
-        port="/dev/tty.usbmodem5AB90687491",
-        id="my_follower_arm",
-        cameras={
-            "wrist": OpenCVCameraConfig(index_or_path=0, width=640, height=480, fps=30),
-            "top": OpenCVCameraConfig(index_or_path=1, width=640, height=480, fps=30)
-        }
-    )
+# Create robot configuration
+robot_config = SO100FollowerConfig(
+    id="my_awesome_follower_arm",
+    cameras={
+        "front": OpenCVCameraConfig(index_or_path=0, width=640, height=480, fps=FPS) # Optional: fourcc="MJPG" for troubleshooting OpenCV async error.
+    },
+    port="/dev/tty.usbmodem58760434471",
+)

-    teleop_config = SO101LeaderConfig(
-        port="/dev/tty.usbmodem5AB90689011",
-        id="my_leader_arm",
-    )
+teleop_config = SO100LeaderConfig(
+    id="my_awesome_leader_arm",
+    port="/dev/tty.usbmodem585A0077581",
+)

-    # Initialize the robot and teleoperator
-    robot = SO101Follower(robot_config)
-    teleop = SO101Leader(teleop_config)
+# Initialize the robot and teleoperator
+robot = SO100Follower(robot_config)
+teleop = SO100Leader(teleop_config)

-    # Configure the dataset features
-    action_features = hw_to_dataset_features(robot.action_features, "action")
-    obs_features = hw_to_dataset_features(robot.observation_features, "observation")
-    dataset_features = {**action_features, **obs_features}
+# Configure the dataset features
+action_features = hw_to_dataset_features(robot.action_features, "action")
+obs_features = hw_to_dataset_features(robot.observation_features, "observation")
+dataset_features = {**action_features, **obs_features}

-    # Create the dataset
-    dataset = LeRobotDataset.create(
-        repo_id="<hf_username>/<dataset_repo_id>",
+# Create the dataset
+dataset = LeRobotDataset.create(
+    repo_id="<hf_username>/<dataset_repo_id>",
+    fps=FPS,
+    features=dataset_features,
+    robot_type=robot.name,
+    use_videos=True,
+    image_writer_threads=4,
+)
+
+# Initialize the keyboard listener and rerun visualization
+_, events = init_keyboard_listener()
+init_rerun(session_name="recording")
+
+# Connect the robot and teleoperator
+robot.connect()
+teleop.connect()
+
+# Create the required processors
+teleop_action_processor, robot_action_processor, robot_observation_processor = make_default_processors()
+
+episode_idx = 0
+while episode_idx < NUM_EPISODES and not events["stop_recording"]:
+    log_say(f"Recording episode {episode_idx + 1} of {NUM_EPISODES}")
+
+    record_loop(
+        robot=robot,
+        events=events,
        fps=FPS,
-        features=dataset_features,
-        robot_type=robot.name,
-        use_videos=True,
-        image_writer_threads=4,
+        teleop_action_processor=teleop_action_processor,
+        robot_action_processor=robot_action_processor,
+        robot_observation_processor=robot_observation_processor,
+        teleop=teleop,
+        dataset=dataset,
+        control_time_s=EPISODE_TIME_SEC,
+        single_task=TASK_DESCRIPTION,
+        display_data=True,
    )

-    # Initialize the keyboard listener and rerun visualization
-    _, events = init_keyboard_listener()
-    init_rerun(session_name="recording")
-
-    # Connect the robot and teleoperator
-    robot.connect()
-    teleop.connect()
-
-    # Create the required processors
-    teleop_action_processor, robot_action_processor, robot_observation_processor = make_default_processors()
-
-    episode_idx = 0
-    while episode_idx < NUM_EPISODES and not events["stop_recording"]:
-        log_say(f"Recording episode {episode_idx + 1} of {NUM_EPISODES}")
-
+    # Reset the environment if not stopping or re-recording
+    if not events["stop_recording"] and (episode_idx < NUM_EPISODES - 1 or events["rerecord_episode"]):
+        log_say("Reset the environment")
        record_loop(
            robot=robot,
            events=events,
@@ -291,50 +291,26 @@ def main():
            robot_action_processor=robot_action_processor,
            robot_observation_processor=robot_observation_processor,
            teleop=teleop,
-            dataset=dataset,
-            control_time_s=EPISODE_TIME_SEC,
+            control_time_s=RESET_TIME_SEC,
            single_task=TASK_DESCRIPTION,
            display_data=True,
        )

-        # Reset the environment if not stopping or re-recording
-        if not events["stop_recording"] and (episode_idx < NUM_EPISODES - 1 or events["rerecord_episode"]):
-            log_say("Reset the environment")
-            record_loop(
-                robot=robot,
-                events=events,
-                fps=FPS,
-                teleop_action_processor=teleop_action_processor,
-                robot_action_processor=robot_action_processor,
-                robot_observation_processor=robot_observation_processor,
-                teleop=teleop,
-                control_time_s=RESET_TIME_SEC,
-                single_task=TASK_DESCRIPTION,
-                display_data=True,
-            )
+    if events["rerecord_episode"]:
+        log_say("Re-recording episode")
+        events["rerecord_episode"] = False
+        events["exit_early"] = False
+        dataset.clear_episode_buffer()
+        continue

-        if events["rerecord_episode"]:
-            log_say("Re-recording episode")
-            events["rerecord_episode"] = False
-            events["exit_early"] = False
-            dataset.clear_episode_buffer()
-            continue
+    dataset.save_episode()
+    episode_idx += 1

-        dataset.save_episode()
-        episode_idx += 1
-
-    # finalize dataset
-    log_say("Finalizing dataset...")
-    dataset.finalize()
-    # Clean up
-    log_say("Stop recording")
-    robot.disconnect()
-    teleop.disconnect()
-    dataset.push_to_hub()
-
-
-if __name__ == "__main__":
-    main()
+# Clean up
+log_say("Stop recording")
+robot.disconnect()
+teleop.disconnect()
+dataset.push_to_hub()
 ```
 <!-- prettier-ignore-end -->

@@ -372,7 +348,7 @@ The `record` function provides a suite of tools for capturing and managing data
 ##### 2. Checkpointing and Resuming

 - Checkpoints are automatically created during recording.
- If an issue occurs or you want to record additional episodes in the same dataset, you can resume by re-running the same command with `--resume=true`. When resuming a recording, `--dataset.num_episodes` must be set to the **number of additional episodes to be recorded**, and not to the targeted total number of episodes in the dataset! Make sure that you also set `--dataset.root="local_path"`, it's a local path to save the new part of the dataset and is required to resume.
+- If an issue occurs, you can resume by re-running the same command with `--resume=true`. When resuming a recording, `--dataset.num_episodes` must be set to the **number of additional episodes to be recorded**, and not to the targeted total number of episodes in the dataset !
 - To start recording from scratch, **manually delete** the dataset directory.

 ##### 3. Recording Parameters
@@ -446,7 +422,7 @@ from lerobot.utils.utils import log_say

 episode_idx = 0

-robot_config = SO100FollowerConfig(port="/dev/tty.usbmodem5AB90687491", id="my_follower_arm")
+robot_config = SO100FollowerConfig(port="/dev/tty.usbmodem58760434471", id="my_awesome_follower_arm")

 robot = SO100Follower(robot_config)
 robot.connect()
@@ -514,83 +490,6 @@ Additionally you can provide extra `tags` or specify a `license` for your model

 If your local computer doesn't have a powerful GPU you could utilize Google Colab to train your model by following the [ACT training notebook](./notebooks#training-act).

-#### Train using Hugging Face Jobs
-
-Hugging Face jobs let's you easily select hardware and run the training in the cloud. So if you don't have a powerful GPU or you need more VRAM or just want to train a model much faster use HF Jobs! It's pay as you go and you simply pay for each second of use, you can see the pricing and additional information [here](https://huggingface.co/docs/hub/jobs).
-
-To run the training use this command:
-
-<hfoptions id="train_with_hf_jobs">
-<hfoption id="Command">
-```bash
-hf jobs run \
-  --flavor a10g-small \
-  --timeout 4h \
-  --secrets HF_TOKEN \
-  huggingface/lerobot-gpu:latest \
-  -- \
-  python -m lerobot.scripts.lerobot_train \
-    --dataset.repo_id=username/dataset \
-    --policy.type=act \
-    --steps=5000 \
-    --batch_size=16 \
-    --policy.device=cuda \
-    --policy.repo_id=username/your_policy \
-    --log_freq=100
-```
-</hfoption>
-<hfoption id="API example">
-
-<!-- prettier-ignore-start -->
-```python
-from huggingface_hub import run_job, get_token
-
-run_name = "act_so101_hf_jobs"
-dataset_id = "username/dataset"
-user_hub_id = "username"
-
-command_args = [
-    "python", "-m", "lerobot.scripts.lerobot_train",
-    "--dataset.repo_id", dataset_id,
-    "--policy.type", "act",
-    "--steps", "5000",
-    "--batch_size", "16",
-    "--num_workers", "4",
-    "--policy.device", "cuda",
-    "--log_freq", "100",
-    "--save_freq", "1000",
-    "--save_checkpoint", "true",
-    "--wandb.enable", "false",
-    "--policy.repo_id", f"{user_hub_id}/{run_name}"
-]
-
-print(f"Submitting job '{run_name}' to Hugging Face Infrastructure...")
-
-job_info = run_job(
-    image="huggingface/lerobot-gpu:latest",
-    command=command_args,
-    flavor="a10g-small",
-    timeout="4h",
-    secrets={"HF_TOKEN": get_token()}
-)
-
-print("\n🚀 Job successfully launched!")
-print(f"🔹 Job ID: {job_info.id}")
-print(f"🔗 Live UI Dashboard & Logs: {job_info.url}")
-```
-<!-- prettier-ignore-end -->
-
-</hfoption>
-</hfoptions>
-
-You can modify the `--flavor` to use different hardware, for example: `t4-small`, `a100-large`, `h200`. Use `hf jobs hardware` to see the full list with pricing.
-Depending on the model you want to train and the hardware you selected you can also modify the `--batch_size` and `--number_of_workers`.
-For longer training sessions increase the timeout.
-
-Once the training is started you can go to [Jobs](https://huggingface.co/settings/jobs) and see if your jobs is running as well as all the outputs. Sometimes it takes a few minutes to schedule your job so be patient.
-
-After training the model will be pushed to hub and you can use it as any other model with LeRobot.
-
 #### Upload policy checkpoints

 Once training is done, upload the latest checkpoint with:
--- a/docs/source/molmoact2.mdx
+++ b/docs/source/molmoact2.mdx
@@ -1,433 +0,0 @@
-# MolmoAct2 Policy
-
-MolmoAct2 is the LeRobot policy implementation of
-[MolmoAct2](https://allenai.org/blog/molmoact2), ported into the LeRobot
-training, evaluation, checkpointing, and dataset interfaces for easier use with
-LeRobot datasets.
-
-This implementation currently supports training and evaluation for the regular
-MolmoAct2 model. MolmoAct2-Think, which supports adaptive depth reasoning, is
-not included in this LeRobot policy yet and is coming soon.
-
-For the original MolmoAct2 training code used for the experiments reported in
-the paper, see [allenai/molmoact2](https://github.com/allenai/molmoact2).
-
-## Installation Requirements
-
-Install LeRobot with the MolmoAct2 optional dependencies:
-
-```bash
-pip install -e ".[molmoact2]"
-```
-
-To run the models in this repository, you need an NVIDIA GPU. The measurements
-below were taken on a single NVIDIA H100 80GB with bf16 model loading, LIBERO with two RGB cameras. MolmoAct2 rows use `chunk_size=10`, action dim 7
-padded to `expected_max_action_dim=32`, and `num_flow_timesteps=8`. Training measurements use
-`gradient_checkpointing=true` and include the forward pass, backward pass,
-gradient clipping, optimizer step, and optimizer state allocation. Values are
-peak GPU memory sampled with `nvidia-smi`. Leave a few GiB of headroom for
-dataloader workers, CUDA context, and fragmentation.
-
-Multi-GPU training through `accelerate` increases throughput and global batch
-size, but this LeRobot port does not currently expose the original MolmoAct2
-`fsdp_devices` model-parallel training path. The current training script has
-not been tested for multi-node training.
-
-| Mode                                             | Peak Memory, bs=8 | Peak Memory, bs=16 | Peak Memory, bs=32 |
-| ------------------------------------------------ | ----------------: | -----------------: | -----------------: |
-| Inference, continuous, CUDA graph enabled (bs=1) |          12.1 GiB |                  - |                  - |
-| Fine-tuning, action expert only, continuous      |          16.5 GiB |           18.3 GiB |           21.4 GiB |
-| Fine-tuning, LoRA VLM, both action modes         |          20.2 GiB |           26.8 GiB |           41.3 GiB |
-| Fine-tuning, full model, both action modes       |          48.3 GiB |           49.8 GiB |           60.1 GiB |
-
-The repo has been tested with Ubuntu 22.04.
-
-## Usage
-
-To use MolmoAct2 in a LeRobot training config, set:
-
-```python
-policy.type=molmoact2
-```
-
-## Training
-
-MolmoAct2 can be fine-tuned from either the released MolmoAct2 Hugging Face
-checkpoint format or from a checkpoint already saved by LeRobot. Both routes use
-the same LeRobot training loop, dataset transforms, checkpoint saving, and
-logging. The difference is only how the initial policy weights and processor
-state are loaded.
-
-### Training With Original MolmoAct2 Weight
-
-Use `policy.checkpoint_path` when starting from a released MolmoAct2 checkpoint,
-for example `allenai/MolmoAct2` or `allenai/MolmoAct2-LIBERO`. LeRobot will load
-the original HF model files, then build its own policy processor from the
-dataset metadata and the policy options below.
-
-The command below shows full fine-tuning on the merged LIBERO dataset. It uses
-bf16 model loading, 8 flow timesteps, LeRobot dataset statistics, image
-augmentation, and LeRobot's checkpointing/logging path.
-
-```bash
-accelerate launch \
-  --num_processes=8 \
-  --mixed_precision=bf16 \
-  -m lerobot.scripts.lerobot_train \
-  --dataset.repo_id=allenai/MolmoAct2-LIBERO-Dataset \
-  --dataset.root=/path/to/lerobot/data/allenai/MolmoAct2-LIBERO-Dataset \
-  --dataset.video_backend=pyav \
-  --dataset.image_transforms.enable=true \
-  --policy.type=molmoact2 \
-  --policy.checkpoint_path=allenai/MolmoAct2-LIBERO \
-  --policy.device=cuda \
-  --policy.action_mode=both \
-  --policy.chunk_size=10 \
-  --policy.n_action_steps=10 \
-  --policy.setup_type="single franka robotic arm in libero" \
-  --policy.control_mode="delta end-effector pose" \
-  --policy.image_keys='["observation.images.image","observation.images.wrist_image"]' \
-  --policy.model_dtype=bfloat16 \
-  --policy.num_flow_timesteps=8 \
-  --policy.gradient_checkpointing=true \
-  --policy.freeze_embedding=true \
-  --policy.normalize_gripper=false \
-  --policy.enable_knowledge_insulation=false \
-  --policy.push_to_hub=false \
-  --wandb.enable=true \
-  --wandb.entity=<wandb_entity> \
-  --wandb.project=<wandb_project> \
-  --job_name=<job_name> \
-  --output_dir=outputs/<job_name> \
-  --steps=10000 \
-  --batch_size=32 \
-  --num_workers=4 \
-  --log_freq=20 \
-  --eval_freq=-1 \
-  --save_checkpoint=true \
-  --save_freq=2000
-```
-
-### Training With LeRobot MolmoAct2 Weight
-
-Use `policy.path` when starting from a MolmoAct2 checkpoint that was saved by
-LeRobot, either from a local `pretrained_model` directory or from the Hub. This
-restores the saved LeRobot policy config, model weights, processor, and
-normalization statistics. You can still override training-time options such as
-`batch_size`, `steps`, LoRA flags, or `policy.action_mode`.
-
-```bash
-accelerate launch \
-  --num_processes=8 \
-  --mixed_precision=bf16 \
-  -m lerobot.scripts.lerobot_train \
-  --dataset.repo_id=allenai/MolmoAct2-LIBERO-Dataset \
-  --dataset.root=/path/to/lerobot/data/allenai/MolmoAct2-LIBERO-Dataset \
-  --dataset.video_backend=pyav \
-  --dataset.image_transforms.enable=true \
-  --policy.path=/path/to/pretrained_model \
-  --policy.device=cuda \
-  --policy.action_mode=both \
-  --policy.chunk_size=10 \
-  --policy.n_action_steps=10 \
-  --policy.model_dtype=bfloat16 \
-  --policy.num_flow_timesteps=8 \
-  --policy.gradient_checkpointing=true \
-  --wandb.enable=true \
-  --wandb.entity=<wandb_entity> \
-  --wandb.project=<wandb_project> \
-  --job_name=<job_name> \
-  --output_dir=outputs/<job_name> \
-  --steps=10000 \
-  --batch_size=32 \
-  --num_workers=4 \
-  --log_freq=20 \
-  --eval_freq=-1 \
-  --save_checkpoint=true \
-  --save_freq=2000
-```
-
-### Common Practices
-
-For fine-tuning on a comparatively small dataset, such as a single LIBERO suite
-or a real-world dataset with less than 200 demonstrations, a global batch size of
-16 to 32 is a good starting point. In these settings, `policy.enable_lora_vlm=true` or `policy.train_action_expert_only=true` is also a practical choice. In both
-cases, we intentionally keep the action expert fully trainable, which we found
-to be crucial for model performance. For larger fine-tuning datasets, larger
-global batch sizes and full fine-tuning are usually preferred.
-
-### Common Policy Options
-
- `policy.checkpoint_path`: original MolmoAct2 HF checkpoint to initialize from.
-  Use this for released MolmoAct2 weights.
- `policy.path`: LeRobot checkpoint to initialize from. Use this for checkpoints
-  created by LeRobot training.
- `policy.action_mode`: training target, one of `continuous`, `discrete`, or
-  `both`. `both` trains the flow-matching action expert and the discrete
-  action-token loss.
- `policy.train_action_expert_only`: trains only parameters whose names contain
-  `action_expert`. It requires `policy.action_mode=continuous`.
- `policy.enable_lora_vlm`: enables LoRA on VLM linear layers. Use
-  `policy.enable_lora_action_expert=true` only if LoRA should also cover action
-  expert linear layers. When `policy.enable_lora_action_expert=false`, the
-  action expert base weights remain fully trainable while the VLM is trained
-  through LoRA adapters. When `policy.enable_lora_action_expert=true`, the
-  action expert is also adapter-tuned instead of fully fine-tuned.
- `policy.enable_knowledge_insulation`: when `true`, detaches action-expert
-  context K/V states before the action loss. The default is `false`.
- `policy.chunk_size`: action horizon used by the policy. For LIBERO we use
-  `10`. This LeRobot port overrides the loaded checkpoint's
-  `max_action_horizon` with this value.
- `policy.n_action_steps`: number of actions consumed from each predicted
-  chunk before querying the policy again. For LIBERO, set it to `chunk_size`.
- `policy.setup_type`: text inserted into the prompt to describe the robot and
-  scene, e.g. `single franka robotic arm in libero`. More examples are listed
-  in the `metadata_by_tag` entries of
-  [`norm_stats.json`](https://huggingface.co/allenai/MolmoAct2/blob/main/norm_stats.json).
- `policy.control_mode`: text inserted into the prompt to describe the action
-  space, e.g. `delta end-effector pose` or `absolute joint pose`.
- `policy.image_keys`: ordered LeRobot image observation keys passed to the
-  processor.
- `policy.model_dtype`: checkpoint/forward dtype, one of `float32`,
-  `bfloat16`, or `float16`. Use `bfloat16` for normal training.
- `policy.num_flow_timesteps`: number of flow-matching timesteps sampled per
-  example during training. We use `8` for fine-tuning.
- `policy.num_inference_steps`: optional override for continuous action
-  generation steps at inference time.
- `policy.gradient_checkpointing`: enables checkpointing in the VLM/action path
-  to reduce activation memory.
- `policy.freeze_embedding`: freezes input embeddings. The default is `true`.
- `policy.normalize_gripper`: controls whether gripper dimensions are included
-  in state/action quantile normalization. The default is `false`.
- `policy.normalize_language`: normalizes task strings before prompt
-  construction. The default is `true`.
- `policy.mask_action_dim_padding`: masks padded dimensions in the flow loss.
-  Released checkpoints use `policy.expected_max_action_dim=32`.
- `policy.max_sequence_length`: optional manual sequence cap. Leave unset to
-  infer it from images, state dimension, action dimension, action horizon, and
-  discrete-action mode.
-
-### Learning Rates
-
-MolmoAct2 uses parameter-group learning rates to match the original MolmoAct2
-fine-tuning experiments.
-
- Full fine-tuning uses `policy.optimizer_lr=1e-5` for the VLM,
-  `policy.optimizer_vit_lr=5e-6` for the vision tower,
-  `policy.optimizer_connector_lr=5e-6` for image connector layers, and
-  `policy.optimizer_action_expert_lr=5e-5` for the action expert.
- LoRA VLM fine-tuning sets the VLM, vision, and connector LoRA parameter
-  groups to `5e-5` when `policy.enable_lora_vlm=true`. By default,
-  `policy.enable_lora_action_expert=false`, so the action expert is still fully
-  fine-tuned with `policy.optimizer_action_expert_lr`. If
-  `policy.enable_lora_action_expert=true`, the action expert is trained through
-  LoRA adapters instead.
- Action-expert-only fine-tuning trains only the action expert and uses
-  `policy.optimizer_action_expert_lr=5e-5`.
-
-You can override the full fine-tuning and action-expert learning rates with
-`policy.optimizer_lr`, `policy.optimizer_vit_lr`,
-`policy.optimizer_connector_lr`, and `policy.optimizer_action_expert_lr`.
-Scheduler settings can be changed with `policy.scheduler_warmup_steps`,
-`policy.scheduler_decay_steps`, and `policy.scheduler_decay_lr`.
-
-### Dataset Quantile Statistics
-
-MolmoAct2 defaults to quantile normalization for state and action features. If
-your dataset has not been converted with quantile statistics, you can add them
-with:
-
-```bash
-python src/lerobot/datasets/v30/augment_dataset_quantile_stats.py \
-  --repo-id=your_dataset
-```
-
-Alternatively, train MolmoAct2 with mean/std normalization:
-
-```bash
--policy.normalization_mapping='{"ACTION": "MEAN_STD", "STATE": "MEAN_STD", "VISUAL": "IDENTITY"}'
-```
-
-## Evaluation
-
-Evaluation also supports both LeRobot-saved checkpoints and original MolmoAct2
-HF checkpoints. For LIBERO replication, keep the EGL rendering environment
-fixed and use `policy.per_episode_seed=true`.
-
-**Important:** We found that `num_steps_wait=10` does not reliably let the
-LIBERO scene stabilize and can degrade measured success. All LIBERO evaluation
-results reported here use `num_steps_wait=50`.
-
-### Evaluation With LeRobot MolmoAct2 Weight
-
-Use `policy.path` for a checkpoint saved by LeRobot. The saved processor and
-normalization statistics are restored together with the model.
-
-```bash
-export MUJOCO_GL=egl
-export PYOPENGL_PLATFORM=egl
-export OMP_NUM_THREADS=1
-export MKL_NUM_THREADS=1
-
-lerobot-eval \
-  --policy.path=allenai/MolmoAct2-LIBERO-LeRobot \
-  --policy.inference_action_mode=continuous \
-  --policy.model_dtype=bfloat16 \
-  --policy.use_amp=true \
-  --policy.enable_inference_cuda_graph=true \
-  --policy.device=cuda \
-  --policy.per_episode_seed=true \
-  --policy.eval_seed=1000 \
-  --env.type=libero \
-  --env.task=libero_10,libero_goal,libero_object,libero_spatial \
-  --env.camera_name_mapping='{"agentview_image":"image","robot0_eye_in_hand_image":"wrist_image"}' \
-  --eval.batch_size=1 \
-  --eval.n_episodes=50 \
-  --seed=1000
-```
-
-### Evaluation With Original MolmoAct2 Weight
-
-You can evaluate a released Hugging Face checkpoint directly without first
-converting it to a LeRobot checkpoint. In this case, set
-`policy.checkpoint_path` to the HF model repo and provide `policy.norm_tag`.
-For LIBERO, `policy.norm_tag=libero` loads the LIBERO action/state
-normalization statistics, action horizon, prompt metadata, and image-key order
-from the checkpoint's `norm_stats.json`.
-
-To fully replicate the MolmoAct2 paper results with released Hugging Face
-checkpoints, we recommend using the v0.5.1-pinned
-[`allenai/lerobot` `molmoact2-hf-inference`](https://github.com/allenai/lerobot/tree/molmoact2-hf-inference)
-branch. That branch matches the original evaluation settings used for the
-reported numbers.
-
-```bash
-export MUJOCO_GL=egl
-export PYOPENGL_PLATFORM=egl
-export OMP_NUM_THREADS=1
-export MKL_NUM_THREADS=1
-
-lerobot-eval \
-  --policy.type=molmoact2 \
-  --policy.checkpoint_path=allenai/MolmoAct2-LIBERO \
-  --policy.norm_tag=libero \
-  --policy.inference_action_mode=continuous \
-  --policy.model_dtype=float32 \
-  --policy.use_amp=false \
-  --policy.enable_inference_cuda_graph=true \
-  --policy.device=cuda \
-  --policy.per_episode_seed=true \
-  --policy.eval_seed=1000 \
-  --env.type=libero \
-  --env.task=libero_goal \
-  --env.camera_name_mapping='{"agentview_image":"image","robot0_eye_in_hand_image":"wrist_image"}' \
-  --eval.batch_size=1 \
-  --eval.n_episodes=50 \
-  --seed=1000
-```
-
-Use `--env.task=libero_10,libero_goal,libero_object,libero_spatial` to run the
-full LIBERO suite. The same command works for other released MolmoAct2
-checkpoints as long as the requested `policy.norm_tag` exists in that
-checkpoint's `norm_stats.json`.
-
-### Common Evaluation Options
-
- `policy.inference_action_mode`: required for rollout. Use `continuous` for
-  flow-matching inference or `discrete` for action-token inference. It must be
-  compatible with the training-time `policy.action_mode` saved in the
-  checkpoint.
- `policy.path`: LeRobot checkpoint path or Hub repo. Use this for checkpoints
-  saved by LeRobot.
- `policy.checkpoint_path`: original MolmoAct2 HF checkpoint path or Hub repo.
-  Use this with `policy.type=molmoact2` and `policy.norm_tag`.
- `policy.norm_tag`: selects normalization statistics, prompt metadata,
-  image-key order, and action horizon from the original checkpoint's
-  `norm_stats.json`. It is required for direct original-HF checkpoint
-  evaluation.
- `policy.model_dtype`: model load/forward dtype. Use `bfloat16` for normal
-  GPU evaluation. Use `float32` only when you explicitly want fp32 inference.
- `policy.use_amp`: runs the policy forward under autocast during eval. For
-  `model_dtype=bfloat16`, keep this enabled.
- `policy.enable_inference_cuda_graph`: enables the MolmoAct2 inference CUDA
-  graph path for faster repeated continuous-action rollout.
- `policy.per_episode_seed` and `policy.eval_seed`: make stochastic continuous
-  action generation deterministic per episode for replication.
- `env.task`: comma-separated LIBERO suites or a single suite. Use
-  `libero_10,libero_goal,libero_object,libero_spatial` for the full benchmark.
- `env.camera_name_mapping`: maps LIBERO camera names to the image keys expected
-  by the policy processor.
-
-## Performance Results
-
-### LIBERO Benchmark Results
-
-MolmoAct2 has demonstrated strong performance on the LIBERO benchmark suite. To
-compare and test its LeRobot implementation, we fine-tuned
-[`allenai/MolmoAct2-LIBERO`](https://huggingface.co/allenai/MolmoAct2-LIBERO)
-for an additional 10k steps on the LIBERO dataset with per-GPU batch size 32 on
-8 H100 GPUs, then compared the results to the original MolmoAct2 reference
-results.
-
-The LeRobot fine-tuned checkpoint reported here is available at
-[`allenai/MolmoAct2-LIBERO-LeRobot`](https://huggingface.co/allenai/MolmoAct2-LIBERO-LeRobot)
-and was trained on
-[`allenai/MolmoAct2-LIBERO-Dataset`](https://huggingface.co/datasets/allenai/MolmoAct2-LIBERO-Dataset).
-
-| Benchmark      | LeRobot Implementation | MolmoAct2 Original |
-| -------------- | ---------------------: | -----------------: |
-| LIBERO Spatial |                  98.4% |              97.8% |
-| LIBERO Object  |                 100.0% |             100.0% |
-| LIBERO Goal    |                  98.0% |              97.8% |
-| LIBERO 10      |                  96.6% |              93.2% |
-| Average        |                 98.25% |             97.20% |
-
-These results demonstrate MolmoAct2's strong performance across diverse robotic
-manipulation tasks. To reproduce them, follow the instructions in the LIBERO
-evaluation section.
-
-## Differences From the Original Implementation
-
-This LeRobot port is intended to match MolmoAct2 behavior while using LeRobot's
-dataset, training, evaluation, checkpoint, and logging infrastructure. The main
-differences from the original training repository are:
-
- The original paper training stack loads the model in fp32 and trains under
-  mixed precision. This LeRobot port usually loads the checkpoint directly in
-  `policy.model_dtype=bfloat16` for lower memory use.
- The original repository uses its own FSDP/model-parallel training path. The
-  LeRobot port uses the standard LeRobot/Accelerate training path and has not
-  been tested for multi-node training.
- The original repository supports sequence packing. The LeRobot port trains on
-  one LeRobot sample per item and pads to an inferred fixed sequence budget.
- The LeRobot port follows LeRobot's optimizer, scheduler, checkpoint saving,
-  dataset transforms, image augmentation, and Weights & Biases logging
-  conventions.
- The original training path supports mixed action horizons by padding to
-  `max_action_horizon` and masking padded horizon slots in the action expert
-  self-attention. This is useful when training across datasets with different
-  control frequencies. The LeRobot port currently targets single-dataset
-  fine-tuning, so `policy.chunk_size` overrides the checkpoint
-  `max_action_horizon` and horizon masking is not implemented yet. Support for
-  this mixed-horizon path is planned.
-
-## Citation
-
-```bibtex
-@misc{fang2026molmoact2actionreasoningmodels,
-      title={MolmoAct2: Action Reasoning Models for Real-world Deployment},
-      author={Haoquan Fang and Jiafei Duan and Donovan Clay and Sam Wang and Shuo Liu and Weikai Huang and Xiang Fan and Wei-Chuan Tsai and Shirui Chen and Yi Ru Wang and Shanli Xing and Jaemin Cho and Jae Sung Park and Ainaz Eftekhar and Peter Sushko and Karen Farley and Angad Wadhwa and Cole Harrison and Winson Han and Ying-Chun Lee and Eli VanderBilt and Rose Hendrix and Suveen Ellawela and Lucas Ngoo and Joyce Chai and Zhongzheng Ren and Ali Farhadi and Dieter Fox and Ranjay Krishna},
-      year={2026},
-      eprint={2605.02881},
-      archivePrefix={arXiv},
-      primaryClass={cs.RO},
-      url={https://arxiv.org/abs/2605.02881},
-}
-```
-
-## License
-
-This model is licensed under Apache 2.0. It is intended for research and
-educational use in accordance with
-[Ai2's Responsible Use Guidelines](https://allenai.org/responsible-use),
-consistent with [allenai/molmoact2](https://github.com/allenai/molmoact2).
--- a/docs/source/policy_groot_README.md
+++ b/docs/source/policy_groot_README.md
@@ -24,8 +24,4 @@ Code: https://github.com/NVIDIA/Isaac-GR00T

 Blog: https://developer.nvidia.com/isaac/gr00t

-Hugging Face Models:
-
- GR00T N1.5: https://huggingface.co/nvidia/GR00T-N1.5-3B
- GR00T N1.7: https://huggingface.co/nvidia/GR00T-N1.7-3B
- GR00T N1.7 LIBERO checkpoints: https://huggingface.co/nvidia/GR00T-N1.7-LIBERO
+Hugging Face Model: https://huggingface.co/nvidia/GR00T-N1.5-3B
--- a/docs/source/policy_molmoact2_README.md
+++ b/docs/source/policy_molmoact2_README.md
@@ -1,39 +0,0 @@
-# MolmoAct2
-
-This repository contains the LeRobot policy implementation of
-[MolmoAct2](https://allenai.org/blog/molmoact2), ported into LeRobot for
-training, evaluation, checkpointing, and dataset compatibility.
-
-This implementation currently supports training and evaluation for the regular
-MolmoAct2 model. MolmoAct2-Think, which supports adaptive depth reasoning, is
-not included in this LeRobot policy yet and is coming soon.
-
-For the original MolmoAct2 training code used for the experiments reported in
-the paper, see [allenai/molmoact2](https://github.com/allenai/molmoact2).
-
-## LIBERO Evaluation
-
-Important: we found that `num_steps_wait=10` does not reliably let the LIBERO
-scene stabilize and can degrade measured success. All LIBERO evaluation results
-reported for this LeRobot implementation use `num_steps_wait=50`.
-
-## Citation
-
-```bibtex
-@misc{fang2026molmoact2actionreasoningmodels,
-      title={MolmoAct2: Action Reasoning Models for Real-world Deployment},
-      author={Haoquan Fang and Jiafei Duan and Donovan Clay and Sam Wang and Shuo Liu and Weikai Huang and Xiang Fan and Wei-Chuan Tsai and Shirui Chen and Yi Ru Wang and Shanli Xing and Jaemin Cho and Jae Sung Park and Ainaz Eftekhar and Peter Sushko and Karen Farley and Angad Wadhwa and Cole Harrison and Winson Han and Ying-Chun Lee and Eli VanderBilt and Rose Hendrix and Suveen Ellawela and Lucas Ngoo and Joyce Chai and Zhongzheng Ren and Ali Farhadi and Dieter Fox and Ranjay Krishna},
-      year={2026},
-      eprint={2605.02881},
-      archivePrefix={arXiv},
-      primaryClass={cs.RO},
-      url={https://arxiv.org/abs/2605.02881},
-}
-```
-
-## License
-
-This model is licensed under Apache 2.0. It is intended for research and
-educational use in accordance with
-[Ai2's Responsible Use Guidelines](https://allenai.org/responsible-use),
-consistent with [allenai/molmoact2](https://github.com/allenai/molmoact2).
--- a/docs/source/robometer.mdx
+++ b/docs/source/robometer.mdx
@@ -1,185 +0,0 @@
-# ROBOMETER
-
-ROBOMETER is a **general-purpose video-language robotic reward model**. It predicts dense, frame-level task progress and frame-level success from a trajectory video and a task description.
-
-**Paper**: [ROBOMETER: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons](https://arxiv.org/abs/2603.02115)
-**Project**: [robometer.github.io](https://robometer.github.io/)
-**Original code**: [github.com/robometer/robometer](https://github.com/robometer/robometer)
-**Checkpoint**: [lerobot/Robometer-4B](https://huggingface.co/lerobot/Robometer-4B)
-
-## Overview
-
-ROBOMETER builds on `Qwen/Qwen3-VL-4B-Instruct` and adds three lightweight prediction heads:
-
- **Progress head**: predicts per-frame task progress in `[0, 1]`.
- **Success head**: predicts per-frame task success probability.
- **Preference head**: predicts which of two trajectories better completes the task during training.
-
-The paper trains ROBOMETER with a composite objective:
-
-```text
-L = L_pref + L_prog + L_succ
-```
-
-The LeRobot integration is currently **inference-only**. It preserves the preference head so that the published `Robometer-4B` checkpoint loads without remapping, but `compute_reward()` queries the progress or success head only.
-
-## What the LeRobot Integration Covers
-
- Standard `reward_model.type=robometer` configuration through LeRobot.
- Qwen3-VL image and text preprocessing through `RobometerEncoderProcessorStep`.
- LeRobot reward-model save/load APIs through `PreTrainedRewardModel`.
- Dense, frame-level progress and success predictions internally.
- A scalar reward through `compute_reward()` for downstream LeRobot reward-model usage.
-
-This page focuses on using the published ROBOMETER checkpoint as a zero-shot reward model. Training ROBOMETER from scratch is outside the current LeRobot integration.
-
-## Installation Requirements
-
-1. Install LeRobot by following the [Installation Guide](./installation).
-2. Install the ROBOMETER dependencies:
-
-```bash
-pip install -e ".[robometer]"
-```
-
-If you use `uv` directly from a source checkout:
-
-```bash
-uv sync --extra robometer
-```
-
-ROBOMETER uses a Qwen3-VL-4B backbone, so GPU inference is strongly recommended.
-
-## Model Inputs and Outputs
-
-ROBOMETER expects:
-
- A trajectory video or sequence of frames.
- A natural-language task description.
-
-In LeRobot datasets, the preprocessor reads:
-
-| Config field              | Default                  | Meaning                                               |
-| ------------------------- | ------------------------ | ----------------------------------------------------- |
-| `reward_model.image_key`  | `observation.images.top` | Camera/video observation used by ROBOMETER            |
-| `reward_model.task_key`   | `task`                   | Key in complementary data that stores the task string |
-| `reward_model.max_frames` | `8`                      | Maximum number of frames passed to ROBOMETER          |
-
-The model predicts per-frame progress and success internally. The LeRobot reward API returns a scalar per sample:
-
- `reward_output="progress"` (default): return the last-frame progress, clamped to `[0, 1]`.
- `reward_output="success"`: return `1.0` if the last-frame success probability is above `success_threshold`, otherwise `0.0`.
-
-## Usage
-
-### Load the Reward Model Directly
-
-```python
-from lerobot.rewards.robometer import RobometerConfig, RobometerRewardModel
-
-cfg = RobometerConfig(
-    pretrained_path="lerobot/Robometer-4B",
-    device="cuda",
-    reward_output="progress",
-)
-reward_model = RobometerRewardModel.from_pretrained(cfg.pretrained_path, config=cfg)
-```
-
-### Encode Frames and Compute a Reward
-
-For a direct Python call, provide frames as `uint8` arrays with shape `(T, H, W, C)` and a task string:
-
-```python
-from lerobot.rewards.robometer.modeling_robometer import ROBOMETER_FEATURE_PREFIX
-from lerobot.rewards.robometer.processor_robometer import RobometerEncoderProcessorStep
-
-# frames: np.ndarray, shape (T, H, W, C), dtype uint8
-# task: str
-encoder = RobometerEncoderProcessorStep(
-    base_model_id=cfg.base_model_id,
-    use_multi_image=cfg.use_multi_image,
-    use_per_frame_progress_token=cfg.use_per_frame_progress_token,
-    max_frames=cfg.max_frames,
-)
-
-encoded = encoder.encode_samples([(frames, task)])
-batch = {f"{ROBOMETER_FEATURE_PREFIX}{key}": value for key, value in encoded.items()}
-
-reward = reward_model.compute_reward(batch)
-```
-
-`reward` is a tensor of shape `(batch_size,)`.
-
-### Use the Reward Factory
-
-You can also instantiate ROBOMETER through the reward factory:
-
-```python
-from lerobot.rewards import make_reward_model, make_reward_model_config, make_reward_pre_post_processors
-
-cfg = make_reward_model_config(
-    "robometer",
-    pretrained_path="lerobot/Robometer-4B",
-    device="cuda",
-    image_key="observation.images.top",
-)
-reward_model = make_reward_model(cfg)
-preprocessor, postprocessor = make_reward_pre_post_processors(cfg)
-```
-
-The preprocessor writes Qwen-VL tensors under the `observation.robometer.*` namespace, and `compute_reward()` reads those encoded tensors.
-
-## Configuration Notes
-
-### Backbone and Vocabulary
-
-The published checkpoint uses a Qwen3-VL-4B backbone. ROBOMETER adds five special tokens to the tokenizer in a fixed order:
-
-```text
-<|split_token|>
-<|reward_token|>
-<|pref_token|>
-<|sim_token|>
-<|prog_token|>
-```
-
-`<|prog_token|>` is inserted after each frame and is the hidden-state position used for per-frame progress and success prediction. `<|split_token|>` and `<|pref_token|>` are used by the paper's pairwise trajectory preference objective. `<|reward_token|>` and `<|sim_token|>` are preserved for checkpoint compatibility.
-
-The LeRobot config stores a serialized `vlm_config` with the post-resize vocabulary so the model can reload from `config.json` without downloading the base Qwen weights first. For `Qwen/Qwen3-VL-4B-Instruct`, the tokenizer length is `151669`, and the five ROBOMETER tokens produce the checkpoint vocabulary size `151674`.
-
-### Progress Prediction
-
-In the published checkpoint, progress is discrete. The progress head outputs logits over `progress_discrete_bins=10` uniformly spaced bin centers in `[0, 1]`. LeRobot converts these logits into a continuous value by applying a softmax and taking the expectation over bin centers, matching the upstream ROBOMETER implementation.
-
-### Success Prediction
-
-The success head outputs raw logits per frame. LeRobot converts them to probabilities with `sigmoid`. When `reward_output="success"`, `compute_reward()` thresholds the last-frame success probability using `success_threshold`.
-
-## Limitations
-
- The current LeRobot integration is inference-only; it does not implement ROBOMETER training or preference-pair training.
- `compute_reward()` returns a scalar per sample for the LeRobot reward-model API, even though ROBOMETER predicts per-frame progress and success internally.
- ROBOMETER is video-language based; it does not use privileged robot state such as contact forces or object poses.
-
-## References
-
- [ROBOMETER project](https://robometer.github.io/)
- [ROBOMETER paper](https://arxiv.org/abs/2603.02115)
- [Original ROBOMETER code](https://github.com/robometer/robometer)
- [Published ROBOMETER-4B checkpoint](https://huggingface.co/lerobot/Robometer-4B)
- [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct)
-
-## Citation
-
-```bibtex
-@inproceedings{liang2026robometer,
-title = {Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons},
-author={Anthony Liang and Yigit Korkmaz and Jiahui Zhang and Minyoung Hwang and Abrar Anwar and Sidhant Kaushik and Aditya Shah and Alex S. Huang and Luke Zettlemoyer and Dieter Fox and Yu Xiang and Anqi Li and Andreea Bobu and Abhishek Gupta and Stephen Tu and Erdem Biyik and Jesse Zhang},
-year={2026},
-booktitle={Robotics: Science and Systems 2026},
-}
-```
-
-## License
-
-This LeRobot integration follows the **Apache 2.0 License** used by LeRobot. Check the upstream ROBOMETER code and model pages for the licenses of the original implementation and released checkpoints.
--- a/docs/source/smolvla.mdx
+++ b/docs/source/smolvla.mdx
@@ -97,22 +97,22 @@ Similarly for when recording an episode, it is recommended that you are logged i
 Once you are logged in, you can run inference in your setup by doing:

 ```bash
-lerobot-rollout \
-  --strategy.type=base \
+lerobot-record \
  --robot.type=so101_follower \
  --robot.port=/dev/ttyACM0 \ # <- Use your port
  --robot.id=my_blue_follower_arm \ # <- Use your robot id
  --robot.cameras="{ front: {type: opencv, index_or_path: 8, width: 640, height: 480, fps: 30}}" \ # <- Use your cameras
-  --task="Grasp a lego block and put it in the bin." \ # <- Use the same task description you used in your dataset recording
-  # <- RTC optional, use when running on low power hardware \
-  # --inference.type=rtc \
-  # --inference.rtc.execution_horizon=10 \
-  # --inference.rtc.max_guidance_weight=10.0 \
+  --dataset.single_task="Grasp a lego block and put it in the bin." \ # <- Use the same task description you used in your dataset recording
+  --dataset.repo_id=${HF_USER}/eval_DATASET_NAME_test \  # <- This will be the dataset name on HF Hub
+  --dataset.episode_time_s=50 \
+  --dataset.num_episodes=10 \
+  --dataset.streaming_encoding=true \
+  --dataset.encoder_threads=2 \
+  # --dataset.camera_encoder.vcodec=auto \
  # <- Teleop optional if you want to teleoperate in between episodes \
  # --teleop.type=so100_leader \
  # --teleop.port=/dev/ttyACM0 \
  # --teleop.id=my_red_leader_arm \
-  # --display_data=true #optional use if you want to see the camera stream \
  --policy.path=HF_USER/FINETUNE_MODEL_NAME # <- Use your fine-tuned model
 ```

--- a/docs/source/topreward.mdx
+++ b/docs/source/topreward.mdx
@@ -1,177 +0,0 @@
-# TOPReward
-
-TOPReward is a **zero-shot reward model** that extracts token log-probabilities from an off-the-shelf vision-language model (VLM) as a robotic reward signal. Given a video trajectory and a task instruction, it returns the VLM's log-likelihood that the instruction is true — no fine-tuning required.
-
-**Paper**: [TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics](https://arxiv.org/abs/2602.19313)
-**Project**: [topreward.github.io](https://topreward.github.io/webpage/)
-**Original code**: [github.com/TOPReward/TOPReward](https://github.com/TOPReward/TOPReward)
-**Default backbone**: [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
-
-## Overview
-
-TOPReward asks a generic VLM how likely a task instruction is, **conditioned on the video** of a robot trying to complete that task. Concretely, given:
-
- A trajectory video (a sequence of frames).
- A task instruction (e.g. _"open the drawer"_).
-
-it builds a chat prompt of the form
-
-```text
-<video>
-"The above video shows a robot manipulation trajectory that completes the
- following task: <instruction> Decide whether the above statement is True
- or not. The answer is: True"
-```
-
-forwards it through the VLM, label-masks everything except the very last token, and reads back the log-probability of that token — by default the literal `"True"` that closes the suffix template. The resulting `log P("True" | video + prompt + instruction)` is the reward.
-
-Because the method only depends on a frozen VLM, TOPReward is **zero-shot**: there are no fine-tuned weights to host. The "model" in LeRobot is a small wrapper around `transformers`' `Qwen3VLForConditionalGeneration` plus the label-masking logic. The processor owns the tokeniser and builds the full chat prompt (EO-1/Robometer pattern).
-
-## What the LeRobot integration covers
-
- Standard `reward_model.type=topreward` configuration through LeRobot.
- VLM loading via the `transformers` `Qwen3VLForConditionalGeneration` API.
- Prompt assembly + tokenisation in the processor (matching upstream `QwenClient.compute_instruction_reward`).
- `compute_reward()` returns one scalar log-prob per sample.
- LeRobot reward-model save/load — `save_pretrained` writes only `config.json` (the VLM is identified by `vlm_name`).
- An offline labeling script that writes a `topreward_progress.parquet` (SARM-compatible schema) for RA-BC and overlay.
-
-The current LeRobot port supports the **Qwen3-VL client only**. Other upstream clients (Gemini, OpenAI, Gemma, Molmo) can be added as follow-up extras.
-
-## Installation Requirements
-
-1. Install LeRobot following the [Installation Guide](./installation).
-2. Install the TOPReward optional extra:
-
-```bash
-pip install -e ".[topreward]"
-```
-
-or, with `uv` from a source checkout:
-
-```bash
-uv sync --extra topreward
-```
-
-This pulls in `transformers`. The first time you run TOPReward, Hugging Face will also download the VLM weights from the Hub (~16 GB for Qwen3-VL-8B-Instruct). A GPU is strongly recommended.
-
-## Model Inputs and Outputs
-
-TOPReward expects:
-
- A trajectory video or sequence of frames.
- A natural-language task description.
-
-In LeRobot datasets the preprocessor reads:
-
-| Config field              | Default                     | Meaning                                       |
-| ------------------------- | --------------------------- | --------------------------------------------- |
-| `reward_model.image_key`  | `observation.images.top`    | Camera observation used by TOPReward          |
-| `reward_model.task_key`   | `task`                      | Key in complementary data for the task string |
-| `reward_model.max_frames` | `16`                        | Cap on frames per sample                      |
-| `reward_model.fps`        | `2.0`                       | Metadata passed to the Qwen video processor   |
-| `reward_model.vlm_name`   | `Qwen/Qwen3-VL-8B-Instruct` | Hugging Face Hub id of the underlying VLM     |
-
-The model returns:
-
- `compute_reward(batch)`: one log-probability per sample. Higher = better task-video alignment. When `success_threshold` is finite, returns the binary thresholded value instead.
-
-## Usage
-
-### Load the reward model directly
-
-```python
-from lerobot.rewards.topreward import TOPRewardConfig, TOPRewardModel
-
-cfg = TOPRewardConfig(
-    vlm_name="Qwen/Qwen3-VL-8B-Instruct",
-    device="cuda",
-)
-reward_model = TOPRewardModel(cfg)
-```
-
-### Use the reward factory
-
-```python
-from lerobot.rewards import make_reward_model, make_reward_model_config, make_reward_pre_post_processors
-
-cfg = make_reward_model_config(
-    "topreward",
-    vlm_name="Qwen/Qwen3-VL-8B-Instruct",
-    device="cuda",
-    image_key="observation.images.top",
-)
-reward_model = make_reward_model(cfg)
-preprocessor, postprocessor = make_reward_pre_post_processors(cfg)
-```
-
-The preprocessor tokenises the full prompt (video + prefix + instruction suffix), writes Qwen-VL tensors + `prompt_length` under `observation.topreward.*`. The model reads those tensors, label-masks based on `prompt_length`, and extracts the log-prob reward.
-
-### Offline dataset labeling
-
-Write a `topreward_progress.parquet` for RA-BC training and overlay videos:
-
-```bash
-# Sparse-dense (15 anchors per episode, matches upstream)
-uv run python -m lerobot.rewards.topreward.compute_rabc_weights \
-    --dataset-repo-id lerobot/libero_10_image \
-    --num-samples 15 \
-    --device cuda
-```
-
-Then render the progress overlay for any episode:
-
-```bash
-uv run examples/dataset/create_progress_videos.py \
-    --repo-id lerobot/libero_10_image \
-    --episode 0 \
-    --progress-file topreward_progress.parquet \
-    --gif
-```
-
-## Configuration Notes
-
-### Prompt knobs
-
-The default prompt mirrors the upstream paper:
-
-```text
-prompt_prefix = "The above video shows a robot manipulation trajectory that completes the following task: "
-prompt_suffix_template = "{instruction} Decide whether the above statement is True or not. The answer is: True"
-```
-
-Both are exposed on `TOPRewardConfig` for ablation. The suffix template **must** contain `{instruction}`.
-
-### Chat template
-
-`add_chat_template=True` wraps the full prompt (including instruction) with the tokenizer's chat template before tokenisation. Default is `False`, matching the upstream paper's main experiments.
-
-## Limitations
-
- The current LeRobot port is **inference-only and zero-shot**; `forward()` is not overridden and `is_trainable` returns `False`.
- Only the **Qwen3-VL family** is supported; other upstream clients are out of scope.
- TOPReward inherits the underlying VLM's biases.
-
-## References
-
- [TOPReward project page](https://topreward.github.io/webpage/)
- [TOPReward paper](https://arxiv.org/abs/2602.19313)
- [Original TOPReward code](https://github.com/TOPReward/TOPReward)
- [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
-
-## Citation
-
-```bibtex
-@article{chen2026topreward,
-  title={TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics},
-  author={Chen, Shirui and Harrison, Cole and Lee, Ying-Chun and Yang, Angela Jin and
-          Ren, Zhongzheng and Ratliff, Lillian J and Duan, Jiafei and Fox, Dieter and
-          Krishna, Ranjay},
-  journal={arXiv preprint arXiv:2602.19313},
-  year={2026}
-}
-```
-
-## License
-
-The original TOPReward codebase is MIT-licensed. The LeRobot port follows the LeRobot Apache 2.0 license; the wrapped Qwen3-VL weights are subject to the original Qwen license.
--- a/examples/dataset/create_progress_videos.py
+++ b/examples/dataset/create_progress_videos.py
@@ -15,12 +15,10 @@
 # limitations under the License.

 """
-Create MP4 (or GIF) videos with per-frame progress overlay for specified episodes.
+Create MP4 (or GIF) videos with sarm_progress overlay for specified episodes.

 Downloads datasets from HuggingFace, seeks directly into the episode segment
 of the source video, draws a progress line on each frame, and writes the result.
-The progress data is read from a parquet file that lives alongside the dataset
-(configurable via ``--progress-file``).

 Usage:
    python examples/dataset/create_progress_videos.py \
@@ -58,26 +56,22 @@ SCORE_FONT_SCALE = 0.8
 TASK_FONT_SCALE = 0.55


-def download_episode_metadata(
-    repo_id: str, episode: int, progress_file: str = "sarm_progress.parquet"
-) -> Path:
-    """Download only the metadata and per-frame progress file for a dataset.
+def download_episode_metadata(repo_id: str, episode: int) -> Path:
+    """Download only the metadata and sarm_progress files for a dataset.

    Args:
        repo_id: HuggingFace dataset repository ID.
        episode: Episode index (used for logging only; all meta is fetched).
-        progress_file: Filename of the per-frame progress parquet inside the
-            dataset repo.

    Returns:
        Local cache path for the downloaded snapshot.
    """
-    logging.info("[1/4] Downloading metadata + %s for %s (episode %d) ...", progress_file, repo_id, episode)
+    logging.info("[1/4] Downloading metadata for %s (episode %d) ...", repo_id, episode)
    local_path = Path(
        snapshot_download(
            repo_id=repo_id,
            repo_type="dataset",
-            allow_patterns=["meta/**", progress_file],
+            allow_patterns=["meta/**", "sarm_progress.parquet"],
            ignore_patterns=["*.mp4"],
        )
    )
@@ -221,28 +215,25 @@ def download_video_file(repo_id: str, local_path: Path, video_rel: str) -> Path:
    return video_path


-def load_progress_data(
-    local_path: Path, episode: int, progress_file: str = "sarm_progress.parquet"
-) -> np.ndarray | None:
-    """Load per-frame progress values for an episode.
+def load_progress_data(local_path: Path, episode: int) -> np.ndarray | None:
+    """Load sarm_progress values for an episode.

    Args:
        local_path: Dataset cache root.
        episode: Episode index.
-        progress_file: Filename of the per-frame progress parquet.

    Returns:
        Sorted (N, 2) array of (frame_index, progress), or None if unavailable.
    """
-    parquet_path = local_path / progress_file
+    parquet_path = local_path / "sarm_progress.parquet"
    if not parquet_path.exists():
-        logging.warning("%s not found", progress_file)
+        logging.warning("sarm_progress.parquet not found")
        return None
    df = pd.read_parquet(parquet_path)
-    logging.info("   %s columns: %s", progress_file, list(df.columns))
+    logging.info("   sarm_progress.parquet columns: %s", list(df.columns))
    episode_df = df[df["episode_index"] == episode].copy()
    if episode_df.empty:
-        logging.warning("No progress rows for episode %d in %s", episode, progress_file)
+        logging.warning("No sarm_progress rows for episode %d", episode)
        return None
    episode_df = episode_df.sort_values("frame_index")

@@ -585,7 +576,6 @@ def process_dataset(
    camera_key: str | None,
    output_dir: Path,
    create_gif: bool = False,
-    progress_file: str = "sarm_progress.parquet",
 ) -> Path | None:
    """Full pipeline: download, extract metadata, composite progress, write output.

@@ -595,8 +585,6 @@ def process_dataset(
        camera_key: Camera key to use, or None for auto-selection.
        output_dir: Directory to write output files.
        create_gif: If True, also generate a GIF from the MP4.
-        progress_file: Filename of the per-frame progress parquet inside the
-            dataset repo.

    Returns:
        Path to the final output file, or None on failure.
@@ -604,7 +592,7 @@ def process_dataset(
    safe_name = repo_id.replace("/", "_")
    logging.info("Processing: %s  |  episode %d", repo_id, episode)

-    local_path = download_episode_metadata(repo_id, episode, progress_file)
+    local_path = download_episode_metadata(repo_id, episode)
    logging.info("   Local cache: %s", local_path)

    episode_meta = load_episode_meta(local_path, episode, camera_key)
@@ -612,9 +600,9 @@ def process_dataset(

    video_path = download_video_file(repo_id, local_path, episode_meta["video_rel"])

-    progress_data = load_progress_data(local_path, episode, progress_file)
+    progress_data = load_progress_data(local_path, episode)
    if progress_data is None:
-        logging.error("Could not load progress data from %s. Skipping overlay.", progress_file)
+        logging.error("Could not load sarm_progress data. Skipping overlay.")
        return None

    logging.info("   Progress frames: %d", len(progress_data))
@@ -639,7 +627,7 @@ def process_dataset(

 def main() -> None:
    parser = argparse.ArgumentParser(
-        description="Create MP4/GIF videos with per-frame progress overlay for dataset episodes."
+        description="Create MP4/GIF videos with sarm_progress overlay for dataset episodes."
    )
    parser.add_argument(
        "--repo-id",
@@ -670,15 +658,6 @@ def main() -> None:
        action="store_true",
        help="Also generate a GIF from the MP4 output.",
    )
-    parser.add_argument(
-        "--progress-file",
-        type=str,
-        default="sarm_progress.parquet",
-        help=(
-            "Filename of the per-frame progress parquet inside the dataset repo "
-            "(default: 'sarm_progress.parquet')."
-        ),
-    )
    args = parser.parse_args()

    logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
@@ -691,7 +670,6 @@ def main() -> None:
        camera_key=args.camera_key,
        output_dir=args.output_dir,
        create_gif=args.gif,
-        progress_file=args.progress_file,
    )

    if result:
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -138,9 +138,7 @@ dataset_viz = ["lerobot[dataset]", "lerobot[viz]"]
 # Common
 av-dep = ["av>=15.0.0,<16.0.0"]
 pygame-dep = ["pygame>=2.5.1,<2.7.0"]
-# NOTE: 0.9.16 links against liburdfdom_sensor.so.4, which is unavailable on Ubuntu 24.04
-# (noble ships urdfdom 3.x). Cap below 0.9.16 until system urdfdom 4.x is broadly available.
-placo-dep = ["placo>=0.9.6,<0.9.16"]
+placo-dep = ["placo>=0.9.6,<0.9.17"]
 transformers-dep = ["transformers>=5.4.0,<5.6.0"]
 grpcio-dep = ["grpcio==1.73.1", "protobuf>=6.31.1,<6.32.0"]
 can-dep = ["python-can>=4.2.0,<5.0.0"]
@@ -198,7 +196,6 @@ wallx = [
    "lerobot[qwen-vl-utils-dep]",
 ]
 pi = ["lerobot[transformers-dep]", "lerobot[scipy-dep]"]
-molmoact2 = ["lerobot[transformers-dep]", "lerobot[peft-dep]", "lerobot[scipy-dep]"]
 smolvla = ["lerobot[transformers-dep]", "num2words>=0.5.14,<0.6.0", "accelerate>=1.7.0,<2.0.0"]
 multi_task_dit = ["lerobot[transformers-dep]", "lerobot[diffusers-dep]"]
 groot = [
@@ -212,8 +209,6 @@ groot = [
    "flash-attn>=2.5.9,<3.0.0 ; sys_platform != 'darwin'"
 ]
 sarm = ["lerobot[transformers-dep]", "pydantic>=2.0.0,<3.0.0", "faker>=33.0.0,<35.0.0", "lerobot[matplotlib-dep]", "lerobot[qwen-vl-utils-dep]"]
-robometer = ["lerobot[transformers-dep]", "lerobot[qwen-vl-utils-dep]", "lerobot[peft-dep]"]
-topreward = ["lerobot[transformers-dep]"]
 xvla = ["lerobot[transformers-dep]"]
 eo1 = ["lerobot[transformers-dep]", "lerobot[qwen-vl-utils-dep]"]
 hilserl = ["lerobot[transformers-dep]", "lerobot[dataset]", "gym-hil>=0.1.13,<0.2.0", "lerobot[grpcio-dep]", "lerobot[placo-dep]"]
@@ -277,7 +272,6 @@ all = [
    "lerobot[multi_task_dit]",
    "lerobot[wallx]",
    "lerobot[pi]",
-    "lerobot[molmoact2]",
    "lerobot[smolvla]",
    # "lerobot[groot]", TODO(Steven): Gr00t requires specific installation instructions for flash-attn
    "lerobot[xvla]",
@@ -292,8 +286,6 @@ all = [
    "lerobot[libero]; sys_platform == 'linux'",
    "lerobot[metaworld]",
    "lerobot[sarm]",
-    "lerobot[robometer]",
-    "lerobot[topreward]",
    "lerobot[peft]",
    # "lerobot[unitree_g1]", TODO: Unitree requires specific installation instructions for unitree_sdk2
 ]
@@ -409,11 +401,8 @@ default.extend-ignore-identifiers-re = [
    "ein",
    "thw",
    "inpt",
-    "arange",
-    "is_compileable",
    "ROBOTIS",
-    "OT_VALUE",
-    "VanderBilt"
+    "OT_VALUE"
 ]

 # TODO: Uncomment when ready to use
--- a/src/lerobot/configs/parser.py
+++ b/src/lerobot/configs/parser.py
@@ -255,7 +255,8 @@ def extract_path_fields_from_config(config_path: str, path_fields: list[str]) ->
            remaining = config_data[field]
            if remaining:
                _config_yaml_overrides[field] = _flatten_to_cli_args(remaining)
-            del config_data[field]
+            else:
+                del config_data[field]
            modified = True

    if not modified:
@@ -310,13 +311,7 @@ def wrap(config_path: Path | None = None) -> Callable[[F], F]:
                    cli_args = filter_arg("config_path", cli_args)
                    cfg = argtype.from_pretrained(config_path_cli, cli_args=cli_args)
                else:
-                    if config_path_cli:
-                        cli_args = filter_arg("config_path", cli_args)
-                    cfg = draccus.parse(
-                        config_class=argtype,
-                        config_path=config_path_cli or config_path,
-                        args=cli_args,
-                    )
+                    cfg = draccus.parse(config_class=argtype, config_path=config_path, args=cli_args)
            response = fn(cfg, *args, **kwargs)
            return response

--- a/src/lerobot/model/kinematics.py
+++ b/src/lerobot/model/kinematics.py
@@ -18,25 +18,12 @@ from typing import TYPE_CHECKING

 import numpy as np

-from lerobot.utils.import_utils import require_package
+from lerobot.utils.import_utils import _placo_available, require_package

-_placo_runtime_error: ImportError | None = None
-
-if TYPE_CHECKING:
+if TYPE_CHECKING or _placo_available:
    import placo  # type: ignore[import-not-found]
 else:
-    try:
-        import placo  # type: ignore[import-not-found]
-    except ImportError as _placo_import_err:
-        placo = None
-        _placo_runtime_error = _placo_import_err
-
-
-def _raise_if_placo_unusable() -> None:
-    if placo is None and _placo_runtime_error is not None:
-        raise ImportError(
-            f"placo is installed but failed to import: {_placo_runtime_error!s}"
-        ) from _placo_runtime_error
+    placo = None


 class RobotKinematics:
@@ -57,7 +44,6 @@ class RobotKinematics:
            joint_names (list[str] | None): List of joint names to use for the kinematics solver
        """
        require_package("placo", extra="placo-dep")
-        _raise_if_placo_unusable()

        self.robot = placo.RobotWrapper(urdf_path)
        self.solver = placo.KinematicsSolver(self.robot)
--- a/src/lerobot/motors/robstride/robstride.py
+++ b/src/lerobot/motors/robstride/robstride.py
@@ -43,7 +43,6 @@ from .tables import (
    CAN_CMD_SET_ZERO,
    DEFAULT_BAUDRATE,
    DEFAULT_TIMEOUT_MS,
-    HANDSHAKE_TIMEOUT_S,
    MODEL_RESOLUTION,
    MOTOR_LIMIT_PARAMS,
    NORMALIZED_DATA,
@@ -216,16 +215,14 @@ class RobstrideMotorsBus(MotorsBusBase):
            self._is_connected = False
            raise ConnectionError(f"Failed to connect to CAN bus: {e}") from e

-    def _query_status_via_clear_fault(
-        self, motor: NameOrID, timeout: float = RUNNING_TIMEOUT
-    ) -> tuple[bool, can.Message | None]:
+    def _query_status_via_clear_fault(self, motor: NameOrID) -> tuple[bool, can.Message | None]:
        motor_name = self._get_motor_name(motor)
        motor_id = self._get_motor_id(motor_name)
        recv_id = self._get_motor_recv_id(motor_name)
        data = [0xFF] * 7 + [CAN_CMD_CLEAR_FAULT]
        msg = can.Message(arbitration_id=motor_id, data=data, is_extended_id=False)
        self._bus().send(msg)
-        return self._recv_status_via_clear_fault(expected_recv_id=recv_id, timeout=timeout)
+        return self._recv_status_via_clear_fault(expected_recv_id=recv_id)

    def _recv_status_via_clear_fault(
        self, expected_recv_id: int | None = None, timeout: float = RUNNING_TIMEOUT
@@ -283,7 +280,7 @@ class RobstrideMotorsBus(MotorsBusBase):
        faulted_motors = []

        for motor_name in self.motors:
-            has_fault, msg = self._query_status_via_clear_fault(motor_name, timeout=HANDSHAKE_TIMEOUT_S)
+            has_fault, msg = self._query_status_via_clear_fault(motor_name)
            if msg is None:
                missing_motors.append(motor_name)
            elif has_fault:
@@ -508,87 +505,6 @@ class RobstrideMotorsBus(MotorsBusBase):

        return responses

-    def _recv_all_messages_until_quiet(
-        self,
-        *,
-        timeout: float = RUNNING_TIMEOUT,
-        max_messages: int = 4096,
-    ) -> list[can.Message]:
-        """
-        Receive frames until the bus goes quiet.
-
-        Args:
-            timeout: Poll timeout used for each recv() call. Collection stops
-                when one recv() times out (quiet gap).
-            max_messages: Safety cap to prevent unbounded loops.
-        """
-        out: list[can.Message] = []
-        max_messages = max(1, max_messages)
-        timeout = max(0.0, timeout)
-
-        try:
-            while len(out) < max_messages:
-                msg = self._bus().recv(timeout=timeout)
-                if msg is None:
-                    break
-                out.append(msg)
-        except (can.CanError, OSError) as e:
-            logger.debug(f"Error draining CAN RX queue on {self.port}: {e}")
-
-        return out
-
-    def _process_feedback_messages(self, messages: list[can.Message]) -> set[int]:
-        """
-        Decode all received feedback frames and update cached motor states.
-
-        Returns:
-            Set of payload recv_ids that were successfully mapped to motors.
-        """
-        processed_recv_ids: set[int] = set()
-        for msg in messages:
-            if len(msg.data) < 1:
-                logger.debug(
-                    f"Dropping short CAN frame on {self.port} "
-                    f"(arb=0x{int(msg.arbitration_id):02X}, data={bytes(msg.data).hex()})"
-                )
-                continue
-
-            recv_id = int(msg.data[0])
-            motor_name = self._recv_id_to_motor.get(recv_id)
-            if motor_name is None:
-                logger.debug(
-                    f"Unmapped CAN frame on {self.port} "
-                    f"(arb=0x{int(msg.arbitration_id):02X}, recv_id=0x{recv_id:02X}, data={bytes(msg.data).hex()})"
-                )
-                continue
-
-            self._process_response(motor_name, msg)
-            processed_recv_ids.add(recv_id)
-
-        return processed_recv_ids
-
-    def flush_rx_queue(self, poll_timeout_s: float = 0.0005, max_messages: int = 4096) -> int:
-        """
-        Drain pending RX frames from the CAN interface.
-
-        This is used by higher-level controllers to drop stale feedback before issuing
-        a fresh read cycle, so subsequent state reads are based on most recent replies.
-        It should also be called once when a controller instance is created/connected,
-        to clear residual frames left on the interface from previous sessions.
-        """
-        drained = 0
-        poll_timeout_s = max(0.0, poll_timeout_s)
-        max_messages = max(1, max_messages)
-        try:
-            while drained < max_messages:
-                msg = self._bus().recv(timeout=poll_timeout_s)
-                if msg is None:
-                    break
-                drained += 1
-        except (can.CanError, OSError) as e:
-            logger.debug(f"Failed to flush CAN RX queue on {self.port}: {e}")
-        return drained
-
    def _speed_control(
        self,
        motor: NameOrID,
@@ -728,14 +644,11 @@ class RobstrideMotorsBus(MotorsBusBase):
            msg = can.Message(arbitration_id=motor_id, data=data, is_extended_id=False)
            self._bus().send(msg)
            recv_id_to_motor[self._get_motor_recv_id(motor)] = motor_name
-        # Read every feedback frame until RX goes quiet, then decode all of them.
-        # This avoids dropping useful frames when responses from different motors interleave.
-        messages = self._recv_all_messages_until_quiet()
-        processed_recv_ids = self._process_feedback_messages(messages)

+        responses = self._recv_all_responses(list(recv_id_to_motor.keys()), timeout=RUNNING_TIMEOUT)
        for recv_id, motor_name in recv_id_to_motor.items():
-            if recv_id not in processed_recv_ids:
-                logger.warning(f"Packet drop: {motor_name} (ID: 0x{recv_id:02X}). Using last known state.")
+            if msg := responses.get(recv_id):
+                self._process_response(motor_name, msg)

    def _float_to_uint(self, x: float, x_min: float, x_max: float, bits: int) -> int:
        """Convert float to unsigned integer for CAN transmission."""
@@ -798,10 +711,7 @@ class RobstrideMotorsBus(MotorsBusBase):
        try:
            self._decode_motor_state(msg.data)
        except Exception as e:
-            logger.warning(
-                f"Failed to decode response from {motor} "
-                f"(arb=0x{int(msg.arbitration_id):02X}, data={bytes(msg.data).hex()}): {e}"
-            )
+            logger.warning(f"Failed to decode response from {motor}: {e}")

    def _get_cached_value(self, motor: str, data_name: str) -> Value:
        """Retrieve a specific value from the state cache."""
@@ -938,12 +848,20 @@ class RobstrideMotorsBus(MotorsBusBase):
            self._bus().send(msg)
            updated_motors.append(motor)

-        messages = self._recv_all_messages_until_quiet()
-        processed_recv_ids = self._process_feedback_messages(messages)
+        expected_recv_ids = [self._get_motor_recv_id(motor) for motor in updated_motors]
+        responses = self._recv_all_responses(expected_recv_ids, timeout=RUNNING_TIMEOUT)
+
+        for response in responses.values():
+            payload_motor_name = self._recv_id_to_motor.get(response.data[0])
+            if payload_motor_name is not None:
+                self._process_response(payload_motor_name, response)
+            else:
+                # Fallback: still attempt to decode based on payload byte0 mapping.
+                self._decode_motor_state(response.data)

        for motor in updated_motors:
            recv_id = self._get_motor_recv_id(motor)
-            if recv_id not in processed_recv_ids:
+            if recv_id not in responses:
                logger.warning(f"Packet drop: {motor} (ID: 0x{recv_id:02X}). Using last known state.")

    def read_calibration(self) -> dict[str, MotorCalibration]:
--- a/src/lerobot/motors/robstride/tables.py
+++ b/src/lerobot/motors/robstride/tables.py
@@ -114,8 +114,7 @@ CAN_CMD_SAVE_PARAM = 0xAA
 CAN_PARAM_ID = 0x7FF


-RUNNING_TIMEOUT = 0.003
-HANDSHAKE_TIMEOUT_S = 0.05
+RUNNING_TIMEOUT = 0.001
 PARAM_TIMEOUT = 0.01

 STATE_CACHE_TTL_S = 0.02
--- a/src/lerobot/policies/init.py
+++ b/src/lerobot/policies/init.py
@@ -20,7 +20,6 @@ from .eo1.configuration_eo1 import EO1Config as EO1Config
 from .factory import get_policy_class, make_policy, make_policy_config, make_pre_post_processors
 from .gaussian_actor.configuration_gaussian_actor import GaussianActorConfig as GaussianActorConfig
 from .groot.configuration_groot import GrootConfig as GrootConfig
-from .molmoact2.configuration_molmoact2 import MolmoAct2Config as MolmoAct2Config
 from .multi_task_dit.configuration_multi_task_dit import MultiTaskDiTConfig as MultiTaskDiTConfig
 from .pi0.configuration_pi0 import PI0Config as PI0Config
 from .pi0_fast.configuration_pi0_fast import PI0FastConfig as PI0FastConfig
@@ -44,7 +43,6 @@ __all__ = [
    "EO1Config",
    "GaussianActorConfig",
    "GrootConfig",
-    "MolmoAct2Config",
    "MultiTaskDiTConfig",
    "PI0Config",
    "PI0FastConfig",
--- a/src/lerobot/policies/factory.py
+++ b/src/lerobot/policies/factory.py
@@ -18,7 +18,6 @@ from __future__ import annotations

 import importlib
 import logging
-from copy import copy
 from typing import TYPE_CHECKING, Any, TypedDict, Unpack

 import torch
@@ -49,8 +48,7 @@ from .act.configuration_act import ACTConfig
 from .diffusion.configuration_diffusion import DiffusionConfig
 from .eo1.configuration_eo1 import EO1Config
 from .gaussian_actor.configuration_gaussian_actor import GaussianActorConfig
-from .groot.configuration_groot import GROOT_N1_7, GrootConfig
-from .molmoact2.configuration_molmoact2 import MolmoAct2Config
+from .groot.configuration_groot import GrootConfig
 from .multi_task_dit.configuration_multi_task_dit import MultiTaskDiTConfig
 from .pi0.configuration_pi0 import PI0Config
 from .pi05.configuration_pi05 import PI05Config
@@ -90,8 +88,7 @@ def get_policy_class(name: str) -> type[PreTrainedPolicy]:

    Args:
        name: The name of the policy. Supported names are "tdmpc", "diffusion", "act",
-            "multi_task_dit", "vqbet", "pi0", "pi05", "gaussian_actor", "smolvla", "wall_x",
-            "molmoact2".
+            "multi_task_dit", "vqbet", "pi0", "pi05", "gaussian_actor", "smolvla", "wall_x".
    Returns:
        The policy class corresponding to the given name.

@@ -154,10 +151,6 @@ def get_policy_class(name: str) -> type[PreTrainedPolicy]:
        from .eo1.modeling_eo1 import EO1Policy

        return EO1Policy
-    elif name == "molmoact2":
-        from .molmoact2.modeling_molmoact2 import MolmoAct2Policy
-
-        return MolmoAct2Policy
    else:
        try:
            return _get_policy_cls_from_policy_name(name=name)
@@ -175,7 +168,7 @@ def make_policy_config(policy_type: str, **kwargs) -> PreTrainedConfig:
    Args:
        policy_type: The type of the policy. Supported types include "tdmpc",
                     "multi_task_dit", "diffusion", "act", "vqbet", "pi0", "pi05", "gaussian_actor",
-                     "smolvla", "wall_x", "molmoact2".
+                     "smolvla", "wall_x".
        **kwargs: Keyword arguments to be passed to the configuration class constructor.

    Returns:
@@ -210,8 +203,6 @@ def make_policy_config(policy_type: str, **kwargs) -> PreTrainedConfig:
        return WallXConfig(**kwargs)
    elif policy_type == "eo1":
        return EO1Config(**kwargs)
-    elif policy_type == "molmoact2":
-        return MolmoAct2Config(**kwargs)
    else:
        try:
            config_cls = PreTrainedConfig.get_choice_class(policy_type)
@@ -240,7 +231,6 @@ class ProcessorConfigKwargs(TypedDict, total=False):
    preprocessor_overrides: dict[str, Any] | None
    postprocessor_overrides: dict[str, Any] | None
    dataset_stats: dict[str, dict[str, torch.Tensor]] | None
-    dataset_meta: Any | None


 def make_pre_post_processors(
@@ -274,47 +264,24 @@ def make_pre_post_processors(
            policy configuration type.
    """
    if pretrained_path:
-        if isinstance(policy_cfg, GrootConfig):
-            from .groot.configuration_groot import is_raw_groot_n1_7_checkpoint
-
-            if is_raw_groot_n1_7_checkpoint(pretrained_path):
-                from .groot.processor_groot import make_groot_pre_post_processors
-
-                processor_cfg = copy(policy_cfg)
-                processor_cfg.base_model_path = str(pretrained_path)
-                return make_groot_pre_post_processors(
-                    config=processor_cfg,
-                    dataset_stats=kwargs.get("dataset_stats"),
-                )
-
        # TODO(Steven): Temporary patch, implement correctly the processors for Gr00t
        if isinstance(policy_cfg, GrootConfig):
-            # GROOT handles normalization in its pack-inputs step
+            # GROOT handles normalization in groot_pack_inputs_v3 step
            # Need to override both stats AND normalize_min_max since saved config might be empty
-            dataset_stats = kwargs.get("dataset_stats")
-            preprocessor_overrides = dict(kwargs.get("preprocessor_overrides", {}))
-            postprocessor_overrides = dict(kwargs.get("postprocessor_overrides", {}))
-            pack_inputs_key = (
-                "groot_n1_7_pack_inputs_v1"
-                if policy_cfg.model_version == GROOT_N1_7
-                else "groot_pack_inputs_v3"
-            )
-            pack_input_overrides = dict(preprocessor_overrides.get(pack_inputs_key, {}))
-            pack_input_overrides["normalize_min_max"] = True
-            if dataset_stats is not None and policy_cfg.model_version != GROOT_N1_7:
-                pack_input_overrides["stats"] = dataset_stats
-            preprocessor_overrides[pack_inputs_key] = pack_input_overrides
+            preprocessor_overrides = {}
+            postprocessor_overrides = {}
+            preprocessor_overrides["groot_pack_inputs_v3"] = {
+                "stats": kwargs.get("dataset_stats"),
+                "normalize_min_max": True,
+            }

            # Also ensure postprocessing slices to env action dim and unnormalizes with dataset stats
            env_action_dim = policy_cfg.output_features[ACTION].shape[0]
-            action_unpack_overrides = dict(
-                postprocessor_overrides.get("groot_action_unpack_unnormalize_v1", {})
-            )
-            action_unpack_overrides["normalize_min_max"] = True
-            action_unpack_overrides["env_action_dim"] = env_action_dim
-            if dataset_stats is not None and policy_cfg.model_version != GROOT_N1_7:
-                action_unpack_overrides["stats"] = dataset_stats
-            postprocessor_overrides["groot_action_unpack_unnormalize_v1"] = action_unpack_overrides
+            postprocessor_overrides["groot_action_unpack_unnormalize_v1"] = {
+                "stats": kwargs.get("dataset_stats"),
+                "normalize_min_max": True,
+                "env_action_dim": env_action_dim,
+            }
            kwargs["preprocessor_overrides"] = preprocessor_overrides
            kwargs["postprocessor_overrides"] = postprocessor_overrides

@@ -447,15 +414,6 @@ def make_pre_post_processors(
            dataset_stats=kwargs.get("dataset_stats"),
        )

-    elif isinstance(policy_cfg, MolmoAct2Config):
-        from .molmoact2.processor_molmoact2 import make_molmoact2_pre_post_processors
-
-        processors = make_molmoact2_pre_post_processors(
-            config=policy_cfg,
-            dataset_stats=kwargs.get("dataset_stats"),
-            dataset_meta=kwargs.get("dataset_meta"),
-        )
-
    else:
        try:
            processors = _make_processors_from_policy_config(
@@ -541,10 +499,6 @@ def make_policy(
        action_names = ds_meta.features.get(ACTION, {}).get("names")
        if action_names is not None:
            cfg.action_feature_names = list(action_names)
-    if ds_meta is not None:
-        set_dataset_feature_metadata = getattr(cfg, "set_dataset_feature_metadata", None)
-        if callable(set_dataset_feature_metadata):
-            set_dataset_feature_metadata(ds_meta.features)

    kwargs["config"] = cfg

--- a/src/lerobot/policies/groot/init.py
+++ b/src/lerobot/policies/groot/init.py
@@ -18,12 +18,4 @@ from .configuration_groot import GrootConfig
 from .modeling_groot import GrootPolicy
 from .processor_groot import make_groot_pre_post_processors

-__all__ = ["GR00TN17", "GR00TN17Config", "GrootConfig", "GrootPolicy", "make_groot_pre_post_processors"]
-
-
-def __getattr__(name: str):
-    if name in {"GR00TN17", "GR00TN17Config"}:
-        from .groot_n1_7 import GR00TN17, GR00TN17Config
-
-        return {"GR00TN17": GR00TN17, "GR00TN17Config": GR00TN17Config}[name]
-    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
+__all__ = ["GrootConfig", "GrootPolicy", "make_groot_pre_post_processors"]
--- a/src/lerobot/policies/groot/action_head/cross_attention_dit.py
+++ b/src/lerobot/policies/groot/action_head/cross_attention_dit.py
@@ -181,7 +181,8 @@ class BasicTransformerBlock(nn.Module):
        attn_output = self.attn1(
            norm_hidden_states,
            encoder_hidden_states=encoder_hidden_states,
-            attention_mask=encoder_attention_mask if encoder_hidden_states is not None else attention_mask,
+            attention_mask=attention_mask,
+            # encoder_attention_mask=encoder_attention_mask,
        )
        if self.final_dropout:
            attn_output = self.final_dropout(attn_output)
@@ -317,71 +318,6 @@ class DiT(ModelMixin, ConfigMixin):
            return self.proj_out_2(hidden_states)


-class AlternateVLDiT(DiT):
-    """N1.7 DiT variant that alternates cross-attention over image and text tokens."""
-
-    def __init__(self, *args, attend_text_every_n_blocks: int = 2, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.attend_text_every_n_blocks = attend_text_every_n_blocks
-
-    def forward(
-        self,
-        hidden_states: torch.Tensor,
-        encoder_hidden_states: torch.Tensor,
-        timestep: torch.LongTensor | None = None,
-        encoder_attention_mask: torch.Tensor | None = None,
-        return_all_hidden_states: bool = False,
-        image_mask: torch.Tensor | None = None,
-        backbone_attention_mask: torch.Tensor | None = None,
-    ):
-        if image_mask is None:
-            raise ValueError("image_mask is required for AlternateVLDiT.")
-        if backbone_attention_mask is None:
-            raise ValueError("backbone_attention_mask is required for AlternateVLDiT.")
-
-        temb = self.timestep_encoder(timestep)
-        hidden_states = hidden_states.contiguous()
-        encoder_hidden_states = encoder_hidden_states.contiguous()
-
-        image_attention_mask = image_mask & backbone_attention_mask
-        non_image_attention_mask = (~image_mask) & backbone_attention_mask
-
-        all_hidden_states = [hidden_states]
-        if not self.config.interleave_self_attention:
-            raise ValueError("AlternateVLDiT requires interleave_self_attention=True.")
-
-        for idx, block in enumerate(self.transformer_blocks):
-            if idx % 2 == 1:
-                hidden_states = block(
-                    hidden_states,
-                    attention_mask=None,
-                    encoder_hidden_states=None,
-                    encoder_attention_mask=None,
-                    temb=temb,
-                )
-            else:
-                curr_encoder_attention_mask = (
-                    non_image_attention_mask
-                    if idx % (2 * self.attend_text_every_n_blocks) == 0
-                    else image_attention_mask
-                )
-                hidden_states = block(
-                    hidden_states,
-                    attention_mask=None,
-                    encoder_hidden_states=encoder_hidden_states,
-                    encoder_attention_mask=curr_encoder_attention_mask,
-                    temb=temb,
-                )
-            all_hidden_states.append(hidden_states)
-
-        conditioning = temb
-        shift, scale = self.proj_out_1(F.silu(conditioning)).chunk(2, dim=1)
-        hidden_states = self.norm_out(hidden_states) * (1 + scale[:, None]) + shift[:, None]
-        if return_all_hidden_states:
-            return self.proj_out_2(hidden_states), all_hidden_states
-        return self.proj_out_2(hidden_states)
-
-
 class SelfAttentionTransformer(ModelMixin, ConfigMixin):
    _supports_gradient_checkpointing = True

--- a/src/lerobot/policies/groot/configuration_groot.py
+++ b/src/lerobot/policies/groot/configuration_groot.py
@@ -14,295 +14,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-import json
-import os
 from dataclasses import dataclass, field
-from pathlib import Path

 from lerobot.configs import FeatureType, NormalizationMode, PolicyFeature, PreTrainedConfig
 from lerobot.optim import AdamWConfig, CosineDecayWithWarmupSchedulerConfig
 from lerobot.utils.constants import ACTION, OBS_STATE

-GROOT_N1_5 = "n1.5"
-GROOT_N1_7 = "n1.7"
-GROOT_N1_5_BASE_MODEL = "nvidia/GR00T-N1.5-3B"
-GROOT_N1_7_BASE_MODEL = "nvidia/GR00T-N1.7-3B"
-GROOT_N1_7_BACKBONE_MODEL = "nvidia/Cosmos-Reason2-2B"
-GROOT_ACTION_DECODE_TRANSFORM_LIBERO = "libero"
-
-_GROOT_MODEL_VERSION_ALIASES = {
-    "n1.5": GROOT_N1_5,
-    "n1_5": GROOT_N1_5,
-    "n15": GROOT_N1_5,
-    "1.5": GROOT_N1_5,
-    "n1.7": GROOT_N1_7,
-    "n1_7": GROOT_N1_7,
-    "n1d7": GROOT_N1_7,
-    "n17": GROOT_N1_7,
-    "1.7": GROOT_N1_7,
-}
-
-_GROOT_ACTION_DECODE_TRANSFORM_ALIASES = {
-    "none": None,
-    "": None,
-    GROOT_ACTION_DECODE_TRANSFORM_LIBERO: GROOT_ACTION_DECODE_TRANSFORM_LIBERO,
-}
-
-
-def normalize_groot_model_version(model_version: str) -> str:
-    normalized = _GROOT_MODEL_VERSION_ALIASES.get(model_version.lower())
-    if normalized is None:
-        supported = ", ".join(sorted({GROOT_N1_5, GROOT_N1_7}))
-        raise ValueError(
-            f"Unsupported GR00T model_version '{model_version}'. Supported versions: {supported}."
-        )
-    return normalized
-
-
-def normalize_groot_action_decode_transform(transform: str | None) -> str | None:
-    if transform is None:
-        return None
-    normalized = _GROOT_ACTION_DECODE_TRANSFORM_ALIASES.get(transform.lower())
-    if normalized is None and transform.lower() not in _GROOT_ACTION_DECODE_TRANSFORM_ALIASES:
-        supported = ", ".join(
-            sorted(key for key, value in _GROOT_ACTION_DECODE_TRANSFORM_ALIASES.items() if value is not None)
-        )
-        raise ValueError(
-            f"Unsupported GR00T N1.7 action decode transform '{transform}'. "
-            f"Supported transforms: none, {supported}."
-        )
-    return normalized
-
-
-def infer_groot_model_version(model_path: str | None) -> str | None:
-    if not model_path:
-        return None
-    model_path_lower = model_path.lower()
-    if "gr00t-n1.7" in model_path_lower or "gr00t_n1.7" in model_path_lower:
-        return GROOT_N1_7
-    if "gr00t-n1.5" in model_path_lower or "gr00t_n1.5" in model_path_lower:
-        return GROOT_N1_5
-    config_version = _infer_groot_model_version_from_local_config(model_path)
-    if config_version is not None:
-        return config_version
-    return None
-
-
-def is_raw_groot_n1_7_checkpoint(model_path: str | Path | None) -> bool:
-    if model_path is None:
-        return False
-
-    path = Path(model_path).expanduser()
-    if path.is_dir():
-        config_path = path / "config.json"
-    elif path.name == "config.json":
-        config_path = path
-    else:
-        return False
-
-    try:
-        with config_path.open() as f:
-            config = json.load(f)
-    except (OSError, json.JSONDecodeError):
-        return False
-
-    return "type" not in config and _infer_groot_model_version_from_config(config) == GROOT_N1_7
-
-
-def infer_groot_n1_7_embodiment_tag(model_path: str | Path | None) -> str | None:
-    if model_path is None:
-        return None
-
-    processor_config_path = Path(model_path).expanduser() / "processor_config.json"
-    try:
-        with processor_config_path.open() as f:
-            processor_config = json.load(f)
-    except (OSError, json.JSONDecodeError):
-        return None
-
-    modality_configs = processor_config.get("processor_kwargs", {}).get("modality_configs", {})
-    if not isinstance(modality_configs, dict):
-        return None
-    if "libero_sim" in modality_configs:
-        return "libero_sim"
-    if len(modality_configs) == 1:
-        return next(iter(modality_configs))
-    return None
-
-
-def infer_groot_n1_7_action_horizon(
-    model_path: str | Path | None, embodiment_tag: str | None = None
-) -> int | None:
-    if model_path is None:
-        return None
-
-    processor_config_path = Path(model_path).expanduser() / "processor_config.json"
-    try:
-        with processor_config_path.open() as f:
-            processor_config = json.load(f)
-    except (OSError, json.JSONDecodeError):
-        return None
-
-    processor_kwargs = processor_config.get("processor_kwargs", {})
-    if not isinstance(processor_kwargs, dict):
-        return None
-    modality_configs = processor_kwargs.get("modality_configs", {})
-    if not isinstance(modality_configs, dict):
-        return None
-
-    if embodiment_tag is None:
-        embodiment_tag = infer_groot_n1_7_embodiment_tag(model_path)
-    if embodiment_tag is None:
-        return None
-
-    embodiment_config = modality_configs.get(embodiment_tag, {})
-    if not isinstance(embodiment_config, dict):
-        return None
-    action_config = embodiment_config.get("action", {})
-    if not isinstance(action_config, dict):
-        return None
-    delta_indices = action_config.get("delta_indices", [])
-    if not isinstance(delta_indices, list):
-        return None
-    return len(delta_indices) or None
-
-
-def infer_groot_n1_7_action_execution_horizon(
-    model_path: str | Path | None, embodiment_tag: str | None = None
-) -> int | None:
-    action_horizon = infer_groot_n1_7_action_horizon(model_path, embodiment_tag)
-    if action_horizon is None:
-        return None
-
-    if embodiment_tag is None:
-        embodiment_tag = infer_groot_n1_7_embodiment_tag(model_path)
-    if embodiment_tag == "libero_sim":
-        # NVIDIA's N1.7 LIBERO rollout wrapper replans after 8 of the 16 decoded
-        # actions. Keeping that execution cadence avoids stale open-loop chunks.
-        return min(action_horizon, 8)
-    return action_horizon
-
-
-def resolve_groot_n1_7_backbone_model(model_name: str, cache_dir: str | Path | None = None) -> str:
-    model_path = Path(model_name).expanduser()
-    if model_path.exists():
-        return str(model_path)
-
-    cached_snapshot = _find_cached_hf_snapshot(model_name, cache_dir=cache_dir)
-    return str(cached_snapshot) if cached_snapshot is not None else model_name
-
-
-def _find_cached_hf_snapshot(repo_id: str, cache_dir: str | Path | None = None) -> Path | None:
-    repo_cache_name = f"models--{repo_id.replace('/', '--')}"
-    required_files = (
-        "config.json",
-        "tokenizer_config.json",
-        "preprocessor_config.json",
-        "video_preprocessor_config.json",
-    )
-
-    for hub_cache in _candidate_hf_hub_caches(cache_dir):
-        repo_cache = hub_cache / repo_cache_name
-        snapshots_dir = repo_cache / "snapshots"
-        if not snapshots_dir.is_dir():
-            continue
-
-        candidates: list[Path] = []
-        ref_path = repo_cache / "refs" / "main"
-        try:
-            ref = ref_path.read_text().strip()
-        except OSError:
-            ref = ""
-        if ref:
-            candidates.append(snapshots_dir / ref)
-        candidates.extend(
-            sorted(
-                (path for path in snapshots_dir.iterdir() if path.is_dir()),
-                key=lambda path: path.stat().st_mtime,
-                reverse=True,
-            )
-        )
-
-        seen: set[Path] = set()
-        for snapshot in candidates:
-            if snapshot in seen:
-                continue
-            seen.add(snapshot)
-            if all((snapshot / filename).exists() for filename in required_files):
-                return snapshot
-    return None
-
-
-def _candidate_hf_hub_caches(cache_dir: str | Path | None) -> list[Path]:
-    candidates: list[Path] = []
-    if cache_dir is not None:
-        cache_path = Path(cache_dir).expanduser()
-        candidates.append(cache_path)
-        candidates.append(cache_path / "hub")
-
-    hub_cache = os.environ.get("HUGGINGFACE_HUB_CACHE")
-    if hub_cache:
-        candidates.append(Path(hub_cache).expanduser())
-
-    hf_home = os.environ.get("HF_HOME")
-    if hf_home:
-        candidates.append(Path(hf_home).expanduser() / "hub")
-
-    candidates.append(Path.home() / ".cache" / "huggingface" / "hub")
-
-    deduped: list[Path] = []
-    seen: set[Path] = set()
-    for candidate in candidates:
-        resolved = candidate.resolve() if candidate.exists() else candidate
-        if resolved not in seen:
-            seen.add(resolved)
-            deduped.append(candidate)
-    return deduped
-
-
-def _infer_groot_model_version_from_local_config(model_path: str) -> str | None:
-    path = Path(model_path).expanduser()
-    if path.is_dir():
-        config_path = path / "config.json"
-    elif path.name == "config.json":
-        config_path = path
-    else:
-        return None
-
-    if not config_path.exists():
-        return None
-
-    try:
-        with config_path.open() as f:
-            config = json.load(f)
-    except (OSError, json.JSONDecodeError):
-        return None
-
-    return _infer_groot_model_version_from_config(config)
-
-
-def _infer_groot_model_version_from_config(config: dict) -> str | None:
-    model_version = config.get("model_version")
-    if isinstance(model_version, str):
-        try:
-            return normalize_groot_model_version(model_version)
-        except ValueError:
-            return None
-
-    candidates = [config.get("model_type"), *(config.get("architectures") or [])]
-    for candidate in candidates:
-        if not isinstance(candidate, str):
-            continue
-        normalized = candidate.lower().replace("-", "_")
-        if normalized in {"gr00tn1d7", "gr00t_n1d7", "gr00t_n1_7"}:
-            return GROOT_N1_7
-        if normalized in {"gr00t_n1_5", "gr00tn15", "gr00t_n1d5"}:
-            return GROOT_N1_5
-
-    if config.get("model_name") == GROOT_N1_7_BACKBONE_MODEL:
-        return GROOT_N1_7
-    return None
-

@PreTrainedConfig.register_subclass("groot")
@dataclass
@@ -335,21 +52,12 @@ class GrootConfig(PreTrainedConfig):

    # Groot-specific model parameters (from groot_finetune_script.py)

-    # Explicit GR00T model family selection. Defaults to N1.5 to preserve existing behavior.
-    model_version: str = GROOT_N1_5
-
    # Path or HuggingFace model ID for the base Groot model
-    base_model_path: str | None = None
+    base_model_path: str = "nvidia/GR00T-N1.5-3B"

    # HF repo ID (or local path) that hosts vocab.json and merges.txt for Eagle tokenizer.
    tokenizer_assets_repo: str = "lerobot/eagle2hg-processor-groot-n1p5"

-    # HF repo ID (or local path) for the GR00T N1.7 Cosmos/Qwen3-VL backbone processor.
-    n1_7_backbone_model: str = GROOT_N1_7_BACKBONE_MODEL
-
-    # Optional named action transform applied after raw N1.7 checkpoint decoding and before env.step().
-    action_decode_transform: str | None = None
-
    # Embodiment tag to use for training (e.g. 'new_embodiment', 'gr1')
    embodiment_tag: str = "new_embodiment"

@@ -409,35 +117,6 @@ class GrootConfig(PreTrainedConfig):
    resume: bool = False

    def __post_init__(self):
-        self.model_version = normalize_groot_model_version(self.model_version)
-        self.action_decode_transform = normalize_groot_action_decode_transform(self.action_decode_transform)
-        if self.base_model_path is None:
-            self.base_model_path = (
-                GROOT_N1_7_BASE_MODEL if self.model_version == GROOT_N1_7 else GROOT_N1_5_BASE_MODEL
-            )
-
-        if self.action_decode_transform is not None and self.model_version != GROOT_N1_7:
-            raise ValueError("action_decode_transform can only be used with model_version='n1.7'.")
-
-        if self.model_version == GROOT_N1_7:
-            if self.max_state_dim == 64:
-                self.max_state_dim = 132
-            if self.max_action_dim == 32:
-                self.max_action_dim = 132
-            if self.chunk_size == 50:
-                self.chunk_size = 40
-            if self.n_action_steps == 50:
-                self.n_action_steps = 40
-            if tuple(self.image_size) == (224, 224):
-                self.image_size = (256, 256)
-
-        inferred_version = infer_groot_model_version(self.base_model_path)
-        if inferred_version is not None and inferred_version != self.model_version:
-            raise ValueError(
-                f"GR00T model_version '{self.model_version}' does not match base_model_path "
-                f"'{self.base_model_path}', which looks like '{inferred_version}'."
-            )
-
        super().__post_init__()

        if self.n_action_steps > self.chunk_size:
@@ -513,12 +192,7 @@ class GrootConfig(PreTrainedConfig):
    @property
    def action_delta_indices(self) -> list[int]:
        """Return indices for delta actions."""
-        model_action_horizon = 16
-        if self.model_version == GROOT_N1_7:
-            model_action_horizon = (
-                infer_groot_n1_7_action_horizon(self.base_model_path, self.embodiment_tag) or 40
-            )
-        return list(range(min(self.chunk_size, model_action_horizon)))
+        return list(range(min(self.chunk_size, 16)))

    @property
    def reward_delta_indices(self) -> None:
--- a/src/lerobot/policies/groot/eagle2_hg_model/modeling_eagle2_5_vl.py
+++ b/src/lerobot/policies/groot/eagle2_hg_model/modeling_eagle2_5_vl.py
@@ -60,7 +60,6 @@ class Eagle25VLPreTrainedModel(PreTrainedModel):
        "SiglipEncoderLayer",
    ]
    _skip_keys_device_placement = "past_key_values"
-    _supports_flash_attn = True
    _supports_flash_attn_2 = True
    _supports_cache_class = True
    _supports_static_cache = True
--- a/src/lerobot/policies/groot/eagle2_hg_model/processing_eagle2_5_vl.py
+++ b/src/lerobot/policies/groot/eagle2_hg_model/processing_eagle2_5_vl.py
@@ -124,6 +124,7 @@ class Eagle25VLProcessor(ProcessorMixin):
        "videos_kwargs",
        "text_kwargs",
    ]
+    image_processor_class = "AutoImageProcessor"
    tokenizer_class = "AutoTokenizer"

    def __init__(
--- a/src/lerobot/policies/groot/groot_n1.py
+++ b/src/lerobot/policies/groot/groot_n1.py
@@ -14,7 +14,7 @@
 # limitations under the License.

 from pathlib import Path
-from typing import TYPE_CHECKING, Any
+from typing import TYPE_CHECKING

 import numpy as np
 import torch
@@ -26,14 +26,9 @@ from lerobot.utils.import_utils import _transformers_available

 # Conditional import for type checking and lazy loading
 if TYPE_CHECKING or _transformers_available:
-    from huggingface_hub.dataclasses import strict
    from transformers import AutoConfig, AutoModel, PretrainedConfig, PreTrainedModel
    from transformers.feature_extraction_utils import BatchFeature
 else:
-
-    def strict(cls):
-        return cls
-
    AutoConfig = None
    AutoModel = None
    PretrainedConfig = object
@@ -178,20 +173,19 @@ N_COLOR_CHANNELS = 3


 # config
-@strict
 class GR00TN15Config(PretrainedConfig):
    model_type = "gr00t_n1_5"

-    backbone_cfg: dict[str, Any] | None = None
-    action_head_cfg: dict[str, Any] | None = None
-    action_horizon: int = 0
-    action_dim: int = 0
+    backbone_cfg: dict
+    action_head_cfg: dict
+    action_horizon: int
+    action_dim: int
    compute_dtype: str = "float32"

-    def __post_init__(self, **kwargs):
-        self.backbone_cfg = {} if self.backbone_cfg is None else self.backbone_cfg
-        self.action_head_cfg = {} if self.action_head_cfg is None else self.action_head_cfg
-        super().__post_init__(**kwargs)
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        for key, value in kwargs.items():
+            setattr(self, key, value)


 # real model
--- a/src/lerobot/policies/groot/groot_n1_7.py
+++ b/src/lerobot/policies/groot/groot_n1_7.py
@@ -1,962 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import annotations
-
-import importlib
-import json
-import logging
-from contextlib import suppress
-from copy import deepcopy
-from typing import TYPE_CHECKING, Any
-
-import torch
-import torch.nn.functional as F  # noqa: N812
-from huggingface_hub import snapshot_download
-from huggingface_hub.errors import HFValidationError, RepositoryNotFoundError
-from torch import nn
-from torch.distributions import Beta
-
-from lerobot.utils.import_utils import _transformers_available, require_package
-
-from .action_head.cross_attention_dit import AlternateVLDiT, DiT, SelfAttentionTransformer
-
-if TYPE_CHECKING or _transformers_available:
-    from transformers import AutoConfig, AutoModel, PretrainedConfig, PreTrainedModel
-    from transformers.feature_extraction_utils import BatchFeature
-else:
-    AutoConfig = None
-    AutoModel = None
-    PretrainedConfig = object
-    PreTrainedModel = object
-    BatchFeature = None
-
-try:
-    import tree
-except ImportError:
-    tree = None
-
-try:
-    from transformers import Qwen3VLConfig, Qwen3VLForConditionalGeneration
-except ImportError:
-    Qwen3VLConfig = None
-    Qwen3VLForConditionalGeneration = None
-
-logger = logging.getLogger(__name__)
-
-
-def _copy_default(value: Any) -> Any:
-    return deepcopy(value)
-
-
-GR00T_N1_7_DEFAULTS: dict[str, Any] = {
-    "model_dtype": "bfloat16",
-    "dtype": "bfloat16",
-    "model_name": "nvidia/Cosmos-Reason2-2B",
-    "backbone_model_type": "qwen",
-    "model_revision": None,
-    "tune_top_llm_layers": 0,
-    "backbone_embedding_dim": 2048,
-    "tune_llm": False,
-    "tune_visual": False,
-    "select_layer": 12,
-    "reproject_vision": False,
-    "use_flash_attention": True,
-    "load_bf16": False,
-    "backbone_trainable_params_fp32": True,
-    "image_crop_size": (230, 230),
-    "image_target_size": (256, 256),
-    "shortest_image_edge": None,
-    "crop_fraction": None,
-    "random_rotation_angle": None,
-    "color_jitter_params": None,
-    "use_albumentations_transforms": True,
-    "extra_augmentation_config": None,
-    "formalize_language": True,
-    "apply_sincos_state_encoding": False,
-    "use_percentiles": True,
-    "use_relative_action": False,
-    "max_state_dim": 132,
-    "max_action_dim": 132,
-    "action_horizon": 40,
-    "hidden_size": 1024,
-    "input_embedding_dim": 1536,
-    "state_history_length": 1,
-    "add_pos_embed": True,
-    "attn_dropout": 0.2,
-    "use_vlln": True,
-    "max_seq_len": 1024,
-    "use_alternate_vl_dit": True,
-    "attend_text_every_n_blocks": 2,
-    "diffusion_model_cfg": {
-        "positional_embeddings": None,
-        "num_layers": 32,
-        "num_attention_heads": 32,
-        "attention_head_dim": 48,
-        "norm_type": "ada_norm",
-        "dropout": 0.2,
-        "final_dropout": True,
-        "output_dim": 1024,
-        "interleave_self_attention": True,
-    },
-    "vl_self_attention_cfg": {
-        "positional_embeddings": None,
-        "num_layers": 4,
-        "num_attention_heads": 32,
-        "attention_head_dim": 64,
-        "dropout": 0.2,
-        "final_dropout": True,
-    },
-    "num_inference_timesteps": 4,
-    "noise_beta_alpha": 1.5,
-    "noise_beta_beta": 1.0,
-    "noise_s": 0.999,
-    "num_timestep_buckets": 1000,
-    "tune_projector": True,
-    "tune_diffusion_model": True,
-    "tune_vlln": True,
-    "state_dropout_prob": 0.2,
-    "exclude_state": False,
-    "use_mean_std": False,
-    "max_num_embodiments": 32,
-    "rtc_ramp_rate": 6.0,
-}
-
-
-class GR00TN17Config(PretrainedConfig):
-    """Configuration for NVIDIA GR00T N1.7.
-
-    N1.7 uses the Cosmos-Reason2-2B / Qwen3-VL backbone and a multi-embodiment
-    flow-matching action head. This mirrors the public N1.7 checkpoint config
-    while keeping it local to LeRobot and independent from the external
-    Isaac-GR00T ``gr00t`` Python package.
-    """
-
-    model_type = "Gr00tN1d7"
-
-    _defaults = GR00T_N1_7_DEFAULTS
-
-    def __init__(self, **kwargs):
-        super().__init__(**kwargs)
-        for key, value in GR00T_N1_7_DEFAULTS.items():
-            setattr(self, key, _copy_default(kwargs.pop(key, value)))
-        for key, value in kwargs.items():
-            setattr(self, key, value)
-
-    def to_filtered_dict(self, exclude_augment: bool = True) -> dict[str, Any]:
-        cfg = self.to_dict()
-        if not exclude_augment:
-            return cfg
-        exclude_keys = {
-            "random_rotation_angle",
-            "color_jitter_params",
-            "use_albumentations_transforms",
-            "formalize_language",
-            "image_crop_size",
-            "image_target_size",
-            "shortest_image_edge",
-            "crop_fraction",
-        }
-        return {k: v for k, v in cfg.items() if k not in exclude_keys}
-
-    def to_filtered_json(self, exclude_augment: bool = True, **kwargs) -> str:
-        return json.dumps(self.to_filtered_dict(exclude_augment), indent=2, default=str, **kwargs)
-
-
-class CategorySpecificLinear(nn.Module):
-    """Linear layer with category-specific weights for multi-embodiment support."""
-
-    def __init__(self, num_categories: int, input_dim: int, hidden_dim: int):
-        super().__init__()
-        self.num_categories = num_categories
-        self.W = nn.Parameter(0.02 * torch.randn(num_categories, input_dim, hidden_dim))
-        self.b = nn.Parameter(torch.zeros(num_categories, hidden_dim))
-
-    def forward(self, x: torch.Tensor, cat_ids: torch.Tensor) -> torch.Tensor:
-        selected_w = self.W[cat_ids]
-        selected_b = self.b[cat_ids]
-        return torch.bmm(x, selected_w) + selected_b.unsqueeze(1)
-
-
-class CategorySpecificMLP(nn.Module):
-    """Two-layer MLP with category-specific weights."""
-
-    def __init__(self, num_categories: int, input_dim: int, hidden_dim: int, output_dim: int):
-        super().__init__()
-        self.layer1 = CategorySpecificLinear(num_categories, input_dim, hidden_dim)
-        self.layer2 = CategorySpecificLinear(num_categories, hidden_dim, output_dim)
-
-    def forward(self, x: torch.Tensor, cat_ids: torch.Tensor) -> torch.Tensor:
-        hidden = F.relu(self.layer1(x, cat_ids))
-        return self.layer2(hidden, cat_ids)
-
-
-class SinusoidalPositionalEncoding(nn.Module):
-    """Sinusoidal encoding of shape ``(B, T, D)`` for timestep tensors ``(B, T)``.
-
-    The frequency scalar is intentionally created on CPU and then broadcast with
-    the device-local arange result. That mirrors Isaac-GR00T's N1.7 timestep
-    embedding and avoids tiny dtype/device construction differences in parity
-    tests.
-    """
-
-    def __init__(self, embedding_dim: int):
-        super().__init__()
-        self.embedding_dim = embedding_dim
-
-    def forward(self, timesteps: torch.Tensor) -> torch.Tensor:
-        timesteps = timesteps.float()
-        half_dim = self.embedding_dim // 2
-        exponent = -torch.arange(half_dim, dtype=torch.float, device=timesteps.device) * (
-            torch.log(torch.tensor(10000.0)) / half_dim
-        )
-        freqs = timesteps.unsqueeze(-1) * exponent.exp()
-        return torch.cat([torch.sin(freqs), torch.cos(freqs)], dim=-1)
-
-
-def swish(x: torch.Tensor) -> torch.Tensor:
-    return x * torch.sigmoid(x)
-
-
-class MultiEmbodimentActionEncoder(nn.Module):
-    """Action encoder with category-specific projections and sinusoidal time encoding."""
-
-    def __init__(self, action_dim: int, hidden_size: int, num_embodiments: int):
-        super().__init__()
-        self.W1 = CategorySpecificLinear(num_embodiments, action_dim, hidden_size)
-        self.W2 = CategorySpecificLinear(num_embodiments, 2 * hidden_size, hidden_size)
-        self.W3 = CategorySpecificLinear(num_embodiments, hidden_size, hidden_size)
-        self.pos_encoding = SinusoidalPositionalEncoding(hidden_size)
-
-    def forward(self, actions: torch.Tensor, timesteps: torch.Tensor, cat_ids: torch.Tensor) -> torch.Tensor:
-        batch_size, horizon, _ = actions.shape
-        if timesteps.dim() != 1 or timesteps.shape[0] != batch_size:
-            raise ValueError("Expected `timesteps` to have shape (B,).")
-        timesteps = timesteps.unsqueeze(1).expand(-1, horizon)
-        action_emb = self.W1(actions, cat_ids)
-        time_emb = self.pos_encoding(timesteps).to(dtype=action_emb.dtype)
-        x = swish(self.W2(torch.cat([action_emb, time_emb], dim=-1), cat_ids))
-        return self.W3(x, cat_ids)
-
-
-class Qwen3Backbone(nn.Module):
-    """Cosmos-Reason2/Qwen3-VL backbone used by GR00T N1.7.
-
-    The public checkpoint stores the action head in the GR00T checkpoint but
-    uses a Hugging Face Qwen3-VL-compatible backbone interface. This wrapper
-    keeps the nested HF module layout compatible across transformer versions
-    and exposes the hidden states consumed by the action head.
-    """
-
-    def __init__(
-        self,
-        model_name: str = "nvidia/Cosmos-Reason2-2B",
-        tune_llm: bool = False,
-        tune_visual: bool = False,
-        select_layer: int = -1,
-        reproject_vision: bool = False,
-        use_flash_attention: bool = False,
-        load_bf16: bool = False,
-        tune_top_llm_layers: int = 0,
-        trainable_params_fp32: bool = False,
-        transformers_loading_kwargs: dict[str, Any] | None = None,
-        load_pretrained_weights: bool = True,
-    ):
-        if Qwen3VLForConditionalGeneration is None:
-            raise ImportError(
-                "Qwen3VLForConditionalGeneration is required for GR00T N1.7. "
-                "Install the GR00T optional dependencies with `pip install 'lerobot[groot]'` "
-                "or use a transformers version that provides Qwen3-VL support."
-            )
-
-        super().__init__()
-        transformers_loading_kwargs = transformers_loading_kwargs or {"trust_remote_code": True}
-
-        extra_kwargs: dict[str, Any] = {}
-        if use_flash_attention:
-            try:
-                import flash_attn  # noqa: F401
-
-                extra_kwargs["attn_implementation"] = "flash_attention_2"
-            except ImportError:
-                logger.warning("flash_attn is not installed. Falling back to SDPA attention.")
-                extra_kwargs["attn_implementation"] = "sdpa"
-        if load_bf16:
-            extra_kwargs["torch_dtype"] = torch.bfloat16
-
-        if load_pretrained_weights:
-            self.model = Qwen3VLForConditionalGeneration.from_pretrained(
-                model_name,
-                **extra_kwargs,
-                **transformers_loading_kwargs,
-            ).eval()
-        else:
-            self.model = self._from_backbone_config(
-                model_name=model_name,
-                model_kwargs=extra_kwargs,
-                config_kwargs=transformers_loading_kwargs,
-            ).eval()
-
-        while len(self.language_model.layers) > select_layer:
-            self.language_model.layers.pop(-1)
-
-        self.select_layer = select_layer
-        self.set_trainable_parameters(tune_llm, tune_visual, tune_top_llm_layers)
-        if load_bf16 and trainable_params_fp32:
-            for parameter in self.parameters():
-                if parameter.requires_grad:
-                    parameter.data = parameter.data.to(torch.float32)
-
-    def set_trainable_parameters(
-        self, tune_llm: bool, tune_visual: bool, tune_top_llm_layers: int = 0
-    ) -> None:
-        self.tune_llm = tune_llm
-        self.tune_visual = tune_visual
-        for parameter in self.parameters():
-            parameter.requires_grad = True
-        if not tune_llm:
-            self.language_model.requires_grad_(False)
-        if not tune_visual:
-            self.visual.requires_grad_(False)
-        if tune_top_llm_layers > 0:
-            for layer in self.language_model.layers[-tune_top_llm_layers:]:
-                for parameter in layer.parameters():
-                    parameter.requires_grad = True
-
-    def set_frozen_modules_to_eval_mode(self) -> None:
-        if self.training:
-            if self.language_model and not self.tune_llm:
-                self.language_model.eval()
-            if self.visual and not self.tune_visual:
-                self.visual.eval()
-
-    @property
-    def language_model(self) -> nn.Module:
-        return getattr(self.model, "model", self.model).language_model
-
-    @property
-    def visual(self) -> nn.Module:
-        return getattr(self.model, "model", self.model).visual
-
-    def _from_backbone_config(
-        self,
-        *,
-        model_name: str,
-        model_kwargs: dict[str, Any],
-        config_kwargs: dict[str, Any],
-    ) -> nn.Module:
-        if _is_cosmos_reason2_backbone(model_name):
-            backbone_config = _cosmos_reason2_qwen3_vl_config()
-        else:
-            if AutoConfig is None:
-                raise ImportError(
-                    "AutoConfig is required to initialize a GR00T N1.7 backbone from config. "
-                    "Install the GR00T optional dependencies with `pip install 'lerobot[groot]'`."
-                )
-            backbone_config = AutoConfig.from_pretrained(model_name, **config_kwargs)
-        return Qwen3VLForConditionalGeneration._from_config(backbone_config, **model_kwargs)
-
-    def prepare_input(self, batch: dict[str, Any]) -> BatchFeature:
-        return BatchFeature(data=batch)
-
-    def _ensure_mm_token_type_ids(self, model_input: dict[str, torch.Tensor]) -> None:
-        if "mm_token_type_ids" in model_input:
-            return
-        if "image_grid_thw" not in model_input and "video_grid_thw" not in model_input:
-            return
-
-        input_ids = model_input.get("input_ids")
-        if input_ids is None:
-            return
-
-        mm_token_type_ids = torch.zeros(input_ids.shape, dtype=torch.int32, device=input_ids.device)
-        image_token_id = getattr(self.model.config, "image_token_id", None)
-        video_token_id = getattr(self.model.config, "video_token_id", None)
-        if image_token_id is not None:
-            mm_token_type_ids[input_ids == image_token_id] = 1
-        if video_token_id is not None:
-            mm_token_type_ids[input_ids == video_token_id] = 2
-
-        model_input["mm_token_type_ids"] = mm_token_type_ids
-
-    def _ensure_legacy_qwen3_position_ids(self, model_input: dict[str, torch.Tensor]) -> None:
-        """Restore the Qwen3-VL text position ids used by older Transformers releases.
-
-        Transformers 5.x computes 3-row multimodal RoPE ids for Qwen3-VL and then
-        drops text position ids before calling text-layer flash attention. GR00T
-        N1.7 was aligned against the older Transformers path, where a fourth text
-        position row is forwarded alongside the temporal/height/width rows. Adding
-        the row here preserves the newer multimodal position computation while
-        keeping flash attention on the legacy code path.
-        """
-
-        if "position_ids" in model_input:
-            return
-
-        qwen3_model = getattr(self.model, "model", self.model)
-        compute_3d_position_ids = getattr(qwen3_model, "compute_3d_position_ids", None)
-        if compute_3d_position_ids is None:
-            return
-
-        position_ids = compute_3d_position_ids(
-            input_ids=model_input.get("input_ids"),
-            image_grid_thw=model_input.get("image_grid_thw"),
-            video_grid_thw=model_input.get("video_grid_thw"),
-            inputs_embeds=None,
-            attention_mask=model_input.get("attention_mask"),
-            past_key_values=None,
-            mm_token_type_ids=model_input.get("mm_token_type_ids"),
-        )
-        if position_ids.ndim == 3 and position_ids.shape[0] == 3:
-            position_ids = torch.cat([position_ids[:1], position_ids], dim=0)
-
-        model_input["position_ids"] = position_ids
-
-    def _last_decoder_layer_output(self, model_input: dict[str, torch.Tensor]) -> torch.Tensor:
-        """Return the pre-final-norm decoder output consumed by the N1.7 action head.
-
-        Older Transformers releases exposed this tensor as ``hidden_states[-1]``.
-        Newer releases expose the post-final-norm tensor there instead. Capturing
-        the last decoder layer output directly keeps the N1.7 action head input
-        stable across Transformers versions.
-        """
-
-        captured: dict[str, torch.Tensor] = {}
-
-        def capture_output(_module: nn.Module, _inputs: tuple[Any, ...], output: Any) -> None:
-            if isinstance(output, torch.Tensor):
-                captured["features"] = output
-            elif isinstance(output, (tuple, list)) and output:
-                captured["features"] = output[0]
-            elif hasattr(output, "last_hidden_state"):
-                captured["features"] = output.last_hidden_state
-
-        hook = self.language_model.layers[-1].register_forward_hook(capture_output)
-        try:
-            outputs = self.model(**model_input, output_hidden_states=True)
-        finally:
-            hook.remove()
-
-        return captured.get("features", outputs.hidden_states[-1])
-
-    def forward(self, vl_input: BatchFeature) -> BatchFeature:
-        self.set_frozen_modules_to_eval_mode()
-        keys_to_use = ["input_ids", "attention_mask", "pixel_values", "image_grid_thw"]
-        optional_keys = ["mm_token_type_ids", "pixel_values_videos", "video_grid_thw"]
-        model_input = {key: vl_input[key] for key in keys_to_use}
-        model_input.update({key: vl_input[key] for key in optional_keys if key in vl_input})
-        self._ensure_mm_token_type_ids(model_input)
-        self._ensure_legacy_qwen3_position_ids(model_input)
-        features = self._last_decoder_layer_output(model_input)
-        image_mask = model_input["input_ids"] == self.model.config.image_token_id
-        attention_mask = model_input["attention_mask"] == 1
-        return BatchFeature(
-            data={
-                "backbone_features": features,
-                "backbone_attention_mask": attention_mask,
-                "image_mask": image_mask,
-            }
-        )
-
-
-class GR00TN17ActionHead(nn.Module):
-    supports_gradient_checkpointing = True
-
-    def __init__(self, config: GR00TN17Config):
-        require_package("diffusers", extra="groot")
-        super().__init__()
-        self.config = config
-        self.hidden_size = config.hidden_size
-        self.input_embedding_dim = config.input_embedding_dim
-
-        if config.use_alternate_vl_dit:
-            self.model = AlternateVLDiT(
-                **config.diffusion_model_cfg,
-                cross_attention_dim=config.backbone_embedding_dim,
-                attend_text_every_n_blocks=config.attend_text_every_n_blocks,
-            )
-        else:
-            self.model = DiT(
-                **config.diffusion_model_cfg,
-                cross_attention_dim=config.backbone_embedding_dim,
-            )
-
-        self.action_dim = config.max_action_dim
-        self.action_horizon = config.action_horizon
-        self.num_inference_timesteps = config.num_inference_timesteps
-        self.state_encoder = CategorySpecificMLP(
-            num_categories=config.max_num_embodiments,
-            input_dim=config.max_state_dim * config.state_history_length,
-            hidden_dim=self.hidden_size,
-            output_dim=self.input_embedding_dim,
-        )
-        self.action_encoder = MultiEmbodimentActionEncoder(
-            action_dim=self.action_dim,
-            hidden_size=self.input_embedding_dim,
-            num_embodiments=config.max_num_embodiments,
-        )
-        self.action_decoder = CategorySpecificMLP(
-            num_categories=config.max_num_embodiments,
-            input_dim=self.hidden_size,
-            hidden_dim=self.hidden_size,
-            output_dim=self.action_dim,
-        )
-        self.vlln = nn.LayerNorm(config.backbone_embedding_dim) if config.use_vlln else nn.Identity()
-        vl_self_attention_cfg = getattr(config, "vl_self_attention_cfg", None)
-        if vl_self_attention_cfg and vl_self_attention_cfg.get("num_layers", 0) > 0:
-            self.vl_self_attention = SelfAttentionTransformer(**vl_self_attention_cfg)
-        else:
-            self.vl_self_attention = nn.Identity()
-        if config.add_pos_embed:
-            self.position_embedding = nn.Embedding(config.max_seq_len, self.input_embedding_dim)
-            nn.init.normal_(self.position_embedding.weight, mean=0.0, std=0.02)
-        self.state_dropout_prob = config.state_dropout_prob
-        self._noise_beta_alpha = config.noise_beta_alpha
-        self._noise_beta_beta = config.noise_beta_beta
-        self._beta_dist = None
-        self.num_timestep_buckets = config.num_timestep_buckets
-        self.set_trainable_parameters(config.tune_projector, config.tune_diffusion_model, config.tune_vlln)
-
-    def set_trainable_parameters(
-        self, tune_projector: bool, tune_diffusion_model: bool, tune_vlln: bool
-    ) -> None:
-        self.tune_projector = tune_projector
-        self.tune_diffusion_model = tune_diffusion_model
-        self.tune_vlln = tune_vlln
-        for parameter in self.parameters():
-            parameter.requires_grad = True
-        if not tune_projector:
-            self.state_encoder.requires_grad_(False)
-            self.action_encoder.requires_grad_(False)
-            self.action_decoder.requires_grad_(False)
-            if self.config.add_pos_embed:
-                self.position_embedding.requires_grad_(False)
-        if not tune_diffusion_model:
-            self.model.requires_grad_(False)
-        if not tune_vlln:
-            self.vlln.requires_grad_(False)
-            self.vl_self_attention.requires_grad_(False)
-
-    def set_frozen_modules_to_eval_mode(self) -> None:
-        if self.training:
-            if not self.tune_projector:
-                self.state_encoder.eval()
-                self.action_encoder.eval()
-                self.action_decoder.eval()
-                if self.config.add_pos_embed:
-                    self.position_embedding.eval()
-            if not self.tune_diffusion_model:
-                self.model.eval()
-            if not self.tune_vlln:
-                self.vlln.eval()
-                self.vl_self_attention.eval()
-
-    def sample_time(self, batch_size: int, device: torch.device, dtype: torch.dtype) -> torch.Tensor:
-        if self._beta_dist is None:
-            beta_alpha = torch.tensor(self._noise_beta_alpha, device="cpu", dtype=torch.float32)
-            beta_beta = torch.tensor(self._noise_beta_beta, device="cpu", dtype=torch.float32)
-            self._beta_dist = Beta(beta_alpha, beta_beta, validate_args=False)
-        sample = self._beta_dist.sample([batch_size]).to(device, dtype=dtype)
-        return (1 - sample) * self.config.noise_s
-
-    def process_backbone_output(self, backbone_output: BatchFeature) -> BatchFeature:
-        backbone_features = self.vlln(backbone_output["backbone_features"])
-        backbone_output["backbone_features"] = self.vl_self_attention(backbone_features)
-        return backbone_output
-
-    def forward(self, backbone_output: BatchFeature, action_input: BatchFeature) -> BatchFeature:
-        self.set_frozen_modules_to_eval_mode()
-        backbone_output = self.process_backbone_output(backbone_output)
-        vl_embeds = backbone_output.backbone_features
-        device = vl_embeds.device
-        embodiment_id = action_input.embodiment_id
-
-        if action_input.state.shape[1] != self.config.state_history_length:
-            raise ValueError("state history length does not match GR00T N1.7 config.")
-        state = action_input.state.view(action_input.state.shape[0], 1, -1)
-        state_features = self.state_encoder(state, embodiment_id)
-
-        if self.training and self.state_dropout_prob > 0:
-            do_dropout = (
-                torch.rand(state_features.shape[0], device=state_features.device) < self.state_dropout_prob
-            )
-            state_features = state_features * (1 - do_dropout[:, None, None].to(dtype=state_features.dtype))
-
-        actions = action_input.action
-        noise = torch.randn(actions.shape, device=actions.device, dtype=actions.dtype)
-        t = self.sample_time(actions.shape[0], device=actions.device, dtype=actions.dtype)
-        t = t[:, None, None]
-        noisy_trajectory = (1 - t) * noise + t * actions
-        velocity = actions - noise
-        t_discretized = (t[:, 0, 0] * self.num_timestep_buckets).long()
-        action_features = self.action_encoder(noisy_trajectory, t_discretized, embodiment_id)
-
-        if self.config.add_pos_embed:
-            pos_ids = torch.arange(action_features.shape[1], dtype=torch.long, device=device)
-            action_features = action_features + self.position_embedding(pos_ids).unsqueeze(0)
-
-        sa_embs = torch.cat((state_features, action_features), dim=1)
-        if self.config.use_alternate_vl_dit:
-            model_output, _ = self.model(
-                hidden_states=sa_embs,
-                encoder_hidden_states=vl_embeds,
-                encoder_attention_mask=backbone_output.backbone_attention_mask,
-                timestep=t_discretized,
-                return_all_hidden_states=True,
-                image_mask=backbone_output.image_mask,
-                backbone_attention_mask=backbone_output.backbone_attention_mask,
-            )
-        else:
-            model_output, _ = self.model(
-                hidden_states=sa_embs,
-                encoder_hidden_states=vl_embeds,
-                encoder_attention_mask=backbone_output.backbone_attention_mask,
-                timestep=t_discretized,
-                return_all_hidden_states=True,
-            )
-
-        pred = self.action_decoder(model_output, embodiment_id)
-        pred_actions = pred[:, -actions.shape[1] :]
-        action_mask = action_input.action_mask.to(dtype=pred_actions.dtype)
-        action_loss = F.mse_loss(pred_actions, velocity, reduction="none") * action_mask
-        loss = action_loss.sum() / (action_mask.sum() + 1e-6)
-        return BatchFeature(
-            data={
-                "loss": loss,
-                "action_loss": action_loss,
-                "action_mask": action_mask,
-                "backbone_features": vl_embeds,
-                "state_features": state_features,
-            }
-        )
-
-    def _encode_features(self, backbone_output: BatchFeature, action_input: BatchFeature) -> BatchFeature:
-        backbone_output = self.process_backbone_output(backbone_output)
-        state = action_input.state
-        if state.shape[1] != self.config.state_history_length:
-            raise ValueError("state history length does not match GR00T N1.7 config.")
-        state = state.view(state.shape[0], 1, -1)
-        state_features = self.state_encoder(state, action_input.embodiment_id)
-        return BatchFeature(
-            data={"backbone_features": backbone_output.backbone_features, "state_features": state_features}
-        )
-
-    @torch.no_grad()
-    def get_action_with_features(
-        self,
-        backbone_features: torch.Tensor,
-        state_features: torch.Tensor,
-        embodiment_id: torch.Tensor,
-        backbone_output: BatchFeature,
-        action_input: BatchFeature,
-        options: dict[str, Any] | None = None,
-    ) -> BatchFeature:
-        vl_embeds = backbone_features
-        batch_size = vl_embeds.shape[0]
-        device = vl_embeds.device
-        actions = torch.randn(
-            size=(batch_size, self.config.action_horizon, self.action_dim),
-            dtype=vl_embeds.dtype,
-            device=device,
-        )
-        dt = 1.0 / self.num_inference_timesteps
-        vel_strength = torch.ones_like(actions)
-
-        if "action" in action_input:
-            if options is None:
-                raise ValueError("RTC options are required when action is provided to get_action.")
-            action_horizon_before_padding = options["action_horizon"]
-            actions[:, : options["rtc_overlap_steps"], :] = action_input["action"][
-                :,
-                action_horizon_before_padding - options["rtc_overlap_steps"] : action_horizon_before_padding,
-                :,
-            ]
-            vel_strength[:, : options["rtc_frozen_steps"], :] = 0.0
-            intermediate_steps = options["rtc_overlap_steps"] - options["rtc_frozen_steps"]
-            t = torch.linspace(0.0, 1.0, intermediate_steps + 2, device=device)
-            ramp = 1 - torch.exp(-options["rtc_ramp_rate"] * t)
-            ramp = ramp / ramp[-1].clamp_min(1e-8)
-            vel_strength[:, options["rtc_frozen_steps"] : options["rtc_overlap_steps"], :] = ramp[1:-1][
-                None, :, None
-            ].to(device)
-
-        for t_step in range(self.num_inference_timesteps):
-            t_cont = t_step / float(self.num_inference_timesteps)
-            t_discretized = int(t_cont * self.num_timestep_buckets)
-            timesteps_tensor = torch.full(size=(batch_size,), fill_value=t_discretized, device=device)
-            action_features = self.action_encoder(actions, timesteps_tensor, embodiment_id)
-            if self.config.add_pos_embed:
-                pos_ids = torch.arange(action_features.shape[1], dtype=torch.long, device=device)
-                action_features = action_features + self.position_embedding(pos_ids).unsqueeze(0)
-            sa_embs = torch.cat((state_features, action_features), dim=1)
-
-            if self.config.use_alternate_vl_dit:
-                model_output = self.model(
-                    hidden_states=sa_embs,
-                    encoder_hidden_states=vl_embeds,
-                    timestep=timesteps_tensor,
-                    image_mask=backbone_output.image_mask,
-                    backbone_attention_mask=backbone_output.backbone_attention_mask,
-                )
-            else:
-                model_output = self.model(
-                    hidden_states=sa_embs,
-                    encoder_hidden_states=vl_embeds,
-                    timestep=timesteps_tensor,
-                )
-            pred = self.action_decoder(model_output, embodiment_id)
-            actions = actions + dt * pred[:, -self.action_horizon :] * vel_strength
-
-        return BatchFeature(
-            data={
-                "action_pred": actions,
-                "backbone_features": vl_embeds,
-                "state_features": state_features,
-            }
-        )
-
-    @torch.no_grad()
-    def get_action(
-        self,
-        backbone_output: BatchFeature,
-        action_input: BatchFeature,
-        options: dict[str, Any] | None = None,
-    ) -> BatchFeature:
-        features = self._encode_features(backbone_output, action_input)
-        return self.get_action_with_features(
-            backbone_features=features.backbone_features,
-            state_features=features.state_features,
-            embodiment_id=action_input.embodiment_id,
-            backbone_output=backbone_output,
-            action_input=action_input,
-            options=options,
-        )
-
-    @property
-    def device(self) -> torch.device:
-        return next(iter(self.parameters())).device
-
-    @property
-    def dtype(self) -> torch.dtype:
-        return next(iter(self.parameters())).dtype
-
-    def prepare_input(self, batch: dict[str, Any]) -> BatchFeature:
-        return BatchFeature(data=batch)
-
-
-def _is_cosmos_reason2_backbone(model_name: str) -> bool:
-    return str(model_name).rstrip("/") == "nvidia/Cosmos-Reason2-2B"
-
-
-def _cosmos_reason2_qwen3_vl_config() -> PretrainedConfig:
-    if Qwen3VLConfig is None:
-        raise ImportError(
-            "Qwen3VLConfig is required for GR00T N1.7. "
-            "Install the GR00T optional dependencies with `pip install 'lerobot[groot]'`."
-        )
-    return Qwen3VLConfig(
-        image_token_id=151655,
-        video_token_id=151656,
-        vision_start_token_id=151652,
-        vision_end_token_id=151653,
-        tie_word_embeddings=True,
-        text_config={
-            "attention_bias": False,
-            "attention_dropout": 0.0,
-            "bos_token_id": 151643,
-            "dtype": "bfloat16",
-            "eos_token_id": 151645,
-            "head_dim": 128,
-            "hidden_act": "silu",
-            "hidden_size": 2048,
-            "initializer_range": 0.02,
-            "intermediate_size": 6144,
-            "max_position_embeddings": 262144,
-            "model_type": "qwen3_vl_text",
-            "num_attention_heads": 16,
-            "num_hidden_layers": 28,
-            "num_key_value_heads": 8,
-            "rms_norm_eps": 1e-6,
-            "rope_scaling": {
-                "mrope_interleaved": True,
-                "mrope_section": [24, 20, 20],
-                "rope_type": "default",
-            },
-            "rope_theta": 5000000,
-            "tie_word_embeddings": True,
-            "use_cache": True,
-            "vocab_size": 151936,
-        },
-        vision_config={
-            "deepstack_visual_indexes": [5, 11, 17],
-            "depth": 24,
-            "hidden_act": "gelu_pytorch_tanh",
-            "hidden_size": 1024,
-            "in_channels": 3,
-            "initializer_range": 0.02,
-            "intermediate_size": 4096,
-            "model_type": "qwen3_vl",
-            "num_heads": 16,
-            "num_position_embeddings": 2304,
-            "out_hidden_size": 2048,
-            "patch_size": 16,
-            "spatial_merge_size": 2,
-            "temporal_patch_size": 2,
-        },
-    )
-
-
-def get_backbone_cls(config: GR00TN17Config):
-    if (
-        config.backbone_model_type == "qwen"
-        or "nvidia/Cosmos-Reason2" in config.model_name
-        or "Qwen/Qwen3-VL" in config.model_name
-    ):
-        return Qwen3Backbone
-    raise ValueError(f"Unsupported GR00T N1.7 backbone model: {config.model_name}")
-
-
-class GR00TN17(PreTrainedModel):
-    """GR00T N1.7 model with a Cosmos-Reason2/Qwen3-VL backbone."""
-
-    config_class = GR00TN17Config
-    supports_gradient_checkpointing = True
-
-    def __init__(
-        self,
-        config: GR00TN17Config,
-        transformers_loading_kwargs: dict[str, Any] | None = None,
-        load_backbone_weights: bool = True,
-    ):
-        super().__init__(config)
-        transformers_loading_kwargs = transformers_loading_kwargs or {"trust_remote_code": True}
-        self.config = config
-        backbone_cls = get_backbone_cls(config)
-        self.backbone = backbone_cls(
-            model_name=config.model_name,
-            tune_llm=config.tune_llm,
-            tune_visual=config.tune_visual,
-            select_layer=config.select_layer,
-            reproject_vision=config.reproject_vision,
-            use_flash_attention=config.use_flash_attention,
-            load_bf16=config.load_bf16,
-            tune_top_llm_layers=config.tune_top_llm_layers,
-            trainable_params_fp32=config.backbone_trainable_params_fp32,
-            transformers_loading_kwargs=transformers_loading_kwargs,
-            load_pretrained_weights=load_backbone_weights,
-        )
-        self.action_head = GR00TN17ActionHead(config)
-        self.post_init()
-
-    def prepare_input(self, inputs: dict[str, Any]) -> tuple[BatchFeature, BatchFeature]:
-        global tree
-        if tree is None:
-            require_package("dm-tree", extra="groot", import_name="tree")
-            tree = importlib.import_module("tree")
-        backbone_inputs = self.backbone.prepare_input(inputs)
-        action_inputs = self.action_head.prepare_input(inputs)
-
-        def to_device_with_dtype(x):
-            if not isinstance(x, torch.Tensor):
-                return x
-            if torch.is_floating_point(x):
-                return x.to(self.device, dtype=self.dtype)
-            return x.to(self.device)
-
-        return (
-            tree.map_structure(to_device_with_dtype, backbone_inputs),
-            tree.map_structure(to_device_with_dtype, action_inputs),
-        )
-
-    def forward(self, inputs: dict[str, Any]) -> BatchFeature:
-        backbone_inputs, action_inputs = self.prepare_input(inputs)
-        backbone_outputs = self.backbone(backbone_inputs)
-        return self.action_head(backbone_outputs, action_inputs)
-
-    def get_action(self, inputs: dict[str, Any], options: dict[str, Any] | None = None) -> BatchFeature:
-        backbone_inputs, action_inputs = self.prepare_input(inputs)
-        backbone_outputs = self.backbone(backbone_inputs)
-        return self.action_head.get_action(backbone_outputs, action_inputs, options)
-
-    @property
-    def device(self) -> torch.device:
-        return next(iter(self.parameters())).device
-
-    @property
-    def dtype(self) -> torch.dtype:
-        return next(iter(self.parameters())).dtype
-
-    @classmethod
-    def from_pretrained(cls, pretrained_model_name_or_path: str, **kwargs):
-        tune_visual = kwargs.pop("tune_visual", True)
-        tune_llm = kwargs.pop("tune_llm", False)
-        tune_projector = kwargs.pop("tune_projector", True)
-        tune_diffusion_model = kwargs.pop("tune_diffusion_model", True)
-        tune_vlln = kwargs.pop("tune_vlln", True)
-        transformers_loading_kwargs = kwargs.pop("transformers_loading_kwargs", None) or {
-            "trust_remote_code": True
-        }
-        load_backbone_weights = kwargs.pop("load_backbone_weights", False)
-        for key in ("revision", "cache_dir", "local_files_only", "token"):
-            if key in kwargs:
-                transformers_loading_kwargs.setdefault(key, kwargs[key])
-
-        try:
-            local_model_path = snapshot_download(
-                pretrained_model_name_or_path,
-                repo_type="model",
-                revision=kwargs.get("revision"),
-                cache_dir=kwargs.get("cache_dir"),
-                local_files_only=kwargs.get("local_files_only", False),
-                token=kwargs.get("token"),
-            )
-        except (HFValidationError, RepositoryNotFoundError):
-            local_model_path = pretrained_model_name_or_path
-
-        pretrained_model = super().from_pretrained(
-            local_model_path,
-            transformers_loading_kwargs=transformers_loading_kwargs,
-            load_backbone_weights=load_backbone_weights,
-            **kwargs,
-        )
-        pretrained_model.backbone.set_trainable_parameters(
-            tune_visual=tune_visual,
-            tune_llm=tune_llm,
-            tune_top_llm_layers=pretrained_model.config.tune_top_llm_layers,
-        )
-        pretrained_model.action_head.set_trainable_parameters(
-            tune_projector=tune_projector,
-            tune_diffusion_model=tune_diffusion_model,
-            tune_vlln=tune_vlln,
-        )
-        return pretrained_model
-
-
-def _register_with_transformers() -> None:
-    if AutoConfig is None or AutoModel is None:
-        return
-    try:
-        AutoConfig.register(GR00TN17Config.model_type, GR00TN17Config, exist_ok=True)
-    except TypeError:
-        with suppress(ValueError):
-            AutoConfig.register(GR00TN17Config.model_type, GR00TN17Config)
-    try:
-        AutoModel.register(GR00TN17Config, GR00TN17, exist_ok=True)
-    except TypeError:
-        with suppress(ValueError):
-            AutoModel.register(GR00TN17Config, GR00TN17)
-
-
-_register_with_transformers()
--- a/src/lerobot/policies/groot/modeling_groot.py
+++ b/src/lerobot/policies/groot/modeling_groot.py
@@ -46,15 +46,7 @@ from lerobot.utils.constants import ACTION, OBS_IMAGES
 from lerobot.utils.import_utils import require_package

 from ..pretrained import PreTrainedPolicy
-from .configuration_groot import (
-    GROOT_N1_5,
-    GROOT_N1_7,
-    GrootConfig,
-    infer_groot_model_version,
-    infer_groot_n1_7_action_execution_horizon,
-    infer_groot_n1_7_action_horizon,
-    normalize_groot_model_version,
-)
+from .configuration_groot import GrootConfig
 from .groot_n1 import GR00TN15

 T = TypeVar("T", bound="GrootPolicy")
@@ -75,7 +67,6 @@ class GrootPolicy(PreTrainedPolicy):

        # Initialize GR00T model using ported components
        self._groot_model = self._create_groot_model()
-        self._action_queue_steps = self._resolve_action_queue_steps()

        self.reset()

@@ -91,23 +82,13 @@ class GrootPolicy(PreTrainedPolicy):
        # Handle Flash Attention compatibility issues
        self._handle_flash_attention_compatibility()

-        model_kwargs = {
-            "pretrained_model_name_or_path": self.config.base_model_path,
-            "tune_llm": self.config.tune_llm,
-            "tune_visual": self.config.tune_visual,
-            "tune_projector": self.config.tune_projector,
-            "tune_diffusion_model": self.config.tune_diffusion_model,
-        }
-        if self.config.model_version == GROOT_N1_7:
-            from .groot_n1_7 import GR00TN17
-
-            model = GR00TN17.from_pretrained(
-                **model_kwargs,
-                tune_vlln=True,
-                transformers_loading_kwargs={"trust_remote_code": True},
-            )
-        else:
-            model = GR00TN15.from_pretrained(**model_kwargs)
+        model = GR00TN15.from_pretrained(
+            pretrained_model_name_or_path=self.config.base_model_path,
+            tune_llm=self.config.tune_llm,
+            tune_visual=self.config.tune_visual,
+            tune_projector=self.config.tune_projector,
+            tune_diffusion_model=self.config.tune_diffusion_model,
+        )

        model.compute_dtype = "bfloat16" if self.config.use_bf16 else model.compute_dtype
        model.config.compute_dtype = model.compute_dtype
@@ -116,7 +97,7 @@ class GrootPolicy(PreTrainedPolicy):

    def reset(self):
        """Reset policy state when environment resets."""
-        self._action_queue = deque([], maxlen=self._action_queue_steps)
+        self._action_queue = deque([], maxlen=self.config.n_action_steps)

    @classmethod
    def from_pretrained(
@@ -160,13 +141,8 @@ class GrootPolicy(PreTrainedPolicy):
        from huggingface_hub.constants import SAFETENSORS_SINGLE_FILE
        from huggingface_hub.errors import HfHubHTTPError

-        requested_version = (
-            normalize_groot_model_version(config.model_version)
-            if config is not None
-            else infer_groot_model_version(str(pretrained_name_or_path)) or GROOT_N1_5
-        )
        print(
-            f"The Groot policy is a wrapper around Nvidia's GR00T {requested_version} model.\n"
+            "The Groot policy is a wrapper around Nvidia's GR00T N1.5 model.\n"
            f"Loading pretrained model from: {pretrained_name_or_path}"
        )

@@ -217,12 +193,8 @@ class GrootPolicy(PreTrainedPolicy):
        print("Detected base GR00T model, loading from HuggingFace...")

        if config is None:
-            model_version = infer_groot_model_version(str(pretrained_name_or_path)) or GROOT_N1_5
            # Create default config with the pretrained path
-            config = GrootConfig(
-                model_version=model_version,
-                base_model_path=str(pretrained_name_or_path),
-            )
+            config = GrootConfig(base_model_path=str(pretrained_name_or_path))

            # Add minimal visual feature required for validation
            # validate_features() will automatically add state and action features
@@ -243,25 +215,6 @@ class GrootPolicy(PreTrainedPolicy):
            if hasattr(config, key):
                setattr(config, key, value)

-        config.model_version = normalize_groot_model_version(config.model_version)
-        inferred_version = infer_groot_model_version(config.base_model_path)
-        if inferred_version is not None and inferred_version != config.model_version:
-            raise ValueError(
-                f"GR00T model_version '{config.model_version}' does not match base_model_path "
-                f"'{config.base_model_path}', which looks like '{inferred_version}'."
-            )
-        if config.model_version == GROOT_N1_7:
-            if config.max_state_dim == 64:
-                config.max_state_dim = 132
-            if config.max_action_dim == 32:
-                config.max_action_dim = 132
-            if config.chunk_size == 50:
-                config.chunk_size = 40
-            if config.n_action_steps == 50:
-                config.n_action_steps = 40
-            if tuple(config.image_size) == (224, 224):
-                config.image_size = (256, 256)
-
        # Create a fresh policy instance - this will automatically load the GR00T model
        # in __init__ via _create_groot_model()
        policy = cls(config)
@@ -272,59 +225,18 @@ class GrootPolicy(PreTrainedPolicy):
    def get_optim_params(self) -> dict:
        return self.parameters()

-    def _resolve_action_queue_steps(self) -> int:
-        n_action_steps = int(self.config.n_action_steps)
-        if self.config.model_version != GROOT_N1_7:
-            return n_action_steps
-
-        checkpoint_action_horizon = infer_groot_n1_7_action_horizon(
-            self.config.base_model_path,
-            self.config.embodiment_tag,
-        )
-        execution_horizon = infer_groot_n1_7_action_execution_horizon(
-            self.config.base_model_path,
-            self.config.embodiment_tag,
-        )
-        horizons = [n_action_steps]
-        if checkpoint_action_horizon is not None:
-            horizons.append(checkpoint_action_horizon)
-        if execution_horizon is not None:
-            horizons.append(execution_horizon)
-        return min(horizons)
-
-    def _filter_groot_inputs(self, batch: dict[str, Tensor], *, include_action: bool) -> dict[str, Tensor]:
-        allowed_base = {"state", "state_mask", "embodiment_id"}
-        if include_action:
-            allowed_base.update({"action", "action_mask"})
-
-        if self.config.model_version == GROOT_N1_7:
-            allowed_base.update(
-                {
-                    "input_ids",
-                    "attention_mask",
-                    "pixel_values",
-                    "image_grid_thw",
-                    "mm_token_type_ids",
-                    "pixel_values_videos",
-                    "video_grid_thw",
-                }
-            )
-            allowed_base.add("action_mask")
-        else:
-            allowed_base.update({"action_mask"} if include_action else set())
-
-        return {
-            k: v
-            for k, v in batch.items()
-            if (k in allowed_base or k.startswith("eagle_")) and not (k.startswith("next.") or k == "info")
-        }
-
    def forward(self, batch: dict[str, Tensor]) -> tuple[Tensor, dict]:
        """Training forward pass.

        Delegates to Isaac-GR00T model.forward when inputs are compatible.
        """
-        groot_inputs = self._filter_groot_inputs(batch, include_action=True)
+        # Build a clean input dict for GR00T: keep only tensors GR00T consumes
+        allowed_base = {"state", "state_mask", "action", "action_mask", "embodiment_id"}
+        groot_inputs = {
+            k: v
+            for k, v in batch.items()
+            if (k in allowed_base or k.startswith("eagle_")) and not (k.startswith("next.") or k == "info")
+        }

        # Get device from model parameters
        device = next(self.parameters()).device
@@ -349,10 +261,15 @@ class GrootPolicy(PreTrainedPolicy):
        """
        self.eval()

-        # Preprocessing is handled by the processor pipeline, so we just filter the batch.
-        # During inference, we do not pass action because it is predicted.
-        # N1.7 still carries a 2-D action horizon mask from its checkpoint processor.
-        groot_inputs = self._filter_groot_inputs(batch, include_action=False)
+        # Build a clean input dict for GR00T: keep only tensors GR00T consumes
+        # Preprocessing is handled by the processor pipeline, so we just filter the batch
+        # NOTE: During inference, we should NOT pass action/action_mask (that's what we're predicting)
+        allowed_base = {"state", "state_mask", "embodiment_id"}
+        groot_inputs = {
+            k: v
+            for k, v in batch.items()
+            if (k in allowed_base or k.startswith("eagle_")) and not (k.startswith("next.") or k == "info")
+        }

        # Get device from model parameters
        device = next(self.parameters()).device
@@ -375,7 +292,7 @@ class GrootPolicy(PreTrainedPolicy):

        if len(self._action_queue) == 0:
            actions = self.predict_action_chunk(batch)
-            self._action_queue.extend(actions[:, : self._action_queue_steps].transpose(0, 1))
+            self._action_queue.extend(actions.transpose(0, 1))
        return self._action_queue.popleft()

    # -------------------------
--- a/src/lerobot/policies/groot/processor_groot.py
+++ b/src/lerobot/policies/groot/processor_groot.py
--- a/src/lerobot/policies/molmoact2/README.md
+++ b/src/lerobot/policies/molmoact2/README.md
@@ -1 +0,0 @@
-../../../../docs/source/policy_molmoact2_README.md
--- a/src/lerobot/policies/molmoact2/init.py
+++ b/src/lerobot/policies/molmoact2/init.py
@@ -1,21 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from .configuration_molmoact2 import MolmoAct2Config
-from .modeling_molmoact2 import MolmoAct2Policy
-from .processor_molmoact2 import make_molmoact2_pre_post_processors
-
-__all__ = ["MolmoAct2Config", "MolmoAct2Policy", "make_molmoact2_pre_post_processors"]
--- a/src/lerobot/policies/molmoact2/configuration_molmoact2.py
+++ b/src/lerobot/policies/molmoact2/configuration_molmoact2.py
@@ -1,519 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import annotations
-
-import json
-import math
-import os
-from contextlib import suppress
-from dataclasses import dataclass, field
-from pathlib import Path
-from typing import Any
-
-from huggingface_hub import snapshot_download
-
-from lerobot.configs import FeatureType, NormalizationMode, PolicyFeature, PreTrainedConfig
-from lerobot.optim import (
-    AdamWConfig,
-    CosineDecayWithWarmupSchedulerConfig,
-    LRSchedulerConfig,
-    OptimizerConfig,
-)
-from lerobot.utils.constants import ACTION, OBS_STATE
-
-from ..rtc.configuration_rtc import RTCConfig
-
-MOLMOACT2_DEFAULT_NUM_IMAGES = 2
-MOLMOACT2_IMAGE_TOKENS_PER_IMAGE = 196
-MOLMOACT2_FIXED_PROMPT_TOKEN_BUDGET = 80
-MOLMOACT2_TASK_TOKEN_BUDGET = 32
-MOLMOACT2_SEQUENCE_LENGTH_MARGIN = 32
-MOLMOACT2_SEQUENCE_LENGTH_MULTIPLE = 64
-MOLMOACT2_DISCRETE_ACTION_WRAPPER_TOKENS = 4
-MOLMOACT2_MIN_DISCRETE_ACTION_TOKENS_PER_STEP = 6
-MOLMOACT2_DISCRETE_ACTION_TOKENS_PER_DIM = 0.95
-
-
-def _hf_token() -> str | None:
-    return os.environ.get("HF_TOKEN") or os.environ.get("HF_ACCESS_TOKEN")
-
-
-def _resolve_checkpoint_location(
-    checkpoint_path: str,
-    *,
-    revision: str | None = None,
-    force_download: bool = False,
-) -> str:
-    checkpoint_path = str(checkpoint_path or "").strip()
-    if not checkpoint_path:
-        raise ValueError("MolmoAct2 policy requires `checkpoint_path`.")
-    local_path = Path(checkpoint_path).expanduser()
-    if local_path.exists():
-        return str(local_path)
-    return snapshot_download(
-        repo_id=checkpoint_path,
-        repo_type="model",
-        revision=revision,
-        force_download=force_download,
-        ignore_patterns=["*.py", "*.pyc", "__pycache__/*"],
-        token=_hf_token(),
-    )
-
-
-def _load_hf_norm_metadata_for_tag(
-    checkpoint_path: str,
-    *,
-    revision: str | None,
-    force_download: bool,
-    norm_tag: str | None,
-) -> dict[str, Any]:
-    norm_tag = str(norm_tag or "").strip()
-    if not norm_tag:
-        return {}
-    checkpoint_location = Path(
-        _resolve_checkpoint_location(
-            checkpoint_path,
-            revision=revision,
-            force_download=force_download,
-        )
-    )
-    norm_stats_filename = "norm_stats.json"
-    config_path = checkpoint_location / "config.json"
-    if config_path.exists():
-        with suppress(OSError, json.JSONDecodeError):
-            norm_stats_filename = str(
-                json.loads(config_path.read_text()).get("norm_stats_filename") or norm_stats_filename
-            )
-    stats_path = checkpoint_location / norm_stats_filename
-    if not stats_path.exists():
-        raise FileNotFoundError(
-            f"MolmoAct2 HF checkpoint is missing {norm_stats_filename!r}; cannot resolve norm_tag={norm_tag!r}."
-        )
-    payload = json.loads(stats_path.read_text())
-    metadata_by_tag = payload.get("metadata_by_tag")
-    if not isinstance(metadata_by_tag, dict):
-        raise ValueError(f"MolmoAct2 norm stats file {stats_path} has no metadata_by_tag mapping.")
-    metadata = metadata_by_tag.get(norm_tag)
-    if not isinstance(metadata, dict):
-        available = sorted(str(tag) for tag in metadata_by_tag)
-        raise ValueError(f"Unknown MolmoAct2 norm_tag={norm_tag!r}. Available tags: {available}.")
-    return metadata
-
-
-@LRSchedulerConfig.register_subclass("molmoact2_cosine_decay_with_warmup")
-@dataclass
-class MolmoAct2CosineDecayWithWarmupSchedulerConfig(CosineDecayWithWarmupSchedulerConfig):
-    """MolmoAct2-local cosine scheduler with optional decay-step auto-match.
-
-    LeRobot's generic cosine scheduler keeps an explicit integer decay length.
-    For MolmoAct2, leaving num_decay_steps unset means "decay across this run's
-    training steps"; build() is the first point where num_training_steps is known.
-    """
-
-    num_decay_steps: int | None
-
-    def build(self, optimizer, num_training_steps: int):
-        return CosineDecayWithWarmupSchedulerConfig(
-            peak_lr=self.peak_lr,
-            decay_lr=self.decay_lr,
-            num_warmup_steps=self.num_warmup_steps,
-            num_decay_steps=num_training_steps if self.num_decay_steps is None else self.num_decay_steps,
-        ).build(optimizer, num_training_steps=num_training_steps)
-
-
-def _round_up(value: int, multiple: int) -> int:
-    return int(math.ceil(value / multiple) * multiple)
-
-
-def infer_molmoact2_max_sequence_length(
-    *,
-    num_images: int,
-    state_dim: int,
-    action_dim: int,
-    action_horizon: int,
-    include_discrete_action: bool,
-) -> int:
-    """Infer the padded text/image sequence cap from MolmoAct2's fixed token layout."""
-    if num_images < 1:
-        num_images = MOLMOACT2_DEFAULT_NUM_IMAGES
-    if state_dim < 0:
-        state_dim = 0
-    if action_dim < 1:
-        action_dim = 1
-    if action_horizon < 1:
-        action_horizon = 1
-
-    image_tokens = num_images * MOLMOACT2_IMAGE_TOKENS_PER_IMAGE
-    prompt_tokens = (
-        MOLMOACT2_FIXED_PROMPT_TOKEN_BUDGET
-        + MOLMOACT2_TASK_TOKEN_BUDGET
-        + state_dim
-        + MOLMOACT2_SEQUENCE_LENGTH_MARGIN
-    )
-    action_tokens = 0
-    if include_discrete_action:
-        action_tokens_per_step = max(
-            MOLMOACT2_MIN_DISCRETE_ACTION_TOKENS_PER_STEP,
-            math.ceil(action_dim * MOLMOACT2_DISCRETE_ACTION_TOKENS_PER_DIM),
-        )
-        action_tokens = MOLMOACT2_DISCRETE_ACTION_WRAPPER_TOKENS + action_horizon * action_tokens_per_step
-
-    return _round_up(
-        image_tokens + prompt_tokens + action_tokens,
-        MOLMOACT2_SEQUENCE_LENGTH_MULTIPLE,
-    )
-
-
-@PreTrainedConfig.register_subclass("molmoact2")
-@dataclass
-class MolmoAct2Config(PreTrainedConfig):
-    """MolmoAct2 policy backed by the converted HF checkpoint implementation."""
-
-    checkpoint_path: str = "allenai/MolmoAct2"
-    checkpoint_revision: str | None = None
-    checkpoint_force_download: bool = False
-
-    n_obs_steps: int = 1
-    chunk_size: int = 30
-    n_action_steps: int = 30
-
-    action_mode: str = "both"
-    inference_action_mode: str | None = None
-    discrete_action_tokenizer: str = "allenai/MolmoAct2-FAST-Tokenizer"
-    discrete_generation_max_steps: int | None = None
-    norm_tag: str | None = None
-
-    setup_type: str = ""
-    control_mode: str = ""
-    image_keys: list[str] = field(default_factory=list)
-    normalize_language: bool = True
-    add_setup_tokens: bool = True
-    add_control_tokens: bool = True
-    normalize_gripper: bool = False
-    num_state_tokens: int = 256
-    # Leave unset for the default MolmoAct2 sequence budget inferred from the fixed
-    # image/prompt/state/action token layout. Override only for unusual long prompts.
-    max_sequence_length: int | None = None
-
-    # Fixed by released MolmoAct2 checkpoints. We validate this at model load.
-    expected_max_action_dim: int = 32
-
-    # Flow-matching training knobs copied from the original MolmoAct2 training path.
-    num_flow_timesteps: int = 8
-    flow_matching_cutoff: float = 1.0
-    flow_matching_time_offset: float = 0.001
-    flow_matching_time_scale: float = 0.999
-    flow_matching_beta_alpha: float = 1.0
-    flow_matching_beta_beta: float = 1.5
-    num_inference_steps: int | None = None
-    mask_action_dim_padding: bool = True
-    enable_inference_cuda_graph: bool = True
-    # MolmoAct2-local eval option. When enabled, stochastic continuous action
-    # generation uses a rollout-local generator derived from eval_seed.
-    per_episode_seed: bool = False
-    eval_seed: int | None = None
-    rtc_config: RTCConfig | None = None
-
-    # Default is full finetuning with gradients from the action expert flowing into the VLM.
-    enable_lora_vlm: bool = False
-    lora_rank: int = 64
-    lora_alpha: int = 16
-    lora_dropout: float = 0.05
-    lora_bias: str = "none"
-    enable_lora_action_expert: bool = False
-    enable_knowledge_insulation: bool = False
-    freeze_embedding: bool = True
-    train_action_expert_only: bool = False
-    gradient_checkpointing: bool = False
-
-    model_dtype: str = "bfloat16"
-    softmax_auxiliary_loss: bool = True
-    softmax_auxiliary_loss_scale: float = 1e-4
-    discrete_loss_token_weighting: str = "root_subsegments_root_tokens"
-
-    optimizer_lr: float = 1e-5
-    optimizer_vit_lr: float = 5e-6
-    optimizer_connector_lr: float = 5e-6
-    optimizer_action_expert_lr: float = 5e-5
-    optimizer_betas: tuple[float, float] = (0.9, 0.95)
-    optimizer_eps: float = 1e-6
-    optimizer_weight_decay: float = 0.0
-    optimizer_grad_clip_norm: float = 1.0
-
-    scheduler_warmup_steps: int = 200
-    scheduler_decay_steps: int | None = None
-    scheduler_decay_lr: float = 1e-6
-
-    normalization_mapping: dict[str, NormalizationMode] = field(
-        default_factory=lambda: {
-            "VISUAL": NormalizationMode.IDENTITY,
-            "STATE": NormalizationMode.QUANTILES,
-            "ACTION": NormalizationMode.QUANTILES,
-        }
-    )
-
-    input_features: dict[str, PolicyFeature] = field(default_factory=dict)
-    output_features: dict[str, PolicyFeature] = field(default_factory=dict)
-    dataset_feature_names: dict[str, Any] = field(default_factory=dict)
-
-    def __post_init__(self) -> None:
-        super().__post_init__()
-        if self.action_mode not in {"continuous", "discrete", "both"}:
-            raise ValueError(
-                f"Unsupported action_mode={self.action_mode!r}. "
-                "Expected one of {'continuous', 'discrete', 'both'}."
-            )
-        if self.inference_action_mode not in {None, "continuous", "discrete"}:
-            raise ValueError(
-                f"Unsupported inference_action_mode={self.inference_action_mode!r}. "
-                "Expected one of {None, 'continuous', 'discrete'}."
-            )
-        if self.inference_action_mode == "continuous" and self.action_mode == "discrete":
-            raise ValueError("MolmoAct2 action_mode='discrete' cannot run continuous inference.")
-        if self.inference_action_mode == "discrete" and self.action_mode == "continuous":
-            raise ValueError("MolmoAct2 action_mode='continuous' cannot run discrete inference.")
-        if self.train_action_expert_only and self.action_mode != "continuous":
-            raise ValueError("MolmoAct2 train_action_expert_only requires action_mode='continuous'.")
-        if self.train_action_expert_only and self.enable_lora_vlm:
-            raise ValueError("MolmoAct2 train_action_expert_only is incompatible with enable_lora_vlm.")
-        if self.enable_lora_action_expert and not self.enable_lora_vlm:
-            raise ValueError("MolmoAct2 enable_lora_action_expert requires enable_lora_vlm.")
-        if self.chunk_size < 1:
-            raise ValueError(f"chunk_size must be >= 1, got {self.chunk_size}.")
-        if self.n_action_steps < 1:
-            raise ValueError(f"n_action_steps must be >= 1, got {self.n_action_steps}.")
-        if self.n_action_steps > self.chunk_size:
-            raise ValueError(
-                f"n_action_steps ({self.n_action_steps}) cannot exceed chunk_size ({self.chunk_size})."
-            )
-        if self.expected_max_action_dim != 32:
-            raise ValueError("MolmoAct2 released checkpoints use expected_max_action_dim=32.")
-        if self.model_dtype not in {"float32", "bfloat16", "float16"}:
-            raise ValueError(
-                f"Unsupported model_dtype={self.model_dtype!r}. Expected 'float32', 'bfloat16', or 'float16'."
-            )
-        if self.lora_rank < 1:
-            raise ValueError(f"lora_rank must be >= 1, got {self.lora_rank}.")
-        if self.lora_alpha < 1:
-            raise ValueError(f"lora_alpha must be >= 1, got {self.lora_alpha}.")
-        if not 0 <= self.lora_dropout <= 1:
-            raise ValueError(f"lora_dropout must be in [0, 1], got {self.lora_dropout}.")
-        if self.lora_bias not in {"none", "all", "lora_only"}:
-            raise ValueError(
-                f"Unsupported lora_bias={self.lora_bias!r}. Expected one of 'none', 'all', or 'lora_only'."
-            )
-        if self.discrete_loss_token_weighting not in {
-            "none",
-            "token",
-            "root_tokens",
-            "root_subsegments",
-            "root_subsegments_root_tokens",
-        }:
-            raise ValueError(
-                f"Unsupported discrete_loss_token_weighting={self.discrete_loss_token_weighting!r}."
-            )
-        if self.discrete_generation_max_steps is not None and self.discrete_generation_max_steps < 1:
-            raise ValueError(
-                f"discrete_generation_max_steps must be >= 1 or None, got {self.discrete_generation_max_steps}."
-            )
-        if self.max_sequence_length is not None and self.max_sequence_length < 1:
-            raise ValueError(f"max_sequence_length must be >= 1 or None, got {self.max_sequence_length}.")
-
-    def inferred_max_sequence_length(
-        self,
-        *,
-        num_images: int | None = None,
-        state_dim: int | None = None,
-        action_dim: int | None = None,
-        action_horizon: int | None = None,
-        include_discrete_action: bool | None = None,
-    ) -> int:
-        if self.max_sequence_length is not None:
-            return int(self.max_sequence_length)
-
-        if num_images is None:
-            num_images = len(self.image_keys) or len(self.image_features) or MOLMOACT2_DEFAULT_NUM_IMAGES
-        if state_dim is None:
-            state_feature = self.robot_state_feature
-            state_dim = int(state_feature.shape[0]) if state_feature is not None else 0
-        if action_dim is None:
-            action_feature = self.action_feature
-            action_dim = (
-                int(action_feature.shape[0]) if action_feature is not None else self.expected_max_action_dim
-            )
-        if action_horizon is None:
-            action_horizon = self.chunk_size
-        if include_discrete_action is None:
-            include_discrete_action = self.action_mode in {"discrete", "both"}
-
-        return infer_molmoact2_max_sequence_length(
-            num_images=int(num_images),
-            state_dim=int(state_dim),
-            action_dim=int(action_dim),
-            action_horizon=int(action_horizon),
-            include_discrete_action=bool(include_discrete_action),
-        )
-
-    @property
-    def observation_delta_indices(self) -> None:
-        return None
-
-    @property
-    def action_delta_indices(self) -> list[int]:
-        return list(range(self.chunk_size))
-
-    @property
-    def reward_delta_indices(self) -> None:
-        return None
-
-    def get_optimizer_preset(self) -> OptimizerConfig:
-        return AdamWConfig(
-            lr=self.optimizer_lr,
-            betas=self.optimizer_betas,
-            eps=self.optimizer_eps,
-            weight_decay=self.optimizer_weight_decay,
-            grad_clip_norm=self.optimizer_grad_clip_norm,
-        )
-
-    def get_scheduler_preset(self) -> LRSchedulerConfig | None:
-        return MolmoAct2CosineDecayWithWarmupSchedulerConfig(
-            peak_lr=self.optimizer_lr,
-            decay_lr=self.scheduler_decay_lr,
-            num_warmup_steps=self.scheduler_warmup_steps,
-            num_decay_steps=self.scheduler_decay_steps,
-        )
-
-    def set_dataset_feature_metadata(self, features: dict[str, Any]) -> None:
-        self.dataset_feature_names = {}
-        for key in (ACTION, OBS_STATE):
-            feature = features.get(key) if isinstance(features, dict) else None
-            if isinstance(feature, dict) and feature.get("names") is not None:
-                self.dataset_feature_names[key] = feature["names"]
-
-    def validate_features(self) -> None:
-        """Validate and set up MolmoAct2 input and output features."""
-        image_features = [key for key, feat in self.input_features.items() if feat.type == FeatureType.VISUAL]
-        if not image_features:
-            raise ValueError(
-                "MolmoAct2 policy requires at least one visual input feature. "
-                "No features of type FeatureType.VISUAL found in input_features."
-            )
-
-        if OBS_STATE not in self.input_features:
-            state_feature = PolicyFeature(
-                type=FeatureType.STATE,
-                shape=(0,),
-            )
-            self.input_features[OBS_STATE] = state_feature
-
-        if ACTION not in self.output_features:
-            action_feature = PolicyFeature(
-                type=FeatureType.ACTION,
-                shape=(self.expected_max_action_dim,),
-            )
-            self.output_features[ACTION] = action_feature
-
-    def apply_norm_tag_metadata(self) -> None:
-        if not str(self.norm_tag or "").strip():
-            return
-        metadata = _load_hf_norm_metadata_for_tag(
-            self.checkpoint_path,
-            revision=self.checkpoint_revision,
-            force_download=bool(self.checkpoint_force_download),
-            norm_tag=self.norm_tag,
-        )
-        if metadata.get("action_horizon") is not None:
-            self.chunk_size = int(metadata["action_horizon"])
-        if metadata.get("n_action_steps") is not None:
-            self.n_action_steps = int(metadata["n_action_steps"])
-        if not self.setup_type and metadata.get("setup_type") is not None:
-            self.setup_type = str(metadata["setup_type"])
-        if not self.control_mode and metadata.get("control_mode") is not None:
-            self.control_mode = str(metadata["control_mode"])
-
-    def saved_policy_action_mode(self) -> str | None:
-        pretrained_path = getattr(self, "pretrained_path", None)
-        if pretrained_path is None:
-            return None
-        config_path = Path(pretrained_path) / "config.json"
-        if not config_path.exists():
-            return None
-        try:
-            mode = json.loads(config_path.read_text()).get("action_mode")
-        except (OSError, json.JSONDecodeError):
-            return None
-        if mode in {"continuous", "discrete", "both"}:
-            return str(mode)
-        return None
-
-    def training_action_mode(self, saved_policy_action_mode: str | None = None) -> str:
-        return saved_policy_action_mode or self.action_mode
-
-    def validate_inference_action_mode(self, saved_policy_action_mode: str | None = None) -> None:
-        requested_mode = self.inference_action_mode
-        if requested_mode is None:
-            return
-        training_mode = self.training_action_mode(saved_policy_action_mode)
-        if requested_mode == "continuous" and training_mode == "discrete":
-            raise ValueError(
-                "MolmoAct2 checkpoint was trained with action_mode='discrete' and cannot run "
-                "continuous inference."
-            )
-        if requested_mode == "discrete" and training_mode == "continuous":
-            raise ValueError(
-                "MolmoAct2 checkpoint was trained with action_mode='continuous' and cannot run "
-                "discrete inference. Train with action_mode='both' or action_mode='discrete' first."
-            )
-
-    def validate_checkpoint_action_mode(
-        self,
-        checkpoint_action_mode: str,
-        *,
-        has_action_expert: bool,
-    ) -> None:
-        if self.action_mode == "both" and checkpoint_action_mode != "both":
-            raise ValueError(
-                f"action_mode='both' requires checkpoint action_mode='both', got {checkpoint_action_mode!r}."
-            )
-        if self.action_mode == "discrete" and checkpoint_action_mode not in {"discrete", "both"}:
-            raise ValueError(
-                f"action_mode='discrete' requires checkpoint action_mode in {{'discrete', 'both'}}, "
-                f"got {checkpoint_action_mode!r}."
-            )
-        if self.action_mode in {"continuous", "both"} and not has_action_expert:
-            raise ValueError("Continuous MolmoAct2 training requires an action expert checkpoint.")
-
-    def resolve_inference_action_mode(
-        self,
-        requested_mode: str | None,
-        saved_policy_action_mode: str | None = None,
-    ) -> str:
-        training_mode = self.training_action_mode(saved_policy_action_mode)
-        if requested_mode is None:
-            requested_mode = self.inference_action_mode
-        if requested_mode is None:
-            raise ValueError(
-                "MolmoAct2 inference requires `inference_action_mode` to be set explicitly "
-                "to either 'continuous' or 'discrete'."
-            )
-        if requested_mode not in {"continuous", "discrete"}:
-            raise ValueError("MolmoAct2 inference_action_mode must be either 'continuous' or 'discrete'.")
-        if requested_mode == "continuous" and training_mode == "discrete":
-            raise ValueError("MolmoAct2 action_mode='discrete' checkpoint cannot run continuous inference.")
-        if requested_mode == "discrete" and training_mode == "continuous":
-            raise ValueError("MolmoAct2 action_mode='continuous' checkpoint cannot run discrete inference.")
-        return requested_mode
--- a/src/lerobot/policies/molmoact2/hf_model/init.py
+++ b/src/lerobot/policies/molmoact2/hf_model/init.py
@@ -1,17 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# ruff: noqa
--- a/src/lerobot/policies/molmoact2/hf_model/action_tokenizer.py
+++ b/src/lerobot/policies/molmoact2/hf_model/action_tokenizer.py
@@ -1,237 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# ruff: noqa
-
-import logging
-import os
-from pathlib import Path
-from typing import ClassVar
-
-import numpy as np
-from tokenizers import ByteLevelBPETokenizer
-from tokenizers.trainers import BpeTrainer
-from huggingface_hub import snapshot_download
-from transformers import PreTrainedTokenizerFast
-from transformers.processing_utils import ProcessorMixin
-
-
-def _hf_token() -> str | None:
-    return os.environ.get("HF_TOKEN") or os.environ.get("HF_ACCESS_TOKEN")
-
-
-def _resolve_tokenizer_location(
-    tokenizer_path: str,
-    *,
-    revision: str | None = None,
-    force_download: bool = False,
-) -> str:
-    local_path = Path(str(tokenizer_path)).expanduser()
-    if local_path.exists():
-        return str(local_path)
-    return snapshot_download(
-        repo_id=str(tokenizer_path),
-        repo_type="model",
-        revision=revision,
-        force_download=force_download,
-        ignore_patterns=["*.py", "*.pyc", "__pycache__/*"],
-        token=_hf_token(),
-    )
-
-
-class UniversalActionProcessor(ProcessorMixin):
-    attributes: ClassVar[list[str]] = ["tokenizer"]
-    tokenizer_class: str = "AutoTokenizer"
-
-    def __init__(
-        self,
-        tokenizer: PreTrainedTokenizerFast,
-        scale: float = 10,
-        vocab_size: int = 1024,
-        min_token: int = 0,
-        *,
-        action_dim: int | None = None,
-        time_horizon: int | None = None,
-    ):
-        self.scale = scale
-        self.vocab_size = vocab_size
-        self.min_token = min_token
-
-        # Action horizon and dimension needed during decoding. These can be specified
-        # in three ways (in order of priority):
-        # 1. passed in as kwargs to decode()
-        # 2. in the constructor
-        # 3. cached from the last time decode() was called
-        self.time_horizon = time_horizon
-        self.action_dim = action_dim
-        self.called_time_horizon = time_horizon
-        self.called_action_dim = action_dim
-
-        super().__init__(tokenizer)
-        self.bpe_tokenizer = self.tokenizer
-
-    def __call__(self, action_chunk: np.array) -> np.array:
-        from scipy.fft import dct
-
-        assert action_chunk.ndim <= 3, "Only 3 dimensions supported: [batch, timesteps, action_dim]"
-        if action_chunk.ndim == 2:
-            action_chunk = action_chunk[None, ...]
-
-        # Cache the time horizon and action dimension for decoding
-        self.called_time_horizon = action_chunk.shape[-2]
-        self.called_action_dim = action_chunk.shape[-1]
-
-        dct_coeff = dct(action_chunk, axis=1, norm="ortho")
-        dct_coeff = np.around(dct_coeff * self.scale)
-        tokens = []
-        for elem in dct_coeff:
-            token_str = "".join(map(chr, np.maximum(elem.flatten() - self.min_token, 0).astype(int)))
-            tokens.append(self.bpe_tokenizer(token_str)["input_ids"])
-        return tokens
-
-    def decode(
-        self,
-        tokens: list[list[int]],
-        *,
-        time_horizon: int | None = None,
-        action_dim: int | None = None,
-    ) -> np.array:
-        from scipy.fft import idct
-
-        self.time_horizon = time_horizon or self.time_horizon or self.called_time_horizon
-        self.action_dim = action_dim or self.action_dim or self.called_action_dim
-
-        # Cache the time horizon and action dimension for the next call
-        self.called_time_horizon = self.time_horizon
-        self.called_action_dim = self.action_dim
-
-        assert self.time_horizon is not None and self.action_dim is not None, (
-            "Tokenizer not initialized, call encode() once or pass in time_horizon and action_dim."
-        )
-
-        decoded_actions = []
-        for token in tokens:
-            try:
-                decoded_tokens = self.bpe_tokenizer.decode(token)
-                decoded_dct_coeff = np.array(list(map(ord, decoded_tokens))) + self.min_token
-                decoded_dct_coeff = decoded_dct_coeff.reshape(-1, self.action_dim)
-                assert decoded_dct_coeff.shape == (
-                    self.time_horizon,
-                    self.action_dim,
-                ), (
-                    f"Decoded DCT coefficients have shape {decoded_dct_coeff.shape}, expected ({self.time_horizon}, {self.action_dim})"
-                )
-            except Exception as e:
-                print(f"Error decoding tokens: {e}")
-                print(f"Tokens: {token}")
-                decoded_dct_coeff = np.zeros((self.time_horizon, self.action_dim))
-            decoded_actions.append(idct(decoded_dct_coeff / self.scale, axis=0, norm="ortho"))
-        return np.stack(decoded_actions)
-
-    @classmethod
-    def fit(
-        cls,
-        action_data: list[np.array],
-        scale: float = 10,
-        vocab_size: int = 1024,
-        *,
-        time_horizon: int | None = None,
-        action_dim: int | None = None,
-    ) -> "UniversalActionProcessor":
-        from scipy.fft import dct
-
-        # Run DCT over all inputs
-        dct_tokens = [dct(a, axis=0, norm="ortho").flatten() for a in action_data]
-
-        # Quantize and find min token
-        max_token = int(np.around(np.concatenate(dct_tokens) * scale).max())
-        min_token = int(np.around(np.concatenate(dct_tokens) * scale).min())
-        min_vocab_size = max_token - min_token
-
-        assert min_vocab_size <= vocab_size, (
-            f"Vocab size {vocab_size} is too small for the range of tokens {min_vocab_size}"
-        )
-        if min_vocab_size + 100 > vocab_size:
-            logging.warning(
-                f"Initial alphabet size {min_vocab_size} is almost as large as the vocab"
-                f"size {vocab_size}, consider increasing vocab size"
-            )
-
-        # Make token iterator for BPE training
-        def _token_iter():
-            for tokens in dct_tokens:
-                rounded_tokens = np.around(tokens * scale) - min_token
-                rounded_tokens = rounded_tokens.astype(int)
-                string = "".join(map(chr, rounded_tokens))
-                yield string
-
-        # Train BPE tokenizer
-        bpe = ByteLevelBPETokenizer()
-
-        # Set up the entire range of possible tokens as the initial alphabet
-        alphabet = [chr(i) for i in range(max_token - min_token + 1)]
-        trainer = BpeTrainer(
-            vocab_size=vocab_size,
-            min_frequency=2,
-            show_progress=True,
-            special_tokens=[],
-            initial_alphabet=alphabet,
-            max_token_length=10000,
-        )
-
-        # Train the inner tokenizer (don't use ByteLevelBPETokenizer.train_from_iterator()
-        # because it doesn't support custom alphabets)
-        bpe._tokenizer.train_from_iterator(_token_iter(), trainer=trainer)
-
-        return cls(
-            PreTrainedTokenizerFast(tokenizer_object=bpe, clean_up_tokenization_spaces=False),
-            scale=scale,
-            vocab_size=vocab_size,
-            min_token=min_token,
-            time_horizon=time_horizon,
-            action_dim=action_dim,
-        )
-
-    @classmethod
-    def from_pretrained_local(
-        cls,
-        pretrained_model_name_or_path: str,
-        *,
-        revision: str | None = None,
-        force_download: bool = False,
-    ) -> "UniversalActionProcessor":
-        location = Path(
-            _resolve_tokenizer_location(
-                pretrained_model_name_or_path,
-                revision=revision,
-                force_download=force_download,
-            )
-        )
-        processor_config = {}
-        processor_config_path = location / "processor_config.json"
-        if processor_config_path.exists():
-            import json
-
-            processor_config = json.loads(processor_config_path.read_text())
-        tokenizer = PreTrainedTokenizerFast.from_pretrained(str(location))
-        return cls(
-            tokenizer,
-            scale=processor_config.get("scale", 10),
-            vocab_size=processor_config.get("vocab_size", 1024),
-            min_token=processor_config.get("min_token", 0),
-            action_dim=processor_config.get("action_dim"),
-            time_horizon=processor_config.get("time_horizon"),
-        )
--- a/src/lerobot/policies/molmoact2/hf_model/configuration_molmoact2.py
+++ b/src/lerobot/policies/molmoact2/hf_model/configuration_molmoact2.py
@@ -1,553 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# ruff: noqa
-
-"""
-MolmoAct2 configuration
-"""
-
-from typing import Optional, Any
-
-from transformers import PretrainedConfig
-from transformers.modeling_rope_utils import rope_config_validation
-from transformers.utils import logging
-
-logger = logging.get_logger(__name__)
-
-
-class MolmoAct2VitConfig(PretrainedConfig):
-    r"""
-    This is the configuration class to store the configuration of a [`MolmoAct2VisionTransformer`].
-    It is used to instantiate a `MolmoAct2VisionTransformer` according to the specified arguments,
-    defining the model architecture.
-
-    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
-    documentation from [`PretrainedConfig`] for more information.
-
-    Example:
-    ```python
-    >>> from transformers import MolmoAct2VitConfig, MolmoAct2VisionTransformer
-
-    >>> # Initializing a MolmoAct2VitConfig
-    >>> configuration = MolmoAct2VitConfig()
-
-    >>> # Initializing a MolmoAct2VisionTransformer (with random weights)
-    >>> model = MolmoAct2VisionTransformer(configuration)
-
-    >>> # Accessing the model configuration
-    >>> configuration = model.config
-    ```"""
-
-    model_type = "molmoact2"
-    base_config_key = "vit_config"
-
-    def __init__(
-        self,
-        hidden_size: int = 1152,
-        intermediate_size: int = 4304,
-        num_hidden_layers: int = 27,
-        num_attention_heads: int = 16,
-        num_key_value_heads: int = 16,
-        head_dim: int = 72,
-        hidden_act: str = "gelu_pytorch_tanh",
-        layer_norm_eps: float = 1e-6,
-        image_default_input_size: tuple[int, int] = (378, 378),
-        image_patch_size: int = 14,
-        image_num_pos: int = 577,
-        attention_dropout: float = 0.0,
-        residual_dropout: float = 0.0,
-        initializer_range: float = 0.02,
-        float32_attention: bool = True,
-        attn_implementation: str = "eager",
-        **kwargs,
-    ):
-        self.attn_implementation = attn_implementation
-        super().__init__(attn_implementation=attn_implementation, **kwargs)
-        self.hidden_size = hidden_size
-        self.intermediate_size = intermediate_size
-        self.num_hidden_layers = num_hidden_layers
-        self.num_attention_heads = num_attention_heads
-        self.num_key_value_heads = num_key_value_heads
-        self.head_dim = head_dim
-        self.hidden_act = hidden_act
-        self.layer_norm_eps = layer_norm_eps
-        self.image_default_input_size = image_default_input_size
-        self.image_patch_size = image_patch_size
-        self.image_num_pos = image_num_pos
-        self.attention_dropout = attention_dropout
-        self.residual_dropout = residual_dropout
-        self.initializer_range = initializer_range
-        self.float32_attention = float32_attention
-
-    @property
-    def image_num_patch(self):
-        h, w = self.image_default_input_size
-        return h // self.image_patch_size, w // self.image_patch_size
-
-
-class MolmoAct2AdapterConfig(PretrainedConfig):
-    r"""
-    This is the configuration class to store the configuration of MolmoAct2Adapter. With MolmoAct2VitConfig,
-    It is used to instantiate an MolmoAct2VisionBackbone according to the specified arguments,
-    defining the model architecture.
-
-    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
-    documentation from [`PretrainedConfig`] for more information.
-
-    Example:
-
-    ```python
-    >>> from transformers import MolmoAct2VitConfig, MolmoAct2AdapterConfig, MolmoAct2VisionBackbone
-
-    >>> # Initializing a MolmoAct2VitConfig and a MolmoAct2AdapterConfig
-    >>> vit_config = MolmoAct2VitConfig()
-    >>> adapter_config = MolmoPoolingConfig()
-
-    >>> # Initializing a MolmoAct2VisionBackbone (with random weights)
-    >>> model = MolmoAct2VisionBackbone(vit_config, adapter_config)
-
-    >>> # Accessing the model configuration
-    >>> vit_configuration = model.vit_config
-    >>> adapter_configuration = model.adapter_config
-    ```"""
-
-    model_type = "molmoact2"
-    base_config_key = "adapter_config"
-
-    def __init__(
-        self,
-        vit_layers: tuple = (-3, -9),
-        pooling_attention_mask: bool = False,
-        hidden_size: int = 1152,
-        num_attention_heads: int = 16,
-        num_key_value_heads: int = 16,
-        head_dim: int = 72,
-        float32_attention: bool = True,
-        attention_dropout: float = 0.0,
-        residual_dropout: float = 0.0,
-        hidden_act: str = "silu",
-        intermediate_size: int = 18944,
-        text_hidden_size: int = 3584,
-        image_feature_dropout: float = 0.0,
-        initializer_range: float = 0.02,
-        attn_implementation: str = "eager",
-        **kwargs,
-    ):
-        self.attn_implementation = attn_implementation
-        super().__init__(attn_implementation=attn_implementation, **kwargs)
-        self.vit_layers = vit_layers
-        self.pooling_attention_mask = pooling_attention_mask
-        self.hidden_size = hidden_size
-        self.num_attention_heads = num_attention_heads
-        self.num_key_value_heads = num_key_value_heads
-        self.head_dim = head_dim
-        self.float32_attention = float32_attention
-        self.attention_dropout = attention_dropout
-        self.residual_dropout = residual_dropout
-        self.hidden_act = hidden_act
-        self.intermediate_size = intermediate_size
-        self.text_hidden_size = text_hidden_size
-        self.image_feature_dropout = image_feature_dropout
-        self.initializer_range = initializer_range
-
-
-class MolmoAct2TextConfig(PretrainedConfig):
-    r"""
-    This is the configuration class to store the configuration of a [`MolmoAct2TextModel`]. It is used to instantiate a
-    `MolmoAct2TextModel` according to the specified arguments, defining the model architecture.
-
-    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
-    documentation from [`PretrainedConfig`] for more information.
-
-    Example:
-    ```python
-    >>> from transformers import MolmoAct2TextConfig, MolmoAct2TextModel
-
-    >>> # Initializing a MolmoAct2TextConfig
-    >>> configuration = MolmoAct2TextConfig()
-
-    >>> # Initializing a MolmoAct2TextModel (with random weights)
-    >>> model = MolmoAct2TextModel(configuration)
-
-    >>> # Accessing the model configuration
-    >>> configuration = model.config
-    ```"""
-
-    model_type = "molmoact2_text"
-    base_config_key = "text_config"
-    keys_to_ignore_at_inference = ["past_key_values"]
-    base_model_tp_plan = {
-        "blocks.*.self_attn.att_proj": "colwise",
-        "blocks.*.self_attn.attn_out": "rowwise",
-        "blocks.*.mlp.ff_proj": "colwise",
-        "blocks.*.mlp.ff_out": "rowwise",
-    }
-    base_model_pp_plan = {
-        "wte": (["input_ids"], ["inputs_embeds"]),
-        "blocks": (["hidden_states", "attention_mask"], ["hidden_states"]),
-        "ln_f": (["hidden_states"], ["hidden_states"]),
-    }
-
-    def __init__(
-        self,
-        hidden_size: int = 3584,
-        num_attention_heads: int = 28,
-        num_key_value_heads: int | None = 4,
-        head_dim: int = 128,
-        vocab_size: int = 152064,
-        additional_vocab_size: int = 128,
-        qkv_bias: bool = True,
-        num_hidden_layers: int = 48,
-        intermediate_size: int = 18944,
-        hidden_act: str = "silu",
-        embedding_dropout: float = 0.0,
-        attention_dropout: float = 0.0,
-        residual_dropout: float = 0.0,
-        max_position_embeddings: int = 4096,
-        rope_theta: float = 1000000.0,
-        rope_scaling: dict[str, Any] = None,
-        rope_scaling_layers: list[int] | None = None,
-        use_qk_norm: bool = False,
-        qk_norm_type: str = "olmo",
-        layer_norm_eps: int = 1e-6,
-        norm_after: bool = False,
-        initializer_range: float = 0.02,
-        use_cache=True,
-        tie_word_embeddings=False,
-        attn_implementation: str = "eager",
-        **kwargs,
-    ):
-        self.attn_implementation = attn_implementation
-        super().__init__(
-            tie_word_embeddings=tie_word_embeddings, attn_implementation=attn_implementation, **kwargs
-        )
-        self.hidden_size = hidden_size
-        self.num_attention_heads = num_attention_heads
-        if num_key_value_heads is None:
-            num_key_value_heads = num_attention_heads
-        self.num_key_value_heads = num_key_value_heads
-        self.head_dim = head_dim
-        self.vocab_size = vocab_size
-        self.additional_vocab_size = additional_vocab_size
-        self.qkv_bias = qkv_bias
-        self.num_hidden_layers = num_hidden_layers
-        self.intermediate_size = intermediate_size
-        self.hidden_act = hidden_act
-        self.embedding_dropout = embedding_dropout
-        self.attention_dropout = attention_dropout
-        self.residual_dropout = residual_dropout
-        self.max_position_embeddings = max_position_embeddings
-        self.rope_theta = rope_theta
-        self.rope_scaling = rope_scaling
-        self.rope_scaling_layers = rope_scaling_layers
-        self.use_qk_norm = use_qk_norm
-        self.qk_norm_type = qk_norm_type
-        self.layer_norm_eps = layer_norm_eps
-        self.norm_after = norm_after
-        self.initializer_range = initializer_range
-        self.use_cache = use_cache
-
-        # Validate the correctness of rotary position embeddings parameters
-        rope_config_validation(self)
-
-
-class MolmoAct2ActionExpertConfig(PretrainedConfig):
-    r"""Configuration for the MolmoAct2 modern action expert."""
-
-    model_type = "molmoact2_action_expert"
-    base_config_key = "action_expert_config"
-
-    def __init__(
-        self,
-        max_action_horizon: int = 32,
-        max_action_dim: int = 32,
-        hidden_size: int = 1024,
-        num_layers: int = 32,
-        num_heads: int = 16,
-        mlp_ratio: float = 8.0 / 3.0,
-        ffn_multiple_of: int = 256,
-        timestep_embed_dim: int = 256,
-        dropout: float = 0.0,
-        attn_dropout: float = 0.0,
-        context_layer_norm: bool = True,
-        qk_norm: bool = True,
-        qk_norm_eps: float = 1e-6,
-        rope: bool = True,
-        causal_attn: bool = False,
-        **kwargs,
-    ):
-        super().__init__(**kwargs)
-        self.max_action_horizon = max_action_horizon
-        self.max_action_dim = max_action_dim
-        self.hidden_size = hidden_size
-        self.num_layers = num_layers
-        self.num_heads = num_heads
-        self.mlp_ratio = mlp_ratio
-        self.ffn_multiple_of = ffn_multiple_of
-        self.timestep_embed_dim = timestep_embed_dim
-        self.dropout = dropout
-        self.attn_dropout = attn_dropout
-        self.context_layer_norm = context_layer_norm
-        self.qk_norm = qk_norm
-        self.qk_norm_eps = qk_norm_eps
-        self.rope = rope
-        self.causal_attn = causal_attn
-
-    def to_dict(self):
-        output = super().to_dict()
-        # These are derived from the parent MolmoAct2Config for HF exports. Keeping
-        # them out of the public nested config avoids duplicated sources of truth.
-        output.pop("max_action_horizon", None)
-        output.pop("max_action_dim", None)
-        return output
-
-
-class MolmoAct2Config(PretrainedConfig):
-    r"""
-    This is the configuration class to store the configuration of a [`MolmoAct2ForConditionalGeneration`].
-    It is used to instantiate an MolmoAct2 model according to the specified arguments, defining the model architecture.
-
-    Example:
-
-    ```python
-    >>> from transformers import MolmoAct2Config, MolmoAct2VitConfig, MolmoAct2AdapterConfig, MolmoAct2TextConfig
-
-    >>> # Initializing a MolmoAct2VitConfig
-    >>> vit_config = MolmoAct2VitConfig()
-
-    >>> # Initializing a MolmoAct2AdapterConfig
-    >>> adapter_config = MolmoAct2AdapterConfig()
-
-    >>> # Initializing a MolmoAct2TextConfig
-    >>> text_config = MolmoAct2TextConfig()
-
-    >>> # Initializing a MolmoAct2Config
-    >>> configuration = MolmoAct2Config(
-    >>>     vit_config=vit_config,
-    >>>     adapter_config=adapter_config,
-    >>>     text_config=text_config,
-    >>>     image_start_token_id=151936,
-    >>>     image_end_token_id=151937,
-    >>>     image_patch_id=151938,
-    >>>     image_col_id=151939,
-    >>>     low_res_image_start_token_id=151940,
-    >>>     image_low_res_id=151942,
-    >>>     frame_start_token_id=151943,
-    >>>     frame_end_token_id=151944,
-    >>> )
-
-    >>> # Initializing a model
-    >>> model = MolmoAct2ForConditionalGeneration(configuration)
-
-    >>> # Accessing the model configuration
-    >>> configuration = model.config
-    ```"""
-
-    model_type = "molmoact2"
-    sub_configs = {
-        "text_config": MolmoAct2TextConfig,
-        "vit_config": MolmoAct2VitConfig,
-        "adapter_config": MolmoAct2AdapterConfig,
-        "action_expert_config": MolmoAct2ActionExpertConfig,
-    }
-
-    def __init__(
-        self,
-        vit_config: MolmoAct2VitConfig = None,
-        adapter_config: MolmoAct2AdapterConfig = None,
-        text_config: MolmoAct2TextConfig = None,
-        action_expert_config: MolmoAct2ActionExpertConfig = None,
-        image_start_token_id: int = None,
-        low_res_image_start_token_id: int = None,
-        image_end_token_id: int = None,
-        image_low_res_id: int = None,
-        image_patch_id: int = None,
-        image_col_id: int = None,
-        frame_start_token_id: int = None,
-        frame_end_token_id: int = None,
-        use_frame_special_tokens: bool = True,
-        initializer_range: float = 0.02,
-        add_action_expert: bool = True,
-        max_action_dim: int = 32,
-        max_action_horizon: int = 30,
-        n_obs_steps: int = 30,
-        action_mode: str = "both",
-        state_format: str = "discrete",
-        flow_matching_num_steps: int = 10,
-        flow_matching_cutoff: float = 1.0,
-        flow_matching_time_offset: float = 0.001,
-        flow_matching_time_scale: float = 0.999,
-        flow_matching_beta_alpha: float = 1.0,
-        flow_matching_beta_beta: float = 1.5,
-        mask_action_dim_padding: bool = True,
-        enable_depth_reasoning: bool = False,
-        depth_mode: int = 2,
-        num_depth_codes: int = 100,
-        action_expert_depth_gate: bool = False,
-        action_expert_depth_gate_per_layer: bool = False,
-        action_expert_depth_gate_init_bias: float = -4.0,
-        action_output_token_id: int = None,
-        action_start_token_id: int = None,
-        action_end_token_id: int = None,
-        action_token_start_id: int = None,
-        num_action_tokens: int = 0,
-        depth_output_token_id: int = None,
-        depth_start_token_id: int = None,
-        depth_end_token_id: int = None,
-        depth_token_start_id: int = None,
-        num_depth_tokens: int = 0,
-        state_start_token_id: int = None,
-        state_end_token_id: int = None,
-        state_token_start_id: int = None,
-        num_state_tokens: int = 0,
-        add_setup_tokens: bool = True,
-        add_control_tokens: bool = True,
-        norm_stats_filename: str = "norm_stats.json",
-        **kwargs,
-    ):
-        super().__init__(**kwargs)
-        if vit_config is None:
-            self.vit_config = MolmoAct2VitConfig()
-        elif isinstance(vit_config, dict):
-            self.vit_config = MolmoAct2VitConfig(**vit_config)
-        else:
-            self.vit_config = vit_config
-        if adapter_config is None:
-            self.adapter_config = MolmoAct2AdapterConfig()
-        elif isinstance(adapter_config, dict):
-            self.adapter_config = MolmoAct2AdapterConfig(**adapter_config)
-        else:
-            self.adapter_config = adapter_config
-        if text_config is None:
-            self.text_config = MolmoAct2TextConfig()
-        elif isinstance(text_config, dict):
-            self.text_config = MolmoAct2TextConfig(**text_config)
-        else:
-            self.text_config = text_config
-        self.add_action_expert = bool(add_action_expert)
-        if not self.add_action_expert:
-            self.action_expert_config = None
-        elif action_expert_config is None:
-            self.action_expert_config = MolmoAct2ActionExpertConfig(
-                max_action_horizon=max_action_horizon,
-                max_action_dim=max_action_dim,
-                num_layers=self.text_config.num_hidden_layers,
-            )
-        elif isinstance(action_expert_config, dict):
-            self.action_expert_config = MolmoAct2ActionExpertConfig(**action_expert_config)
-        else:
-            self.action_expert_config = action_expert_config
-        if self.add_action_expert:
-            self.action_expert_config.max_action_dim = int(max_action_dim)
-            self.action_expert_config.max_action_horizon = int(max_action_horizon)
-            self._validate_release_action_config(
-                state_format=state_format,
-            )
-        self.image_start_token_id = image_start_token_id
-        self.low_res_image_start_token_id = low_res_image_start_token_id
-        self.image_end_token_id = image_end_token_id
-        self.image_low_res_id = image_low_res_id
-        self.image_high_res_id = image_patch_id
-        self.image_patch_id = image_patch_id
-        self.image_col_id = image_col_id
-        self.frame_start_token_id = frame_start_token_id
-        self.frame_end_token_id = frame_end_token_id
-        self.use_frame_special_tokens = use_frame_special_tokens
-        self.initializer_range = initializer_range
-        self.max_action_dim = max_action_dim
-        self.max_action_horizon = max_action_horizon
-        self.n_obs_steps = n_obs_steps
-        self.action_mode = action_mode
-        self.state_format = state_format
-        self.flow_matching_num_steps = flow_matching_num_steps
-        self.flow_matching_cutoff = flow_matching_cutoff
-        self.flow_matching_time_offset = flow_matching_time_offset
-        self.flow_matching_time_scale = flow_matching_time_scale
-        self.flow_matching_beta_alpha = flow_matching_beta_alpha
-        self.flow_matching_beta_beta = flow_matching_beta_beta
-        self.mask_action_dim_padding = mask_action_dim_padding
-        self.enable_depth_reasoning = enable_depth_reasoning
-        self.depth_mode = depth_mode
-        self.num_depth_codes = num_depth_codes
-        self.action_expert_depth_gate = action_expert_depth_gate
-        self.action_expert_depth_gate_per_layer = action_expert_depth_gate_per_layer
-        self.action_expert_depth_gate_init_bias = action_expert_depth_gate_init_bias
-        self.action_output_token_id = action_output_token_id
-        self.action_start_token_id = action_start_token_id
-        self.action_end_token_id = action_end_token_id
-        self.action_token_start_id = action_token_start_id
-        self.num_action_tokens = num_action_tokens
-        self.depth_output_token_id = depth_output_token_id
-        self.depth_start_token_id = depth_start_token_id
-        self.depth_end_token_id = depth_end_token_id
-        self.depth_token_start_id = depth_token_start_id
-        self.num_depth_tokens = num_depth_tokens
-        self.state_start_token_id = state_start_token_id
-        self.state_end_token_id = state_end_token_id
-        self.state_token_start_id = state_token_start_id
-        self.num_state_tokens = num_state_tokens
-        self.add_setup_tokens = add_setup_tokens
-        self.add_control_tokens = add_control_tokens
-        self.norm_stats_filename = norm_stats_filename
-
-    @staticmethod
-    def _validate_release_action_config(
-        *,
-        state_format: str,
-    ) -> None:
-        if state_format != "discrete":
-            raise ValueError("MolmoAct2 HF export supports only state_format='discrete'.")
-
-    @property
-    def image_num_patch(self):
-        assert self.vit_config is not None
-        return self.vit_config.image_num_patch
-
-    @property
-    def num_attention_heads(self):
-        return self.text_config.num_attention_heads
-
-    @property
-    def num_key_value_heads(self):
-        return self.text_config.num_key_value_heads
-
-    @property
-    def head_dim(self):
-        return self.text_config.head_dim
-
-    @property
-    def num_hidden_layers(self):
-        return self.text_config.num_hidden_layers
-
-    @property
-    def hidden_size(self):
-        return self.text_config.hidden_size
-
-    @property
-    def vocab_size(self):
-        return self.text_config.vocab_size
-
-    @property
-    def max_position_embeddings(self):
-        return self.text_config.max_position_embeddings
-
-
-MolmoAct2VitConfig.register_for_auto_class()
-MolmoAct2AdapterConfig.register_for_auto_class()
-MolmoAct2TextConfig.register_for_auto_class()
-MolmoAct2ActionExpertConfig.register_for_auto_class()
-MolmoAct2Config.register_for_auto_class()
--- a/src/lerobot/policies/molmoact2/hf_model/image_processing_molmoact2.py
+++ b/src/lerobot/policies/molmoact2/hf_model/image_processing_molmoact2.py
@@ -1,564 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# ruff: noqa
-
-"""Image processor class for MolmoAct2"""
-
-from typing import Optional, Union
-import numpy as np
-import einops
-import torch
-import torchvision.transforms
-
-from transformers.image_utils import (
-    IMAGENET_STANDARD_MEAN,
-    IMAGENET_STANDARD_STD,
-    ImageInput,
-    PILImageResampling,
-    make_flat_list_of_images,
-    valid_images,
-    to_numpy_array,
-)
-from transformers.image_transforms import convert_to_rgb
-from transformers.processing_utils import ImagesKwargs
-from transformers.image_processing_utils import BaseImageProcessor, get_size_dict
-from transformers.utils import logging
-from transformers.feature_extraction_utils import BatchFeature
-from transformers.utils import TensorType, logging
-
-
-logger = logging.get_logger(__name__)
-
-
-def normalize_image(
-    image: np.ndarray,
-    image_mean: list[float],
-    image_std: list[float],
-) -> np.ndarray:
-    if np.allclose(image_mean, [0.5, 0.5, 0.5]) and np.allclose(image_std, [0.5, 0.5, 0.5]):
-        return image * np.asarray(2.0, dtype=np.float32) - np.asarray(1.0, dtype=np.float32)
-    image -= np.array(image_mean, dtype=np.float32)[None, None, :]
-    image /= np.array(image_std, dtype=np.float32)[None, None, :]
-    return image
-
-
-def resize_image(
-    image: np.ndarray,
-    desired_output_size: list[int],
-    resample: PILImageResampling,
-) -> np.ndarray:
-    image = torch.permute(torch.from_numpy(image), [2, 0, 1])
-    dtype = image.dtype
-    if torch.is_floating_point(image):
-        in_min = 0.0
-        in_max = 1.0
-        resized = torchvision.transforms.Resize(
-            desired_output_size,
-            resample,
-            antialias=False,
-        )(image)
-        resized = torch.clip(resized, 0.0, 1.0).to(dtype)
-    else:
-        assert image.dtype == torch.uint8, "SigLIP expects float images or uint8 images, but got {}".format(
-            image.dtype
-        )
-        in_min = 0.0
-        in_max = 255.0
-        resized = torchvision.transforms.Resize(
-            desired_output_size,
-            resample,
-            antialias=False,
-        )(image)
-        resized = torch.clip(resized, 0, 255).to(dtype)
-
-    resized = resized.to(torch.float32)
-    resized = (resized - in_min) / (in_max - in_min)
-
-    resized = torch.permute(resized, [1, 2, 0]).numpy()
-
-    return resized
-
-
-def select_tiling(h, w, patch_size, max_num_crops):
-    """Divide in image of size [w, h] in up to max_num_patches of size patch_size"""
-    original_size = np.stack([h, w])  # [1, 2]
-    original_res = h * w
-    tilings = []
-    for i in range(1, max_num_crops + 1):
-        for j in range(1, max_num_crops + 1):
-            if i * j <= max_num_crops:
-                tilings.append((i, j))
-    # sort so argmin and argmax favour smaller tilings in the event of a tie
-    tilings.sort(key=lambda x: (x[0] * x[1], x[0]))
-    candidate_tilings = np.array(tilings, dtype=np.int32)  # [n_resolutions, 2]
-    candidate_resolutions = candidate_tilings * patch_size  # [n_resolutions, 2]
-
-    # How much we would need to scale the image to fit exactly in each tiling
-    original_size = np.stack([h, w], dtype=np.float32)  # [1, 2]
-
-    # The original size can be zero in rare cases if the image is smaller than the margin
-    # In those cases letting the scale become infinite means the tiling is based on the
-    # other side, or falls back to the smallest tiling
-    with np.errstate(divide="ignore"):
-        required_scale_d = (candidate_resolutions.astype(np.float32) / original_size,)
-    required_scale = np.min(required_scale_d, axis=-1, keepdims=True)  # [n_resolutions, 1]
-    if np.all(required_scale < 1):
-        # We are forced to downscale, so try to minimize the amount of downscaling
-        ix = np.argmax(required_scale)
-    else:
-        # Pick the resolution that required the least upscaling so that it most closely fits the image
-        required_scale = np.where(required_scale < 1.0, 10e9, required_scale)
-        ix = np.argmin(required_scale)
-    return candidate_tilings[ix]
-
-
-def build_resized_image(
-    image: np.ndarray,
-    base_image_input_size: list[int],
-    resample: PILImageResampling,
-    image_mean: list[float],
-    image_std: list[float],
-    image_patch_size: int,
-) -> tuple[np.ndarray, np.ndarray]:
-    resized = resize_image(
-        image,
-        base_image_input_size,
-        resample,
-    )
-    resized = normalize_image(resized, image_mean, image_std)
-    if len(resized.shape) == 3:
-        resized = np.expand_dims(resized, 0)
-    crop_patch_w = base_image_input_size[1] // image_patch_size
-    crop_patch_h = base_image_input_size[0] // image_patch_size
-    resize_idx = np.arange(crop_patch_w * crop_patch_h).reshape([crop_patch_h, crop_patch_w])
-    return resized, resize_idx
-
-
-def build_overlapping_crops(
-    image: np.ndarray,
-    max_crops: int,
-    overlap_margins: list[int],
-    base_image_input_size: list[int],
-    resample: PILImageResampling,
-    image_mean: list[float],
-    image_std: list[float],
-    image_patch_size: int,
-) -> tuple[np.ndarray, np.ndarray]:
-    """Decompose an image into a set of overlapping crops
-
-    :return crop_arr: [n_crops, h, w, 3] The crops
-    :return patch_idx: [overlap_patch_h, overlap_patch_w] For each patch in the resized image
-                        the crops were extracted from, what patch in `crop_arr` it corresponds to
-    """
-    original_image_h, original_image_w = image.shape[:2]
-    crop_size = base_image_input_size[0]
-    assert base_image_input_size[0] == base_image_input_size[1]
-
-    left_margin, right_margin = overlap_margins
-    total_margin_pixels = image_patch_size * (right_margin + left_margin)  # pixels removed per dim
-    crop_patches = base_image_input_size[0] // image_patch_size  # patches per crop dim
-    crop_window_patches = crop_patches - (right_margin + left_margin)  # usable patches
-    crop_window_size = crop_window_patches * image_patch_size
-    crop_patch_w = base_image_input_size[1] // image_patch_size
-    crop_patch_h = base_image_input_size[0] // image_patch_size
-    original_image_h, original_image_w = image.shape[:2]
-    crop_size = base_image_input_size[0]
-
-    # Decide how to tile the image, to account for the overlap margins we compute the tiling
-    # as if we had an image without the margins and were using a crop size without the margins
-    tiling = select_tiling(
-        original_image_h - total_margin_pixels,
-        original_image_w - total_margin_pixels,
-        crop_window_size,
-        max_crops,
-    )
-
-    src = resize_image(
-        image,
-        [
-            tiling[0] * crop_window_size + total_margin_pixels,
-            tiling[1] * crop_window_size + total_margin_pixels,
-        ],
-        resample,
-    )
-    src = normalize_image(src, image_mean, image_std)
-
-    # Now we have to split the image into crops, and track what patches came from
-    # where in `patch_idx_arr`
-    n_crops = tiling[0] * tiling[1]
-    crop_arr = np.zeros([n_crops, crop_size, crop_size, 3], dtype=src.dtype)
-    patch_idx_arr = np.zeros([n_crops, crop_patch_h, crop_patch_w], dtype=np.int32)
-    on_crop = 0
-    for i in range(tiling[0]):
-        # Slide over `src` by `crop_window_size` steps, but extract crops of size `crops_size`
-        # which results in overlapping crop windows
-        y0 = i * crop_window_size
-        for j in range(tiling[1]):
-            x0 = j * crop_window_size
-            crop_arr[on_crop] = src[y0 : y0 + crop_size, x0 : x0 + crop_size]
-            patch_idx = np.arange(crop_patch_w * crop_patch_h).reshape(crop_patch_h, crop_patch_w)
-            patch_idx += on_crop * crop_patch_h * crop_patch_w
-
-            # Mask out idx that are in the overlap region
-            if i != 0:
-                patch_idx[:left_margin, :] = -1
-            if j != 0:
-                patch_idx[:, :left_margin] = -1
-            if i != tiling[0] - 1:
-                patch_idx[-right_margin:, :] = -1
-            if j != tiling[1] - 1:
-                patch_idx[:, -right_margin:] = -1
-            patch_idx_arr[on_crop] = patch_idx
-            on_crop += 1
-
-    # `patch_idx_arr` is ordered crop-by-crop, here we transpose `patch_idx_arr`
-    # so it is ordered left-to-right order
-    patch_idx_arr = np.reshape(patch_idx_arr, [tiling[0], tiling[1], crop_patch_h, crop_patch_w])
-    patch_idx_arr = np.transpose(patch_idx_arr, [0, 2, 1, 3])
-    patch_idx_arr = np.reshape(patch_idx_arr, [-1])
-
-    # Now get the parts not in the overlap region, so it should map each patch in `src`
-    # to the correct patch it should come from in `crop_arr`
-    patch_idx_arr = patch_idx_arr[patch_idx_arr >= 0].reshape(
-        src.shape[0] // image_patch_size,
-        src.shape[1] // image_patch_size,
-    )
-    return crop_arr, patch_idx_arr
-
-
-def batch_pixels_to_patches(array: np.ndarray, patch_size: int) -> np.ndarray:
-    """Reshape images of [n_images, h, w, 3] -> [n_images, n_patches, pixels_per_patch]"""
-    if len(array.shape) == 3:
-        n_crops, h, w = array.shape
-        h_patches = h // patch_size
-        w_patches = w // patch_size
-        array = np.reshape(array, [n_crops, h_patches, patch_size, w_patches, patch_size])
-        array = np.transpose(array, [0, 1, 3, 2, 4])
-        array = np.reshape(array, [n_crops, h_patches * w_patches, patch_size * patch_size])
-        return array
-    else:
-        n_crops, h, w, c = array.shape
-        h_patches = h // patch_size
-        w_patches = w // patch_size
-        array = np.reshape(array, [n_crops, h_patches, patch_size, w_patches, patch_size, c])
-        array = np.transpose(array, [0, 1, 3, 2, 4, 5])
-        array = np.reshape(array, [n_crops, h_patches * w_patches, patch_size * patch_size * c])
-        return array
-
-
-def arange_for_pooling(
-    idx_arr: np.ndarray,
-    pool_h: int,
-    pool_w: int,
-) -> np.ndarray:
-    h_pad = pool_h * ((idx_arr.shape[0] + pool_h - 1) // pool_h) - idx_arr.shape[0]
-    w_pad = pool_w * ((idx_arr.shape[1] + pool_w - 1) // pool_w) - idx_arr.shape[1]
-    idx_arr = np.pad(
-        idx_arr,
-        [[h_pad // 2, (h_pad + 1) // 2], [w_pad // 2, (w_pad + 1) // 2]],
-        mode="constant",
-        constant_values=-1,
-    )
-    return einops.rearrange(idx_arr, "(h dh) (w dw) -> h w (dh dw)", dh=pool_h, dw=pool_w)
-
-
-def image_to_patches_and_grids(
-    image: np.ndarray,
-    max_crops: int,
-    overlap_margins: list[int],
-    base_image_input_size: list[int],
-    resample: PILImageResampling,
-    image_mean: list[float],
-    image_std: list[float],
-    image_patch_size: int,
-    image_pooling_w: int,
-    image_pooling_h: int,
-    crop_mode: str = "overlap-and-resize-c2",
-) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
-    """
-    :return image_grids, the shape of each (low-res, high-res) image after pooling
-    :return crops, the image crops to processes with the ViT
-    :return pooled_patch_idx, for each patch_id tokens in `image_tokens`, the indices of the
-                                patches in `crops` to pool for that token, masked with -1
-    """
-    if isinstance(base_image_input_size, int):
-        base_image_input_size = (base_image_input_size, base_image_input_size)
-
-    base_image_input_d = image_patch_size
-    pooling_w = image_pooling_w
-    pooling_h = image_pooling_h
-    crop_patch_w = base_image_input_size[1] // base_image_input_d
-    crop_patch_h = base_image_input_size[0] // base_image_input_d
-
-    if crop_mode == "resize":
-        resized, resize_idx = build_resized_image(
-            image,
-            base_image_input_size,
-            resample,
-            image_mean,
-            image_std,
-            image_patch_size,
-        )
-        resize_idx = arange_for_pooling(resize_idx, pooling_h, pooling_w)
-        resized_h, resized_w = resize_idx.shape[:2]
-        resize_idx = resize_idx.reshape([-1, pooling_h * pooling_w])
-        image_grid = [np.array([resized_h, resized_w, 0, 0])]
-        return (
-            np.stack(image_grid, 0),
-            batch_pixels_to_patches(resized, image_patch_size),
-            resize_idx,
-        )
-
-    if crop_mode not in {"overlap-and-resize-c2", "overlap-and-resize"}:
-        raise ValueError(f"Unsupported MolmoAct2 image crop_mode {crop_mode!r}.")
-
-    crop_arr, patch_idx_arr = build_overlapping_crops(
-        image,
-        max_crops,
-        overlap_margins,
-        base_image_input_size,
-        resample,
-        image_mean,
-        image_std,
-        image_patch_size,
-    )
-    pooling_idx = arange_for_pooling(patch_idx_arr, pooling_h, pooling_w)
-    h, w = pooling_idx.shape[:2]
-    pooling_idx = pooling_idx.reshape([-1, pooling_h * pooling_w])
-
-    # Finally do the same for the global image
-    resized, resize_idx = build_resized_image(
-        image,
-        base_image_input_size,
-        resample,
-        image_mean,
-        image_std,
-        image_patch_size,
-    )
-    crop_arr = np.concatenate([resized, crop_arr], 0)
-
-    resize_idx = arange_for_pooling(resize_idx, pooling_h, pooling_w)
-    resized_h, resized_w = resize_idx.shape[:2]
-    resize_idx = resize_idx.reshape([-1, pooling_h * pooling_w])
-
-    # Global image goes first, so the order of patches in previous crops gets increased
-    pooling_idx = np.where(pooling_idx >= 0, pooling_idx + crop_patch_h * crop_patch_w, -1)
-    pooling_idx = np.concatenate([resize_idx, pooling_idx])
-    image_grid = [np.array([resized_h, resized_w, h, w])]
-
-    return (np.stack(image_grid, 0), batch_pixels_to_patches(crop_arr, image_patch_size), pooling_idx)
-
-
-class MolmoAct2ImagesKwargs(ImagesKwargs, total=False):
-    max_crops: int | None
-    overlap_margins: list[int] | None
-    crop_mode: str | None
-    patch_size: int | None
-    pooling_size: list[int] | None
-
-
-class MolmoAct2ImageProcessor(BaseImageProcessor):
-    r"""
-    Constructs a MolmoAct2 image processor that preprocesses images for the model.
-
-    Args:
-        size (`dict[str, int]` *optional*, defaults to `{"height": 378, "width": 378}`):
-            Size of the image after resizing.
-        resample (`PILImageResampling`, *optional*, defaults to `Resampling.BILINEAR`):
-            Resampling filter to use when resizing the image.
-        image_mean (`float` or `list[float]`, *optional*, defaults to `[0.5, 0.5, 0.5]`):
-            Mean to use if normalizing the image. This is a float or list of floats for each channel in the image.
-        image_std (`float` or `list[float]`, *optional*, defaults to `[0.5, 0.5, 0.5]`):
-            Standard deviation to use if normalizing the image. This is a float or list of floats for each channel in the image.
-        do_convert_rgb (`bool`, *optional*, defaults to `True`):
-            Whether to convert the image to RGB.
-        max_crops (`int`, *optional*, defaults to `8`):
-            Maximum number of crops to use per image.
-        overlap_margins (`list[int]`, *optional*, defaults to `[4, 4]`):
-            Overlap margins to use.
-        patch_size (`int`, *optional*, defaults to 14):
-            The spatial patch size of the vision encoder.
-        pooling_size (`list[int]`, *optional*, defaults to `[2, 2]`):
-            The pooling size of the vision adapter.
-    """
-
-    model_input_names = ["pixel_values", "image_token_pooling", "image_grids", "image_num_crops"]
-
-    def __init__(
-        self,
-        size: dict[str, int] | None = None,
-        resample: PILImageResampling = PILImageResampling.BILINEAR,
-        image_mean: float | list[float] | None = None,
-        image_std: float | list[float] | None = None,
-        do_convert_rgb: bool = True,
-        max_crops: int = 8,
-        overlap_margins: list[int] = [4, 4],
-        crop_mode: str = "overlap-and-resize-c2",
-        patch_size: int = 14,
-        pooling_size: list[int] = [2, 2],
-        **kwargs,
-    ) -> None:
-        super().__init__(**kwargs)
-        size = size if size is not None else {"height": 378, "width": 378}
-        size = get_size_dict(size, default_to_square=True)
-        self.size = size
-
-        self.resample = resample
-        self.image_mean = image_mean if image_mean is not None else IMAGENET_STANDARD_MEAN
-        self.image_std = image_std if image_std is not None else IMAGENET_STANDARD_STD
-        self.do_convert_rgb = do_convert_rgb
-
-        self.max_crops = max_crops
-        self.overlap_margins = overlap_margins
-        self.crop_mode = crop_mode
-        self.patch_size = patch_size
-        self.pooling_size = pooling_size
-
-    def preprocess(
-        self,
-        images: ImageInput,
-        size: dict[str, int] | None = None,
-        resample: PILImageResampling | None = None,
-        image_mean: float | list[float] | None = None,
-        image_std: float | list[float] | None = None,
-        do_convert_rgb: bool | None = None,
-        max_crops: int | None = None,
-        overlap_margins: list[int] | None = None,
-        crop_mode: str | None = None,
-        patch_size: int | None = None,
-        pooling_size: list[int] | None = None,
-        return_tensors: str | TensorType | None = None,
-        **kwargs,
-    ) -> BatchFeature:
-        """
-        Args:
-            images (`ImageInput`):
-                Image to preprocess.
-            size (`dict[str, int]`, *optional*, defaults to `self.size`):
-                Size of the image after resizing.
-            resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
-                Resampling filter to use when resizing the image. This can be one of the enum `PILImageResampling`. Only
-                has an effect if `do_resize` is set to `True`.
-            image_mean (`float` or `list[float]`, *optional*, defaults to `self.image_mean`):
-                Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
-            image_std (`float` or `list[float]`, *optional*, defaults to `self.image_std`):
-                Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
-                `True`.
-            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
-                Whether to convert the image to RGB.
-            max_crops (`int`, *optional*, defaults to `self.max_crops`):
-                Maximum number of crops to use per image.
-            overlap_margins (`list[int]`, *optional*, defaults to `self.overlap_margins`):
-                Overlap margins to use.
-            patch_size (`int`, *optional*, defaults to `self.patch_size`):
-                The spatial patch size of the vision encoder.
-            pooling_size (`list[int]`, *optional*, defaults to `self.pooling_size`):
-                The pooling size of the vision adapter.
-            return_tensors (`str` or `TensorType`, *optional*):
-                The type of tensors to return. Can be one of:
-                - Unset: Return a list of `np.ndarray`.
-                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
-                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
-                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
-                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
-
-        Returns:
-            A `BatchFeature` containing the following keys:
-                - `pixel_values`: The preprocessed images.
-                - `image_token_pooling`: The indices of the patches in `crops` to pool for each token in `image_tokens`.
-                - `image_grids`: The image grids.
-                - `image_num_crops`: The number of crops for each image.
-        """
-        if size is not None:
-            if "height" not in size or "width" not in size:
-                raise ValueError("size must contain 'height' and 'width' keys.")
-        else:
-            size = {**self.size}
-
-        base_image_input_size = [size["height"], size["width"]]
-
-        resample = resample or self.resample
-        image_mean = image_mean or self.image_mean
-        image_std = image_std or self.image_std
-        do_convert_rgb = do_convert_rgb or self.do_convert_rgb
-
-        max_crops = max_crops or self.max_crops
-        overlap_margins = overlap_margins or self.overlap_margins
-        crop_mode = crop_mode or self.crop_mode
-        patch_size = patch_size or self.patch_size
-        pooling_size = pooling_size or self.pooling_size
-
-        image_pooling_h, image_pooling_w = pooling_size
-
-        if images is not None:
-            images = self.fetch_images(images)
-            images = make_flat_list_of_images(images)
-
-        if images is not None and not valid_images(images):
-            raise ValueError(
-                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
-                "torch.Tensor, tf.Tensor or jax.ndarray."
-            )
-
-        if do_convert_rgb:
-            images = [convert_to_rgb(image) for image in images]
-
-        # All transformations expect numpy arrays.
-        images = [to_numpy_array(image) for image in images]
-
-        data = {}
-        if images is not None:
-            batch_grids = []
-            batch_crops = []
-            batch_pooled_patches_idx = []
-            batch_num_crops = []
-
-            for image in images:
-                image_grid, crops, pooled_idx = image_to_patches_and_grids(
-                    image,
-                    max_crops,
-                    overlap_margins,
-                    base_image_input_size,
-                    resample,
-                    image_mean,
-                    image_std,
-                    patch_size,
-                    image_pooling_w,
-                    image_pooling_h,
-                    crop_mode,
-                )
-                batch_grids.append(image_grid)
-                batch_crops.append(crops)
-                batch_pooled_patches_idx.append(pooled_idx)
-                batch_num_crops.append(crops.shape[0])
-
-            pixel_values = np.concatenate(batch_crops, 0)
-            image_token_pooling = np.concatenate(batch_pooled_patches_idx, 0)
-            image_grids = np.concatenate(batch_grids, 0)
-            image_num_crops = np.array(batch_num_crops)
-
-            data.update(
-                pixel_values=pixel_values,
-                image_token_pooling=image_token_pooling,
-                image_grids=image_grids,
-                image_num_crops=image_num_crops,
-            )
-
-        return BatchFeature(data, tensor_type=return_tensors)
-
-
-MolmoAct2ImageProcessor.register_for_auto_class()
--- a/src/lerobot/policies/molmoact2/hf_model/inference.py
+++ b/src/lerobot/policies/molmoact2/hf_model/inference.py
@@ -1,748 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# ruff: noqa
-
-"""Inference utilities for MolmoAct2"""
-
-from dataclasses import dataclass
-from typing import Any, Optional, Tuple
-from collections.abc import Iterable, Sequence
-
-import torch
-from torch.nn import functional as F
-from transformers.cache_utils import Cache
-from transformers.configuration_utils import PretrainedConfig
-
-
-@dataclass
-class _ActionFlowInputs:
-    trajectory: torch.Tensor
-    context: Any
-    modulations: Sequence[Any]
-    action_dim_is_pad: torch.Tensor | None
-
-
-@dataclass
-class _ActionFlowCudaGraph:
-    key: tuple[Any, ...]
-    graph: torch.cuda.CUDAGraph
-    static_inputs: _ActionFlowInputs
-    output: torch.Tensor
-
-
-@dataclass
-class _DepthDecodeCudaGraphLayerStage:
-    residual: torch.Tensor
-    query: torch.Tensor
-    key: torch.Tensor
-    value: torch.Tensor
-
-
-@dataclass
-class _DepthDecodeCudaGraphPostStage:
-    graph: torch.cuda.CUDAGraph
-    attn_context: torch.Tensor
-
-
-@dataclass
-class _DepthDecodeCudaGraph:
-    cache_key: tuple[Any, ...]
-    pre_graph: torch.cuda.CUDAGraph
-    token_ids: torch.Tensor
-    cos: torch.Tensor
-    sin: torch.Tensor
-    positions: torch.Tensor
-    stages: Sequence[_DepthDecodeCudaGraphLayerStage]
-    post_graphs: Sequence[_DepthDecodeCudaGraphPostStage]
-    output: torch.Tensor
-
-
-@dataclass
-class _DepthDecodeCudaGraphSpec:
-    eligible: bool
-    cache_key_prefix: tuple[Any, ...]
-    num_hidden_layers: int
-    head_dim: int
-    num_attention_heads: int
-
-
-def _cache_seq_len_int(past_key_values: Cache | None) -> int:
-    if past_key_values is None:
-        return 0
-    seq_len = past_key_values.get_seq_length()
-    if torch.is_tensor(seq_len):
-        return int(seq_len.item())
-    return int(seq_len)
-
-
-def _cache_max_len_int(past_key_values: Cache | None) -> int:
-    if past_key_values is None:
-        return -1
-    max_len = past_key_values.get_max_cache_shape()
-    if torch.is_tensor(max_len):
-        return int(max_len.item())
-    return int(max_len)
-
-
-def _iter_cache_key_values(
-    past_key_values: Cache,
-) -> Iterable[tuple[torch.Tensor | None, torch.Tensor | None]]:
-    layers = getattr(past_key_values, "layers", None)
-    if layers is not None:
-        for layer in layers:
-            yield getattr(layer, "keys", None), getattr(layer, "values", None)
-        return
-    for layer in past_key_values:
-        yield layer[0], layer[1]
-
-
-class _DepthDecodeStaticLayerCache:
-    is_compileable = False
-    is_sliding = False
-
-    def __init__(self, max_cache_len: int) -> None:
-        self.max_cache_len = int(max_cache_len)
-        self.cumulative_length = 0
-        self.keys: torch.Tensor | None = None
-        self.values: torch.Tensor | None = None
-
-    def _allocate(self, key_states: torch.Tensor, value_states: torch.Tensor) -> None:
-        bsz, n_heads = key_states.shape[:2]
-        self.keys = torch.empty(
-            (bsz, n_heads, self.max_cache_len, key_states.shape[-1]),
-            dtype=key_states.dtype,
-            device=key_states.device,
-        )
-        self.values = torch.empty(
-            (bsz, n_heads, self.max_cache_len, value_states.shape[-1]),
-            dtype=value_states.dtype,
-            device=value_states.device,
-        )
-
-    def update(
-        self,
-        key_states: torch.Tensor,
-        value_states: torch.Tensor,
-        *args,
-        **kwargs,
-    ) -> tuple[torch.Tensor, torch.Tensor]:
-        if self.keys is None:
-            self._allocate(key_states, value_states)
-        start = self.cumulative_length
-        end = start + key_states.shape[-2]
-        if end > self.max_cache_len:
-            raise RuntimeError(f"KV cache length {end} exceeds max_cache_len={self.max_cache_len}.")
-        self.keys[:, :, start:end, :].copy_(key_states)
-        self.values[:, :, start:end, :].copy_(value_states)
-        self.cumulative_length = end
-        return self.keys[:, :, :end, :], self.values[:, :, :end, :]
-
-    def get_seq_length(self) -> int:
-        return self.cumulative_length
-
-    def get_max_cache_shape(self) -> int:
-        return -1
-
-    def reset(self) -> None:
-        self.cumulative_length = 0
-
-
-class _DepthDecodeStaticCache(Cache):
-    def __init__(self, config: PretrainedConfig, max_cache_len: int) -> None:
-        text_config = config.get_text_config(decoder=True)
-        super().__init__(
-            layers=[
-                _DepthDecodeStaticLayerCache(max_cache_len=max_cache_len)
-                for _ in range(text_config.num_hidden_layers)
-            ]
-        )
-
-    def get_seq_length(self, layer_idx: int = 0) -> int:
-        return self.layers[layer_idx].get_seq_length()
-
-    def get_max_cache_shape(self, layer_idx: int = 0) -> int:
-        return self.layers[layer_idx].get_max_cache_shape()
-
-    def reset(self) -> None:
-        for layer in self.layers:
-            layer.reset()
-
-
-class ActionCudaGraphManager:
-    def __init__(self, model: Any) -> None:
-        self.model = model
-        self.enabled = True
-        self.action_flow_graph: _ActionFlowCudaGraph | None = None
-
-    def set_enabled(self, enabled: bool) -> None:
-        self.enabled = bool(enabled)
-
-    def can_use_action_flow(self, inputs: _ActionFlowInputs) -> bool:
-        action_model = self.model
-        if not self.enabled:
-            return False
-        if action_model.training or action_model._require_action_expert().training:
-            return False
-        if inputs.trajectory.device.type != "cuda":
-            return False
-
-        def all_on_cuda():
-            yield inputs.trajectory
-            for k, v in inputs.context.kv_contexts:
-                yield k
-                yield v
-            for t in (
-                inputs.context.cross_mask,
-                inputs.context.self_mask,
-                inputs.context.valid_action,
-                inputs.action_dim_is_pad,
-            ):
-                if t is not None:
-                    yield t
-            if inputs.context.rope_cache is not None:
-                yield from inputs.context.rope_cache
-            for step in inputs.modulations:
-                yield step.conditioning
-                for block_modulation in step.block_modulations:
-                    yield from block_modulation
-                yield from step.final_modulation
-
-        return all(t.device.type == "cuda" for t in all_on_cuda())
-
-    def run_action_flow(
-        self,
-        inputs: _ActionFlowInputs,
-        steps: int,
-        run_loop,
-    ) -> torch.Tensor:
-        key = _cuda_graph_key(inputs, steps)
-        cache = self.action_flow_graph
-        if cache is None or cache.key != key:
-            static_inputs = _clone_static_inputs(inputs)
-            graph, output = _capture_cuda_graph(
-                lambda: run_loop(static_inputs, steps),
-                inputs.trajectory.device,
-                after_warmup=lambda: static_inputs.trajectory.copy_(inputs.trajectory),
-            )
-            cache = _ActionFlowCudaGraph(
-                key=key,
-                graph=graph,
-                static_inputs=static_inputs,
-                output=output,
-            )
-            self.action_flow_graph = cache
-        else:
-            _copy_inputs_(cache.static_inputs, inputs)
-
-        cache.graph.replay()
-        return cache.output.clone()
-
-
-class DepthDecodeCudaGraphManager:
-    def __init__(self, model: Any) -> None:
-        self.model = model
-        self.backbone = model.model
-        self.enabled = True
-        self.graph: _DepthDecodeCudaGraph | None = None
-        self.graph_spec: _DepthDecodeCudaGraphSpec | None = None
-
-    def set_enabled(self, enabled: bool) -> None:
-        self.enabled = bool(enabled)
-
-    def make_static_cache(self, max_cache_len: int) -> _DepthDecodeStaticCache:
-        return _DepthDecodeStaticCache(
-            config=self.model.config.text_config,
-            max_cache_len=max_cache_len,
-        )
-
-    def _depth_decode_spec(self) -> _DepthDecodeCudaGraphSpec:
-        static = self.graph_spec
-        if static is None:
-            cfg = self.backbone.transformer.config
-            rotary_emb = getattr(self.backbone.transformer, "rotary_emb", None)
-            static = _DepthDecodeCudaGraphSpec(
-                eligible=(
-                    not cfg.norm_after
-                    and cfg.rope_scaling_layers is None
-                    and getattr(rotary_emb, "rope_type", None) == "default"
-                    and cfg._attn_implementation == "sdpa"
-                ),
-                cache_key_prefix=(
-                    cfg.hidden_size,
-                    cfg.num_attention_heads,
-                    cfg.num_key_value_heads,
-                    cfg.head_dim,
-                    cfg.num_hidden_layers,
-                    cfg.use_qk_norm,
-                    cfg.qk_norm_type,
-                    cfg._attn_implementation,
-                ),
-                num_hidden_layers=cfg.num_hidden_layers,
-                head_dim=cfg.head_dim,
-                num_attention_heads=cfg.num_attention_heads,
-            )
-            self.graph_spec = static
-        return static
-
-    def can_use(
-        self,
-        next_input_ids: torch.Tensor,
-        *,
-        past_key_values: Cache,
-        attention_bias: torch.Tensor,
-    ) -> bool:
-        if not self.enabled or self.model.training or self.backbone.transformer.training:
-            return False
-        if next_input_ids.device.type != "cuda":
-            return False
-        if next_input_ids.ndim != 2 or next_input_ids.shape[0] != 1 or next_input_ids.shape[1] != 1:
-            return False
-        if not isinstance(past_key_values, _DepthDecodeStaticCache):
-            return False
-        if not torch.is_tensor(attention_bias) or attention_bias.device != next_input_ids.device:
-            return False
-        return self._depth_decode_spec().eligible
-
-    def _depth_decode_key(
-        self,
-        next_input_ids: torch.Tensor,
-        attention_bias: torch.Tensor,
-    ) -> tuple[Any, ...]:
-        device = next_input_ids.device
-        return (
-            self._depth_decode_spec().cache_key_prefix,
-            device.type,
-            device.index,
-            self.model.lm_head.weight.dtype,
-            attention_bias.shape[-1],
-        )
-
-    def _select_depth_decode_rope(self, cos: torch.Tensor, sin: torch.Tensor, *, past_length: int) -> None:
-        emb = self.backbone.transformer.rotary_emb
-        cos.copy_(emb._pos_cos_cache[0, :, past_length : past_length + 1, :])
-        sin.copy_(emb._pos_sin_cache[0, :, past_length : past_length + 1, :])
-
-    def _depth_decode_pre_layer(
-        self,
-        layer_idx: int,
-        hidden_states: torch.Tensor,
-        cos: torch.Tensor,
-        sin: torch.Tensor,
-    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
-        block = self.backbone.transformer.blocks[layer_idx]
-        attention = block.self_attn
-        residual = hidden_states
-        hidden_states = block.attn_norm(hidden_states)
-
-        input_shape = hidden_states.shape[:-1]
-        hidden_shape = (*input_shape, -1, attention.head_dim)
-        qkv = attention.att_proj(hidden_states)
-        query_states, key_states, value_states = qkv.split(attention.fused_dims, dim=-1)
-        value_states = value_states.view(hidden_shape)
-
-        apply_qk_norm = attention.q_norm is not None and attention.k_norm is not None
-        norm_after_view = apply_qk_norm and attention.qk_norm_type == "qwen3"
-
-        if apply_qk_norm and not norm_after_view:
-            query_states = attention.q_norm(query_states)
-            key_states = attention.k_norm(key_states)
-
-        query_states = query_states.view(hidden_shape)
-        key_states = key_states.view(hidden_shape)
-
-        if norm_after_view:
-            query_states = attention.q_norm(query_states)
-            key_states = attention.k_norm(key_states)
-
-        query_states = query_states.transpose(1, 2)
-        key_states = key_states.transpose(1, 2)
-        value_states = value_states.transpose(1, 2)
-        query_states, key_states = _apply_rotary_pos_emb(query_states, key_states, cos, sin)
-        return residual, query_states, key_states, value_states
-
-    def _depth_decode_pre0(
-        self,
-        token_ids: torch.Tensor,
-        cos: torch.Tensor,
-        sin: torch.Tensor,
-    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
-        inputs_embeds = self.model._embed_base_tokens(token_ids)
-        return self._depth_decode_pre_layer(0, inputs_embeds, cos, sin)
-
-    def _depth_decode_post_layer(
-        self,
-        layer_idx: int,
-        residual: torch.Tensor,
-        attn_context: torch.Tensor,
-    ) -> torch.Tensor:
-        block = self.backbone.transformer.blocks[layer_idx]
-        attention = block.self_attn
-        input_shape = residual.shape[:-1]
-        attn_output = attn_context.reshape(*input_shape, -1).contiguous()
-        attn_output = attention.attn_out(attn_output)
-        hidden_states = residual + block.dropout(attn_output)
-
-        residual = hidden_states
-        hidden_states = block.ff_norm(hidden_states)
-        hidden_states = block.mlp(hidden_states)
-        hidden_states = residual + block.dropout(hidden_states)
-        return hidden_states
-
-    def _depth_decode_post_and_pre_next(
-        self,
-        layer_idx: int,
-        residual: torch.Tensor,
-        attn_context: torch.Tensor,
-        cos: torch.Tensor,
-        sin: torch.Tensor,
-    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
-        hidden_states = self._depth_decode_post_layer(layer_idx, residual, attn_context)
-        return self._depth_decode_pre_layer(layer_idx + 1, hidden_states, cos, sin)
-
-    def _depth_decode_last_post(
-        self,
-        layer_idx: int,
-        residual: torch.Tensor,
-        attn_context: torch.Tensor,
-    ) -> torch.Tensor:
-        hidden_states = self._depth_decode_post_layer(layer_idx, residual, attn_context)
-        return self.backbone.transformer.ln_f(hidden_states)
-
-    def _build_depth_decode_graph(
-        self,
-        next_input_ids: torch.Tensor,
-        *,
-        past_length: int,
-        attention_bias: torch.Tensor,
-    ) -> _DepthDecodeCudaGraph:
-        text_config = self.backbone.transformer.config
-        device = next_input_ids.device
-        dtype = self.model.lm_head.weight.dtype
-        static = self._depth_decode_spec()
-        num_layers = static.num_hidden_layers
-        head_dim = static.head_dim
-        max_cache_len = int(attention_bias.shape[-1])
-        max_rope_len = max(int(text_config.max_position_embeddings or 0), max_cache_len)
-        self.backbone.transformer.prepare_rope_cache(device=device, max_seq_len=max_rope_len)
-
-        token_ids = torch.empty((1, 1), device=device, dtype=torch.long)
-        cos = torch.empty((1, 1, head_dim), device=device, dtype=dtype)
-        sin = torch.empty_like(cos)
-        positions = torch.arange(max_cache_len, device=device, dtype=torch.long)
-        context_shape = (1, 1, static.num_attention_heads, head_dim)
-
-        token_ids.copy_(next_input_ids)
-        self._select_depth_decode_rope(cos, sin, past_length=past_length)
-
-        pre_graph, pre_output = _capture_cuda_graph(
-            lambda: self._depth_decode_pre0(token_ids, cos, sin),
-            device,
-        )
-        stages = [_DepthDecodeCudaGraphLayerStage(*pre_output)]
-        post_graphs = []
-        for layer_idx in range(num_layers - 1):
-            stage = stages[-1]
-            attn_context = torch.empty(context_shape, device=device, dtype=dtype)
-            graph, output = _capture_cuda_graph(
-                lambda layer_idx=layer_idx, stage=stage, attn_context=attn_context: (
-                    self._depth_decode_post_and_pre_next(
-                        layer_idx,
-                        stage.residual,
-                        attn_context,
-                        cos,
-                        sin,
-                    )
-                ),
-                device,
-            )
-            post_graphs.append(_DepthDecodeCudaGraphPostStage(graph=graph, attn_context=attn_context))
-            stages.append(_DepthDecodeCudaGraphLayerStage(*output))
-
-        last_stage = stages[-1]
-        last_attn_context = torch.empty(context_shape, device=device, dtype=dtype)
-        last_graph, last_output = _capture_cuda_graph(
-            lambda: self._depth_decode_last_post(
-                num_layers - 1,
-                last_stage.residual,
-                last_attn_context,
-            ),
-            device,
-        )
-        post_graphs.append(_DepthDecodeCudaGraphPostStage(graph=last_graph, attn_context=last_attn_context))
-        return _DepthDecodeCudaGraph(
-            cache_key=self._depth_decode_key(next_input_ids, attention_bias),
-            pre_graph=pre_graph,
-            token_ids=token_ids,
-            cos=cos,
-            sin=sin,
-            positions=positions,
-            stages=tuple(stages),
-            post_graphs=tuple(post_graphs),
-            output=last_output,
-        )
-
-    def _get_depth_decode_graph(
-        self,
-        next_input_ids: torch.Tensor,
-        *,
-        past_length: int,
-        attention_bias: torch.Tensor,
-    ) -> _DepthDecodeCudaGraph:
-        key = self._depth_decode_key(next_input_ids, attention_bias)
-        decode_graph = self.graph
-        if decode_graph is None or decode_graph.cache_key != key:
-            decode_graph = self._build_depth_decode_graph(
-                next_input_ids,
-                past_length=past_length,
-                attention_bias=attention_bias,
-            )
-            self.graph = decode_graph
-        else:
-            decode_graph.token_ids.copy_(next_input_ids)
-            self._select_depth_decode_rope(decode_graph.cos, decode_graph.sin, past_length=past_length)
-        return decode_graph
-
-    def _run_depth_decode_attention_core(
-        self,
-        layer_idx: int,
-        stage: _DepthDecodeCudaGraphLayerStage,
-        *,
-        past_key_values: Cache,
-        attention_bias: torch.Tensor,
-        cache_position: torch.Tensor,
-        cos: torch.Tensor,
-        sin: torch.Tensor,
-    ) -> torch.Tensor:
-        attention = self.backbone.transformer.blocks[layer_idx].self_attn
-        cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
-        key_states, value_states = past_key_values.update(
-            stage.key,
-            stage.value,
-            layer_idx,
-            cache_kwargs,
-        )
-        key_states = _repeat_kv(key_states, attention.num_key_value_groups)
-        value_states = _repeat_kv(value_states, attention.num_key_value_groups)
-        attn_output = F.scaled_dot_product_attention(
-            stage.query,
-            key_states,
-            value_states,
-            attn_mask=attention_bias,
-            dropout_p=0.0,
-            is_causal=False,
-        )
-        return attn_output.transpose(1, 2)
-
-    def run(
-        self,
-        next_input_ids: torch.Tensor,
-        *,
-        past_key_values: Cache,
-        attention_bias: torch.Tensor,
-        past_length: int,
-    ) -> tuple[torch.Tensor, Cache]:
-        end = past_length + 1
-        decode_graph = self._get_depth_decode_graph(
-            next_input_ids,
-            past_length=past_length,
-            attention_bias=attention_bias,
-        )
-        cache_position = decode_graph.positions[past_length:end]
-        attention_bias_q = attention_bias[:, :, past_length:end, :end]
-
-        decode_graph.pre_graph.replay()
-
-        for layer_idx, post_graph in enumerate(decode_graph.post_graphs):
-            attn_context = self._run_depth_decode_attention_core(
-                layer_idx,
-                decode_graph.stages[layer_idx],
-                past_key_values=past_key_values,
-                attention_bias=attention_bias_q,
-                cache_position=cache_position,
-                cos=decode_graph.cos,
-                sin=decode_graph.sin,
-            )
-            post_graph.attn_context.copy_(attn_context)
-            post_graph.graph.replay()
-
-        return decode_graph.output, past_key_values
-
-
-def _cuda_graph_tensor_signature(
-    tensor: torch.Tensor | None,
-) -> tuple[Any, ...] | None:
-    if tensor is None:
-        return None
-    return (
-        tuple(tensor.shape),
-        tuple(tensor.stride()),
-        str(tensor.dtype),
-        str(tensor.device),
-    )
-
-
-def _cuda_graph_context_signature(context: Any) -> tuple[Any, ...]:
-    sig = _cuda_graph_tensor_signature
-    return (
-        tuple((sig(k), sig(v)) for k, v in context.kv_contexts),
-        sig(context.cross_mask),
-        sig(context.self_mask),
-        sig(context.valid_action),
-        None if context.rope_cache is None else tuple(sig(t) for t in context.rope_cache),
-    )
-
-
-def _cuda_graph_modulation_signature(modulations: Sequence[Any]) -> tuple[Any, ...]:
-    sig = _cuda_graph_tensor_signature
-    return tuple(
-        (
-            sig(step.conditioning),
-            tuple(tuple(sig(t) for t in block_modulation) for block_modulation in step.block_modulations),
-            tuple(sig(t) for t in step.final_modulation),
-        )
-        for step in modulations
-    )
-
-
-def _cuda_graph_key(inputs: _ActionFlowInputs, steps: int) -> tuple[Any, ...]:
-    sig = _cuda_graph_tensor_signature
-    return (
-        sig(inputs.trajectory),
-        _cuda_graph_context_signature(inputs.context),
-        _cuda_graph_modulation_signature(inputs.modulations),
-        sig(inputs.action_dim_is_pad),
-        int(steps),
-    )
-
-
-def _clone_static_tensor(tensor: torch.Tensor | None) -> torch.Tensor | None:
-    if tensor is None:
-        return None
-    static = torch.empty_strided(
-        tuple(tensor.shape),
-        tuple(tensor.stride()),
-        device=tensor.device,
-        dtype=tensor.dtype,
-    )
-    static.copy_(tensor)
-    return static
-
-
-def _clone_static_context(context: Any) -> Any:
-    rope_cache = None
-    if context.rope_cache is not None:
-        rope_cache = tuple(_clone_static_tensor(t) for t in context.rope_cache)
-    return context.__class__(
-        kv_contexts=tuple((_clone_static_tensor(k), _clone_static_tensor(v)) for k, v in context.kv_contexts),
-        cross_mask=_clone_static_tensor(context.cross_mask),
-        self_mask=_clone_static_tensor(context.self_mask),
-        valid_action=_clone_static_tensor(context.valid_action),
-        rope_cache=rope_cache,
-    )
-
-
-def _clone_static_modulations(modulations: Sequence[Any]) -> Sequence[Any]:
-    return tuple(
-        step.__class__(
-            conditioning=_clone_static_tensor(step.conditioning),
-            block_modulations=tuple(
-                tuple(_clone_static_tensor(t) for t in block_modulation)
-                for block_modulation in step.block_modulations
-            ),
-            final_modulation=tuple(_clone_static_tensor(t) for t in step.final_modulation),
-        )
-        for step in modulations
-    )
-
-
-def _clone_static_inputs(inputs: _ActionFlowInputs) -> _ActionFlowInputs:
-    return _ActionFlowInputs(
-        trajectory=_clone_static_tensor(inputs.trajectory),
-        context=_clone_static_context(inputs.context),
-        modulations=_clone_static_modulations(inputs.modulations),
-        action_dim_is_pad=_clone_static_tensor(inputs.action_dim_is_pad),
-    )
-
-
-def _copy_context_(dst: Any, src: Any) -> None:
-    for (dst_k, dst_v), (src_k, src_v) in zip(dst.kv_contexts, src.kv_contexts):
-        dst_k.copy_(src_k)
-        dst_v.copy_(src_v)
-    if src.cross_mask is not None:
-        dst.cross_mask.copy_(src.cross_mask)
-    if src.self_mask is not None:
-        dst.self_mask.copy_(src.self_mask)
-    if src.valid_action is not None:
-        dst.valid_action.copy_(src.valid_action)
-    if src.rope_cache is not None:
-        for dst_tensor, src_tensor in zip(dst.rope_cache, src.rope_cache):
-            dst_tensor.copy_(src_tensor)
-
-
-def _copy_inputs_(dst: _ActionFlowInputs, src: _ActionFlowInputs) -> None:
-    dst.trajectory.copy_(src.trajectory)
-    _copy_context_(dst.context, src.context)
-    if src.action_dim_is_pad is not None:
-        dst.action_dim_is_pad.copy_(src.action_dim_is_pad)
-
-
-def _rotate_half(x: torch.Tensor) -> torch.Tensor:
-    x1 = x[..., : x.shape[-1] // 2]
-    x2 = x[..., x.shape[-1] // 2 :]
-    return torch.cat((-x2, x1), dim=-1)
-
-
-def _apply_rotary_pos_emb(
-    q: torch.Tensor,
-    k: torch.Tensor,
-    cos: torch.Tensor,
-    sin: torch.Tensor,
-    unsqueeze_dim: int = 1,
-) -> tuple[torch.Tensor, torch.Tensor]:
-    cos = cos.unsqueeze(unsqueeze_dim)
-    sin = sin.unsqueeze(unsqueeze_dim)
-    q_embed = (q * cos) + (_rotate_half(q) * sin)
-    k_embed = (k * cos) + (_rotate_half(k) * sin)
-    return q_embed, k_embed
-
-
-def _repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
-    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
-    if n_rep == 1:
-        return hidden_states
-    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
-    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
-
-
-def _capture_cuda_graph(
-    fn,
-    device: torch.device,
-    *,
-    after_warmup=None,
-) -> tuple[torch.cuda.CUDAGraph, Any]:
-    warmup_stream = torch.cuda.Stream(device=device)
-    warmup_stream.wait_stream(torch.cuda.current_stream(device))
-    with torch.cuda.stream(warmup_stream):
-        fn()
-    torch.cuda.current_stream(device).wait_stream(warmup_stream)
-    if after_warmup is not None:
-        after_warmup()
-
-    graph = torch.cuda.CUDAGraph()
-    with torch.cuda.graph(graph):
-        output = fn()
-    return graph, output
--- a/src/lerobot/policies/molmoact2/hf_model/modeling_molmoact2.py
+++ b/src/lerobot/policies/molmoact2/hf_model/modeling_molmoact2.py
--- a/src/lerobot/policies/molmoact2/hf_model/processing_molmoact2.py
+++ b/src/lerobot/policies/molmoact2/hf_model/processing_molmoact2.py
@@ -1,431 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# ruff: noqa
-
-"""
-Processor class for MolmoAct2.
-"""
-
-from typing import Optional, Union
-import dataclasses
-
-import numpy as np
-
-from transformers.image_utils import ImageInput
-from transformers.video_utils import VideoInput
-from transformers.processing_utils import (
-    Unpack,
-    ProcessingKwargs,
-    ProcessorMixin,
-)
-from transformers.feature_extraction_utils import BatchFeature
-from transformers.tokenization_utils_base import TextInput, PreTokenizedInput
-from transformers.utils import logging
-
-from transformers import AutoTokenizer
-from .image_processing_molmoact2 import MolmoAct2ImagesKwargs, MolmoAct2ImageProcessor
-from .video_processing_molmoact2 import MolmoAct2VideoProcessorKwargs, MolmoAct2VideoProcessor
-
-
-logger = logging.get_logger(__name__)
-
-
-# Special tokens, these should be present in any tokenizer we use since the preprocessor uses them
-IMAGE_PATCH_TOKEN = f"<im_patch>"  # Where to insert high-res tokens
-IMAGE_LOW_RES_TOKEN = f"<im_low>"  # Where to insert low-res tokens
-IM_START_TOKEN = f"<im_start>"
-LOW_RES_IMAGE_START_TOKEN = f"<low_res_im_start>"
-FRAME_START_TOKEN = f"<frame_start>"
-IM_END_TOKEN = f"<im_end>"
-FRAME_END_TOKEN = f"<frame_end>"
-IM_COL_TOKEN = f"<im_col>"
-IMAGE_PROMPT = "<|image|>"
-VIDEO_PROMPT = "<|video|>"
-
-IMAGE_TOKENS = [
-    IMAGE_PATCH_TOKEN,
-    IM_COL_TOKEN,
-    IM_START_TOKEN,
-    LOW_RES_IMAGE_START_TOKEN,
-    FRAME_START_TOKEN,
-    IM_END_TOKEN,
-    FRAME_END_TOKEN,
-    IMAGE_LOW_RES_TOKEN,
-]
-
-
-class MolmoAct2ProcessorKwargs(ProcessingKwargs, total=False):
-    """MolmoAct2 processor kwargs"""
-
-    images_kwargs: MolmoAct2ImagesKwargs
-    videos_kwargs: MolmoAct2VideoProcessorKwargs
-    _defaults = {
-        "text_kwargs": {
-            "padding": False,
-            "return_mm_token_type_ids": True,
-        },
-        "videos_kwargs": {"return_metadata": True},
-    }
-
-
-class MolmoAct2Processor(ProcessorMixin):
-    attributes = ["image_processor", "video_processor", "tokenizer"]
-    optional_attributes = [
-        "chat_template",
-        "time_mode",
-        "image_use_col_tokens",
-        "use_single_crop_col_tokens",
-        "use_single_crop_start_token",
-        "video_use_col_tokens",
-        "use_frame_special_tokens",
-    ]
-    image_processor_class = "AutoImageProcessor"
-    video_processor_class = "AutoVideoProcessor"
-    tokenizer_class = "AutoTokenizer"
-
-    def __init__(
-        self,
-        image_processor: MolmoAct2ImageProcessor = None,
-        video_processor: MolmoAct2VideoProcessor = None,
-        tokenizer: AutoTokenizer = None,
-        chat_template: str | None = None,
-        image_use_col_tokens: bool | None = True,
-        use_single_crop_col_tokens: bool | None = None,
-        use_single_crop_start_token: bool | None = True,
-        video_use_col_tokens: bool | None = False,
-        use_frame_special_tokens: bool | None = True,
-        **kwargs,
-    ) -> None:
-        super().__init__(
-            image_processor,
-            video_processor,
-            tokenizer,
-            chat_template=chat_template,
-        )
-        self.image_use_col_tokens = image_use_col_tokens
-        self.use_single_crop_col_tokens = use_single_crop_col_tokens
-        self.use_single_crop_start_token = use_single_crop_start_token
-        self.video_use_col_tokens = video_use_col_tokens
-        self.use_frame_special_tokens = use_frame_special_tokens
-
-        self.image_placeholder_token = IMAGE_PROMPT
-        self.video_placeholder_token = VIDEO_PROMPT
-        self.image_token_ids = [tokenizer.convert_tokens_to_ids(token) for token in IMAGE_TOKENS]
-
-    def get_image_tokens(self, image_grid: np.ndarray):
-        resized_h, resized_w, height, width = image_grid
-        if int(height) == 0 or int(width) == 0:
-            per_row = np.full(resized_w, IMAGE_PATCH_TOKEN)
-            use_single_crop_col_tokens = (
-                self.image_use_col_tokens
-                if self.use_single_crop_col_tokens is None
-                else self.use_single_crop_col_tokens
-            )
-            if use_single_crop_col_tokens:
-                per_row = np.concatenate([per_row, [IM_COL_TOKEN]], 0)
-            joint = [
-                [IM_START_TOKEN],
-                np.tile(per_row, [resized_h]),
-                [IM_END_TOKEN],
-            ]
-            return np.concatenate(joint)
-        per_row = np.full(width, IMAGE_PATCH_TOKEN)
-        if self.image_use_col_tokens:
-            per_row = np.concatenate([per_row, [IM_COL_TOKEN]], 0)
-        joint = [
-            [IM_START_TOKEN],
-            np.tile(per_row, [height]),
-            [IM_END_TOKEN],
-        ]
-        per_row = np.full(resized_w, IMAGE_PATCH_TOKEN)
-        use_single_crop_col_tokens = (
-            self.image_use_col_tokens
-            if self.use_single_crop_col_tokens is None
-            else self.use_single_crop_col_tokens
-        )
-        image_start_token = LOW_RES_IMAGE_START_TOKEN if self.use_single_crop_start_token else IM_START_TOKEN
-        if use_single_crop_col_tokens:
-            per_row = np.concatenate([per_row, [IM_COL_TOKEN]], 0)
-        joint = [
-            [image_start_token],
-            np.tile(per_row, [resized_h]),
-            [IM_END_TOKEN],
-        ] + joint
-
-        return np.concatenate(joint)
-
-    def get_video_string(
-        self,
-        video_grid: np.ndarray,
-        timestamps: np.ndarray,
-    ):
-        if self.use_frame_special_tokens:
-            start_token_id = FRAME_START_TOKEN
-            end_token_id = FRAME_END_TOKEN
-        else:
-            start_token_id = IM_START_TOKEN
-            end_token_id = IM_END_TOKEN
-
-        num_frames, h, w = video_grid
-        video_string: str = ""
-        for frame_idx, frame_time in enumerate(timestamps):
-            # `per-frame-compact` time mode
-            prev_space = " " if frame_idx > 0 else ""
-            frame_prefix = prev_space + f"{frame_time:.1f} "  # explicit whitespace before/after image tokens
-
-            video_string += frame_prefix
-            per_row = np.full(w, IMAGE_PATCH_TOKEN)
-            if self.video_use_col_tokens:
-                per_row = np.concatenate([per_row, [IM_COL_TOKEN]], 0)
-            extra_tokens = np.tile(per_row, [h])
-            video_tokens = [
-                [start_token_id],
-                extra_tokens,
-                [end_token_id],
-            ]
-            video_string += "".join(np.concatenate(video_tokens, 0))
-
-        return video_string
-
-    def insert_bos(
-        self,
-        input_ids: np.ndarray,
-        attention_mask: np.ndarray,
-        bos_token_id: int,
-        pad_token_id: int,
-    ):
-        """
-        Args:
-            input_ids: [B, S] array with left padding
-            attention_mask: [B, S] array (0 for pad, 1 for valid)
-            bos_token_id: int
-            pad_token_id: int
-        Returns:
-            input_ids_out: [B, S] or [B, S+1] array with bos inserted if needed
-            attention_mask_out: same shape as input_ids_out
-        """
-
-        need_to_expand = len(input_ids.shape) == 1
-        if need_to_expand:
-            input_ids = input_ids[None, :]
-            attention_mask = attention_mask[None, :]
-
-        B, S = input_ids.shape
-
-        # Handle zero-length sequence
-        if S == 0:
-            new_input_ids = np.full((B, 1), bos_token_id, dtype=input_ids.dtype)
-            new_attention_mask = np.ones((B, 1), dtype=attention_mask.dtype)
-            if need_to_expand:
-                new_input_ids = new_input_ids[0]
-                new_attention_mask = new_attention_mask[0]
-            return new_input_ids, new_attention_mask
-
-        first_valid_index = (attention_mask == 1).argmax(axis=-1)  # [B]
-        bos_already_present = np.all(input_ids[np.arange(B), first_valid_index] == bos_token_id)
-
-        if bos_already_present:
-            if need_to_expand:
-                input_ids = input_ids[0]
-                attention_mask = attention_mask[0]
-            return input_ids, attention_mask
-        else:
-            new_input_ids = np.full((B, S + 1), pad_token_id, dtype=input_ids.dtype)
-            new_attention_mask = np.zeros((B, S + 1), dtype=attention_mask.dtype)
-
-            src_idx = np.tile(np.arange(S), (B, 1))  # [B, S]
-            valid_mask = src_idx >= first_valid_index[:, None]  # [B, S]
-            tgt_idx = src_idx + 1  # shit right
-            batch_idx = np.tile(np.arange(B)[:, None], (1, S))  # [B, S]
-
-            # flatten valid_positions
-            flat_vals = input_ids[valid_mask]
-            flat_batch = batch_idx[valid_mask]
-            flat_tgt = tgt_idx[valid_mask]
-
-            new_input_ids[flat_batch, flat_tgt] = flat_vals
-            new_attention_mask[flat_batch, flat_tgt] = 1
-
-            insert_pos = first_valid_index
-            new_input_ids[np.arange(B), insert_pos] = bos_token_id
-            new_attention_mask[np.arange(B), insert_pos] = 1
-
-            if need_to_expand:
-                new_input_ids = new_input_ids[0]
-                new_attention_mask = new_attention_mask[0]
-
-            return new_input_ids, new_attention_mask
-
-    def __call__(
-        self,
-        text: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] = None,
-        images: ImageInput = None,
-        videos: VideoInput = None,
-        **kwargs: Unpack[MolmoAct2ProcessorKwargs],
-    ) -> BatchFeature:
-        """
-
-        Args:
-            text (`str`, `list[str]`, `list[list[str]]`):
-                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
-                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
-                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
-            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `list[PIL.Image.Image]`, `list[np.ndarray]`, `list[torch.Tensor]`):
-                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
-                tensor. Both channels-first and channels-last formats are supported.
-            videos (`dict[str, Any]` or `list[dict[str, Any]]`):
-                The video or batch of videos to be prepared. Each video can be a dictionary with the following keys:
-                - `"frames"`: `np.ndarray` of shape (T, H, W, 3)
-                - `"timestamps"`: `np.ndarray` of shape (T,)
-                - `"sampled_fps"`: `float` (optional)
-                - `"sampling_augmentation"`: `str` (optional)
-            return_tensors (`str` or [`~utils.TensorType`], *optional*):
-                If set, will return tensors of a particular framework. Acceptable values are:
-                - `'tf'`: Return TensorFlow `tf.constant` objects.
-                - `'pt'`: Return PyTorch `torch.Tensor` objects.
-                - `'np'`: Return NumPy `np.ndarray` objects.
-                - `'jax'`: Return JAX `jnp.ndarray` objects.
-
-        Returns:
-            `BatchFeature`: A [`BatchFeature`] with the following fields:
-            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
-            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
-              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not `None`).
-            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
-            - **image_token_pooling** -- Indices of the patches in `image_grids` to pool for each token in `image_tokens`.
-              Returned when `images` is not `None`.
-            - **image_grids** -- Grids of images. Returned when `images` is not `None`.
-            - **image_num_crops** -- Number of crops for each image. Returned when `images` is not `None`.
-            - **pixel_values_videos** -- Pixel values of videos to be fed to a model. Returned when `videos` is not `None`.
-            - **video_token_pooling** -- Indices of the patches in `video_grids` to pool for each token in `video_tokens`.
-              Returned when `videos` is not `None`.
-            - **video_grids** -- Grids of videos. Returned when `videos` is not `None`.
-        """
-
-        output_kwargs = self._merge_kwargs(
-            MolmoAct2ProcessorKwargs,
-            tokenizer_init_kwargs=self.tokenizer.init_kwargs,
-            **kwargs,
-        )
-
-        if images is not None:
-            image_inputs = self.image_processor(images, **output_kwargs["images_kwargs"])
-            image_grids = image_inputs["image_grids"]
-        else:
-            image_inputs = {}
-            image_grids = None
-
-        if videos is not None:
-            videos_inputs = self.video_processor(videos=videos, **output_kwargs["videos_kwargs"])
-            video_grids = videos_inputs["video_grids"]
-            # If user has not requested video metadata, pop it
-            if "return_metadata" not in kwargs:
-                video_metadata = videos_inputs.pop("video_metadata")
-            else:
-                video_metadata = videos_inputs["video_metadata"]
-        else:
-            videos_inputs = {}
-            video_grids = None
-
-        if not isinstance(text, list):
-            text = [text]
-
-        text = text.copy()  # below lines change text in-place
-
-        if image_grids is not None:
-            index = 0
-            for i in range(len(text)):
-                num_images = text[i].count(self.image_placeholder_token)
-                image_grids_i = image_grids[index : index + num_images]
-                for image_grid in image_grids_i:
-                    image_tokens = self.get_image_tokens(image_grid)
-                    image_string = "".join(image_tokens)
-                    text[i] = text[i].replace(self.image_placeholder_token, image_string, 1)
-                index += num_images
-
-        if video_grids is not None:
-            index = 0
-            for i in range(len(text)):
-                num_videos = text[i].count(self.video_placeholder_token)
-                assert num_videos in {0, 1}, "At most one video is supported for now"
-                video_grids_i = video_grids[index : index + num_videos]
-                metadata_i = video_metadata[index : index + num_videos]
-                for video_grid, metadata in zip(video_grids_i, metadata_i):
-                    video_string = self.get_video_string(
-                        video_grid,
-                        metadata.timestamps,
-                    )
-                    text[i] = text[i].replace(self.video_placeholder_token, video_string, 1)
-                index += num_videos
-
-        return_tensors = output_kwargs["text_kwargs"].pop("return_tensors", None)
-        return_mm_token_type_ids = output_kwargs["text_kwargs"].pop("return_mm_token_type_ids", False)
-        text_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"])
-
-        input_ids = text_inputs["input_ids"]
-        attention_mask = text_inputs["attention_mask"]
-
-        input_ids = np.array(input_ids)
-        attention_mask = np.array(attention_mask)
-
-        bos = self.tokenizer.bos_token_id or self.tokenizer.eos_token_id
-        input_ids, attention_mask = self.insert_bos(
-            input_ids, attention_mask, bos, self.tokenizer.pad_token_id
-        )
-
-        if return_mm_token_type_ids:
-            image_tokens = np.array(self.image_token_ids).astype(input_ids.dtype)
-            token_type_ids = np.any(input_ids[:, :, None] == image_tokens[None, None, :], axis=-1)
-            text_inputs["token_type_ids"] = token_type_ids.tolist()
-
-        text_inputs["input_ids"] = input_ids.tolist()
-        text_inputs["attention_mask"] = attention_mask.tolist()
-
-        return BatchFeature(
-            data={**text_inputs, **image_inputs, **videos_inputs},
-            tensor_type=return_tensors,
-        )
-
-    def post_process_image_text_to_text(
-        self, generated_outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False, **kwargs
-    ):
-        """
-        Post-process the output of the model to decode the text.
-
-        Args:
-            generated_outputs (`torch.Tensor` or `np.ndarray`):
-                The output of the model `generate` function. The output is expected to be a tensor of shape `(batch_size, sequence_length)`
-                or `(sequence_length,)`.
-            skip_special_tokens (`bool`, *optional*, defaults to `True`):
-                Whether or not to remove special tokens in the output. Argument passed to the tokenizer's `batch_decode` method.
-            clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
-                Whether or not to clean up the tokenization spaces. Argument passed to the tokenizer's `batch_decode` method.
-            **kwargs:
-                Additional arguments to be passed to the tokenizer's `batch_decode method`.
-
-        Returns:
-            `list[str]`: The decoded text.
-        """
-        return self.tokenizer.batch_decode(
-            generated_outputs,
-            skip_special_tokens=skip_special_tokens,
-            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
-            **kwargs,
-        )
-
-
-MolmoAct2Processor.register_for_auto_class()
--- a/src/lerobot/policies/molmoact2/hf_model/video_processing_molmoact2.py
+++ b/src/lerobot/policies/molmoact2/hf_model/video_processing_molmoact2.py
@@ -1,997 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# ruff: noqa
-
-"""Video processor class for MolmoAct2"""
-
-from functools import partial
-import os
-import warnings
-from contextlib import redirect_stdout
-from io import BytesIO
-from urllib.parse import urlparse
-from typing import Optional, Union
-from collections.abc import Callable
-
-import numpy as np
-import requests
-import einops
-import torch
-import torchvision.transforms
-
-from transformers.image_utils import (
-    IMAGENET_STANDARD_MEAN,
-    IMAGENET_STANDARD_STD,
-    ImageInput,
-    PILImageResampling,
-    SizeDict,
-    validate_kwargs,
-)
-from transformers.video_utils import (
-    VideoInput,
-    is_valid_video,
-    make_batched_videos,
-    make_batched_metadata,
-    VideoMetadata,
-)
-from transformers.processing_utils import Unpack, VideosKwargs
-from transformers.video_processing_utils import BaseVideoProcessor
-from transformers.utils import logging
-from transformers.feature_extraction_utils import BatchFeature
-from transformers.utils import (
-    is_av_available,
-    is_decord_available,
-    is_torchcodec_available,
-    is_yt_dlp_available,
-    TensorType,
-    logging,
-    to_numpy,
-)
-
-
-logger = logging.get_logger(__name__)
-
-MAX_VIDEO_FPS = 8
-
-
-def normalize_image(
-    image: np.ndarray,
-    image_mean: list[float],
-    image_std: list[float],
-) -> np.ndarray:
-    if np.allclose(image_mean, [0.5, 0.5, 0.5]) and np.allclose(image_std, [0.5, 0.5, 0.5]):
-        return image * np.asarray(2.0, dtype=np.float32) - np.asarray(1.0, dtype=np.float32)
-    image -= np.array(image_mean, dtype=np.float32)[None, None, :]
-    image /= np.array(image_std, dtype=np.float32)[None, None, :]
-    return image
-
-
-def resize_image(
-    image: np.ndarray,
-    desired_output_size: list[int],
-    resample: PILImageResampling,
-) -> np.ndarray:
-    if len(image.shape) == 3:
-        is_video = False
-        image = torch.permute(torch.from_numpy(image), [2, 0, 1])
-    else:
-        is_video = True
-        image = torch.permute(torch.from_numpy(image), [0, 3, 1, 2])
-    dtype = image.dtype
-    if torch.is_floating_point(image):
-        in_min = 0.0
-        in_max = 1.0
-        resized = torchvision.transforms.Resize(
-            desired_output_size,
-            resample,
-            antialias=False,
-        )(image)
-        resized = torch.clip(resized, 0.0, 1.0).to(dtype)
-    else:
-        assert image.dtype == torch.uint8, "SigLIP expects float images or uint8 images, but got {}".format(
-            image.dtype
-        )
-        in_min = 0.0
-        in_max = 255.0
-        resized = torchvision.transforms.Resize(
-            desired_output_size,
-            resample,
-            antialias=False,
-        )(image)
-        resized = torch.clip(resized, 0, 255).to(dtype)
-
-    resized = resized.to(torch.float32)
-    resized = (resized - in_min) / (in_max - in_min)
-
-    if is_video:
-        resized = torch.permute(resized, [0, 2, 3, 1]).numpy()
-    else:
-        resized = torch.permute(resized, [1, 2, 0]).numpy()
-
-    return resized
-
-
-def build_resized_image(
-    image: np.ndarray,
-    base_image_input_size: list[int],
-    resample: PILImageResampling,
-    image_mean: list[float],
-    image_std: list[float],
-    image_patch_size: int,
-) -> tuple[np.ndarray, np.ndarray]:
-    resized = resize_image(
-        image,
-        base_image_input_size,
-        resample,
-    )
-    resized = normalize_image(resized, image_mean, image_std)
-    if len(resized.shape) == 3:
-        resized = np.expand_dims(resized, 0)
-    crop_patch_w = base_image_input_size[1] // image_patch_size
-    crop_patch_h = base_image_input_size[0] // image_patch_size
-    resize_idx = np.arange(crop_patch_w * crop_patch_h).reshape([crop_patch_h, crop_patch_w])
-    return resized, resize_idx
-
-
-def batch_pixels_to_patches(array: np.ndarray, patch_size: int) -> np.ndarray:
-    """Reshape images of [n_images, h, w, 3] -> [n_images, n_patches, pixels_per_patch]"""
-    if len(array.shape) == 3:
-        n_crops, h, w = array.shape
-        h_patches = h // patch_size
-        w_patches = w // patch_size
-        array = np.reshape(array, [n_crops, h_patches, patch_size, w_patches, patch_size])
-        array = np.transpose(array, [0, 1, 3, 2, 4])
-        array = np.reshape(array, [n_crops, h_patches * w_patches, patch_size * patch_size])
-        return array
-    else:
-        n_crops, h, w, c = array.shape
-        h_patches = h // patch_size
-        w_patches = w // patch_size
-        array = np.reshape(array, [n_crops, h_patches, patch_size, w_patches, patch_size, c])
-        array = np.transpose(array, [0, 1, 3, 2, 4, 5])
-        array = np.reshape(array, [n_crops, h_patches * w_patches, patch_size * patch_size * c])
-        return array
-
-
-def arange_for_pooling(
-    idx_arr: np.ndarray,
-    pool_h: int,
-    pool_w: int,
-) -> np.ndarray:
-    h_pad = pool_h * ((idx_arr.shape[0] + pool_h - 1) // pool_h) - idx_arr.shape[0]
-    w_pad = pool_w * ((idx_arr.shape[1] + pool_w - 1) // pool_w) - idx_arr.shape[1]
-    idx_arr = np.pad(
-        idx_arr,
-        [[h_pad // 2, (h_pad + 1) // 2], [w_pad // 2, (w_pad + 1) // 2]],
-        mode="constant",
-        constant_values=-1,
-    )
-    return einops.rearrange(idx_arr, "(h dh) (w dw) -> h w (dh dw)", dh=pool_h, dw=pool_w)
-
-
-def image_to_patches_and_grids(
-    image: ImageInput,
-    base_image_input_size: list[int],
-    resample: PILImageResampling,
-    image_mean: list[float],
-    image_std: list[float],
-    image_patch_size: int,
-    image_pooling_w: int,
-    image_pooling_h: int,
-) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
-    """
-    :return image_grids, the shape of each image after pooling
-    :return crops, the image crops to processes with the ViT
-    :return pooled_patch_idx, for each patch_id tokens in `image_tokens`, the indices of the
-                                patches in `crops` to pool for that token, masked with -1
-    """
-    if isinstance(base_image_input_size, int):
-        base_image_input_size = (base_image_input_size, base_image_input_size)
-
-    pooling_w = image_pooling_w
-    pooling_h = image_pooling_h
-
-    resized, resize_idx = build_resized_image(
-        image,
-        base_image_input_size,
-        resample,
-        image_mean,
-        image_std,
-        image_patch_size,
-    )
-    pooling_idx = arange_for_pooling(resize_idx, pooling_h, pooling_w)
-    h, w = pooling_idx.shape[:2]
-    pooling_idx = pooling_idx.reshape([-1, pooling_h * pooling_w])
-    image_grid = [h, w]
-    return (
-        image_grid,
-        batch_pixels_to_patches(resized, image_patch_size),
-        pooling_idx,
-    )
-
-
-def get_candidate_target_fps(
-    video_fps: int | float,
-    sampling_fps: int | float,
-    max_fps: int | float = MAX_VIDEO_FPS,
-) -> list[float]:
-    """
-    Return the subset of `video_fps` factors that remain multiples of `sampling_fps`.
-
-    Examples:
-        >>> get_candidate_target_fps(video_fps=6, sampling_fps=2)
-        [2, 6]
-        >>> get_candidate_target_fps(video_fps=5, sampling_fps=1)
-        [1, 5]
-        >>> get_candidate_target_fps(video_fps=2, sampling_fps=2)
-        [2]
-        >>> get_candidate_target_fps(video_fps=5, sampling_fps=2)
-        Traceback (most recent call last):
-            ...
-        ValueError: sampling_fps=2 must divide video_fps=5 to produce consistent frame steps.
-    """
-    video_fps = int(video_fps)
-    sampling_fps = int(sampling_fps)
-    max_fps = int(max_fps)
-
-    if sampling_fps is None:
-        raise ValueError("sampling_fps must be provided")
-    if video_fps <= 0 or sampling_fps <= 0:
-        raise ValueError(f"video_fps and sampling_fps must be positive (got {video_fps}, {sampling_fps})")
-    if video_fps % sampling_fps != 0:
-        raise ValueError(f"sampling_fps={sampling_fps} must divide video_fps={video_fps}.")
-
-    candidates = []
-    for candidate in range(sampling_fps, video_fps + 1, sampling_fps):
-        if candidate > max_fps:
-            break
-        if video_fps % candidate == 0:
-            candidates.append(float(candidate))
-
-    return candidates
-
-
-def read_video_decord(
-    video_path,
-    sample_timestamps_fn: Callable,
-    **kwargs,
-) -> np.ndarray:
-    """
-    Decode a video using the Decord backend.
-
-    Args:
-        video_path (`str`):
-            Path to the video file.
-        sample_timestamps_fn (`Callable`):
-            A callable function that will return timestamps at which the video should be sampled.
-
-    Returns:
-        tuple[`np.array`, `VideoMetadata`]: A tuple containing:
-            - Numpy array of frames in RGB (shape: [num_frames, height, width, 3]).
-            - `VideoMetadata` object.
-    """
-    # Lazy import from decord
-    import importlib
-
-    decord = importlib.import_module("decord")
-
-    vr = decord.VideoReader(uri=video_path, ctx=decord.cpu(0))  # decord has problems with gpu
-    video_fps = vr.get_avg_fps()
-    total_num_frames = len(vr)
-    time_stamps = vr.get_frame_timestamp(list(range(len(vr))))
-    duration = time_stamps[-1][1] - time_stamps[0][0]
-
-    metadata = VideoMetadata(
-        total_num_frames=int(total_num_frames),
-        fps=float(video_fps),
-        duration=float(duration),
-        video_backend="decord",
-    )
-
-    target_timestamps = sample_timestamps_fn(metadata=metadata, **kwargs)
-    target_timestamps = np.array(target_timestamps)
-    offset = time_stamps[0, 0]
-
-    ix = np.searchsorted(time_stamps[:, 1], target_timestamps + offset, side="right")
-    ix = np.minimum(ix, len(time_stamps) - 1)
-
-    video = vr.get_batch(ix).asnumpy()
-    metadata.update(
-        {
-            "frames_indices": target_timestamps * video_fps,
-            "height": video.shape[1],
-            "width": video.shape[2],
-        }
-    )
-    return video, metadata
-
-
-def read_video_torchcodec(
-    video_path,
-    sample_timestamps_fn: Callable,
-    **kwargs,
-) -> np.ndarray:
-    """
-    Decode a video using torchcodec decoder.
-
-    Args:
-        video_path (`str`):
-            Path to the video file.
-        sample_timestamps_fn (`Callable`):
-            A callable function that will return timestamps at which the video should be sampled.
-
-    Returns:
-        tuple[`np.array`, `VideoMetadata`]: A tuple containing:
-            - Numpy array of frames in RGB (shape: [num_frames, height, width, 3]).
-            - `VideoMetadata` object.
-    """
-    # Lazy import torchcodec
-    import importlib
-
-    torchcodec = importlib.import_module("torchcodec")
-
-    decoder = torchcodec.decoders.VideoDecoder(
-        video_path,
-        # Interestingly `exact` mode takes less than approximate when we load the whole video
-        seek_mode="exact",
-        # Allow FFmpeg decide on the number of threads for efficiency
-        num_ffmpeg_threads=0,
-    )
-    # If the first frame starts at > 0, we effectively clip the video starting at that time
-    # since (most) video players would also skip to that time
-    time_offset = decoder.metadata.begin_stream_seconds_from_content
-    # Note this duration does assume we started playing at `time_offset`
-    duration = decoder.metadata.duration_seconds
-
-    metadata = VideoMetadata(
-        total_num_frames=decoder.metadata.num_frames,
-        fps=decoder.metadata.average_fps,
-        duration=duration,
-        video_backend="torchcodec",
-        height=decoder.metadata.height,
-        width=decoder.metadata.width,
-    )
-
-    target_timestamps = sample_timestamps_fn(metadata=metadata, **kwargs)
-
-    # Floating point/rounding issues might cause `target_timestamps` to be very slightly
-    # out-of-bounds, to handle this we sanity check then clip them
-    assert all(x >= 0 for x in target_timestamps)
-    assert all(x < duration + 1e-6 for x in target_timestamps)
-    # 1e-6 padding since torchcodec can throw out-of-bounds errors even if you ask for the
-    # exact boundary value, we should still get the first/last frame anyway
-    max_timestamp = decoder.metadata.end_stream_seconds_from_content - 1e-6
-    min_timestamp = decoder.metadata.begin_stream_seconds_from_content + 1e-6
-    # Note we avoid using numpy ops here to reduce floating precision issues
-    timestamps = [x + time_offset for x in target_timestamps]
-    timestamps = [max(min_timestamp, min(max_timestamp, x)) for x in timestamps]
-
-    video = (
-        decoder.get_frames_played_at(timestamps).data.numpy().transpose(0, 2, 3, 1)
-    )  # Convert to THWC format
-    target_timestamps = np.array(target_timestamps)
-    metadata.frames_indices = target_timestamps * metadata.fps
-
-    return video, metadata
-
-
-def read_video_pyav(
-    video_path,
-    sample_timestamps_fn: Callable,
-    **kwargs,
-) -> np.ndarray:
-    """
-    Decode a video using the PyAV backend.
-
-    Args:
-        video_path (`str`):
-            Path to the video file.
-        sample_timestamps_fn (`Callable`):
-            A callable function that will return timestamps at which the video should be sampled.
-
-    Returns:
-        tuple[`np.array`, `VideoMetadata`]: A tuple containing:
-            - Numpy array of frames in RGB (shape: [num_frames, height, width, 3]).
-            - `VideoMetadata` object.
-    """
-    # Lazy import torchcodec
-    import importlib
-
-    av = importlib.import_module("av")
-
-    with av.open(video_path) as container:
-        video_stream = container.streams.video[0]
-        fps = video_stream.average_rate or video_stream.guessed_rate
-        it = container.decode(video=0)
-        frames = list(it)
-
-        stream = container.streams.video[0]
-        start = frames[0].pts * stream.time_base
-        container_end = stream.duration
-        if container_end is not None:
-            container_end *= stream.time_base
-        if container_end is None or container_end < frames[-1].pts:
-            # Some problem with stream duration, so use the frame PTS directly
-            # and guess the duration of the last frame
-            end = frames[-1].pts * stream.time_base + 1 / fps
-        else:
-            end = container_end
-        duration = float(end - start)
-
-        metadata = VideoMetadata(
-            total_num_frames=len(frames),
-            fps=float(fps),
-            duration=float(duration),
-            video_backend="pyav",
-            height=video_stream.height,
-            width=video_stream.width,
-        )
-
-        target_timestamps = sample_timestamps_fn(metadata=metadata, **kwargs)
-        offset = float(start)
-
-        target_timestamps = np.array(target_timestamps)
-        end_time_stamps = np.array([float(frame.pts * stream.time_base) for frame in frames[1:]] + [duration])
-        indices = np.searchsorted(end_time_stamps, target_timestamps + offset, side="right")
-        indices = np.minimum(indices, len(end_time_stamps) - 1)
-
-        video = np.stack(
-            [frames[i].to_ndarray(format="rgb24", channel_last=True) for i in indices],
-            axis=0,
-        )
-
-        metadata.frames_indices = target_timestamps * fps
-
-        return video, metadata
-
-
-VIDEO_DECODERS = {
-    "decord": read_video_decord,
-    "torchcodec": read_video_torchcodec,
-    "pyav": read_video_pyav,
-}
-
-
-def load_video(
-    video: VideoInput,
-    backend: str = "decord",
-    sample_timestamps_fn: Callable | None = None,
-    **kwargs,
-):
-    """
-    Loads `video` to a numpy array.
-
-    Args:
-        video (`VideoInput`):
-            The video to convert to the numpy array format. Can be a link to video or local path.
-        backend (`str`, *optional*, defaults to `"decord"`):
-            The backend to use when loading the video. Can be any of ["decord", "pyav", ""torchcodec"]. Defaults to "decord".
-        sample_timestamps_fn (`Callable`):
-            A callable function that will return timestamps at which the video should be sampled.
-    """
-
-    # Early exit if provided an array or `PIL` frames
-    if not isinstance(video, str):
-        metadata = [None] * len(video)
-        return video, metadata
-
-    if urlparse(video).netloc in ["www.youtube.com", "youtube.com"]:
-        if not is_yt_dlp_available():
-            raise ImportError("To load a video from YouTube url you have  to install `yt_dlp` first.")
-        # Lazy import from yt_dlp
-        import importlib
-
-        yt_dlp = importlib.import_module("yt_dlp")
-
-        buffer = BytesIO()
-        with redirect_stdout(buffer), yt_dlp.YoutubeDL() as f:
-            f.download([video])
-        bytes_obj = buffer.getvalue()
-        file_obj = BytesIO(bytes_obj)
-    elif video.startswith("http://") or video.startswith("https://"):
-        file_obj = BytesIO(requests.get(video, timeout=10).content)
-    elif os.path.isfile(video):
-        file_obj = video
-    else:
-        raise TypeError(
-            "Incorrect format used for video. Should be an url linking to an video or a local path."
-        )
-
-    # can also load with decord, but not cv2/torchvision
-    # both will fail in case of url links
-    video_is_url = video.startswith("http://") or video.startswith("https://")
-    if video_is_url and backend == "opencv":
-        raise ValueError("If you are trying to load a video from URL, you cannot use 'opencv' as backend")
-
-    if (
-        (not is_decord_available() and backend == "decord")
-        or (not is_torchcodec_available() and backend == "torchcodec")
-        or (not is_av_available() and backend == "pyav")
-    ):
-        raise ImportError(
-            f"You chose backend={backend} for loading the video but the required library is not found in your environment "
-            f"Make sure to install {backend} before loading the video."
-        )
-
-    video_decoder = VIDEO_DECODERS[backend]
-    video, metadata = video_decoder(file_obj, sample_timestamps_fn, **kwargs)
-    return video, metadata
-
-
-def get_target_fps(
-    video_fps: float,
-    max_frames: int,
-    total_frames: int,
-    frame_sample_mode: str,
-    candidate_target_fps: tuple[float],
-) -> float:
-    """
-    Get the target fps that best spans the video and has the most frames sampled
-    """
-    num_frames_sampled = 0
-    selected_target_fps = None
-    for target_fps in candidate_target_fps:
-        step_size = max(int(video_fps / target_fps), 1)
-        num_frames_sampled_at_fps = int(total_frames / step_size)
-        if num_frames_sampled == 0:
-            if "uniform" in frame_sample_mode:
-                if num_frames_sampled_at_fps > max_frames:
-                    break
-            selected_target_fps = target_fps
-            num_frames_sampled = num_frames_sampled_at_fps
-
-        else:
-            # the candidate sampling fps increases so frame count can't decrease
-            assert num_frames_sampled <= num_frames_sampled_at_fps
-            if num_frames_sampled_at_fps > max_frames:
-                # choose the sampling fps that spans the video
-                continue
-
-            elif num_frames_sampled_at_fps > num_frames_sampled:
-                # both are less than max_frames, choose the one with higher density of frames sampled
-                selected_target_fps = target_fps
-                num_frames_sampled = num_frames_sampled_at_fps
-    return selected_target_fps
-
-
-def get_frame_times_and_chosen_fps(selected_target_fps, total_frames, max_frames, video_fps):
-    if selected_target_fps is None:
-        frame_indices = np.linspace(0, total_frames, max_frames, endpoint=False, dtype=int)
-    else:
-        step_size = max(int(video_fps / selected_target_fps), 1)
-        frame_indices = np.arange(0, total_frames, step_size)
-    if len(frame_indices) > max_frames:
-        frame_indices = frame_indices[:max_frames]
-    return selected_target_fps, frame_indices
-
-
-class MolmoAct2VideoProcessorKwargs(VideosKwargs, total=False):
-    patch_size: int | None
-    pooling_size: list[int] | None
-    frame_sample_mode: str | None
-    max_fps: int | None
-    sampling_fps: int | None
-
-
-class MolmoAct2VideoProcessor(BaseVideoProcessor):
-    resample = PILImageResampling.BILINEAR
-    size = {"height": 378, "width": 378}
-    image_mean = IMAGENET_STANDARD_MEAN
-    image_std = IMAGENET_STANDARD_STD
-    do_resize = True
-    do_rescale = True
-    do_normalize = True
-    do_convert_rgb = True
-    patch_size = 14
-    pooling_size = [3, 3]
-    do_sample_frames = True
-    frame_sample_mode = "uniform_last_frame"
-    max_fps = 2
-    sampling_fps = 2
-    valid_kwargs = MolmoAct2VideoProcessorKwargs
-    model_input_names = ["pixel_values_videos", "video_token_pooling", "video_grids"]
-
-    def __init__(self, **kwargs: Unpack[MolmoAct2VideoProcessorKwargs]):
-        super().__init__(**kwargs)
-        if self.size is not None and (
-            self.size.get("height", None) is None or self.size.get("width", None) is None
-        ):
-            raise ValueError("size must contain 'height' and 'width' keys.")
-
-    def _further_process_kwargs(
-        self,
-        size: SizeDict | None = None,
-        **kwargs,
-    ) -> dict:
-        """
-        Update kwargs that need further processing before being validated
-        Can be overridden by subclasses to customize the processing of kwargs.
-        """
-        if size is not None and ("height" not in size or "width" not in size):
-            raise ValueError("size must contain 'height' and 'width' keys.")
-
-        return super()._further_process_kwargs(size=size, **kwargs)
-
-    def sample_times(
-        self,
-        metadata: VideoMetadata,
-        frame_sample_mode: str,
-        num_frames: int,
-        max_fps: int | None = None,
-        sampling_fps: int | None = None,
-        **kwargs,
-    ) -> np.ndarray:
-        """
-        Time-based sampling if an array video is passed
-        Args:
-            metadata (`VideoMetadata`):
-                Metadata of the video containing information about total duration, fps and total number of frames.
-            frame_sample_mode (`str`, *optional*):
-                Mode to sample frames. Defaults to `self.frame_sample_mode`.
-            num_frames (`int`, *optional*):
-                Maximum number of frames to sample. Defaults to `self.num_frames`.
-            man_fps (`int`, *optional*):
-                Maximum frames per second to sample.
-            sampling_fps (`int`, *optional*):
-                Sampling frames per second. Defaults to `self.sampling_fps`.
-                Used when `frame_sample_mode` is `"fps"`.
-        """
-        frame_sample_mode = frame_sample_mode or self.frame_sample_mode
-        num_frames = num_frames or self.num_frames
-        sampling_fps = sampling_fps or self.sampling_fps
-
-        duration = metadata.duration or metadata.total_num_frames / metadata.fps
-        if frame_sample_mode == "fps":
-            candidate_target_fps = get_candidate_target_fps(metadata.fps, sampling_fps)
-            # Try larger and larger FPSs until we hit one that can't span the video
-            target_fps = candidate_target_fps[0]
-            for candidate_fps in candidate_target_fps[1:]:
-                if num_frames / candidate_fps < duration:
-                    break
-                target_fps = candidate_fps
-            times = np.arange(0, num_frames) / target_fps
-            times = times[times < duration]
-            return times
-        elif frame_sample_mode == "uniform_last_frame":
-            if max_fps is not None:
-                max_duration = (num_frames - 1) / max_fps  # -1 to include the last frame
-                if max_duration < duration:
-                    times = np.linspace(0, duration, num=num_frames, endpoint=True, dtype=np.float64)
-                else:
-                    times = np.arange(0.0, stop=duration, step=1 / max_fps)
-                    times = np.concatenate([times, [duration]], axis=0)
-                    assert len(times) <= num_frames
-            else:
-                times = np.linspace(0, duration, num=num_frames, endpoint=True, dtype=np.float64)
-            return times
-        else:
-            raise NotImplementedError(frame_sample_mode)
-
-    def sample_frames(
-        self,
-        metadata: VideoMetadata,
-        frame_sample_mode: str | None = None,
-        num_frames: int | None = None,
-        max_fps: int | None = None,
-        sampling_fps: int | None = None,
-        **kwargs,
-    ) -> np.ndarray:
-        """
-        Frame-based sampling if an array video is passed
-        Args:
-            metadata (`VideoMetadata`):
-                Metadata of the video containing information about total duration, fps and total number of frames.
-            frame_sample_mode (`str`, *optional*):
-                Mode to sample frames. Defaults to `self.frame_sample_mode`.
-            num_frames (`int`, *optional*):
-                Maximum number of frames to sample. Defaults to `self.num_frames`.
-            max_fps (`int`, *optional*):
-                Maximum frames per second to sample.
-            sampling_fps (`int`, *optional*):
-                Sampling frames per second. Defaults to `self.sampling_fps`.
-                Used when `frame_sample_mode` is `"fps"`.
-        """
-        frame_sample_mode = frame_sample_mode or self.frame_sample_mode
-        num_frames = num_frames or self.num_frames
-        sampling_fps = sampling_fps or self.sampling_fps
-
-        total_num_frames = metadata.total_num_frames
-        if frame_sample_mode == "uniform_last_frame" and max_fps is not None:
-            duration = total_num_frames / metadata.fps
-            if total_num_frames <= 2:
-                return np.arange(total_num_frames).astype(int)
-            if duration > (num_frames - 1) / max_fps:  # -1 to include the last frame
-                # uniform fallback
-                indices = np.linspace(
-                    0,
-                    total_num_frames - 1,
-                    num=min(num_frames, total_num_frames),
-                    endpoint=True,
-                ).astype(int)
-                return indices
-            else:
-                float_indices = np.arange(
-                    0.0,
-                    stop=total_num_frames - 1,
-                    step=float(metadata.fps / max_fps),
-                )
-                if np.round(float_indices[-1]) != total_num_frames - 1:
-                    float_indices = np.concatenate([float_indices, [total_num_frames - 1]], axis=0)
-                indices = np.round(float_indices).astype(int)
-                assert indices[-1] < total_num_frames
-                assert len(float_indices) <= num_frames
-                return indices
-        elif frame_sample_mode == "uniform_last_frame":
-            indices = np.linspace(
-                0,
-                total_num_frames - 1,
-                num=min(num_frames, total_num_frames),
-                endpoint=True,
-            ).astype(int)
-            return indices
-        elif frame_sample_mode == "fps":
-            candidate_target_fps = get_candidate_target_fps(metadata.fps, sampling_fps)
-            selected_target_fps = get_target_fps(
-                metadata.fps,
-                num_frames,
-                total_num_frames,
-                frame_sample_mode,
-                candidate_target_fps,
-            )
-            _, indices = get_frame_times_and_chosen_fps(
-                selected_target_fps,
-                total_num_frames,
-                num_frames,
-                metadata.fps,
-            )
-            return indices
-        else:
-            raise NotImplementedError(frame_sample_mode)
-
-    def fetch_videos(self, video_url_or_urls: str | list[str] | list[list[str]], sample_timestamps_fn=None):
-        """
-        Convert a single or a list of urls into the corresponding `np.array` objects.
-
-        If a single url is passed, the return value will be a single object. If a list is passed a list of objects is
-        returned.
-        """
-        if (not is_decord_available()) and (not is_torchcodec_available()) and (not is_av_available()):
-            raise ImportError(
-                "MolmoAct2VideoProcessor requires `decord`, `torchcodec`, or `av` to be installed."
-            )
-
-        if is_decord_available():
-            backend = "decord"
-        elif is_torchcodec_available():
-            warnings.warn(
-                "`decord` is not installed and cannot be used to decode the video by default. "
-                "Falling back to `torchcodec`."
-            )
-            backend = "torchcodec"
-        else:
-            warnings.warn(
-                "`decord` is not installed and cannot be used to decode the video by default. "
-                "Falling back to `PyAV`."
-            )
-            backend = "pyav"
-
-        if isinstance(video_url_or_urls, list):
-            return list(
-                zip(
-                    *[
-                        self.fetch_videos(x, sample_timestamps_fn=sample_timestamps_fn)
-                        for x in video_url_or_urls
-                    ]
-                )
-            )
-        else:
-            return load_video(video_url_or_urls, backend=backend, sample_timestamps_fn=sample_timestamps_fn)
-
-    def _decode_and_sample_videos(
-        self,
-        videos: VideoInput,
-        video_metadata: VideoMetadata | dict,
-        do_sample_frames: bool | None = None,
-        sample_indices_fn: Callable | None = None,
-        sample_timestamps_fn: Callable | None = None,
-    ):
-        """
-        Decode input videos and sample frames if needed.
-        """
-        videos = make_batched_videos(videos)
-        video_metadata = make_batched_metadata(videos, video_metadata=video_metadata)
-
-        # Framed-based sampling if an array video is passed
-        # Otherwise, time-based sampling with decoding
-        if is_valid_video(videos[0]) and do_sample_frames:
-            assert video_metadata[0].fps is not None, "FPS must be provided for video input"
-            sampled_videos = []
-            sampled_metadata = []
-            for video, metadata in zip(videos, video_metadata):
-                indices = sample_indices_fn(metadata=metadata)
-                metadata.frames_indices = indices
-                sampled_videos.append(video[indices])
-                sampled_metadata.append(metadata)
-            videos = sampled_videos
-            video_metadata = sampled_metadata
-        elif not is_valid_video(videos[0]):
-            if sample_indices_fn is None:
-                logger.warning(
-                    "do_sample_frames is False, but video array is not provided: "
-                    "Will decode the video and sample frames using MolmoAct2's default sampling mode"
-                )
-            if isinstance(videos[0], list):
-                raise ValueError("A list of images is not supported for video input!")
-            else:
-                videos, video_metadata = self.fetch_videos(videos, sample_timestamps_fn=sample_timestamps_fn)
-
-        return videos, video_metadata
-
-    def _prepare_input_videos(
-        self,
-        videos: VideoInput,
-        **kwargs,
-    ) -> list[np.ndarray]:
-        processed_videos = [to_numpy(video) for video in videos]
-        return processed_videos
-
-    def preprocess(
-        self,
-        videos: VideoInput,
-        **kwargs: Unpack[MolmoAct2VideoProcessorKwargs],
-    ) -> BatchFeature:
-        validate_kwargs(
-            captured_kwargs=kwargs.keys(),
-            valid_processor_keys=list(self.valid_kwargs.__annotations__.keys()) + ["return_tensors"],
-        )
-
-        # Set default kwargs from self. This ensures that if a kwarg is not provided
-        # by the user, it gets its default value from the instance, or is set to None.
-        for kwarg_name in self.valid_kwargs.__annotations__:
-            kwargs.setdefault(kwarg_name, getattr(self, kwarg_name, None))
-
-        do_sample_frames = kwargs.pop("do_sample_frames")
-        video_metadata = kwargs.pop("video_metadata")
-
-        sample_indices_fn = partial(self.sample_frames, **kwargs) if do_sample_frames else None
-        sample_timestamps_fn = partial(self.sample_times, **kwargs)
-        videos, video_metadata = self._decode_and_sample_videos(
-            videos,
-            video_metadata=video_metadata,
-            do_sample_frames=do_sample_frames,
-            sample_indices_fn=sample_indices_fn,
-            sample_timestamps_fn=sample_timestamps_fn,
-        )
-        videos = self._prepare_input_videos(videos=videos)
-
-        kwargs = self._further_process_kwargs(**kwargs)
-
-        return_metadata = kwargs.pop("return_metadata")
-        preprocessed_videos = self._preprocess(videos=videos, **kwargs)
-        if return_metadata:
-            preprocessed_videos["video_metadata"] = video_metadata
-        return preprocessed_videos
-
-    def _preprocess(
-        self,
-        videos: list[np.ndarray],
-        size: SizeDict | None = None,
-        resample: PILImageResampling | None = None,
-        image_mean: float | list[float] | None = None,
-        image_std: float | list[float] | None = None,
-        do_convert_rgb: bool | None = None,
-        patch_size: int | None = None,
-        pooling_size: list[int] | None = None,
-        return_tensors: str | TensorType | None = None,
-        **kwargs,
-    ) -> BatchFeature:
-        """
-        Preprocess a video for the model.
-        Args:
-            videos (`VideoInput`):
-                Video to preprocess.
-            size (`SizeDict`, *optional*, defaults to `self.size`):
-                Size of the image after resizing.
-            resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
-                Resampling filter to use when resizing the image. This can be one of the enum `PILImageResampling`. Only
-                has an effect if `do_resize` is set to `True`.
-            image_mean (`float` or `list[float]`, *optional*, defaults to `self.image_mean`):
-                Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
-            image_std (`float` or `list[float]`, *optional*, defaults to `self.image_std`):
-                Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
-                `True`.
-            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
-                Whether to convert the image to RGB.
-            patch_size (`int`, *optional*, defaults to `self.patch_size`):
-                The spatial patch size of the vision encoder.
-            pooling_size (`list[int]`, *optional*, defaults to `self.pooling_size`):
-                The pooling size of the vision adapter.
-            return_tensors (`str` or `TensorType`, *optional*):
-                The type of tensors to return. Can be one of:
-                - Unset: Return a list of `np.ndarray`.
-                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
-                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
-                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
-                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
-
-        Returns:
-            A `BatchFeature` containing the following keys:
-                - `pixel_values_videos`: The preprocessed videos.
-                - `video_token_pooling`: The indices of the patches in `crops` to pool for each token in `video_tokens`.
-                - `video_grids`: The video grids.
-        """
-        if size.height is None or size.width is None:
-            raise ValueError("size must contain 'height' and 'width' keys.")
-
-        base_image_input_size = [size.height, size.width]
-
-        resample = resample or self.resample
-        image_mean = image_mean or self.image_mean
-        image_std = image_std or self.image_std
-        do_convert_rgb = do_convert_rgb or self.do_convert_rgb
-
-        patch_size = patch_size or self.patch_size
-        pooling_size = pooling_size or self.pooling_size
-
-        image_pooling_h, image_pooling_w = pooling_size
-
-        batch_grids = []
-        batch_crops = []
-        batch_pooled_patches_idx = []
-
-        for video in videos:
-            all_crops = []
-            pooled_patches_idx = []
-
-            for frame in video:
-                image_grid, crops, pooled_idx = image_to_patches_and_grids(
-                    frame,
-                    base_image_input_size,
-                    resample,
-                    image_mean,
-                    image_std,
-                    patch_size,
-                    image_pooling_w,
-                    image_pooling_h,
-                )
-                offset = sum(np.prod(x.shape[:2]) for x in all_crops)
-                pooled_idx_with_offset = np.where(pooled_idx >= 0, pooled_idx + offset, pooled_idx)
-                pooled_patches_idx.append(pooled_idx_with_offset)
-                all_crops.append(crops)
-
-            video_grid = np.array([len(video), image_grid[0], image_grid[1]])
-            all_crops = np.concatenate(all_crops, 0)
-            pooled_patches_idx = np.concatenate(pooled_patches_idx, 0)
-
-            batch_grids.append(video_grid)
-            batch_crops.append(all_crops)
-            batch_pooled_patches_idx.append(pooled_patches_idx)
-
-        video_grids = np.stack(batch_grids, 0)
-        pixel_values_videos = np.concatenate(batch_crops, 0)
-        video_token_pooling = np.concatenate(batch_pooled_patches_idx, 0)
-
-        data = dict(
-            pixel_values_videos=pixel_values_videos,
-            video_token_pooling=video_token_pooling,
-            video_grids=video_grids,
-        )
-
-        return BatchFeature(data, tensor_type=return_tensors)
-
-
-MolmoAct2VideoProcessor.register_for_auto_class()
--- a/src/lerobot/policies/molmoact2/modeling_molmoact2.py
+++ b/src/lerobot/policies/molmoact2/modeling_molmoact2.py
--- a/src/lerobot/policies/molmoact2/processor_molmoact2.py
+++ b/src/lerobot/policies/molmoact2/processor_molmoact2.py
--- a/src/lerobot/policies/pi0/modeling_pi0.py
+++ b/src/lerobot/policies/pi0/modeling_pi0.py
@@ -15,6 +15,7 @@
 # limitations under the License.

 import builtins
+import copy
 import logging
 import math
 from collections import deque
@@ -29,7 +30,6 @@ from lerobot.utils.import_utils import _transformers_available, require_package

 # Conditional import for type checking and lazy loading
 if TYPE_CHECKING or _transformers_available:
-    from transformers.cache_utils import DynamicCache
    from transformers.models.auto import CONFIG_MAPPING
    from transformers.models.gemma import modeling_gemma

@@ -41,7 +41,6 @@ if TYPE_CHECKING or _transformers_available:
    )
 else:
    CONFIG_MAPPING = None
-    DynamicCache = None
    modeling_gemma = None
    PiGemmaForCausalLM = None
    _gated_residual = None
@@ -142,15 +141,6 @@ def make_att_2d_masks(pad_masks, att_masks):  # see openpi `make_att_2d_masks` (
    return att_2d_masks & pad_2d_masks


-def clone_past_key_values(past_key_values):
-    """Clone the DynamicCache returned by prefix prefill for compiled denoising."""
-    return DynamicCache(
-        tuple(
-            (keys.clone(), values.clone(), sliding_window) for keys, values, sliding_window in past_key_values
-        )
-    )
-
-
 def pad_vector(vector, new_dim):
    """Pad the last dimension of a vector to new_dim with zeros.

@@ -237,13 +227,16 @@ def resize_with_pad_torch(  # see openpi `resize_with_pad_torch` (exact copy)


 # Define the complete layer computation function for gradient checkpointing
-def compute_layer_complete(inputs_embeds, attention_mask, position_ids, adarms_cond, layers, rotary_emb):
+def compute_layer_complete(
+    layer_idx, inputs_embeds, attention_mask, position_ids, adarms_cond, paligemma, gemma_expert
+):
+    models = [paligemma.model.language_model, gemma_expert.model]
    query_states = []
    key_states = []
    value_states = []
    gates = []
    for i, hidden_states in enumerate(inputs_embeds):
-        layer = layers[i]
+        layer = models[i].layers[layer_idx]
        hidden_states, gate = layernorm_forward(layer.input_layernorm, hidden_states, adarms_cond[i])
        gates.append(gate)
        input_shape = hidden_states.shape[:-1]
@@ -265,16 +258,15 @@ def compute_layer_complete(inputs_embeds, attention_mask, position_ids, adarms_c
        device=query_states.device,
        dtype=query_states.dtype,
    )
-    cos, sin = rotary_emb(dummy_tensor, position_ids)
+    cos, sin = paligemma.model.language_model.rotary_emb(dummy_tensor, position_ids)
    query_states, key_states = modeling_gemma.apply_rotary_pos_emb(
        query_states, key_states, cos, sin, unsqueeze_dim=1
    )
    batch_size = query_states.shape[0]
-    paligemma_layer = layers[0]
-    scaling = paligemma_layer.self_attn.scaling
+    scaling = paligemma.model.language_model.layers[layer_idx].self_attn.scaling
    # Attention computation
    att_output, _ = modeling_gemma.eager_attention_forward(
-        paligemma_layer.self_attn,
+        paligemma.model.language_model.layers[layer_idx].self_attn,
        query_states,
        key_states,
        value_states,
@@ -282,13 +274,13 @@ def compute_layer_complete(inputs_embeds, attention_mask, position_ids, adarms_c
        scaling,
    )
    # Get head_dim from the current layer, not from the model
-    head_dim = paligemma_layer.self_attn.head_dim
+    head_dim = paligemma.model.language_model.layers[layer_idx].self_attn.head_dim
    att_output = att_output.reshape(batch_size, -1, 1 * 8 * head_dim)
    # Process layer outputs
    outputs_embeds = []
    start_pos = 0
    for i, hidden_states in enumerate(inputs_embeds):
-        layer = layers[i]
+        layer = models[i].layers[layer_idx]
        end_pos = start_pos + hidden_states.shape[1]
        if att_output.dtype != layer.self_attn.o_proj.weight.dtype:
            att_output = att_output.to(layer.self_attn.o_proj.weight.dtype)
@@ -496,9 +488,8 @@ class PaliGemmaWithExpertModel(
            prefix_output = None
            prefix_past_key_values = None
        else:
-            paligemma_layers = self.paligemma.model.language_model.layers
-            gemma_expert_layers = self.gemma_expert.model.layers
-            rotary_emb = self.paligemma.model.language_model.rotary_emb
+            models = [self.paligemma.model.language_model, self.gemma_expert.model]
+            num_layers = self.paligemma.config.text_config.num_hidden_layers

            # Check if gradient checkpointing is enabled for any of the models
            use_gradient_checkpointing = (
@@ -508,39 +499,36 @@ class PaliGemmaWithExpertModel(
            ) or (hasattr(self, "gradient_checkpointing") and self.gradient_checkpointing and self.training)

            # Process all layers with gradient checkpointing if enabled
-            for layers in zip(paligemma_layers, gemma_expert_layers, strict=True):
+            for layer_idx in range(num_layers):
                if use_gradient_checkpointing:
                    inputs_embeds = torch.utils.checkpoint.checkpoint(
                        compute_layer_complete,
+                        layer_idx,
                        inputs_embeds,
                        attention_mask,
                        position_ids,
                        adarms_cond,
                        use_reentrant=False,
                        preserve_rng_state=False,
-                        layers=layers,
-                        rotary_emb=rotary_emb,
+                        paligemma=self.paligemma,
+                        gemma_expert=self.gemma_expert,
                    )
                else:
                    inputs_embeds = compute_layer_complete(
+                        layer_idx,
                        inputs_embeds,
                        attention_mask,
                        position_ids,
                        adarms_cond,
-                        layers=layers,
-                        rotary_emb=rotary_emb,
+                        paligemma=self.paligemma,
+                        gemma_expert=self.gemma_expert,
                    )

            # final norm
-            final_norms = (
-                self.paligemma.model.language_model.norm,
-                self.gemma_expert.model.norm,
-            )
-
            def compute_final_norms(inputs_embeds, adarms_cond):
                outputs_embeds = []
                for i, hidden_states in enumerate(inputs_embeds):
-                    out_emb, _ = layernorm_forward(final_norms[i], hidden_states, adarms_cond[i])
+                    out_emb, _ = layernorm_forward(models[i].norm, hidden_states, adarms_cond[i])
                    outputs_embeds.append(out_emb)
                return outputs_embeds

@@ -919,7 +907,7 @@ class PI0Pytorch(nn.Module):  # see openpi `PI0Pytorch`
        full_att_2d_masks_4d = self._prepare_attention_masks_4d(full_att_2d_masks)
        self.paligemma_with_expert.gemma_expert.model.config._attn_implementation = "eager"  # noqa: SLF001

-        past_key_values = clone_past_key_values(past_key_values)
+        past_key_values = copy.deepcopy(past_key_values)
        outputs_embeds, _ = self.paligemma_with_expert.forward(
            attention_mask=full_att_2d_masks_4d,
            position_ids=position_ids,
--- a/src/lerobot/policies/pi05/modeling_pi05.py
+++ b/src/lerobot/policies/pi05/modeling_pi05.py
@@ -15,6 +15,7 @@
 # limitations under the License.

 import builtins
+import copy
 import logging
 import math
 from collections import deque
@@ -29,7 +30,6 @@ from lerobot.utils.import_utils import _transformers_available, require_package

 # Conditional import for type checking and lazy loading
 if TYPE_CHECKING or _transformers_available:
-    from transformers.cache_utils import DynamicCache
    from transformers.models.auto import CONFIG_MAPPING
    from transformers.models.gemma import modeling_gemma

@@ -41,7 +41,6 @@ if TYPE_CHECKING or _transformers_available:
    )
 else:
    CONFIG_MAPPING = None
-    DynamicCache = None
    modeling_gemma = None
    PiGemmaForCausalLM = None
    _gated_residual = None
@@ -139,15 +138,6 @@ def make_att_2d_masks(pad_masks, att_masks):  # see openpi `make_att_2d_masks` (
    return att_2d_masks & pad_2d_masks


-def clone_past_key_values(past_key_values):
-    """Clone the DynamicCache returned by prefix prefill for compiled denoising."""
-    return DynamicCache(
-        tuple(
-            (keys.clone(), values.clone(), sliding_window) for keys, values, sliding_window in past_key_values
-        )
-    )
-
-
 def pad_vector(vector, new_dim):
    """Pad the last dimension of a vector to new_dim with zeros.

@@ -234,13 +224,16 @@ def resize_with_pad_torch(  # see openpi `resize_with_pad_torch` (exact copy)


 # Define the complete layer computation function for gradient checkpointing
-def compute_layer_complete(inputs_embeds, attention_mask, position_ids, adarms_cond, layers, rotary_emb):
+def compute_layer_complete(
+    layer_idx, inputs_embeds, attention_mask, position_ids, adarms_cond, paligemma, gemma_expert
+):
+    models = [paligemma.model.language_model, gemma_expert.model]
    query_states = []
    key_states = []
    value_states = []
    gates = []
    for i, hidden_states in enumerate(inputs_embeds):
-        layer = layers[i]
+        layer = models[i].layers[layer_idx]
        hidden_states, gate = layernorm_forward(layer.input_layernorm, hidden_states, adarms_cond[i])
        gates.append(gate)
        input_shape = hidden_states.shape[:-1]
@@ -262,16 +255,15 @@ def compute_layer_complete(inputs_embeds, attention_mask, position_ids, adarms_c
        device=query_states.device,
        dtype=query_states.dtype,
    )
-    cos, sin = rotary_emb(dummy_tensor, position_ids)
+    cos, sin = paligemma.model.language_model.rotary_emb(dummy_tensor, position_ids)
    query_states, key_states = modeling_gemma.apply_rotary_pos_emb(
        query_states, key_states, cos, sin, unsqueeze_dim=1
    )
    batch_size = query_states.shape[0]
-    paligemma_layer = layers[0]
-    scaling = paligemma_layer.self_attn.scaling
+    scaling = paligemma.model.language_model.layers[layer_idx].self_attn.scaling
    # Attention computation
    att_output, _ = modeling_gemma.eager_attention_forward(
-        paligemma_layer.self_attn,
+        paligemma.model.language_model.layers[layer_idx].self_attn,
        query_states,
        key_states,
        value_states,
@@ -279,13 +271,13 @@ def compute_layer_complete(inputs_embeds, attention_mask, position_ids, adarms_c
        scaling,
    )
    # Get head_dim from the current layer, not from the model
-    head_dim = paligemma_layer.self_attn.head_dim
+    head_dim = paligemma.model.language_model.layers[layer_idx].self_attn.head_dim
    att_output = att_output.reshape(batch_size, -1, 1 * 8 * head_dim)
    # Process layer outputs
    outputs_embeds = []
    start_pos = 0
    for i, hidden_states in enumerate(inputs_embeds):
-        layer = layers[i]
+        layer = models[i].layers[layer_idx]
        end_pos = start_pos + hidden_states.shape[1]
        if att_output.dtype != layer.self_attn.o_proj.weight.dtype:
            att_output = att_output.to(layer.self_attn.o_proj.weight.dtype)
@@ -493,9 +485,8 @@ class PaliGemmaWithExpertModel(
            prefix_output = None
            prefix_past_key_values = None
        else:
-            paligemma_layers = self.paligemma.model.language_model.layers
-            gemma_expert_layers = self.gemma_expert.model.layers
-            rotary_emb = self.paligemma.model.language_model.rotary_emb
+            models = [self.paligemma.model.language_model, self.gemma_expert.model]
+            num_layers = self.paligemma.config.text_config.num_hidden_layers

            # Check if gradient checkpointing is enabled for any of the models
            use_gradient_checkpointing = (
@@ -505,39 +496,36 @@ class PaliGemmaWithExpertModel(
            ) or (hasattr(self, "gradient_checkpointing") and self.gradient_checkpointing and self.training)

            # Process all layers with gradient checkpointing if enabled
-            for layers in zip(paligemma_layers, gemma_expert_layers, strict=True):
+            for layer_idx in range(num_layers):
                if use_gradient_checkpointing:
                    inputs_embeds = torch.utils.checkpoint.checkpoint(
                        compute_layer_complete,
+                        layer_idx,
                        inputs_embeds,
                        attention_mask,
                        position_ids,
                        adarms_cond,
                        use_reentrant=False,
                        preserve_rng_state=False,
-                        layers=layers,
-                        rotary_emb=rotary_emb,
+                        paligemma=self.paligemma,
+                        gemma_expert=self.gemma_expert,
                    )
                else:
                    inputs_embeds = compute_layer_complete(
+                        layer_idx,
                        inputs_embeds,
                        attention_mask,
                        position_ids,
                        adarms_cond,
-                        layers=layers,
-                        rotary_emb=rotary_emb,
+                        paligemma=self.paligemma,
+                        gemma_expert=self.gemma_expert,
                    )

            # final norm
-            final_norms = (
-                self.paligemma.model.language_model.norm,
-                self.gemma_expert.model.norm,
-            )
-
            def compute_final_norms(inputs_embeds, adarms_cond):
                outputs_embeds = []
                for i, hidden_states in enumerate(inputs_embeds):
-                    out_emb, _ = layernorm_forward(final_norms[i], hidden_states, adarms_cond[i])
+                    out_emb, _ = layernorm_forward(models[i].norm, hidden_states, adarms_cond[i])
                    outputs_embeds.append(out_emb)
                return outputs_embeds

@@ -892,7 +880,7 @@ class PI05Pytorch(nn.Module):  # see openpi `PI0Pytorch`
        full_att_2d_masks_4d = self._prepare_attention_masks_4d(full_att_2d_masks)
        self.paligemma_with_expert.gemma_expert.model.config._attn_implementation = "eager"  # noqa: SLF001

-        past_key_values = clone_past_key_values(past_key_values)
+        past_key_values = copy.deepcopy(past_key_values)
        outputs_embeds, _ = self.paligemma_with_expert.forward(
            attention_mask=full_att_2d_masks_4d,
            position_ids=position_ids,
--- a/src/lerobot/policies/pretrained.py
+++ b/src/lerobot/policies/pretrained.py
@@ -248,7 +248,13 @@ class PreTrainedPolicy(nn.Module, HubMixin, abc.ABC):
    def generate_model_card(
        self, dataset_repo_id: str, model_type: str, license: str | None, tags: list[str] | None
    ) -> ModelCard:
-        base_model = "lerobot/smolvla_base" if model_type == "smolvla" else None  # Set a base model
+        base_model_mapping = {
+            "smolvla": "lerobot/smolvla_base",
+            "pi0": "lerobot/pi0_base",
+            "pi05": "lerobot/pi05_base",
+            "pi0_fast": "lerobot/pi0fast-base",
+            "xvla": "lerobot/xvla-base",
+        }

        card_data = ModelCardData(
            license=license or "apache-2.0",
@@ -257,7 +263,7 @@ class PreTrainedPolicy(nn.Module, HubMixin, abc.ABC):
            tags=list(set(tags or []).union({"robotics", "lerobot", model_type})),
            model_name=model_type,
            datasets=dataset_repo_id,
-            base_model=base_model,
+            base_model=base_model_mapping(model_type, None),
        )

        template_card = (
--- a/src/lerobot/rewards/init.py
+++ b/src/lerobot/rewards/init.py
@@ -20,16 +20,12 @@ from .factory import (
    make_reward_pre_post_processors as make_reward_pre_post_processors,
 )
 from .pretrained import PreTrainedRewardModel as PreTrainedRewardModel
-from .robometer.configuration_robometer import RobometerConfig as RobometerConfig
 from .sarm.configuration_sarm import SARMConfig as SARMConfig
-from .topreward.configuration_topreward import TOPRewardConfig as TOPRewardConfig

 __all__ = [
    # Configuration classes
    "RewardClassifierConfig",
-    "RobometerConfig",
    "SARMConfig",
-    "TOPRewardConfig",
    # Base class
    "PreTrainedRewardModel",
    # Factory functions
--- a/src/lerobot/rewards/factory.py
+++ b/src/lerobot/rewards/factory.py
@@ -25,9 +25,7 @@ from lerobot.processor import PolicyAction, PolicyProcessorPipeline

 from .classifier.configuration_classifier import RewardClassifierConfig
 from .pretrained import PreTrainedRewardModel
-from .robometer.configuration_robometer import RobometerConfig
 from .sarm.configuration_sarm import SARMConfig
-from .topreward.configuration_topreward import TOPRewardConfig


 def get_reward_model_class(name: str) -> type[PreTrainedRewardModel]:
@@ -39,7 +37,7 @@ def get_reward_model_class(name: str) -> type[PreTrainedRewardModel]:

    Args:
        name: The name of the reward model. Supported names are "reward_classifier",
-              "sarm", "robometer", "topreward".
+              "sarm".

    Returns:
        The reward model class corresponding to the given name.
@@ -55,14 +53,6 @@ def get_reward_model_class(name: str) -> type[PreTrainedRewardModel]:
        from lerobot.rewards.sarm.modeling_sarm import SARMRewardModel

        return SARMRewardModel
-    elif name == "robometer":
-        from lerobot.rewards.robometer.modeling_robometer import RobometerRewardModel
-
-        return RobometerRewardModel
-    elif name == "topreward":
-        from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
-
-        return TOPRewardModel
    else:
        try:
            return _get_reward_model_cls_from_name(name=name)
@@ -79,7 +69,7 @@ def make_reward_model_config(reward_type: str, **kwargs) -> RewardModelConfig:

    Args:
        reward_type: The type of the reward model. Supported types include
-                     "reward_classifier", "sarm", "robometer", "topreward".
+                     "reward_classifier", "sarm".
        **kwargs: Keyword arguments to be passed to the configuration class constructor.

    Returns:
@@ -92,10 +82,6 @@ def make_reward_model_config(reward_type: str, **kwargs) -> RewardModelConfig:
        return RewardClassifierConfig(**kwargs)
    elif reward_type == "sarm":
        return SARMConfig(**kwargs)
-    elif reward_type == "robometer":
-        return RobometerConfig(**kwargs)
-    elif reward_type == "topreward":
-        return TOPRewardConfig(**kwargs)
    else:
        try:
            config_cls = RewardModelConfig.get_choice_class(reward_type)
@@ -175,21 +161,6 @@ def make_reward_pre_post_processors(
            dataset_stats=kwargs.get("dataset_stats"),
            dataset_meta=kwargs.get("dataset_meta"),
        )
-    elif isinstance(reward_cfg, RobometerConfig):
-        from lerobot.rewards.robometer.processor_robometer import make_robometer_pre_post_processors
-
-        return make_robometer_pre_post_processors(
-            config=reward_cfg,
-            dataset_stats=kwargs.get("dataset_stats"),
-        )
-
-    elif isinstance(reward_cfg, TOPRewardConfig):
-        from lerobot.rewards.topreward.processor_topreward import make_topreward_pre_post_processors
-
-        return make_topreward_pre_post_processors(
-            config=reward_cfg,
-            dataset_stats=kwargs.get("dataset_stats"),
-        )

    else:
        try:
--- a/src/lerobot/rewards/robometer/init.py
+++ b/src/lerobot/rewards/robometer/init.py
@@ -1,19 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from .configuration_robometer import RobometerConfig
-from .modeling_robometer import RobometerRewardModel
-from .processor_robometer import make_robometer_pre_post_processors
-
-__all__ = ["RobometerConfig", "RobometerRewardModel", "make_robometer_pre_post_processors"]
--- a/src/lerobot/rewards/robometer/compute_rabc_weights.py
+++ b/src/lerobot/rewards/robometer/compute_rabc_weights.py
@@ -1,320 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""Compute per-frame Robometer progress and success curves for a LeRobot dataset.
-
-For each episode, builds per-frame sub-samples using the frame-steps
-strategy from the Robometer eval server: for each original frame ``t``,
-linspace-subsample ``[0, t]`` into ``K`` frames (default 4, matching
-``NUM_SUBSAMPLED_FRAMES`` in the eval server), run one forward through
-the Robometer processor + model, and keep the last-frame progress value.
-All sub-samples are the same size ``K`` so they batch cleanly.
-
-The parquet uses the same schema as SARM's
-:mod:`lerobot.rewards.sarm.compute_rabc_weights` so existing consumers —
-:class:`lerobot.rewards.sarm.rabc.RABCWeights` (which reads
-``progress_sparse``) and the progress-overlay script in
-``examples/dataset/create_progress_videos.py`` — work without modification.
-
-Usage:
-    # Dense per-frame progress for one episode
-    python -m lerobot.rewards.robometer.compute_rabc_weights \\
-        --dataset-repo-id lerobot/libero_10_image \\
-        --reward-model-path lerobot/Robometer-4B \\
-        --episodes 0
-
-    # All episodes with batching
-    python -m lerobot.rewards.robometer.compute_rabc_weights \\
-        --dataset-repo-id lerobot/libero_10_image \\
-        --reward-model-path lerobot/Robometer-4B \\
-        --batch-size 16
-"""
-
-from __future__ import annotations
-
-import argparse
-import logging
-from pathlib import Path
-from typing import Any
-
-import numpy as np
-import pyarrow as pa
-import pyarrow.parquet as pq
-import torch
-from tqdm import tqdm
-
-from lerobot.datasets import LeRobotDataset
-from lerobot.rewards.robometer.configuration_robometer import RobometerConfig
-from lerobot.rewards.robometer.modeling_robometer import RobometerRewardModel
-from lerobot.rewards.robometer.processor_robometer import RobometerEncoderProcessorStep
-from lerobot.types import TransitionKey
-
-DEFAULT_OUTPUT_FILENAME = "robometer_progress.parquet"
-
-# Upstream Robometer eval server uses K=4 for frame-steps sub-samples.
-DEFAULT_NUM_SUBSAMPLED_FRAMES = 4
-
-
-def get_reward_model_path_from_parquet(parquet_path: Path) -> str | None:
-    """Read ``reward_model_path`` from parquet metadata if available."""
-    if not parquet_path.exists():
-        return None
-    try:
-        metadata = pq.read_metadata(parquet_path).schema.to_arrow_schema().metadata
-        if metadata and b"reward_model_path" in metadata:
-            return metadata[b"reward_model_path"].decode()
-    except Exception:  # nosec B110
-        return None
-    return None
-
-
-def _resolve_task(sample: dict[str, Any], default: str) -> str:
-    """Best-effort task extraction from a dataset sample."""
-    task = sample.get("task")
-    if isinstance(task, str) and task:
-        return task
-    return default
-
-
-def _build_subsample_indices(num_frames: int, num_subsampled_frames: int) -> list[np.ndarray]:
-    """Frame-steps linspace expansion.
-
-    For each ``t in [0, num_frames - 1]`` returns ``num_subsampled_frames``
-    indices from ``np.linspace(0, t, num_subsampled_frames)`` — the first
-    and last frames are always included. Each entry is a fixed-size array
-    so the model can batch them.
-    """
-    return [np.linspace(0, t, num_subsampled_frames).round().astype(np.int64) for t in range(num_frames)]
-
-
-def compute_robometer_progress(
-    dataset_repo_id: str,
-    reward_model_path: str,
-    output_path: str | None = None,
-    device: str = "cuda",
-    batch_size: int = 32,
-    num_subsampled_frames: int = DEFAULT_NUM_SUBSAMPLED_FRAMES,
-    episodes: list[int] | None = None,
-    image_key: str | None = None,
-) -> Path:
-    """Run Robometer over a dataset and write per-frame progress + success."""
-    logging.info(f"Loading Robometer: {reward_model_path}")
-    config = RobometerConfig(pretrained_path=reward_model_path, device=device)
-    if image_key is not None:
-        config.image_key = image_key
-    model = RobometerRewardModel.from_pretrained(reward_model_path, config=config)
-    model.to(device).eval()
-
-    encoder = RobometerEncoderProcessorStep(
-        base_model_id=config.base_model_id,
-        image_key=config.image_key,
-        task_key=config.task_key,
-        default_task=config.default_task,
-        max_frames=num_subsampled_frames,
-        use_multi_image=config.use_multi_image,
-        use_per_frame_progress_token=config.use_per_frame_progress_token,
-    )
-
-    image_key = config.image_key
-
-    logging.info(f"Loading dataset: {dataset_repo_id}")
-    dataset = LeRobotDataset(dataset_repo_id, download_videos=True)
-    logging.info(f"Dataset: {dataset.num_episodes} episodes, {dataset.num_frames} frames")
-
-    episode_indices = list(range(dataset.num_episodes)) if episodes is None else episodes
-    logging.info(f"Processing {len(episode_indices)} episode(s)")
-
-    all_index: list[int] = []
-    all_episode: list[int] = []
-    all_frame: list[int] = []
-    all_progress: list[float] = []
-
-    for episode_idx in tqdm(episode_indices, desc="Episodes"):
-        ep = dataset.meta.episodes[episode_idx]
-        ep_start = int(ep["dataset_from_index"])
-        ep_end = int(ep["dataset_to_index"])
-        num_frames = ep_end - ep_start
-        if num_frames <= 0:
-            continue
-
-        first_sample = dataset[ep_start]
-        task = _resolve_task(first_sample, default=config.default_task or "perform the task")
-
-        ep_frames = torch.stack([dataset[ep_start + i][image_key] for i in range(num_frames)])
-
-        sub_indices = _build_subsample_indices(num_frames, num_subsampled_frames)
-
-        progress_per_frame = np.zeros(num_frames, dtype=np.float32)
-
-        for start in tqdm(range(0, num_frames, batch_size), desc=f"  Ep {episode_idx}", leave=False):
-            end = min(start + batch_size, num_frames)
-            frames_batch = torch.stack([ep_frames[sub_indices[i]] for i in range(start, end)])
-
-            transition = {
-                TransitionKey.OBSERVATION: {image_key: frames_batch},
-                TransitionKey.COMPLEMENTARY_DATA: {"task": task},
-            }
-            encoded = encoder(transition)
-            obs = encoded[TransitionKey.OBSERVATION]
-            batch = {
-                key: value.to(device) if isinstance(value, torch.Tensor) else value
-                for key, value in obs.items()
-            }
-
-            with torch.no_grad():
-                rewards = model.compute_reward(batch)
-            progress_per_frame[start:end] = rewards.cpu().numpy()
-
-        for local in range(num_frames):
-            all_index.append(ep_start + local)
-            all_episode.append(episode_idx)
-            all_frame.append(local)
-            all_progress.append(float(progress_per_frame[local]))
-
-        if device.startswith("cuda"):
-            torch.cuda.empty_cache()
-
-    table = pa.table(
-        {
-            "index": np.asarray(all_index, dtype=np.int64),
-            "episode_index": np.asarray(all_episode, dtype=np.int64),
-            "frame_index": np.asarray(all_frame, dtype=np.int64),
-            "progress_sparse": np.asarray(all_progress, dtype=np.float32),
-        }
-    ).replace_schema_metadata({b"reward_model_path": reward_model_path.encode()})
-
-    out = Path(dataset.root) / DEFAULT_OUTPUT_FILENAME if output_path is None else Path(output_path)
-    out.parent.mkdir(parents=True, exist_ok=True)
-    pq.write_table(table, out)
-    logging.info(f"Saved {len(table)} frame values to {out}")
-
-    progress_arr = np.asarray(all_progress, dtype=np.float32)
-    if progress_arr.size:
-        logging.info(
-            f"Progress: mean={float(progress_arr.mean()):.4f}, "
-            f"std={float(progress_arr.std()):.4f}, "
-            f"min={float(progress_arr.min()):.4f}, "
-            f"max={float(progress_arr.max()):.4f}"
-        )
-    return out
-
-
-def main():
-    parser = argparse.ArgumentParser(
-        description="Compute per-frame Robometer progress curves for RA-BC weighting.",
-        formatter_class=argparse.RawDescriptionHelpFormatter,
-        epilog="""
-Examples:
-    # Dense per-frame progress for one episode
-    python -m lerobot.rewards.robometer.compute_rabc_weights \\
-        --dataset-repo-id lerobot/libero_10_image \\
-        --reward-model-path lerobot/Robometer-4B \\
-        --episodes 0
-
-    # All episodes, smaller batches for memory-constrained GPUs
-    python -m lerobot.rewards.robometer.compute_rabc_weights \\
-        --dataset-repo-id lerobot/libero_10_image \\
-        --reward-model-path lerobot/Robometer-4B \\
-        --batch-size 16
-        """,
-    )
-    parser.add_argument(
-        "--dataset-repo-id", type=str, required=True, help="HuggingFace dataset repo id or local path."
-    )
-    parser.add_argument(
-        "--reward-model-path", type=str, default=None, help="Robometer checkpoint repo id or local path."
-    )
-    parser.add_argument("--output-path", type=str, default=None, help="Output parquet path.")
-    parser.add_argument("--device", type=str, default="cuda", help="Device to use (default: cuda).")
-    parser.add_argument(
-        "--batch-size", type=int, default=32, help="Sub-samples per Qwen forward (default: 32)."
-    )
-    parser.add_argument(
-        "--num-subsampled-frames",
-        type=int,
-        default=DEFAULT_NUM_SUBSAMPLED_FRAMES,
-        help=f"Frames per sub-sample (default: {DEFAULT_NUM_SUBSAMPLED_FRAMES}, matches eval server).",
-    )
-    parser.add_argument(
-        "--episodes", type=int, nargs="+", default=None, help="Process only these episode indices."
-    )
-    parser.add_argument(
-        "--image-key", type=str, default=None, help="Image observation key (default: from config)."
-    )
-    parser.add_argument(
-        "--push-to-hub", action="store_true", help="Upload to the dataset repo on HuggingFace Hub."
-    )
-
-    args = parser.parse_args()
-
-    logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
-
-    reward_model_path = args.reward_model_path
-    if reward_model_path is None:
-        temp_dataset = LeRobotDataset(args.dataset_repo_id, download_videos=False)
-        parquet_path = Path(temp_dataset.root) / DEFAULT_OUTPUT_FILENAME
-        reward_model_path = get_reward_model_path_from_parquet(parquet_path)
-        if reward_model_path:
-            logging.info(f"Using reward model from parquet metadata: {reward_model_path}")
-        else:
-            raise ValueError(
-                "--reward-model-path is required (no existing parquet with model metadata found)."
-            )
-
-    output_path = compute_robometer_progress(
-        dataset_repo_id=args.dataset_repo_id,
-        reward_model_path=reward_model_path,
-        output_path=args.output_path,
-        device=args.device,
-        batch_size=args.batch_size,
-        num_subsampled_frames=args.num_subsampled_frames,
-        episodes=args.episodes,
-        image_key=args.image_key,
-    )
-
-    print(f"\nRobometer progress saved to: {output_path}")
-
-    if args.push_to_hub:
-        from huggingface_hub import HfApi
-
-        api = HfApi()
-        hub_path = DEFAULT_OUTPUT_FILENAME
-
-        print(f"\nUploading to Hub: {args.dataset_repo_id}/{hub_path}")
-        api.upload_file(
-            path_or_fileobj=str(output_path),
-            path_in_repo=hub_path,
-            repo_id=args.dataset_repo_id,
-            repo_type="dataset",
-        )
-        print(
-            "Successfully uploaded to: "
-            f"https://huggingface.co/datasets/{args.dataset_repo_id}/blob/main/{hub_path}"
-        )
-
-        print("\nTo use in training, add to your config:")
-        print("  use_rabc: true")
-        print(f"  rabc_progress_path: hf://datasets/{args.dataset_repo_id}/{hub_path}")
-        print("  rabc_head_mode: sparse")
-    else:
-        print("\nTo use in training, add to your config:")
-        print("  use_rabc: true")
-        print(f"  rabc_progress_path: {output_path}")
-        print("  rabc_head_mode: sparse")
-
-
-if __name__ == "__main__":
-    main()
--- a/src/lerobot/rewards/robometer/configuration_robometer.py
+++ b/src/lerobot/rewards/robometer/configuration_robometer.py
@@ -1,158 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import annotations
-
-from copy import deepcopy
-from dataclasses import dataclass, field
-from typing import TYPE_CHECKING, Any
-
-from lerobot.configs import FeatureType, NormalizationMode, PolicyFeature
-from lerobot.configs.rewards import RewardModelConfig
-from lerobot.utils.constants import OBS_IMAGES
-from lerobot.utils.import_utils import _transformers_available, require_package
-
-if TYPE_CHECKING or _transformers_available:
-    from transformers import AutoConfig, AutoTokenizer
-else:
-    AutoConfig = None  # type: ignore[assignment]
-    AutoTokenizer = None  # type: ignore[assignment]
-
-
-# Special tokens Robometer adds to the Qwen-VL tokenizer at construction time.
-# The order is part of the data contract: upstream resized ``embed_tokens``
-# after adding these tokens in this exact order, so changing the set or order
-# would silently misalign the saved embedding rows with their token ids.
-# ``<|reward_token|>`` and ``<|sim_token|>`` are leftover from earlier upstream
-# heads (never read at inference) but still occupy rows the checkpoint expects.
-ROBOMETER_SPECIAL_TOKENS = (
-    "<|split_token|>",
-    "<|reward_token|>",
-    "<|pref_token|>",
-    "<|sim_token|>",
-    "<|prog_token|>",
-)
-
-
-@RewardModelConfig.register_subclass("robometer")
-@dataclass
-class RobometerConfig(RewardModelConfig):
-    """Configuration for the Robometer reward model."""
-
-    pretrained_path: str | None = "lerobot/Robometer-4B"
-    image_key: str = OBS_IMAGES + ".top"
-    task_key: str = "task"
-    default_task: str | None = None
-
-    max_frames: int | None = 8
-    reward_output: str = "progress"  # "progress" or "success"
-    success_threshold: float = 0.5
-
-    license: str | None = "apache-2.0"
-    tags: list[str] | None = field(
-        default_factory=lambda: ["reward-model", "vision-language", "qwen3-vl", "zero-shot"]
-    )
-
-    base_model_id: str = "Qwen/Qwen3-VL-4B-Instruct"
-    torch_dtype: str = "bfloat16"
-    use_multi_image: bool = True
-    use_per_frame_progress_token: bool = True
-    average_temporal_patches: bool = True
-    frame_pooling: str = "mean"  # "mean" | "boundary" | "attention"
-    frame_pooling_attn_temperature: float = 1.0
-    progress_loss_type: str = "discrete"  # "l1" | "l2" | "discrete"
-    progress_discrete_bins: int = 10
-
-    # Serialised Qwen backbone config (post-resize). Always populated by
-    # ``__post_init__`` from ``base_model_id`` + ``len(tokenizer) + 5``, so it
-    # is non-empty after construction. Saved into ``config.json`` automatically
-    # by the base ``_save_pretrained``.
-    vlm_config: dict[str, Any] = field(default_factory=dict)
-
-    input_features: dict[str, PolicyFeature] = field(default_factory=dict)
-    output_features: dict[str, PolicyFeature] = field(default_factory=dict)
-    normalization_mapping: dict[str, NormalizationMode] = field(
-        default_factory=lambda: {
-            "VISUAL": NormalizationMode.IDENTITY,
-            "REWARD": NormalizationMode.IDENTITY,
-        }
-    )
-
-    def __post_init__(self) -> None:
-        super().__post_init__()
-        if self.reward_output not in {"progress", "success"}:
-            raise ValueError(f"reward_output must be 'progress' or 'success', got {self.reward_output!r}")
-        if self.max_frames is not None and self.max_frames < 1:
-            raise ValueError(f"max_frames must be >= 1, got {self.max_frames}")
-        if self.frame_pooling not in {"mean", "boundary", "attention"}:
-            raise ValueError(f"frame_pooling must be mean/boundary/attention; got {self.frame_pooling!r}")
-        if self.frame_pooling_attn_temperature <= 0:
-            raise ValueError("frame_pooling_attn_temperature must be > 0")
-        if self.progress_loss_type not in {"l1", "l2", "discrete"}:
-            raise ValueError(f"progress_loss_type must be l1/l2/discrete; got {self.progress_loss_type!r}")
-        if self.use_per_frame_progress_token and not self.use_multi_image:
-            raise ValueError("use_per_frame_progress_token=True requires use_multi_image=True")
-
-        if self.image_key not in self.input_features:
-            self.input_features[self.image_key] = PolicyFeature(shape=(3, 224, 224), type=FeatureType.VISUAL)
-        self.output_features.setdefault("progress", PolicyFeature(shape=(1,), type=FeatureType.REWARD))
-        self.output_features.setdefault("success", PolicyFeature(shape=(1,), type=FeatureType.REWARD))
-
-        # Deterministically populate ``vlm_config`` so it is non-empty after
-        # construction. For ``Qwen/Qwen3-VL-4B-Instruct`` this gives
-        # ``len(tokenizer) + 5 = 151,669 + 5 = 151,674`` — the exact post-resize
-        # vocab the published ``Robometer-4B`` checkpoint was saved with.
-        if not self.vlm_config:
-            require_package("transformers", extra="robometer")
-            vlm = AutoConfig.from_pretrained(self.base_model_id).to_dict()
-            tokenizer = AutoTokenizer.from_pretrained(self.base_model_id)
-            text_config = vlm.get("text_config")
-            if not isinstance(text_config, dict):
-                raise ValueError(
-                    f"Backbone config for {self.base_model_id!r} has no nested `text_config`; "
-                    "Robometer expects a Qwen-VL-style config."
-                )
-            text_config["vocab_size"] = len(tokenizer) + len(ROBOMETER_SPECIAL_TOKENS)
-            self.vlm_config = vlm
-
-    @property
-    def use_discrete_progress(self) -> bool:
-        """Whether the progress head outputs distribution logits over bins."""
-        return self.progress_loss_type.lower() == "discrete"
-
-    @property
-    def vlm_backbone_config(self):
-        """Reconstruct the Qwen backbone config from :attr:`vlm_config`."""
-        require_package("transformers", extra="robometer")
-        config_dict = deepcopy(self.vlm_config)
-        model_type = config_dict.pop("model_type", None)
-        if model_type is None:
-            raise ValueError("vlm_config must include `model_type` to reconstruct the backbone config")
-        return AutoConfig.for_model(model_type, **config_dict)
-
-    @property
-    def observation_delta_indices(self) -> list[int] | None:
-        return None
-
-    @property
-    def action_delta_indices(self) -> None:
-        return None
-
-    @property
-    def reward_delta_indices(self) -> None:
-        return None
-
-    def validate_features(self) -> None:
-        if self.image_key not in self.input_features:
-            raise ValueError(f"Robometer requires image input feature {self.image_key!r}")
--- a/src/lerobot/rewards/robometer/modeling_robometer.py
+++ b/src/lerobot/rewards/robometer/modeling_robometer.py
@@ -1,481 +0,0 @@
-# Copyright 2026 Anthony Liang, Yigit Korkmaz, Stephen Tu, Erdem Bıyık, Jesse Zhang
-# and The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""ROBOMETER: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons.
-
-Paper:         https://arxiv.org/abs/2603.02115
-Project:       https://robometer.github.io
-Original code: https://github.com/aliang8/robometer
-Model:         https://huggingface.co/robometer/Robometer-4B
-
-Robometer is a general-purpose, video-language-input reward model built on
-``Qwen/Qwen3-VL-4B-Instruct``. It is trained with a dual reward-prediction
-objective:
-
- A frame-level progress loss anchoring reward magnitude on expert data.
- A trajectory-comparison preference loss imposing global ordering constraints
-  across trajectories sharing the same instruction.
-
-To support downstream RL it also predicts a frame-level binary success. The
-training prompt inserts three learnable tokens:
-
- ``<|prog_token|>`` after each frame to read per-frame progress and success.
- ``<|pref_token|>`` at the end to read pairwise preference (training-only).
- ``<|split_token|>`` between two trajectories in preference samples
-  (training-only).
-
-Progress is modeled as a categorical distribution over ``progress_discrete_bins``
-uniformly-spaced centers in ``[0, 1]`` (C51-style), and the continuous estimate
-is recovered as the softmax-weighted mean of those centers — see
-:func:`convert_bins_to_continuous`.
-
-This LeRobot port is **inference-only**: the preference head is preserved in
-the state dict for byte-equivalence with the published ``Robometer-4B``
-checkpoint but is not queried by :meth:`RobometerRewardModel.compute_reward`,
-which returns the last-frame progress (clamped to ``[0, 1]``) or sigmoid'd
-success probability depending on :attr:`RobometerConfig.reward_output`.
-"""
-
-from __future__ import annotations
-
-import logging
-from typing import TYPE_CHECKING, Any
-
-import torch
-from torch import Tensor, nn
-
-from lerobot.rewards.pretrained import PreTrainedRewardModel
-from lerobot.rewards.robometer.configuration_robometer import RobometerConfig
-from lerobot.utils.constants import OBS_PREFIX
-from lerobot.utils.import_utils import _transformers_available, require_package
-
-if TYPE_CHECKING or _transformers_available:
-    from transformers import AutoModelForImageTextToText
-else:
-    AutoModelForImageTextToText = None  # type: ignore[assignment]
-
-logger = logging.getLogger(__name__)
-
-# Namespace for Robometer's pre-encoded Qwen-VL observation tensors.
-ROBOMETER_FEATURE_PREFIX = f"{OBS_PREFIX}robometer."
-ROBOMETER_QWEN_INPUT_KEYS = (
-    "input_ids",
-    "attention_mask",
-    "pixel_values",
-    "pixel_values_videos",
-    "image_grid_thw",
-    "video_grid_thw",
-    "second_per_grid_ts",
-    "mm_token_type_ids",
-)
-ROBOMETER_METADATA_KEYS = (
-    "prog_token_id",
-    "vision_start_token_id",
-    "vision_end_token_id",
-    "video_merge_size",
-)
-ROBOMETER_INPUT_KEYS = ROBOMETER_QWEN_INPUT_KEYS + ROBOMETER_METADATA_KEYS
-
-
-def convert_bins_to_continuous(bin_logits: Tensor) -> Tensor:
-    """Collapse per-bin logits into a single value in ``[0, 1]``.
-
-    The discrete progress head outputs ``num_bins`` logits per frame. Bins are
-    evenly spaced centers in ``[0, 1]``; the continuous prediction is the
-    softmax-weighted mean of those centers.
-    """
-    bin_probs = torch.softmax(bin_logits, dim=-1)
-    num_bins = bin_logits.shape[-1]
-    bin_centers = torch.linspace(0.0, 1.0, num_bins, device=bin_logits.device, dtype=bin_logits.dtype)
-    return (bin_probs * bin_centers).sum(dim=-1)
-
-
-def _squeeze_last_safe(x: Tensor) -> Tensor:
-    """Drop a trailing singleton dim only when present."""
-    return x.squeeze(-1) if x.ndim > 1 and x.shape[-1] == 1 else x
-
-
-def _torch_dtype(name: str) -> torch.dtype:
-    dtype = getattr(torch, name, None)
-    if isinstance(dtype, torch.dtype):
-        return dtype
-    raise ValueError(f"Unknown torch dtype: {name!r}")
-
-
-class RobometerPredictionHead(nn.Sequential):
-    """Small MLP head used for Robometer's progress / success / preference outputs."""
-
-    def __init__(self, hidden_dim: int, output_size: int, *, dropout: float, with_sigmoid: bool) -> None:
-        layers: list[nn.Module] = [
-            nn.Linear(hidden_dim, hidden_dim // 2),
-            nn.LayerNorm(hidden_dim // 2),
-            nn.GELU(),
-            nn.Dropout(dropout),
-            nn.Linear(hidden_dim // 2, output_size),
-        ]
-        if with_sigmoid:
-            layers.append(nn.Sigmoid())
-        super().__init__(*layers)
-
-
-def decode_progress_outputs(
-    progress_logits: Tensor | None,
-    success_logits: Tensor | None,
-    *,
-    is_discrete_mode: bool,
-) -> dict[str, list[list[float]]]:
-    """Decode RBM head outputs into per-frame floats.
-
-    Args:
-        progress_logits: ``(B, T)`` (continuous) or ``(B, T, num_bins)`` (discrete).
-        success_logits: ``(B, T)`` raw logits, ``sigmoid``-ed to probabilities.
-        is_discrete_mode: if True the progress logits get a softmax over bins
-            and are projected onto bin centers via :func:`convert_bins_to_continuous`.
-
-    Returns:
-        Dict with ``progress_pred`` and ``success_probs``, each a list of
-        length ``B`` of per-frame float lists.
-    """
-    progress_pred: list[list[float]] = []
-    success_probs: list[list[float]] = []
-
-    if progress_logits is not None:
-        for sample_logits in progress_logits:
-            if is_discrete_mode:
-                continuous = convert_bins_to_continuous(sample_logits.detach().float().cpu())
-                progress_pred.append(continuous.flatten().tolist())
-            else:
-                progress_pred.append(sample_logits.detach().float().cpu().flatten().tolist())
-
-    if success_logits is not None:
-        for sample_logits in success_logits:
-            success_probs.append(torch.sigmoid(sample_logits.detach().float().cpu()).flatten().tolist())
-
-    return {"progress_pred": progress_pred, "success_probs": success_probs}
-
-
-class RobometerRewardModel(PreTrainedRewardModel):
-    """Robometer (RBM) reward model — inference-only LeRobot port.
-
-    Wraps a Qwen-VL backbone (default: ``Qwen/Qwen3-VL-4B-Instruct``) with three
-    prediction heads from the paper (progress, success, preference). At
-    inference time only the progress and success heads are queried; the
-    preference head is kept on the module so the published ``Robometer-4B``
-    safetensors load unchanged.
-    """
-
-    name = "robometer"
-    config_class = RobometerConfig
-
-    def __init__(self, config: RobometerConfig, *, dropout: float = 0.1) -> None:
-        require_package("transformers", extra="robometer")
-        super().__init__(config)
-        self.config = config
-
-        # Two backbone-build paths (EO-1 style, branched on ``pretrained_path``):
-        #
-        #   - Fresh training (``pretrained_path is None``): download the base
-        #     Qwen weights and resize the embed table to match
-        #     ``vlm_config.text_config.vocab_size`` — populated deterministically
-        #     in ``RobometerConfig.__post_init__`` as
-        #     ``len(tokenizer) + len(ROBOMETER_SPECIAL_TOKENS)``
-        #
-        #   - Loading a saved checkpoint (``pretrained_path`` is set): rebuild
-        #     the empty architecture from ``vlm_config`` via
-        #     ``AutoModelForImageTextToText.from_config`` so the subsequent
-        #     ``model.safetensors`` load is a direct fill of the right shape —
-        #     no redundant Qwen weight download.
-        torch_dtype = _torch_dtype(config.torch_dtype)
-        if config.pretrained_path is None:
-            self.model = AutoModelForImageTextToText.from_pretrained(
-                config.base_model_id,
-                dtype=torch_dtype,
-                trust_remote_code=True,
-            )
-            target_vocab = config.vlm_config["text_config"]["vocab_size"]
-            self.model.resize_token_embeddings(target_vocab)
-        else:
-            self.model = AutoModelForImageTextToText.from_config(
-                config.vlm_backbone_config,
-                dtype=torch_dtype,
-                trust_remote_code=True,
-            )
-
-        # All Qwen-VL backbones Robometer supports expose `text_config.hidden_size`.
-        # Falls back to the top-level `hidden_size` so future non-multimodal
-        # variants would still resolve.
-        backbone_config = self.model.config
-        text_config = getattr(backbone_config, "text_config", None)
-        hidden_size = getattr(text_config, "hidden_size", None) if text_config is not None else None
-        if hidden_size is None:
-            hidden_size = getattr(backbone_config, "hidden_size", None)
-        if hidden_size is None:
-            raise AttributeError(
-                f"Could not infer hidden_size from backbone config of {config.base_model_id}"
-            )
-        hidden_dim = int(hidden_size)
-
-        # Robometer's three prediction heads + frame-pool attention.
-        progress_output = config.progress_discrete_bins if config.use_discrete_progress else 1
-        self.progress_head = RobometerPredictionHead(
-            hidden_dim,
-            progress_output,
-            dropout=dropout,
-            with_sigmoid=not config.use_discrete_progress,
-        )
-        self.preference_head = RobometerPredictionHead(hidden_dim, 1, dropout=dropout, with_sigmoid=False)
-        self.success_head = RobometerPredictionHead(hidden_dim, 1, dropout=dropout, with_sigmoid=False)
-        self.frame_pool_attn = nn.Linear(hidden_dim, 1, bias=False)
-
-        # Match the dtype of the loaded base model so weight loading is a no-op cast.
-        model_dtype = next(self.model.parameters()).dtype
-        self.progress_head.to(dtype=model_dtype)
-        self.preference_head.to(dtype=model_dtype)
-        self.success_head.to(dtype=model_dtype)
-        self.frame_pool_attn.to(dtype=model_dtype)
-
-    def compute_reward(self, batch: dict[str, Tensor]) -> Tensor:
-        inputs = {
-            key: batch[f"{ROBOMETER_FEATURE_PREFIX}{key}"]
-            for key in ROBOMETER_INPUT_KEYS
-            if f"{ROBOMETER_FEATURE_PREFIX}{key}" in batch
-        }
-        if "input_ids" not in inputs:
-            raise KeyError(
-                f"Robometer batch missing pre-encoded inputs (expected "
-                f"`{ROBOMETER_FEATURE_PREFIX}input_ids`). Make sure the "
-                "RobometerEncoderProcessorStep ran before `compute_reward`."
-            )
-
-        device = next(self.model.parameters()).device
-        inputs = {key: value.to(device) if hasattr(value, "to") else value for key, value in inputs.items()}
-
-        self.eval()
-        with torch.no_grad():
-            progress_logits, success_logits = self._compute_rbm_logits(inputs)
-
-        decoded = decode_progress_outputs(
-            progress_logits,
-            success_logits,
-            is_discrete_mode=self.config.use_discrete_progress,
-        )
-        values = (
-            decoded["success_probs"] if self.config.reward_output == "success" else decoded["progress_pred"]
-        )
-
-        rewards = torch.stack([torch.as_tensor(seq, dtype=torch.float32)[-1] for seq in values])
-        if self.config.reward_output == "success":
-            rewards = (rewards > self.config.success_threshold).float()
-        else:
-            # Match upstream Robometer's ``extract_rewards_from_output``: per-frame
-            # progress predictions are clamped to ``[0, 1]`` before being returned.
-            rewards = rewards.clamp(0.0, 1.0)
-        return rewards.to(self.config.device or "cpu")
-
-    def _compute_rbm_logits(
-        self,
-        inputs: dict[str, Any],
-    ) -> tuple[Tensor, Tensor]:
-        """Run the Qwen3-VL backbone and apply Robometer's heads.
-
-        ``inputs`` is the encoded batch produced by
-        :class:`RobometerEncoderProcessorStep`. It carries Qwen tensors as well
-        as Robometer-specific metadata (``prog_token_id``,
-        ``vision_start_token_id``, ``vision_end_token_id``, ``video_merge_size``)
-        — the metadata is popped here so the rest can be forwarded straight to
-        the Qwen model.
-
-        Returns ``(progress_logits, success_logits)``. Shapes:
-
-        - ``progress_logits``: ``(B, T)`` (continuous) or ``(B, T, num_bins)`` (discrete).
-        - ``success_logits``: ``(B, T)`` raw logits (sigmoid happens at decode time).
-        """
-        prog_token_id = inputs.pop("prog_token_id", None)
-        vision_start_token_id = inputs.pop("vision_start_token_id", None)
-        vision_end_token_id = inputs.pop("vision_end_token_id", None)
-        video_merge_size = inputs.pop("video_merge_size", 14)
-
-        # Qwen3-VL doesn't reliably populate `last_hidden_state`; ask for the
-        # full hidden-state tuple and take the last layer. This matches the
-        # `is_qwen3` path in upstream Robometer's `RBM.forward_qwen` (main).
-        outputs = self.model(**inputs, output_hidden_states=True, return_dict=True)
-        hidden_state = (
-            outputs.hidden_states[-1]
-            if getattr(outputs, "hidden_states", None)
-            else outputs.last_hidden_state
-        )
-
-        input_ids = inputs["input_ids"]
-        if self.config.use_per_frame_progress_token:
-            if prog_token_id is None:
-                raise KeyError("`prog_token_id` missing in batch (run RobometerEncoderProcessorStep first)")
-            return self._process_token_extraction(hidden_state, input_ids, prog_token_id=prog_token_id)
-        if self.config.use_multi_image:
-            if vision_start_token_id is None or vision_end_token_id is None:
-                raise KeyError(
-                    "`vision_start_token_id` / `vision_end_token_id` missing in batch "
-                    "(run RobometerEncoderProcessorStep first)"
-                )
-            return self._process_multi_image_frames(
-                hidden_state,
-                input_ids,
-                start_id=vision_start_token_id,
-                end_id=vision_end_token_id,
-            )
-        video_grid_thw = inputs.get("video_grid_thw")
-        if video_grid_thw is None:
-            raise ValueError("video_grid_thw is required for video-mode Robometer inference")
-        if vision_start_token_id is None:
-            raise KeyError("`vision_start_token_id` missing in batch")
-        return self._process_video_frames(
-            hidden_state,
-            input_ids,
-            video_grid_thw,
-            start_id=vision_start_token_id,
-            merge_size=video_merge_size,
-        )
-
-    def _apply_heads_to_hidden_states(self, frame_embeddings: Tensor) -> tuple[Tensor, Tensor]:
-        """Apply progress + success heads to a tensor of frame embeddings."""
-        progress_out = self.progress_head(frame_embeddings)
-        progress = progress_out if self.config.use_discrete_progress else _squeeze_last_safe(progress_out)
-        success = _squeeze_last_safe(self.success_head(frame_embeddings))
-        return progress, success
-
-    def _process_token_extraction(
-        self,
-        hidden_state: Tensor,
-        input_ids: Tensor,
-        *,
-        prog_token_id: int,
-    ) -> tuple[Tensor, Tensor]:
-        """Per-frame progress/success from ``<|prog_token|>`` positions."""
-        token_mask = input_ids == prog_token_id
-        batch_indices, positions = token_mask.nonzero(as_tuple=True)
-        if positions.numel() == 0:
-            raise ValueError("`<|prog_token|>` not found in any sequence")
-
-        per_sample_hidden = [
-            hidden_state[i, positions[batch_indices == i]] for i in range(input_ids.shape[0])
-        ]
-        progress_list, success_list = [], []
-        for embeddings in per_sample_hidden:
-            if embeddings.shape[0] == 0:
-                raise ValueError("`<|prog_token|>` missing in a sequence")
-            progress, success = self._apply_heads_to_hidden_states(embeddings)
-            progress_list.append(progress)
-            success_list.append(success)
-
-        return torch.stack(progress_list), torch.stack(success_list)
-
-    def _process_multi_image_frames(
-        self,
-        hidden_state: Tensor,
-        input_ids: Tensor,
-        *,
-        start_id: int,
-        end_id: int,
-    ) -> tuple[Tensor, Tensor]:
-        """Per-frame progress/success in multi-image mode (Qwen-VL)."""
-        progress_list, success_list = [], []
-        for batch_idx in range(input_ids.shape[0]):
-            seq_ids = input_ids[batch_idx]
-            seq_hidden = hidden_state[batch_idx]
-            frame_embeddings = self._extract_hidden_states_from_token_pairs(
-                seq_hidden, seq_ids, start_id, end_id
-            )
-            progress, success = self._apply_heads_to_hidden_states(frame_embeddings)
-            progress_list.append(progress)
-            success_list.append(success)
-
-        return torch.stack(progress_list), torch.stack(success_list)
-
-    def _extract_hidden_states_from_token_pairs(
-        self,
-        hidden_state: Tensor,
-        input_ids: Tensor,
-        start_id: int,
-        end_id: int,
-    ) -> Tensor:
-        start_positions = (input_ids == start_id).nonzero(as_tuple=True)[0]
-        end_positions = (input_ids == end_id).nonzero(as_tuple=True)[0]
-        if start_positions.numel() == 0:
-            raise ValueError("`<|vision_start|>` not found in sequence")
-        if start_positions.numel() != end_positions.numel():
-            raise ValueError(
-                f"Mismatched vision token counts: {start_positions.numel()} start vs "
-                f"{end_positions.numel()} end"
-            )
-
-        frames: list[Tensor] = []
-        for start, end in zip(start_positions.tolist(), end_positions.tolist(), strict=True):
-            if start >= end:
-                raise ValueError(f"Invalid vision token pair: start={start} end={end}")
-            patch_tokens = hidden_state[start + 1 : end]
-            if patch_tokens.shape[0] == 0:
-                frames.append((hidden_state[start] + hidden_state[end]) / 2.0)
-                continue
-
-            pooling = self.config.frame_pooling
-            if pooling == "mean":
-                frames.append(patch_tokens.mean(dim=0))
-            elif pooling == "boundary":
-                frames.append(patch_tokens[-1])
-            else:  # attention
-                scores = (
-                    self.frame_pool_attn(patch_tokens).squeeze(-1)
-                    / self.config.frame_pooling_attn_temperature
-                )
-                weights = torch.softmax(scores, dim=0).unsqueeze(-1)
-                frames.append((weights * patch_tokens).sum(dim=0))
-
-        return torch.stack(frames)
-
-    def _process_video_frames(
-        self,
-        hidden_state: Tensor,
-        input_ids: Tensor,
-        video_grid_thw: Tensor,
-        *,
-        start_id: int,
-        merge_size: int,
-    ) -> tuple[Tensor, Tensor]:
-        """Per-frame progress/success in video mode (Qwen-VL)."""
-        progress_list, success_list = [], []
-        for batch_idx in range(input_ids.shape[0]):
-            seq_ids = input_ids[batch_idx]
-            seq_hidden = hidden_state[batch_idx]
-            start_positions = (seq_ids == start_id).nonzero(as_tuple=True)[0]
-            if start_positions.numel() == 0:
-                raise ValueError("`<|vision_start|>` not found in sequence")
-            t_dim, h_dim, w_dim = (int(x) for x in video_grid_thw[batch_idx].tolist())
-            tokens_per_frame = (h_dim * w_dim) // (merge_size**2)
-
-            cursor = start_positions[0].item()
-            frame_embeddings: list[Tensor] = []
-            for _ in range(t_dim):
-                if self.config.average_temporal_patches:
-                    patch = seq_hidden[cursor : cursor + tokens_per_frame]
-                    frame_embeddings.append(patch.mean(dim=0))
-                else:
-                    frame_embeddings.append(seq_hidden[cursor + tokens_per_frame])
-                cursor += tokens_per_frame
-
-            stacked = torch.stack(frame_embeddings)
-            progress, success = self._apply_heads_to_hidden_states(stacked)
-            progress_list.append(progress)
-            success_list.append(success)
-
-        return torch.stack(progress_list), torch.stack(success_list)
--- a/src/lerobot/rewards/robometer/processor_robometer.py
+++ b/src/lerobot/rewards/robometer/processor_robometer.py
@@ -1,338 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""Robometer pre/post processing pipelines."""
-
-from __future__ import annotations
-
-from dataclasses import dataclass, field
-from typing import TYPE_CHECKING, Any
-
-import numpy as np
-import torch
-from PIL import Image
-from torch import Tensor
-
-from lerobot.configs import PipelineFeatureType, PolicyFeature
-from lerobot.processor import (
-    AddBatchDimensionProcessorStep,
-    DeviceProcessorStep,
-    PolicyAction,
-    PolicyProcessorPipeline,
-    ProcessorStep,
-    ProcessorStepRegistry,
-    policy_action_to_transition,
-)
-from lerobot.rewards.robometer.configuration_robometer import (
-    ROBOMETER_SPECIAL_TOKENS,
-    RobometerConfig,
-)
-from lerobot.rewards.robometer.modeling_robometer import ROBOMETER_FEATURE_PREFIX
-from lerobot.types import EnvTransition, TransitionKey
-from lerobot.utils.constants import (
-    OBS_IMAGES,
-    POLICY_POSTPROCESSOR_DEFAULT_NAME,
-    POLICY_PREPROCESSOR_DEFAULT_NAME,
-)
-from lerobot.utils.import_utils import _transformers_available, require_package
-
-if TYPE_CHECKING or _transformers_available:
-    from transformers import AutoProcessor
-else:
-    AutoProcessor = None
-
-PROGRESS_PROMPT = (
-    "The task for the robot is '{task}'. Given the trajectory video, predict "
-    "the task progress at each frame, how far along the robot is towards "
-    "completing the task, a float between 0 and 1, where 0 is the starting "
-    "state and 1 is when the task is completed. If the robot is not "
-    "performing the same task, predict 0 progress."
-)
-
-
-def _frames_to_pil(frames: np.ndarray) -> list[Image.Image]:
-    """Convert ``(T, H, W, C)`` uint8 frames to a list of PIL images."""
-    if frames.ndim != 4:
-        raise ValueError(f"Expected (T,H,W,C) frames; got shape {frames.shape}")
-    if frames.dtype != np.uint8:
-        frames = np.clip(frames, 0, 255).astype(np.uint8)
-    return [Image.fromarray(frames[i]) for i in range(frames.shape[0])]
-
-
-def _video_to_numpy(video: Tensor, *, max_frames: int | None) -> np.ndarray:
-    """Convert one trajectory tensor to a ``(T, H, W, C) uint8`` numpy array."""
-    if max_frames is not None:
-        video = video[-max_frames:]
-    if video.shape[1] in (1, 3):
-        video = video.permute(0, 2, 3, 1)
-    elif video.shape[-1] not in (1, 3):
-        raise ValueError(f"Expected channel dim of size 1 or 3, got shape {tuple(video.shape)}")
-
-    array = video.detach().cpu().numpy()
-    if np.issubdtype(array.dtype, np.floating) and array.size > 0 and array.max() <= 1.0:
-        array = array * 255.0
-    return np.clip(array, 0, 255).astype(np.uint8)
-
-
-def _expand_tasks(task: Any, *, batch_size: int, default: str | None) -> list[str]:
-    if task is None:
-        task = default
-    if task is None:
-        raise KeyError("Robometer expected a task description in complementary data")
-    if isinstance(task, str):
-        return [task] * batch_size
-    if isinstance(task, tuple):
-        task = list(task)
-    if not (isinstance(task, list) and all(isinstance(item, str) for item in task)):
-        raise TypeError(f"Robometer task must be a string or list of strings, got {type(task)}")
-    if len(task) == 1 and batch_size > 1:
-        return task * batch_size
-    if len(task) != batch_size:
-        raise ValueError(f"Expected {batch_size} tasks, got {len(task)}")
-    return task
-
-
-@dataclass
-@ProcessorStepRegistry.register(name="robometer_encoder")
-class RobometerEncoderProcessorStep(ProcessorStep):
-    """Encode raw frames + task into Qwen-VL tensors for the Robometer model.
-
-    Loads a :class:`~transformers.AutoProcessor` matching ``base_model_id`` and
-    registers Robometer's special tokens on the tokenizer. The matching
-    embedding resize happens model-side in
-    :meth:`RobometerRewardModel.__init__`.
-
-    At call time the step reads:
-
-    - ``observation[image_key]``: ``(B, T, C, H, W)`` or ``(B, C, H, W)`` frames.
-    - ``complementary_data[task_key]``: a string or list of strings.
-
-    and writes ``observation[f"{ROBOMETER_FEATURE_PREFIX}<name>"]`` for:
-
-    - the Qwen-VL processor outputs: ``input_ids``, ``attention_mask``,
-      ``pixel_values``, ``image_grid_thw``, ``video_grid_thw``, ...
-    - Robometer-specific token ids consumed by the model heads:
-      ``prog_token_id``, ``vision_start_token_id``, ``vision_end_token_id``,
-      ``video_merge_size``.
-    """
-
-    base_model_id: str = "Qwen/Qwen3-VL-4B-Instruct"
-    image_key: str = OBS_IMAGES + ".top"
-    task_key: str = "task"
-    default_task: str | None = None
-    max_frames: int | None = 8
-    use_multi_image: bool = True
-    use_per_frame_progress_token: bool = True
-    max_length: int = 1024
-
-    _processor: Any = field(default=None, init=False, repr=False)
-
-    def __post_init__(self) -> None:
-        require_package("transformers", extra="robometer")
-        require_package("qwen-vl-utils", extra="robometer", import_name="qwen_vl_utils")
-
-        self._processor = AutoProcessor.from_pretrained(
-            self.base_model_id,
-            trust_remote_code=True,
-            do_sample_frames=False,
-            padding_side="right",
-        )
-
-        # Register Robometer's special tokens on the tokenizer. The matching
-        # embedding resize happens model-side in `RobometerRewardModel.__init__`.
-        tokenizer = self._processor.tokenizer
-        # Qwen tokenizers may not define a pad token, but batched prompts/videos
-        # require padding, so reuse EOS as the padding token.
-        if tokenizer.pad_token is None:
-            tokenizer.pad_token = tokenizer.eos_token
-        for token in ROBOMETER_SPECIAL_TOKENS:
-            if token not in tokenizer.get_vocab():
-                tokenizer.add_special_tokens({"additional_special_tokens": [token]})
-
-    def __call__(self, transition: EnvTransition) -> EnvTransition:
-        observation = transition.get(TransitionKey.OBSERVATION)
-        complementary = transition.get(TransitionKey.COMPLEMENTARY_DATA) or {}
-        if not isinstance(observation, dict):
-            raise ValueError("RobometerEncoderProcessorStep requires an observation dict")
-
-        if self.image_key not in observation:
-            raise KeyError(f"Robometer expected image key {self.image_key!r} in observation")
-
-        frames = observation[self.image_key]
-        tensor = frames.detach().cpu() if isinstance(frames, Tensor) else torch.as_tensor(frames)
-        if tensor.ndim == 4:
-            tensor = tensor.unsqueeze(1)
-        elif tensor.ndim != 5:
-            raise ValueError(
-                f"Expected Robometer frames with shape (B,C,H,W) or (B,T,C,H,W); got {tuple(tensor.shape)}"
-            )
-
-        batch_size = tensor.shape[0]
-        tasks = _expand_tasks(
-            complementary.get(self.task_key, self.default_task),
-            batch_size=batch_size,
-            default=self.default_task,
-        )
-
-        samples = [
-            (_video_to_numpy(tensor[i], max_frames=self.max_frames), tasks[i]) for i in range(batch_size)
-        ]
-        encoded = self.encode_samples(samples)
-
-        new_observation = dict(observation)
-        for key, value in encoded.items():
-            new_observation[f"{ROBOMETER_FEATURE_PREFIX}{key}"] = value
-
-        new_transition = transition.copy()
-        new_transition[TransitionKey.OBSERVATION] = new_observation
-        return new_transition
-
-    def encode_samples(self, samples: list[tuple[np.ndarray, str]]) -> dict[str, Tensor]:
-        """Run the Qwen-VL processor on a list of ``(frames, task)`` samples."""
-        from qwen_vl_utils import process_vision_info
-
-        conversations = [self._build_conversation(frames, task) for frames, task in samples]
-
-        texts = [
-            self._processor.apply_chat_template(
-                msg,
-                tokenize=False,
-                add_generation_prompt=False,
-                add_vision_id=True,
-                enable_thinking=False,
-                fps=1,
-            )
-            for msg in conversations
-        ]
-
-        process_kwargs: dict[str, Any] = {
-            "return_video_kwargs": True,
-            "return_video_metadata": True,
-        }
-        image_processor = getattr(self._processor, "image_processor", None)
-        if image_processor is not None and hasattr(image_processor, "patch_size"):
-            process_kwargs["image_patch_size"] = image_processor.patch_size
-
-        image_inputs, video_inputs, video_kwargs = process_vision_info(conversations, **process_kwargs)
-
-        videos: list[Any] | None = None
-        video_metadatas: list[Any] | None = None
-        if video_inputs:
-            if isinstance(video_inputs[0], tuple) and len(video_inputs[0]) == 2:
-                videos_seq, metadatas_seq = zip(*video_inputs, strict=False)
-                videos = list(videos_seq)
-                video_metadatas = list(metadatas_seq)
-            else:
-                videos = list(video_inputs)
-
-        processor_kwargs: dict[str, Any] = {
-            "text": texts,
-            "images": image_inputs,
-            "padding": True,
-            "truncation": False,
-            "max_length": self.max_length,
-            "return_tensors": "pt",
-            "do_resize": False,
-        }
-        if videos is not None:
-            processor_kwargs["videos"] = videos
-        if video_metadatas is not None:
-            processor_kwargs["video_metadata"] = video_metadatas
-        if video_kwargs:
-            processor_kwargs.update(video_kwargs)
-
-        encoded = self._processor(**processor_kwargs)
-
-        # Write Robometer-specific token ids and the video patch merge size into
-        # the encoded batch so `RobometerRewardModel` doesn't need its own
-        # tokenizer at inference (EO1-style separation: the processor owns the
-        # tokenizer, the model owns the backbone and heads).
-        tokenizer = self._processor.tokenizer
-        encoded["prog_token_id"] = tokenizer.convert_tokens_to_ids("<|prog_token|>")
-        encoded["vision_start_token_id"] = tokenizer.convert_tokens_to_ids("<|vision_start|>")
-        encoded["vision_end_token_id"] = tokenizer.convert_tokens_to_ids("<|vision_end|>")
-        video_processor = getattr(self._processor, "video_processor", None)
-        encoded["video_merge_size"] = int(getattr(video_processor, "merge_size", 14))
-        return encoded
-
-    def _build_conversation(self, frames: np.ndarray, task: str) -> list[dict[str, Any]]:
-        pil_frames = _frames_to_pil(frames)
-        prompt = PROGRESS_PROMPT.format(task=task)
-        content: list[dict[str, Any]] = [{"type": "text", "text": prompt}]
-
-        if self.use_multi_image:
-            for image in pil_frames:
-                content.append({"type": "image", "image": image})
-                if self.use_per_frame_progress_token:
-                    content.append({"type": "text", "text": "<|prog_token|>"})
-        else:
-            content.append({"type": "video", "video": pil_frames, "sample_fps": 1.0})
-
-        return [{"role": "user", "content": content}]
-
-    def transform_features(
-        self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
-    ) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
-        return features
-
-    def get_config(self) -> dict[str, Any]:
-        return {
-            "base_model_id": self.base_model_id,
-            "image_key": self.image_key,
-            "task_key": self.task_key,
-            "default_task": self.default_task,
-            "max_frames": self.max_frames,
-            "use_multi_image": self.use_multi_image,
-            "use_per_frame_progress_token": self.use_per_frame_progress_token,
-            "max_length": self.max_length,
-        }
-
-
-def make_robometer_pre_post_processors(
-    config: RobometerConfig,
-    dataset_stats: dict[str, dict[str, Any]] | None = None,
-) -> tuple[
-    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
-    PolicyProcessorPipeline[PolicyAction, PolicyAction],
-]:
-    """Pipeline that pre-encodes frames + task into Qwen-VL tensors.
-
-    The preprocessor adds a batch dimension if needed, runs Robometer's
-    encoder, and moves everything to the configured device. The
-    postprocessor is the identity since Robometer outputs a single reward
-    tensor.
-    """
-    del dataset_stats  # Robometer has its own normalisation inside the Qwen-VL processor.
-
-    preprocessor = PolicyProcessorPipeline[dict[str, Any], dict[str, Any]](
-        steps=[
-            AddBatchDimensionProcessorStep(),
-            RobometerEncoderProcessorStep(
-                base_model_id=config.base_model_id,
-                image_key=config.image_key,
-                task_key=config.task_key,
-                default_task=config.default_task,
-                max_frames=config.max_frames,
-                use_multi_image=config.use_multi_image,
-                use_per_frame_progress_token=config.use_per_frame_progress_token,
-            ),
-            DeviceProcessorStep(device=config.device or "cpu"),
-        ],
-        name=POLICY_PREPROCESSOR_DEFAULT_NAME,
-    )
-    postprocessor = PolicyProcessorPipeline(
-        name=POLICY_POSTPROCESSOR_DEFAULT_NAME,
-        to_transition=policy_action_to_transition,
-    )
-    return preprocessor, postprocessor
--- a/src/lerobot/rewards/topreward/init.py
+++ b/src/lerobot/rewards/topreward/init.py
@@ -1,19 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from .configuration_topreward import TOPRewardConfig
-from .modeling_topreward import TOPRewardModel
-from .processor_topreward import make_topreward_pre_post_processors
-
-__all__ = ["TOPRewardConfig", "TOPRewardModel", "make_topreward_pre_post_processors"]
--- a/src/lerobot/rewards/topreward/compute_rabc_weights.py
+++ b/src/lerobot/rewards/topreward/compute_rabc_weights.py
@@ -1,353 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""Compute per-frame TOPReward progress curves for a LeRobot dataset.
-
-For each episode, scores trajectory prefixes of increasing length using
-the TOPReward reward model, min-max normalises the raw log-prob rewards per episode,
-and writes a parquet file with one row per frame.
-
-The parquet uses the same schema as SARM's :mod:`lerobot.rewards.sarm.compute_rabc_weights`.
-
-Usage:
-    # Sparse-dense mode (15 anchors per episode, matches upstream)
-    python -m lerobot.rewards.topreward.compute_rabc_weights \\
-        --dataset-repo-id lerobot/libero_10_image \\
-        --num-samples 15
-
-    # Use a different VLM backbone
-    python -m lerobot.rewards.topreward.compute_rabc_weights \\
-        --dataset-repo-id lerobot/libero_10_image \\
-        --vlm-name Qwen/Qwen3-VL-4B-Instruct
-"""
-
-from __future__ import annotations
-
-import argparse
-import logging
-from pathlib import Path
-from typing import Any
-
-import numpy as np
-import pyarrow as pa
-import pyarrow.parquet as pq
-import torch
-from tqdm import tqdm
-
-from lerobot.datasets import LeRobotDataset
-from lerobot.rewards.topreward.configuration_topreward import TOPRewardConfig
-from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
-from lerobot.rewards.topreward.processor_topreward import TOPRewardEncoderProcessorStep
-from lerobot.types import TransitionKey
-
-DEFAULT_OUTPUT_FILENAME = "topreward_progress.parquet"
-
-
-def get_reward_model_path_from_parquet(parquet_path: Path) -> str | None:
-    """Read ``reward_model_path`` from parquet metadata if available."""
-    if not parquet_path.exists():
-        return None
-    try:
-        metadata = pq.read_metadata(parquet_path).schema.to_arrow_schema().metadata
-        if metadata and b"reward_model_path" in metadata:
-            return metadata[b"reward_model_path"].decode()
-    except Exception:  # nosec B110
-        return None
-    return None
-
-
-def _resolve_task(sample: dict[str, Any], default: str) -> str:
-    """Best-effort task extraction from a dataset sample."""
-    task = sample.get("task")
-    if isinstance(task, str) and task:
-        return task
-    return default
-
-
-def normalize_rewards(rewards: list[float] | np.ndarray) -> np.ndarray:
-    """Min-max normalise raw log-prob rewards into ``[0, 1]``."""
-    rewards_arr = np.asarray(rewards, dtype=np.float64)
-    if rewards_arr.size == 0:
-        return rewards_arr.astype(np.float32)
-    if rewards_arr.size == 1:
-        return np.array([1.0], dtype=np.float32)
-    r_min, r_max = rewards_arr.min(), rewards_arr.max()
-    if r_max == r_min:
-        return np.ones_like(rewards_arr, dtype=np.float32)
-    return ((rewards_arr - r_min) / (r_max - r_min)).astype(np.float32)
-
-
-def compute_instruction_rewards_for_prefixes(
-    model: TOPRewardModel,
-    encoder: TOPRewardEncoderProcessorStep,
-    dataset: LeRobotDataset,
-    ep_start: int,
-    num_frames: int,
-    task: str,
-    image_key: str,
-    num_samples: int | None,
-    device: str,
-) -> np.ndarray:
-    """Score an episode via prefix sweep and return a per-frame normalised curve."""
-    if num_samples is None or num_samples >= num_frames:
-        prefix_lengths = np.arange(1, num_frames + 1, dtype=np.int64)
-    else:
-        prefix_lengths = np.unique(np.linspace(1, num_frames, num_samples).round().astype(np.int64))
-
-    episode_frames = torch.stack([dataset[ep_start + i][image_key] for i in range(num_frames)])
-    rewards: list[float] = []
-    for length in prefix_lengths:
-        frames = episode_frames[: int(length)].unsqueeze(0)  # (1, T, C, H, W)
-
-        transition = {
-            TransitionKey.OBSERVATION: {image_key: frames},
-            TransitionKey.COMPLEMENTARY_DATA: {"task": task},
-        }
-        encoded = encoder(transition)
-        obs = encoded[TransitionKey.OBSERVATION]
-        batch = {
-            key: value.to(device) if isinstance(value, torch.Tensor) else value for key, value in obs.items()
-        }
-
-        with torch.no_grad():
-            reward = model.compute_reward(batch)
-        rewards.append(float(reward.item()))
-
-    normalized_rewards = normalize_rewards(rewards)
-
-    if prefix_lengths.shape[0] == num_frames:
-        return normalized_rewards
-
-    return np.interp(
-        np.arange(1, num_frames + 1, dtype=np.float64),
-        prefix_lengths.astype(np.float64),
-        normalized_rewards.astype(np.float64),
-    ).astype(np.float32)
-
-
-def compute_topreward_progress(
-    dataset_repo_id: str,
-    reward_model_path: str | None = None,
-    vlm_name: str | None = None,
-    output_path: str | None = None,
-    device: str = "cuda",
-    num_samples: int | None = None,
-    fps: float | None = None,
-    episodes: list[int] | None = None,
-) -> Path:
-    """Run TOPReward over a dataset and write per-frame progress."""
-    if reward_model_path is not None:
-        logging.info(f"Loading TOPReward config from: {reward_model_path}")
-        model = TOPRewardModel.from_pretrained(reward_model_path)
-        config = model.config
-        config.device = device
-        if vlm_name is not None and vlm_name != config.vlm_name:
-            logging.info(f"Overriding vlm_name from config: {config.vlm_name} -> {vlm_name}")
-            config.vlm_name = vlm_name
-            model = TOPRewardModel(config)
-    else:
-        config_kwargs: dict[str, Any] = {"device": device}
-        if vlm_name is not None:
-            config_kwargs["vlm_name"] = vlm_name
-        if fps is not None:
-            config_kwargs["fps"] = fps
-        config = TOPRewardConfig(**config_kwargs)
-        logging.info(f"Constructing TOPReward with VLM: {config.vlm_name}")
-        model = TOPRewardModel(config)
-
-    model.to(device).eval()
-
-    encoder = TOPRewardEncoderProcessorStep(
-        vlm_name=config.vlm_name,
-        image_key=config.image_key,
-        task_key=config.task_key,
-        default_task=config.default_task,
-        max_frames=None,  # no tail-crop: we control prefix length explicitly
-        fps=config.fps,
-        prompt_prefix=config.prompt_prefix,
-        prompt_suffix_template=config.prompt_suffix_template,
-        add_chat_template=config.add_chat_template,
-        max_length=config.max_input_length,
-    )
-
-    image_key = config.image_key
-
-    logging.info(f"Loading dataset: {dataset_repo_id}")
-    dataset = LeRobotDataset(dataset_repo_id, download_videos=True)
-    logging.info(f"Dataset: {dataset.num_episodes} episodes, {dataset.num_frames} frames")
-
-    episode_indices = list(range(dataset.num_episodes)) if episodes is None else episodes
-    logging.info(f"Processing {len(episode_indices)} episode(s)")
-
-    all_index: list[int] = []
-    all_episode: list[int] = []
-    all_frame: list[int] = []
-    all_progress: list[float] = []
-
-    for episode_idx in tqdm(episode_indices, desc="Episodes"):
-        ep = dataset.meta.episodes[episode_idx]
-        ep_start = int(ep["dataset_from_index"])
-        ep_end = int(ep["dataset_to_index"])
-        num_frames = ep_end - ep_start
-        if num_frames <= 0:
-            continue
-
-        first_sample = dataset[ep_start]
-        task = _resolve_task(first_sample, default=config.default_task or "perform the task")
-
-        per_frame = compute_instruction_rewards_for_prefixes(
-            model=model,
-            encoder=encoder,
-            dataset=dataset,
-            ep_start=ep_start,
-            num_frames=num_frames,
-            task=task,
-            image_key=image_key,
-            num_samples=num_samples,
-            device=device,
-        )
-
-        for local in range(num_frames):
-            all_index.append(ep_start + local)
-            all_episode.append(episode_idx)
-            all_frame.append(local)
-            all_progress.append(float(per_frame[local]))
-
-        if device.startswith("cuda"):
-            torch.cuda.empty_cache()
-
-    table = pa.table(
-        {
-            "index": np.asarray(all_index, dtype=np.int64),
-            "episode_index": np.asarray(all_episode, dtype=np.int64),
-            "frame_index": np.asarray(all_frame, dtype=np.int64),
-            "progress_sparse": np.asarray(all_progress, dtype=np.float32),
-        }
-    )
-
-    schema_metadata: dict[bytes, bytes] = {b"vlm_name": config.vlm_name.encode()}
-    if reward_model_path is not None:
-        schema_metadata[b"reward_model_path"] = reward_model_path.encode()
-    table = table.replace_schema_metadata(schema_metadata)
-
-    out = Path(dataset.root) / DEFAULT_OUTPUT_FILENAME if output_path is None else Path(output_path)
-    out.parent.mkdir(parents=True, exist_ok=True)
-    pq.write_table(table, out)
-    logging.info(f"Saved {len(table)} frame values to {out}")
-
-    progress_arr = np.asarray(all_progress, dtype=np.float32)
-    if progress_arr.size:
-        logging.info(
-            f"Progress: mean={float(progress_arr.mean()):.4f}, "
-            f"std={float(progress_arr.std()):.4f}, "
-            f"min={float(progress_arr.min()):.4f}, "
-            f"max={float(progress_arr.max()):.4f}"
-        )
-    return out
-
-
-def main():
-    parser = argparse.ArgumentParser(
-        description="Compute per-frame TOPReward progress curves for RA-BC weighting.",
-        formatter_class=argparse.RawDescriptionHelpFormatter,
-        epilog="""
-Examples:
-    # Sparse-dense mode (matches upstream TOPReward num_samples=15)
-    python -m lerobot.rewards.topreward.compute_rabc_weights \\
-        --dataset-repo-id lerobot/libero_10_image \\
-        --num-samples 15
-
-    # Use a smaller VLM
-    python -m lerobot.rewards.topreward.compute_rabc_weights \\
-        --dataset-repo-id lerobot/libero_10_image \\
-        --vlm-name Qwen/Qwen3-VL-4B-Instruct
-        """,
-    )
-    parser.add_argument(
-        "--dataset-repo-id", type=str, required=True, help="HuggingFace dataset repo id or local path."
-    )
-    parser.add_argument(
-        "--reward-model-path", type=str, default=None, help="Optional TOPReward LeRobot config."
-    )
-    parser.add_argument("--vlm-name", type=str, default=None, help="Override the VLM backbone (HF Hub id).")
-    parser.add_argument("--output-path", type=str, default=None, help="Output parquet path.")
-    parser.add_argument("--device", type=str, default="cuda", help="Device to use (default: cuda).")
-    parser.add_argument(
-        "--num-samples",
-        type=int,
-        default=None,
-        help="Anchor prefix samples per episode. None = dense. 15 matches upstream.",
-    )
-    parser.add_argument(
-        "--episodes",
-        type=int,
-        nargs="+",
-        default=None,
-        help="Process only these episode indices (e.g. --episodes 0 or --episodes 0 5 10).",
-    )
-    parser.add_argument("--fps", type=float, default=None, help="Override TOPRewardConfig.fps.")
-    parser.add_argument(
-        "--push-to-hub", action="store_true", help="Upload to the dataset repo on HuggingFace Hub."
-    )
-
-    args = parser.parse_args()
-
-    logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
-
-    output_path = compute_topreward_progress(
-        dataset_repo_id=args.dataset_repo_id,
-        reward_model_path=args.reward_model_path,
-        vlm_name=args.vlm_name,
-        output_path=args.output_path,
-        device=args.device,
-        num_samples=args.num_samples,
-        fps=args.fps,
-        episodes=args.episodes,
-    )
-
-    print(f"\nTOPReward progress saved to: {output_path}")
-
-    if args.push_to_hub:
-        from huggingface_hub import HfApi
-
-        api = HfApi()
-        hub_path = DEFAULT_OUTPUT_FILENAME
-
-        print(f"\nUploading to Hub: {args.dataset_repo_id}/{hub_path}")
-        api.upload_file(
-            path_or_fileobj=str(output_path),
-            path_in_repo=hub_path,
-            repo_id=args.dataset_repo_id,
-            repo_type="dataset",
-        )
-        print(
-            "Successfully uploaded to: "
-            f"https://huggingface.co/datasets/{args.dataset_repo_id}/blob/main/{hub_path}"
-        )
-
-        print("\nTo use in training, add to your config:")
-        print("  use_rabc: true")
-        print(f"  rabc_progress_path: hf://datasets/{args.dataset_repo_id}/{hub_path}")
-        print("  rabc_head_mode: sparse")
-    else:
-        print("\nTo use in training, add to your config:")
-        print("  use_rabc: true")
-        print(f"  rabc_progress_path: {output_path}")
-        print("  rabc_head_mode: sparse")
-
-
-if __name__ == "__main__":
-    main()
--- a/src/lerobot/rewards/topreward/configuration_topreward.py
+++ b/src/lerobot/rewards/topreward/configuration_topreward.py
@@ -1,146 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import annotations
-
-from dataclasses import dataclass, field
-
-from lerobot.configs import FeatureType, NormalizationMode, PolicyFeature
-from lerobot.configs.rewards import RewardModelConfig
-from lerobot.utils.constants import OBS_IMAGES
-
-# Default prompt scaffolding from the upstream TOPReward paper / reference
-# implementation (``QwenClient.compute_instruction_reward``). The prompt
-# scores the terminal ``True`` token in ``f"{instruction} ... True"``
-# given the video.
-DEFAULT_PROMPT_PREFIX = (
-    "The above video shows a robot manipulation trajectory that completes the following task: "
-)
-DEFAULT_PROMPT_SUFFIX_TEMPLATE = (
-    "{instruction} Decide whether the above statement is True or not. The answer is: True"
-)
-
-
-@RewardModelConfig.register_subclass("topreward")
-@dataclass
-class TOPRewardConfig(RewardModelConfig):
-    """Configuration for the TOPReward zero-shot reward model.
-
-    TOPReward is **zero-shot**: it has no learnable parameters of its own.
-    The "model" is a generic vision-language model (default
-    ``Qwen/Qwen3-VL-8B-Instruct``) used with a fixed prompt to extract
-    token log-probabilities as a reward signal. There is therefore no
-    fine-tuned checkpoint to host: ``pretrained_path`` is unused at
-    runtime — the model identity is :attr:`vlm_name` (an HF Hub id).
-
-    Args:
-        vlm_name: Hugging Face Hub id of the underlying VLM. Must be a
-            Qwen3-VL family model (the only client implemented in this
-            LeRobot port).
-        torch_dtype: Torch dtype name passed to the VLM loader
-            (``"auto"``, ``"bfloat16"``, ``"float16"``, ...).
-        attn_implementation: ``transformers`` attention implementation
-            (e.g. ``"flash_attention_2"``, ``"sdpa"``). Defaults to
-            ``None`` so the upstream picks the best available.
-        image_key: Observation key that holds the trajectory frames.
-        task_key: Complementary-data key that holds the task instruction.
-        default_task: Fallback instruction when ``task_key`` is absent.
-        max_frames: Cap on the number of frames fed to the VLM per
-            sample. ``None`` = use all frames.
-        fps: Frames-per-second metadata for the Qwen video processor.
-        prompt_prefix: Text shown to the VLM right after the video and
-            before the suffix template.
-        prompt_suffix_template: Suffix appended after ``prompt_prefix``.
-            Must contain ``{instruction}``; the VLM scores the
-            log-likelihood of the tokens that follow the prefix.
-        add_chat_template: If ``True``, wrap the full prompt with the
-            tokenizer's chat template before tokenisation (matches
-            upstream ``add_chat_template=True``).
-        success_threshold: Optional log-prob threshold. If finite,
-            :meth:`TOPRewardModel.compute_reward` returns
-            ``(reward > success_threshold).float()`` instead of the raw
-            log-prob.
-        max_input_length: Hard limit on the total tokenized input length;
-            samples that exceed it raise a ``ValueError``.
-    """
-
-    # Path to a local LeRobot dir or HF repo that holds a ``config.json``
-    # snapshot of this TOPRewardConfig. The VLM weights themselves are
-    # always identified by ``vlm_name``.
-    pretrained_path: str | None = None
-
-    vlm_name: str = "Qwen/Qwen3-VL-8B-Instruct"
-    torch_dtype: str = "auto"
-    attn_implementation: str | None = None
-
-    image_key: str = OBS_IMAGES + ".top"
-    task_key: str = "task"
-    default_task: str | None = None
-    max_frames: int | None = 16
-    fps: float = 2.0
-
-    prompt_prefix: str = DEFAULT_PROMPT_PREFIX
-    prompt_suffix_template: str = DEFAULT_PROMPT_SUFFIX_TEMPLATE
-    add_chat_template: bool = False
-
-    success_threshold: float = float("-inf")
-    max_input_length: int = 32768
-
-    license: str | None = "mit"  # matches upstream TOPReward
-    tags: list[str] | None = field(
-        default_factory=lambda: ["reward-model", "vision-language", "qwen3-vl", "zero-shot"]
-    )
-
-    input_features: dict[str, PolicyFeature] = field(default_factory=dict)
-    output_features: dict[str, PolicyFeature] = field(default_factory=dict)
-    normalization_mapping: dict[str, NormalizationMode] = field(
-        default_factory=lambda: {
-            "VISUAL": NormalizationMode.IDENTITY,
-            "REWARD": NormalizationMode.IDENTITY,
-        }
-    )
-
-    def __post_init__(self) -> None:
-        super().__post_init__()
-        if self.max_frames is not None and self.max_frames < 1:
-            raise ValueError(f"max_frames must be >= 1, got {self.max_frames}")
-        if self.fps <= 0:
-            raise ValueError(f"fps must be > 0, got {self.fps}")
-        if "{instruction}" not in self.prompt_suffix_template:
-            raise ValueError(
-                "prompt_suffix_template must contain `{instruction}` so the model "
-                "scores the log-likelihood of the task suffix."
-            )
-        if self.max_input_length <= 0:
-            raise ValueError(f"max_input_length must be > 0, got {self.max_input_length}")
-
-        if self.image_key not in self.input_features:
-            self.input_features[self.image_key] = PolicyFeature(shape=(3, 224, 224), type=FeatureType.VISUAL)
-        self.output_features.setdefault("reward", PolicyFeature(shape=(1,), type=FeatureType.REWARD))
-
-    @property
-    def observation_delta_indices(self) -> list[int] | None:
-        return None
-
-    @property
-    def action_delta_indices(self) -> None:
-        return None
-
-    @property
-    def reward_delta_indices(self) -> None:
-        return None
-
-    def validate_features(self) -> None:
-        if self.image_key not in self.input_features:
-            raise ValueError(f"TOPReward requires image input feature {self.image_key!r}")
--- a/src/lerobot/rewards/topreward/modeling_topreward.py
+++ b/src/lerobot/rewards/topreward/modeling_topreward.py
@@ -1,238 +0,0 @@
-# Copyright 2026 Shirui Chen, Cole Harrison, Ying-Chun Lee, Angela Jin Yang,
-# Zhongzheng Ren, Lillian J. Ratliff, Jiafei Duan, Dieter Fox, Ranjay Krishna
-# and The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics.
-
-Paper:         https://arxiv.org/abs/2602.19313
-Project:       https://topreward.github.io/webpage/
-Original code: https://github.com/TOPReward/TOPReward
-Backbone:      https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct  (default)
-
-TOPReward is a **zero-shot** reward model: it has no fine-tuned weights of
-its own. Given a video trajectory and a task instruction, it asks an
-off-the-shelf VLM how likely the instruction is, conditioned on the video,
-and returns that log-likelihood as the reward signal.
-
-Inference recipe:
-
-1. The processor builds a chat-style prompt, tokenises it, and emits
-   ``input_ids``, ``attention_mask``, vision tensors, and ``labels``.
-   The processor label-masks everything except the terminal answer token with
-   ``-100``.
-2. Forward the full token sequence through the VLM.
-3. Read the terminal answer token log-probability from the logits as the
-   scalar reward.
-
-With the default ``prompt_suffix_template``, the only unmasked token is the
-literal ``"True"`` at the end — the reward is
-``log P("True" | video + prompt + instruction)``.
-
-This LeRobot port is **inference-only and not trainable** — :meth:`forward`
-is intentionally inherited from :class:`PreTrainedRewardModel` and raises
-``NotImplementedError``, making :attr:`PreTrainedRewardModel.is_trainable`
-return ``False``.
-
-Because the VLM weights live on the Hugging Face Hub under their canonical
-id (``Qwen/Qwen3-VL-8B-Instruct`` etc.) and TOPReward never modifies them,
-:meth:`_save_pretrained` and :meth:`from_pretrained` are overridden so a
-TOPReward LeRobot "checkpoint" is a single ``config.json`` (the VLM is
-re-fetched from the Hub at load time).
-"""
-
-from __future__ import annotations
-
-import builtins
-import logging
-import os
-from pathlib import Path
-from tempfile import TemporaryDirectory
-from typing import TYPE_CHECKING, Any, TypeVar
-
-import numpy as np
-import torch
-from huggingface_hub import HfApi, hf_hub_download
-from huggingface_hub.constants import CONFIG_NAME
-from huggingface_hub.errors import HfHubHTTPError
-from torch import Tensor
-from torch.nn.functional import cross_entropy
-
-from lerobot.configs.rewards import RewardModelConfig
-from lerobot.rewards.pretrained import PreTrainedRewardModel
-from lerobot.rewards.topreward.configuration_topreward import TOPRewardConfig
-from lerobot.rewards.topreward.processor_topreward import TOPREWARD_FEATURE_PREFIX, TOPREWARD_INPUT_KEYS
-from lerobot.utils.import_utils import _transformers_available, require_package
-
-if TYPE_CHECKING:
-    from lerobot.configs.train import TrainPipelineConfig
-
-if TYPE_CHECKING or _transformers_available:
-    from transformers import Qwen3VLForConditionalGeneration
-else:
-    Qwen3VLForConditionalGeneration = None  # type: ignore[assignment]
-
-logger = logging.getLogger(__name__)
-
-T = TypeVar("T", bound="TOPRewardModel")
-
-
-def _torch_dtype(name: str) -> torch.dtype | str:
-    """Resolve a torch dtype name; ``"auto"`` is passed through verbatim."""
-    if name == "auto":
-        return "auto"
-    dtype = getattr(torch, name, None)
-    if isinstance(dtype, torch.dtype):
-        return dtype
-    raise ValueError(f"Unknown torch dtype: {name!r}")
-
-
-class TOPRewardModel(PreTrainedRewardModel):
-    """TOPReward zero-shot reward model."""
-
-    name = "topreward"
-    config_class = TOPRewardConfig
-
-    def __init__(self, config: TOPRewardConfig) -> None:
-        require_package("transformers", extra="topreward")
-        super().__init__(config)
-        self.config = config
-
-        torch_dtype = _torch_dtype(config.torch_dtype)
-        model_kwargs: dict[str, Any] = {"dtype": torch_dtype, "trust_remote_code": True}
-        if config.attn_implementation is not None:
-            model_kwargs["attn_implementation"] = config.attn_implementation
-
-        self.model = Qwen3VLForConditionalGeneration.from_pretrained(config.vlm_name, **model_kwargs)
-
-    def compute_reward(self, batch: dict[str, Any]) -> Tensor:
-        """Return one log-prob reward per sample in the batch."""
-        inputs: dict[str, Any] = {}
-        for key in TOPREWARD_INPUT_KEYS:
-            batch_key = f"{TOPREWARD_FEATURE_PREFIX}{key}"
-            if batch_key not in batch:
-                raise KeyError(
-                    f"TOPReward batch missing `{batch_key}`. Make sure the "
-                    "TOPRewardEncoderProcessorStep ran before `compute_reward`."
-                )
-            inputs[key] = batch[batch_key]
-
-        device = next(self.model.parameters()).device
-        inputs = {key: value.to(device) if hasattr(value, "to") else value for key, value in inputs.items()}
-        labels = inputs.pop("labels")
-        inputs["logits_to_keep"] = 2
-
-        self.eval()
-        with torch.no_grad():
-            outputs = self.model(**inputs)
-        logits = outputs.logits
-        rewards = -cross_entropy(logits[:, -2, :].float(), labels[:, -1], reduction="none")
-        if np.isfinite(self.config.success_threshold):
-            rewards = (rewards > self.config.success_threshold).float()
-        return rewards.to(self.config.device or "cpu")
-
-    def _save_pretrained(self, save_directory: Path) -> None:
-        """Save ``config.json`` only."""
-        self.config._save_pretrained(save_directory)
-
-    @classmethod
-    def from_pretrained(
-        cls: builtins.type[T],
-        pretrained_name_or_path: str | Path,
-        *,
-        config: RewardModelConfig | None = None,
-        force_download: bool = False,
-        resume_download: bool | None = None,
-        proxies: dict | None = None,
-        token: str | bool | None = None,
-        cache_dir: str | Path | None = None,
-        local_files_only: bool = False,
-        revision: str | None = None,
-        strict: bool = False,  # noqa: ARG003 — accepted for API parity; unused (no safetensors to load)
-        **kwargs: Any,
-    ) -> T:
-        """Load a TOPReward configuration and instantiate the wrapped VLM."""
-        if config is None:
-            config = RewardModelConfig.from_pretrained(
-                pretrained_name_or_path=pretrained_name_or_path,
-                force_download=force_download,
-                resume_download=resume_download,
-                proxies=proxies,
-                token=token,
-                cache_dir=cache_dir,
-                local_files_only=local_files_only,
-                revision=revision,
-                **kwargs,
-            )
-        if not isinstance(config, TOPRewardConfig):
-            raise TypeError(
-                f"Expected a TOPRewardConfig, got {type(config).__name__}. Make sure "
-                f"`pretrained_name_or_path={pretrained_name_or_path!r}` points at a "
-                "TOPReward checkpoint."
-            )
-
-        model_id = str(pretrained_name_or_path)
-        if not os.path.isdir(model_id):
-            try:
-                hf_hub_download(
-                    repo_id=model_id,
-                    filename=CONFIG_NAME,
-                    revision=revision,
-                    cache_dir=cache_dir,
-                    force_download=force_download,
-                    proxies=proxies,
-                    resume_download=resume_download,
-                    token=token,
-                    local_files_only=local_files_only,
-                )
-            except HfHubHTTPError as e:
-                raise FileNotFoundError(
-                    f"{CONFIG_NAME} not found on the HuggingFace Hub in {model_id}"
-                ) from e
-
-        instance = cls(config, **kwargs)
-        instance.to(config.device)
-        instance.eval()
-        return instance
-
-    def push_model_to_hub(self, cfg: TrainPipelineConfig):
-        """Push the TOPReward ``config.json`` + model card to the Hub."""
-        api = HfApi()
-        repo_id = api.create_repo(
-            repo_id=self.config.repo_id, private=self.config.private, exist_ok=True
-        ).repo_id
-
-        with TemporaryDirectory(ignore_cleanup_errors=True) as tmp:
-            saved_path = Path(tmp) / repo_id
-            saved_path.mkdir(parents=True, exist_ok=True)
-
-            self.config._save_pretrained(saved_path)
-
-            card = self.generate_model_card(
-                cfg.dataset.repo_id, self.config.type, self.config.license, self.config.tags
-            )
-            card.save(str(saved_path / "README.md"))
-
-            cfg.save_pretrained(saved_path)
-
-            commit_info = api.upload_folder(
-                repo_id=repo_id,
-                repo_type="model",
-                folder_path=saved_path,
-                commit_message="Upload TOPReward config and readme",
-                allow_patterns=["*.json", "*.yaml", "*.md"],
-                ignore_patterns=["*.tmp", "*.log", "*.safetensors"],
-            )
-
-            logger.info(f"Model pushed to {commit_info.repo_url.url}")
--- a/src/lerobot/rewards/topreward/processor_topreward.py
+++ b/src/lerobot/rewards/topreward/processor_topreward.py
@@ -1,305 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""TOPReward pre/post processing pipeline."""
-
-from __future__ import annotations
-
-from dataclasses import dataclass, field
-from typing import TYPE_CHECKING, Any
-
-import torch
-from torch import Tensor
-
-from lerobot.configs import PipelineFeatureType, PolicyFeature
-from lerobot.processor import (
-    AddBatchDimensionProcessorStep,
-    DeviceProcessorStep,
-    PolicyAction,
-    PolicyProcessorPipeline,
-    ProcessorStep,
-    ProcessorStepRegistry,
-    policy_action_to_transition,
-)
-from lerobot.rewards.topreward.configuration_topreward import (
-    DEFAULT_PROMPT_PREFIX,
-    DEFAULT_PROMPT_SUFFIX_TEMPLATE,
-    TOPRewardConfig,
-)
-from lerobot.types import EnvTransition, TransitionKey
-from lerobot.utils.constants import (
-    OBS_IMAGES,
-    OBS_PREFIX,
-    POLICY_POSTPROCESSOR_DEFAULT_NAME,
-    POLICY_PREPROCESSOR_DEFAULT_NAME,
-)
-from lerobot.utils.import_utils import _transformers_available, require_package
-
-if TYPE_CHECKING or _transformers_available:
-    from transformers import AutoProcessor
-else:
-    AutoProcessor = None
-
-TOPREWARD_FEATURE_PREFIX = f"{OBS_PREFIX}topreward."
-
-_TRUE_ANSWER = "True"
-
-TOPREWARD_VLM_INPUT_KEYS = (
-    "input_ids",
-    "attention_mask",
-    "pixel_values_videos",
-    "video_grid_thw",
-    "mm_token_type_ids",
-)
-TOPREWARD_INPUT_KEYS = TOPREWARD_VLM_INPUT_KEYS + ("labels",)
-
-
-def _prepare_video_batch(video: Tensor, *, max_frames: int | None) -> Tensor:
-    """Return videos as ``(B, T, C, H, W)`` uint8 tensors for Qwen3-VL."""
-    if video.ndim == 4:
-        video = video.unsqueeze(1)
-    elif video.ndim != 5:
-        raise ValueError(
-            f"Expected TOPReward frames with shape (B,C,H,W) or (B,T,C,H,W); got {tuple(video.shape)}"
-        )
-
-    if max_frames is not None:
-        video = video[:, -max_frames:]
-    if video.shape[-1] in (1, 3):
-        video = video.permute(0, 1, 4, 2, 3)
-    elif video.shape[2] not in (1, 3):
-        raise ValueError(f"Expected channel dim of size 1 or 3, got shape {tuple(video.shape)}")
-
-    if video.is_floating_point():
-        video = video * 255.0
-
-    return video.clamp(0, 255).to(torch.uint8).contiguous()
-
-
-def _expand_tasks(task: Any, *, batch_size: int, default: str | None) -> list[str]:
-    if task is None:
-        task = default
-    if task is None:
-        raise KeyError("TOPReward expected a task description in complementary data")
-    if isinstance(task, str):
-        return [task] * batch_size
-    if isinstance(task, tuple):
-        task = list(task)
-    if not (isinstance(task, list) and all(isinstance(item, str) for item in task)):
-        raise TypeError(f"TOPReward task must be a string or list of strings, got {type(task)}")
-    if len(task) == 1 and batch_size > 1:
-        return task * batch_size
-    if len(task) != batch_size:
-        raise ValueError(f"Expected {batch_size} tasks, got {len(task)}")
-    return task
-
-
-@dataclass
-@ProcessorStepRegistry.register(name="topreward_encoder")
-class TOPRewardEncoderProcessorStep(ProcessorStep):
-    """Encode raw frames + task into Qwen-VL tensors for the TOPReward model.
-
-    Loads a :class:`~transformers.AutoProcessor` matching ``vlm_name`` and
-    builds the full chat prompt including the instruction suffix. The
-    resulting ``input_ids``, ``attention_mask``, vision tensors, and
-    ``labels`` are written under the ``observation.topreward.*`` namespace
-    so the model can score without re-tokenising.
-
-    At call time the step reads:
-
-    - ``observation[image_key]``: ``(B, T, C, H, W)`` or ``(B, C, H, W)`` frames.
-    - ``complementary_data[task_key]``: a string or list of strings.
-
-    and writes ``observation[f"{TOPREWARD_FEATURE_PREFIX}<name>"]`` for the
-    Qwen-VL tensors plus ``labels``.
-    """
-
-    vlm_name: str = "Qwen/Qwen3-VL-8B-Instruct"
-    image_key: str = OBS_IMAGES + ".top"
-    task_key: str = "task"
-    default_task: str | None = None
-    max_frames: int | None = 16
-    fps: float = 2.0
-    prompt_prefix: str = DEFAULT_PROMPT_PREFIX
-    prompt_suffix_template: str = DEFAULT_PROMPT_SUFFIX_TEMPLATE
-    add_chat_template: bool = False
-    max_length: int = 32768
-
-    _processor: Any = field(default=None, init=False, repr=False)
-
-    def __post_init__(self) -> None:
-        require_package("transformers", extra="topreward")
-        self._processor = AutoProcessor.from_pretrained(self.vlm_name, trust_remote_code=True)
-
-    def __call__(self, transition: EnvTransition) -> EnvTransition:
-        observation = transition.get(TransitionKey.OBSERVATION)
-        complementary = transition.get(TransitionKey.COMPLEMENTARY_DATA) or {}
-        if self.image_key not in observation:
-            raise KeyError(f"TOPReward expected image key {self.image_key!r} in observation")
-
-        frames = observation[self.image_key]
-        videos = frames.detach().cpu() if isinstance(frames, Tensor) else torch.as_tensor(frames)
-        videos = _prepare_video_batch(videos, max_frames=self.max_frames)
-
-        batch_size = videos.shape[0]
-        tasks = _expand_tasks(
-            complementary.get(self.task_key, self.default_task),
-            batch_size=batch_size,
-            default=self.default_task,
-        )
-
-        encoded = self._encode_batch(videos, tasks, batch_size)
-
-        new_observation = dict(observation)
-        for key, value in encoded.items():
-            new_observation[f"{TOPREWARD_FEATURE_PREFIX}{key}"] = value
-
-        new_transition = transition.copy()
-        new_transition[TransitionKey.OBSERVATION] = new_observation
-        return new_transition
-
-    def _encode_batch(self, videos: Tensor, tasks: list[str], batch_size) -> dict[str, Any]:
-        """Tokenise a batch of (frames, task) pairs into Qwen-VL tensors.
-
-        The loop only builds per-sample chat strings. Tokenisation, padding,
-        video preprocessing, and label construction are batched.
-        """
-
-        texts: list[str] = []
-        video_metadata = [
-            {
-                "total_num_frames": int(videos.shape[1]),
-                "fps": float(self.fps),
-                "frames_indices": list(range(int(videos.shape[1]))),
-            }
-            for _ in range(batch_size)
-        ]
-        eos_token = self._processor.tokenizer.eos_token
-
-        for i in range(batch_size):
-            instruction_suffix = self.prompt_suffix_template.format(instruction=tasks[i])
-            if self.add_chat_template:
-                suffix_for_template = instruction_suffix.removesuffix(_TRUE_ANSWER).rstrip()
-                templated_messages = [
-                    {
-                        "role": "user",
-                        "content": [
-                            {"type": "video", "video": videos[i], "fps": self.fps},
-                            {"type": "text", "text": f"{self.prompt_prefix}{suffix_for_template}"},
-                        ],
-                    }
-                ]
-                prompt_chat = self._processor.apply_chat_template(
-                    templated_messages, tokenize=False, add_generation_prompt=True
-                )
-                full_text = f"{prompt_chat}{_TRUE_ANSWER}"
-            else:
-                user_messages = [
-                    {
-                        "role": "user",
-                        "content": [
-                            {"type": "video", "video": videos[i], "fps": self.fps},
-                            {"type": "text", "text": self.prompt_prefix},
-                        ],
-                    }
-                ]
-                prompt_chat = self._processor.apply_chat_template(
-                    user_messages, tokenize=False, add_generation_prompt=False
-                )
-                if eos_token is not None:
-                    prompt_chat = prompt_chat.split(eos_token)[0]
-                full_text = f"{prompt_chat}{instruction_suffix}"
-
-            texts.append(full_text)
-
-        result = self._processor(
-            text=texts,
-            videos=videos,
-            video_metadata=video_metadata,
-            do_sample_frames=False,
-            padding=True,
-            padding_side="left",
-            return_tensors="pt",
-        )
-        input_ids = result["input_ids"]
-
-        if input_ids.shape[-1] > self.max_length:
-            raise ValueError(
-                f"TOPReward input length {input_ids.shape[-1]} exceeds max_length "
-                f"{self.max_length}; lower `max_frames` or raise `max_length`."
-            )
-
-        labels = torch.full_like(input_ids, -100)
-        labels[:, -1] = input_ids[:, -1]
-        result["labels"] = labels
-        return result
-
-    def transform_features(
-        self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
-    ) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
-        return features
-
-    def get_config(self) -> dict[str, Any]:
-        return {
-            "vlm_name": self.vlm_name,
-            "image_key": self.image_key,
-            "task_key": self.task_key,
-            "default_task": self.default_task,
-            "max_frames": self.max_frames,
-            "fps": self.fps,
-            "prompt_prefix": self.prompt_prefix,
-            "prompt_suffix_template": self.prompt_suffix_template,
-            "add_chat_template": self.add_chat_template,
-            "max_length": self.max_length,
-        }
-
-
-def make_topreward_pre_post_processors(
-    config: TOPRewardConfig,
-    dataset_stats: dict[str, dict[str, Any]] | None = None,
-) -> tuple[
-    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
-    PolicyProcessorPipeline[PolicyAction, PolicyAction],
-]:
-    """Pipeline that pre-encodes frames + task into Qwen-VL tensors.
-
-    The preprocessor adds a batch dimension if needed, runs TOPReward's
-    encoder (which tokenises the full prompt and emits ``labels``), and
-    moves everything to the configured device. The postprocessor is
-    the identity since TOPReward outputs a single reward tensor.
-    """
-    preprocessor = PolicyProcessorPipeline[dict[str, Any], dict[str, Any]](
-        steps=[
-            AddBatchDimensionProcessorStep(),
-            TOPRewardEncoderProcessorStep(
-                vlm_name=config.vlm_name,
-                image_key=config.image_key,
-                task_key=config.task_key,
-                default_task=config.default_task,
-                max_frames=config.max_frames,
-                fps=config.fps,
-                prompt_prefix=config.prompt_prefix,
-                prompt_suffix_template=config.prompt_suffix_template,
-                add_chat_template=config.add_chat_template,
-                max_length=config.max_input_length,
-            ),
-            DeviceProcessorStep(device=config.device or "cpu"),
-        ],
-        name=POLICY_PREPROCESSOR_DEFAULT_NAME,
-    )
-    postprocessor = PolicyProcessorPipeline(
-        name=POLICY_POSTPROCESSOR_DEFAULT_NAME,
-        to_transition=policy_action_to_transition,
-    )
-    return preprocessor, postprocessor
--- a/src/lerobot/templates/lerobot_modelcard_template.md
+++ b/src/lerobot/templates/lerobot_modelcard_template.md
@@ -73,14 +73,17 @@ _Writes checkpoints to `outputs/train/<desired_policy_repo_id>/checkpoints/`._
 ### Evaluate the policy/run inference

 ```bash
-lerobot-record \
-  --robot.type=so100_follower \
-  --dataset.repo_id=<hf_user>/eval_<dataset> \
+lerobot-rollout \
+  --strategy.type=base \
+  --robot.type=so101_follower \
+  --robot.port=/dev/ttyACM0 \
+  --robot.cameras="{ up: {type: opencv, index_or_path: /dev/video1, width: 640, height: 480, fps: 30}, side: {type: opencv, index_or_path: /dev/video5, width: 640, height: 480, fps: 30}}" \
  --policy.path=<hf_user>/<desired_policy_repo_id> \
-  --episodes=10
+  --task="Put lego brick into the transparent box" \
+  --duration=60
 ```

-Prefix the dataset repo with **eval\_** and supply `--policy.path` pointing to a local or hub checkpoint.
+If you want to record a dataset while testing the policy use `--dataset.repo_id=<hf_user>/eval_dataset_name` it is important to use the prefix **eval\_**. For the policy path use the policy from the Hugging Face Hub or a local one. Skipping duration will make the policy run indefinitely.

 ---

--- a/src/lerobot/templates/lerobot_rewardmodel_modelcard_template.md
+++ b/src/lerobot/templates/lerobot_rewardmodel_modelcard_template.md
@@ -13,10 +13,6 @@
 A reward classifier is a lightweight neural network that scores observations or trajectories for task success, providing a learned reward signal or offline evaluation when explicit rewards are unavailable.
 {% elif model_name == "sarm" %}
 A Success-Aware Reward Model (SARM) predicts a dense reward signal from observations, typically used downstream for reinforcement learning or human-in-the-loop fine-tuning when task success is not directly observable.
-{% elif model_name == "robometer" %}
-ROBOMETER is a general-purpose video-language robotic reward model built on a fine-tuned Qwen3-VL-4B backbone with progress, preference, and success heads. Given a trajectory video and a task description, it predicts dense, frame-level task progress in [0, 1] and frame-level success probabilities for downstream robot learning, including offline RL, online RL, data filtering and retrieval, and automated failure detection.
-{% elif model_name == "topreward" %}
-TOPReward is a **zero-shot** reward model that extracts token log-probabilities from an off-the-shelf vision-language model (default Qwen3-VL) as a reward signal. Given a video trajectory and a task instruction, it returns the VLM's log-likelihood of the instruction being true, with no fine-tuning required.
 {% else %}
 _Reward model type not recognized — please update this template._
 {% endif %}
--- a/tests/policies/groot/test_groot_n1_7.py
+++ b/tests/policies/groot/test_groot_n1_7.py
--- a/tests/policies/molmoact2/test_molmoact2.py
+++ b/tests/policies/molmoact2/test_molmoact2.py
--- a/tests/policies/pi0_pi05/openpi_pytorch/init.py
+++ b/tests/policies/pi0_pi05/openpi_pytorch/init.py
@@ -1 +0,0 @@
-"""Lightweight vendored OpenPI PyTorch modules for PI0/PI05 parity tests."""
--- a/tests/policies/pi0_pi05/openpi_pytorch/gemma.py
+++ b/tests/policies/pi0_pi05/openpi_pytorch/gemma.py
@@ -1,22 +0,0 @@
-from dataclasses import dataclass
-
-
-@dataclass
-class Config:
-    width: int
-    depth: int
-    mlp_dim: int
-    num_heads: int
-    num_kv_heads: int
-    head_dim: int
-
-
-def get_config(variant: str) -> Config:
-    """Return the Gemma shape config needed by the OpenPI PyTorch model."""
-    if variant == "dummy":
-        return Config(width=64, depth=4, mlp_dim=128, num_heads=8, num_kv_heads=1, head_dim=16)
-    if variant == "gemma_300m":
-        return Config(width=1024, depth=18, mlp_dim=4096, num_heads=8, num_kv_heads=1, head_dim=256)
-    if variant == "gemma_2b":
-        return Config(width=2048, depth=18, mlp_dim=16_384, num_heads=8, num_kv_heads=1, head_dim=256)
-    raise ValueError(f"Unknown variant: {variant}")
--- a/tests/policies/pi0_pi05/openpi_pytorch/gemma_pytorch.py
+++ b/tests/policies/pi0_pi05/openpi_pytorch/gemma_pytorch.py
@@ -1,300 +0,0 @@
-from typing import Literal
-
-import torch
-from torch import nn
-from transformers.models.auto import CONFIG_MAPPING
-from transformers.models.gemma import modeling_gemma
-
-from lerobot.policies.pi_gemma import (
-    PaliGemmaForConditionalGenerationWithPiGemma,
-    PiGemmaForCausalLM,
-    _gated_residual,
-    layernorm_forward,
-)
-
-
-class PaliGemmaWithExpertModel(nn.Module):
-    def __init__(
-        self,
-        vlm_config,
-        action_expert_config,
-        use_adarms=None,
-        precision: Literal["bfloat16", "float32"] = "bfloat16",
-    ):
-        if use_adarms is None:
-            use_adarms = [False, False]
-        super().__init__()
-
-        vlm_config_hf = CONFIG_MAPPING["paligemma"]()
-        vlm_config_hf._vocab_size = 257152  # noqa: SLF001
-        vlm_config_hf.image_token_index = 257152
-        vlm_config_hf.text_config.hidden_size = vlm_config.width
-        vlm_config_hf.text_config.intermediate_size = vlm_config.mlp_dim
-        vlm_config_hf.text_config.num_attention_heads = vlm_config.num_heads
-        vlm_config_hf.text_config.head_dim = vlm_config.head_dim
-        vlm_config_hf.text_config.num_hidden_layers = vlm_config.depth
-        vlm_config_hf.text_config.num_key_value_heads = vlm_config.num_kv_heads
-        vlm_config_hf.text_config.hidden_activation = "gelu_pytorch_tanh"
-        vlm_config_hf.text_config.dtype = "float32"
-        vlm_config_hf.text_config.vocab_size = 257152
-        vlm_config_hf.text_config.use_adarms = use_adarms[0]
-        vlm_config_hf.text_config.adarms_cond_dim = vlm_config.width if use_adarms[0] else None
-        vlm_config_hf.vision_config.intermediate_size = 4304
-        vlm_config_hf.vision_config.projection_dim = 2048
-        vlm_config_hf.vision_config.projector_hidden_act = "gelu_fast"
-        vlm_config_hf.vision_config.dtype = "float32"
-
-        action_expert_config_hf = CONFIG_MAPPING["gemma"](
-            head_dim=action_expert_config.head_dim,
-            hidden_size=action_expert_config.width,
-            intermediate_size=action_expert_config.mlp_dim,
-            num_attention_heads=action_expert_config.num_heads,
-            num_hidden_layers=action_expert_config.depth,
-            num_key_value_heads=action_expert_config.num_kv_heads,
-            vocab_size=257152,
-            hidden_activation="gelu_pytorch_tanh",
-            dtype="float32",
-            use_adarms=use_adarms[1],
-            adarms_cond_dim=action_expert_config.width if use_adarms[1] else None,
-        )
-
-        self.paligemma = PaliGemmaForConditionalGenerationWithPiGemma(config=vlm_config_hf)
-        self.gemma_expert = PiGemmaForCausalLM(config=action_expert_config_hf)
-        self.gemma_expert.model.embed_tokens = None
-
-        self.to_bfloat16_for_selected_params(precision)
-
-    def to_bfloat16_for_selected_params(self, precision: Literal["bfloat16", "float32"] = "bfloat16"):
-        if precision == "bfloat16":
-            self.to(dtype=torch.bfloat16)
-        elif precision == "float32":
-            self.to(dtype=torch.float32)
-            return
-        else:
-            raise ValueError(f"Invalid precision: {precision}")
-
-        params_to_keep_float32 = [
-            "vision_tower",
-            "multi_modal_projector",
-            "input_layernorm",
-            "post_attention_layernorm",
-            "model.norm",
-        ]
-
-        for name, param in self.named_parameters():
-            if any(selector in name for selector in params_to_keep_float32):
-                param.data = param.data.to(dtype=torch.float32)
-
-    def embed_image(self, image: torch.Tensor):
-        # Transformers 5.4 no longer divides PaliGemma image features by sqrt(hidden_size),
-        # so the upstream helper now matches OpenPI's patched PaliGemma image-scale semantics.
-        # See https://github.com/huggingface/transformers/pull/44432/changes#diff-c916907e7e52ac85ee1a1527560eae4656cd6c76141ceb1fe3da61bd5f697d2a
-        out_dtype = image.dtype
-        if image.dtype != torch.float32:
-            image = image.to(torch.float32)
-        image_outputs = self.paligemma.model.get_image_features(image)
-        features = image_outputs.pooler_output
-        if features.dtype != out_dtype:
-            features = features.to(out_dtype)
-        return features
-
-    def embed_language_tokens(self, tokens: torch.Tensor):
-        return self.paligemma.model.language_model.get_input_embeddings()(tokens)
-
-    def forward(
-        self,
-        attention_mask: torch.Tensor | None = None,
-        position_ids: torch.LongTensor | None = None,
-        past_key_values: list[torch.FloatTensor] | None = None,
-        inputs_embeds: list[torch.FloatTensor] | None = None,
-        use_cache: bool | None = None,
-        adarms_cond: list[torch.Tensor] | None = None,
-    ):
-        if adarms_cond is None:
-            adarms_cond = [None, None]
-        if inputs_embeds[1] is None:
-            prefix_output = self.paligemma.model.language_model.forward(
-                inputs_embeds=inputs_embeds[0],
-                attention_mask=attention_mask,
-                position_ids=position_ids,
-                past_key_values=past_key_values,
-                use_cache=use_cache,
-                adarms_cond=adarms_cond[0] if adarms_cond is not None else None,
-            )
-            prefix_past_key_values = prefix_output.past_key_values
-            prefix_output = prefix_output.last_hidden_state
-            suffix_output = None
-        elif inputs_embeds[0] is None:
-            suffix_output = self.gemma_expert.model.forward(
-                inputs_embeds=inputs_embeds[1],
-                attention_mask=attention_mask,
-                position_ids=position_ids,
-                past_key_values=past_key_values,
-                use_cache=use_cache,
-                adarms_cond=adarms_cond[1] if adarms_cond is not None else None,
-            )
-            suffix_output = suffix_output.last_hidden_state
-            prefix_output = None
-            prefix_past_key_values = None
-        else:
-            models = [self.paligemma.model.language_model, self.gemma_expert.model]
-            num_layers = self.paligemma.config.text_config.num_hidden_layers
-
-            # Check if gradient checkpointing is enabled for any of the models
-            use_gradient_checkpointing = (
-                hasattr(self.gemma_expert.model, "gradient_checkpointing")
-                and self.gemma_expert.model.gradient_checkpointing
-                and self.training
-            ) or (hasattr(self, "gradient_checkpointing") and self.gradient_checkpointing and self.training)
-
-            # Force enable gradient checkpointing if we're in training mode and the model supports it
-            if self.training and hasattr(self.gemma_expert.model, "gradient_checkpointing"):
-                if not self.gemma_expert.model.gradient_checkpointing:
-                    print("Forcing gradient checkpointing to be enabled for Gemma expert model")
-                    self.gemma_expert.model.gradient_checkpointing = True
-                use_gradient_checkpointing = True
-
-            # Debug gradient checkpointing status
-            if hasattr(self, "_debug_gc_printed") and not self._debug_gc_printed:
-                print(f"Gemma expert model gradient checkpointing: {use_gradient_checkpointing}")
-                print(f"Model training mode: {self.training}")
-                print(
-                    f"Gemma expert model has gradient_checkpointing attr: {hasattr(self.gemma_expert.model, 'gradient_checkpointing')}"
-                )
-                if hasattr(self.gemma_expert.model, "gradient_checkpointing"):
-                    print(
-                        f"Gemma expert model gradient_checkpointing value: {self.gemma_expert.model.gradient_checkpointing}"
-                    )
-                self._debug_gc_printed = True
-
-            # Define the complete layer computation function for gradient checkpointing
-            def compute_layer_complete(layer_idx, inputs_embeds, attention_mask, position_ids, adarms_cond):
-                models = [self.paligemma.model.language_model, self.gemma_expert.model]
-
-                query_states = []
-                key_states = []
-                value_states = []
-                gates = []
-                for i, hidden_states in enumerate(inputs_embeds):
-                    layer = models[i].layers[layer_idx]
-                    hidden_states, gate = layernorm_forward(
-                        layer.input_layernorm, hidden_states, adarms_cond[i]
-                    )
-                    gates.append(gate)
-
-                    input_shape = hidden_states.shape[:-1]
-                    hidden_shape = (*input_shape, -1, layer.self_attn.head_dim)
-                    query_state = layer.self_attn.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
-                    key_state = layer.self_attn.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
-                    value_state = layer.self_attn.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
-
-                    query_states.append(query_state)
-                    key_states.append(key_state)
-                    value_states.append(value_state)
-
-                # Concatenate and process attention
-                query_states = torch.cat(query_states, dim=2)
-                key_states = torch.cat(key_states, dim=2)
-                value_states = torch.cat(value_states, dim=2)
-
-                dummy_tensor = torch.zeros(
-                    query_states.shape[0],
-                    query_states.shape[2],
-                    query_states.shape[-1],
-                    device=query_states.device,
-                    dtype=query_states.dtype,
-                )
-                cos, sin = self.paligemma.model.language_model.rotary_emb(dummy_tensor, position_ids)
-                query_states, key_states = modeling_gemma.apply_rotary_pos_emb(
-                    query_states, key_states, cos, sin, unsqueeze_dim=1
-                )
-
-                batch_size = query_states.shape[0]
-                scaling = self.paligemma.model.language_model.layers[layer_idx].self_attn.scaling
-
-                # Attention computation
-                att_output, _ = modeling_gemma.eager_attention_forward(
-                    self.paligemma.model.language_model.layers[layer_idx].self_attn,
-                    query_states,
-                    key_states,
-                    value_states,
-                    attention_mask,
-                    scaling,
-                )
-                # Get head_dim from the current layer, not from the model
-                head_dim = self.paligemma.model.language_model.layers[layer_idx].self_attn.head_dim
-                att_output = att_output.reshape(batch_size, -1, 1 * 8 * head_dim)
-
-                # Process layer outputs
-                outputs_embeds = []
-                start_pos = 0
-                for i, hidden_states in enumerate(inputs_embeds):
-                    layer = models[i].layers[layer_idx]
-                    end_pos = start_pos + hidden_states.shape[1]
-
-                    if att_output.dtype != layer.self_attn.o_proj.weight.dtype:
-                        att_output = att_output.to(layer.self_attn.o_proj.weight.dtype)
-                    out_emb = layer.self_attn.o_proj(att_output[:, start_pos:end_pos])
-
-                    # first residual
-                    out_emb = _gated_residual(hidden_states, out_emb, gates[i])
-                    after_first_residual = out_emb.clone()
-                    out_emb, gate = layernorm_forward(layer.post_attention_layernorm, out_emb, adarms_cond[i])
-                    # Convert to bfloat16 if the next layer (mlp) uses bfloat16
-                    if layer.mlp.up_proj.weight.dtype == torch.bfloat16:
-                        out_emb = out_emb.to(dtype=torch.bfloat16)
-
-                    out_emb = layer.mlp(out_emb)
-                    # second residual
-                    out_emb = _gated_residual(after_first_residual, out_emb, gate)
-                    outputs_embeds.append(out_emb)
-                    start_pos = end_pos
-
-                return outputs_embeds
-
-            # Process all layers with gradient checkpointing if enabled
-            for layer_idx in range(num_layers):
-                if use_gradient_checkpointing:
-                    inputs_embeds = torch.utils.checkpoint.checkpoint(
-                        compute_layer_complete,
-                        layer_idx,
-                        inputs_embeds,
-                        attention_mask,
-                        position_ids,
-                        adarms_cond,
-                        use_reentrant=False,
-                        preserve_rng_state=False,
-                    )
-                else:
-                    inputs_embeds = compute_layer_complete(
-                        layer_idx, inputs_embeds, attention_mask, position_ids, adarms_cond
-                    )
-
-                # Old code removed - now using compute_layer_complete function above
-
-            # final norm
-            # Define final norm computation function for gradient checkpointing
-            def compute_final_norms(inputs_embeds, adarms_cond):
-                outputs_embeds = []
-                for i, hidden_states in enumerate(inputs_embeds):
-                    out_emb, _ = layernorm_forward(models[i].norm, hidden_states, adarms_cond[i])
-                    outputs_embeds.append(out_emb)
-                return outputs_embeds
-
-            # Apply gradient checkpointing to final norm if enabled
-            if use_gradient_checkpointing:
-                outputs_embeds = torch.utils.checkpoint.checkpoint(
-                    compute_final_norms,
-                    inputs_embeds,
-                    adarms_cond,
-                    use_reentrant=False,
-                    preserve_rng_state=False,
-                )
-            else:
-                outputs_embeds = compute_final_norms(inputs_embeds, adarms_cond)
-
-            prefix_output = outputs_embeds[0]
-            suffix_output = outputs_embeds[1]
-            prefix_past_key_values = None
-
-        return [prefix_output, suffix_output], prefix_past_key_values
--- a/tests/policies/pi0_pi05/openpi_pytorch/image_tools.py
+++ b/tests/policies/pi0_pi05/openpi_pytorch/image_tools.py
@@ -1,79 +0,0 @@
-import torch
-import torch.nn.functional as F  # noqa: N812
-
-
-def resize_with_pad_torch(
-    images: torch.Tensor,
-    height: int,
-    width: int,
-    mode: str = "bilinear",
-) -> torch.Tensor:
-    """PyTorch version of resize_with_pad. Resizes an image to a target height and width without distortion
-    by padding with black. If the image is float32, it must be in the range [-1, 1].
-
-    Args:
-        images: Tensor of shape [*b, h, w, c] or [*b, c, h, w]
-        height: Target height
-        width: Target width
-        mode: Interpolation mode ('bilinear', 'nearest', etc.)
-
-    Returns:
-        Resized and padded tensor with same shape format as input
-    """
-    # Check if input is in channels-last format [*b, h, w, c] or channels-first [*b, c, h, w]
-    if images.shape[-1] <= 4:  # Assume channels-last format
-        channels_last = True
-        # Convert to channels-first for torch operations
-        if images.dim() == 3:
-            images = images.unsqueeze(0)  # Add batch dimension
-        images = images.permute(0, 3, 1, 2)  # [b, h, w, c] -> [b, c, h, w]
-    else:
-        channels_last = False
-        if images.dim() == 3:
-            images = images.unsqueeze(0)  # Add batch dimension
-
-    batch_size, channels, cur_height, cur_width = images.shape
-
-    # Calculate resize ratio
-    ratio = max(cur_width / width, cur_height / height)
-    resized_height = int(cur_height / ratio)
-    resized_width = int(cur_width / ratio)
-
-    # Resize
-    resized_images = F.interpolate(
-        images,
-        size=(resized_height, resized_width),
-        mode=mode,
-        align_corners=False if mode == "bilinear" else None,
-    )
-
-    # Handle dtype-specific clipping
-    if images.dtype == torch.uint8:
-        resized_images = torch.round(resized_images).clamp(0, 255).to(torch.uint8)
-    elif images.dtype == torch.float32:
-        resized_images = resized_images.clamp(-1.0, 1.0)
-    else:
-        raise ValueError(f"Unsupported image dtype: {images.dtype}")
-
-    # Calculate padding
-    pad_h0, remainder_h = divmod(height - resized_height, 2)
-    pad_h1 = pad_h0 + remainder_h
-    pad_w0, remainder_w = divmod(width - resized_width, 2)
-    pad_w1 = pad_w0 + remainder_w
-
-    # Pad
-    constant_value = 0 if images.dtype == torch.uint8 else -1.0
-    padded_images = F.pad(
-        resized_images,
-        (pad_w0, pad_w1, pad_h0, pad_h1),  # left, right, top, bottom
-        mode="constant",
-        value=constant_value,
-    )
-
-    # Convert back to original format if needed
-    if channels_last:
-        padded_images = padded_images.permute(0, 2, 3, 1)  # [b, c, h, w] -> [b, h, w, c]
-        if batch_size == 1 and images.shape[0] == 1:
-            padded_images = padded_images.squeeze(0)  # Remove batch dimension if it was added
-
-    return padded_images
--- a/tests/policies/pi0_pi05/openpi_pytorch/pi0_pytorch.py
+++ b/tests/policies/pi0_pi05/openpi_pytorch/pi0_pytorch.py
@@ -1,471 +0,0 @@
-import copy
-import logging
-import math
-
-import torch
-import torch.nn.functional as F  # noqa: N812
-from torch import Tensor, nn
-
-import tests.policies.pi0_pi05.openpi_pytorch.gemma as _gemma
-from tests.policies.pi0_pi05.openpi_pytorch import preprocessing_pytorch as _preprocessing
-from tests.policies.pi0_pi05.openpi_pytorch.gemma_pytorch import PaliGemmaWithExpertModel
-
-
-def get_safe_dtype(target_dtype, device_type):
-    """Get a safe dtype for the given device type."""
-    if device_type == "cpu":
-        # CPU doesn't support bfloat16, use float32 instead
-        if target_dtype == torch.bfloat16:
-            return torch.float32
-        if target_dtype == torch.float64:
-            return torch.float64
-    return target_dtype
-
-
-def create_sinusoidal_pos_embedding(
-    time: torch.tensor, dimension: int, min_period: float, max_period: float, device="cpu"
-) -> Tensor:
-    """Computes sine-cosine positional embedding vectors for scalar positions."""
-    if dimension % 2 != 0:
-        raise ValueError(f"dimension ({dimension}) must be divisible by 2")
-
-    if time.ndim != 1:
-        raise ValueError("The time tensor is expected to be of shape `(batch_size, )`.")
-
-    dtype = get_safe_dtype(torch.float64, device.type)
-    fraction = torch.linspace(0.0, 1.0, dimension // 2, dtype=dtype, device=device)
-    period = min_period * (max_period / min_period) ** fraction
-
-    # Compute the outer product
-    scaling_factor = 1.0 / period * 2 * math.pi
-    sin_input = scaling_factor[None, :] * time[:, None]
-    return torch.cat([torch.sin(sin_input), torch.cos(sin_input)], dim=1)
-
-
-def sample_beta(alpha, beta, bsize, device):
-    alpha_t = torch.as_tensor(alpha, dtype=torch.float32, device=device)
-    beta_t = torch.as_tensor(beta, dtype=torch.float32, device=device)
-    dist = torch.distributions.Beta(alpha_t, beta_t)
-    return dist.sample((bsize,))
-
-
-def make_att_2d_masks(pad_masks, att_masks):
-    """Copied from big_vision.
-
-    Tokens can attend to valid inputs tokens which have a cumulative mask_ar
-    smaller or equal to theirs. This way `mask_ar` int[B, N] can be used to
-    setup several types of attention, for example:
-
-      [[1 1 1 1 1 1]]: pure causal attention.
-
-      [[0 0 0 1 1 1]]: prefix-lm attention. The first 3 tokens can attend between
-          themselves and the last 3 tokens have a causal attention. The first
-          entry could also be a 1 without changing behaviour.
-
-      [[1 0 1 0 1 0 0 1 0 0]]: causal attention between 4 blocks. Tokens of a
-          block can attend all previous blocks and all tokens on the same block.
-
-    Args:
-      input_mask: bool[B, N] true if its part of the input, false if padding.
-      mask_ar: int32[B, N] mask that's 1 where previous tokens cannot depend on
-        it and 0 where it shares the same attention mask as the previous token.
-    """
-    if att_masks.ndim != 2:
-        raise ValueError(att_masks.ndim)
-    if pad_masks.ndim != 2:
-        raise ValueError(pad_masks.ndim)
-
-    cumsum = torch.cumsum(att_masks, dim=1)
-    att_2d_masks = cumsum[:, None, :] <= cumsum[:, :, None]
-    pad_2d_masks = pad_masks[:, None, :] * pad_masks[:, :, None]
-    return att_2d_masks & pad_2d_masks
-
-
-class PI0Pytorch(nn.Module):
-    def __init__(self, config):
-        super().__init__()
-        self.config = config
-        self.pi05 = config.pi05
-
-        paligemma_config = _gemma.get_config(config.paligemma_variant)
-        action_expert_config = _gemma.get_config(config.action_expert_variant)
-
-        self.paligemma_with_expert = PaliGemmaWithExpertModel(
-            paligemma_config,
-            action_expert_config,
-            use_adarms=[False, True] if self.pi05 else [False, False],
-            precision=config.dtype,
-        )
-
-        self.action_in_proj = nn.Linear(config.action_dim, action_expert_config.width)
-        self.action_out_proj = nn.Linear(action_expert_config.width, config.action_dim)
-
-        if self.pi05:
-            self.time_mlp_in = nn.Linear(action_expert_config.width, action_expert_config.width)
-            self.time_mlp_out = nn.Linear(action_expert_config.width, action_expert_config.width)
-        else:
-            self.state_proj = nn.Linear(config.action_dim, action_expert_config.width)
-            self.action_time_mlp_in = nn.Linear(2 * action_expert_config.width, action_expert_config.width)
-            self.action_time_mlp_out = nn.Linear(action_expert_config.width, action_expert_config.width)
-
-        torch.set_float32_matmul_precision("high")
-        if config.pytorch_compile_mode is not None:
-            self.sample_actions = torch.compile(self.sample_actions, mode=config.pytorch_compile_mode)
-
-        # Initialize gradient checkpointing flag
-        self.gradient_checkpointing_enabled = False
-
-        # The upstream OpenPI module verifies a site-package Transformers patch here.
-        # This vendored test copy instead routes through LeRobot's local PiGemma compatibility layer.
-
-    def gradient_checkpointing_enable(self):
-        """Enable gradient checkpointing for memory optimization."""
-        self.gradient_checkpointing_enabled = True
-        self.paligemma_with_expert.paligemma.model.language_model.gradient_checkpointing = True
-        self.paligemma_with_expert.paligemma.model.vision_tower.gradient_checkpointing = True
-        self.paligemma_with_expert.gemma_expert.model.gradient_checkpointing = True
-
-        logging.info("Enabled gradient checkpointing for PI0Pytorch model")
-
-    def gradient_checkpointing_disable(self):
-        """Disable gradient checkpointing."""
-        self.gradient_checkpointing_enabled = False
-        self.paligemma_with_expert.paligemma.model.language_model.gradient_checkpointing = False
-        self.paligemma_with_expert.paligemma.model.vision_tower.gradient_checkpointing = False
-        self.paligemma_with_expert.gemma_expert.model.gradient_checkpointing = False
-
-        logging.info("Disabled gradient checkpointing for PI0Pytorch model")
-
-    def is_gradient_checkpointing_enabled(self):
-        """Check if gradient checkpointing is enabled."""
-        return self.gradient_checkpointing_enabled
-
-    def _apply_checkpoint(self, func, *args, **kwargs):
-        """Helper method to apply gradient checkpointing if enabled."""
-        if self.gradient_checkpointing_enabled and self.training:
-            return torch.utils.checkpoint.checkpoint(
-                func, *args, use_reentrant=False, preserve_rng_state=False, **kwargs
-            )
-        return func(*args, **kwargs)
-
-    def _prepare_attention_masks_4d(self, att_2d_masks):
-        """Helper method to prepare 4D attention masks for transformer."""
-        att_2d_masks_4d = att_2d_masks[:, None, :, :]
-        return torch.where(att_2d_masks_4d, 0.0, -2.3819763e38)
-
-    def _preprocess_observation(self, observation, *, train=True):
-        """Helper method to preprocess observation."""
-        observation = _preprocessing.preprocess_observation_pytorch(observation, train=train)
-        return (
-            list(observation.images.values()),
-            list(observation.image_masks.values()),
-            observation.tokenized_prompt,
-            observation.tokenized_prompt_mask,
-            observation.state,
-        )
-
-    def sample_noise(self, shape, device):
-        return torch.normal(
-            mean=0.0,
-            std=1.0,
-            size=shape,
-            dtype=torch.float32,
-            device=device,
-        )
-
-    def sample_time(self, bsize, device):
-        time_beta = sample_beta(1.5, 1.0, bsize, device)
-        time = time_beta * 0.999 + 0.001
-        return time.to(dtype=torch.float32, device=device)
-
-    def embed_prefix(
-        self, images, img_masks, lang_tokens, lang_masks
-    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
-        """Embed images with SigLIP and language tokens with embedding layer to prepare
-        for PaliGemma transformer processing.
-        """
-        embs = []
-        pad_masks = []
-        att_masks = []
-
-        # Process images
-        for img, img_mask in zip(images, img_masks, strict=True):
-
-            def image_embed_func(img):
-                return self.paligemma_with_expert.embed_image(img)
-
-            img_emb = self._apply_checkpoint(image_embed_func, img)
-
-            bsize, num_img_embs = img_emb.shape[:2]
-
-            embs.append(img_emb)
-            pad_masks.append(img_mask[:, None].expand(bsize, num_img_embs))
-
-            # Create attention masks so that image tokens attend to each other
-            att_masks += [0] * num_img_embs
-
-        # Process language tokens
-        def lang_embed_func(lang_tokens):
-            lang_emb = self.paligemma_with_expert.embed_language_tokens(lang_tokens)
-            # Transformers > 5.4 scales Gemma token embeddings inside embed_tokens, matching
-            # OpenPI's former explicit sqrt(hidden_size) multiply without applying it twice.
-            # See https://github.com/huggingface/transformers/pull/44432/changes#diff-5f76eac6f18f4b491521314c318a9692318feb4d19228e9576cce7bde4240834
-            return lang_emb
-
-        lang_emb = self._apply_checkpoint(lang_embed_func, lang_tokens)
-
-        embs.append(lang_emb)
-        pad_masks.append(lang_masks)
-
-        # full attention between image and language inputs
-        num_lang_embs = lang_emb.shape[1]
-        att_masks += [0] * num_lang_embs
-
-        embs = torch.cat(embs, dim=1)
-        pad_masks = torch.cat(pad_masks, dim=1)
-        att_masks = torch.tensor(att_masks, dtype=torch.bool, device=pad_masks.device)
-
-        # Get batch size from the first dimension of the concatenated tensors
-        bsize = pad_masks.shape[0]
-        att_masks = att_masks[None, :].expand(bsize, len(att_masks))
-
-        return embs, pad_masks, att_masks
-
-    def embed_suffix(self, state, noisy_actions, timestep):
-        """Embed state, noisy_actions, timestep to prepare for Expert Gemma processing."""
-        embs = []
-        pad_masks = []
-        att_masks = []
-
-        if not self.pi05:
-            if self.state_proj.weight.dtype == torch.float32:
-                state = state.to(torch.float32)
-
-            # Embed state
-            def state_proj_func(state):
-                return self.state_proj(state)
-
-            state_emb = self._apply_checkpoint(state_proj_func, state)
-
-            embs.append(state_emb[:, None, :])
-            bsize = state_emb.shape[0]
-            device = state_emb.device
-
-            state_mask = torch.ones(bsize, 1, dtype=torch.bool, device=device)
-            pad_masks.append(state_mask)
-
-            # Set attention masks so that image and language inputs do not attend to state or actions
-            att_masks += [1]
-
-        # Embed timestep using sine-cosine positional encoding with sensitivity in the range [0, 1]
-        time_emb = create_sinusoidal_pos_embedding(
-            timestep,
-            self.action_in_proj.out_features,
-            min_period=4e-3,
-            max_period=4.0,
-            device=timestep.device,
-        )
-        time_emb = time_emb.type(dtype=timestep.dtype)
-
-        # Fuse timestep + action information using an MLP
-        def action_proj_func(noisy_actions):
-            return self.action_in_proj(noisy_actions)
-
-        action_emb = self._apply_checkpoint(action_proj_func, noisy_actions)
-
-        if not self.pi05:
-            time_emb = time_emb[:, None, :].expand_as(action_emb)
-            action_time_emb = torch.cat([action_emb, time_emb], dim=2)
-
-            # Apply MLP layers
-            def mlp_func(action_time_emb):
-                x = self.action_time_mlp_in(action_time_emb)
-                x = F.silu(x)  # swish == silu
-                return self.action_time_mlp_out(x)
-
-            action_time_emb = self._apply_checkpoint(mlp_func, action_time_emb)
-            adarms_cond = None
-        else:
-            # time MLP (for adaRMS)
-            def time_mlp_func(time_emb):
-                x = self.time_mlp_in(time_emb)
-                x = F.silu(x)  # swish == silu
-                x = self.time_mlp_out(x)
-                return F.silu(x)
-
-            time_emb = self._apply_checkpoint(time_mlp_func, time_emb)
-            action_time_emb = action_emb
-            adarms_cond = time_emb
-
-        # Add to input tokens
-        embs.append(action_time_emb)
-
-        bsize, action_time_dim = action_time_emb.shape[:2]
-        action_time_mask = torch.ones(bsize, action_time_dim, dtype=torch.bool, device=timestep.device)
-        pad_masks.append(action_time_mask)
-
-        # Set attention masks so that image, language and state inputs do not attend to action tokens
-        att_masks += [1] + ([0] * (self.config.action_horizon - 1))
-
-        embs = torch.cat(embs, dim=1)
-        pad_masks = torch.cat(pad_masks, dim=1)
-        att_masks = torch.tensor(att_masks, dtype=embs.dtype, device=embs.device)
-        att_masks = att_masks[None, :].expand(bsize, len(att_masks))
-
-        return embs, pad_masks, att_masks, adarms_cond
-
-    def forward(self, observation, actions, noise=None, time=None) -> Tensor:
-        """Do a full training forward pass and compute the loss (batch_size x num_steps x num_motors)"""
-        images, img_masks, lang_tokens, lang_masks, state = self._preprocess_observation(
-            observation, train=True
-        )
-
-        if noise is None:
-            noise = self.sample_noise(actions.shape, actions.device)
-
-        if time is None:
-            time = self.sample_time(actions.shape[0], actions.device)
-
-        time_expanded = time[:, None, None]
-        x_t = time_expanded * noise + (1 - time_expanded) * actions
-        u_t = noise - actions
-
-        prefix_embs, prefix_pad_masks, prefix_att_masks = self.embed_prefix(
-            images, img_masks, lang_tokens, lang_masks
-        )
-        suffix_embs, suffix_pad_masks, suffix_att_masks, adarms_cond = self.embed_suffix(state, x_t, time)
-        if (
-            self.paligemma_with_expert.paligemma.model.language_model.layers[0].self_attn.q_proj.weight.dtype
-            == torch.bfloat16
-        ):
-            suffix_embs = suffix_embs.to(dtype=torch.bfloat16)
-            prefix_embs = prefix_embs.to(dtype=torch.bfloat16)
-
-        pad_masks = torch.cat([prefix_pad_masks, suffix_pad_masks], dim=1)
-        att_masks = torch.cat([prefix_att_masks, suffix_att_masks], dim=1)
-
-        att_2d_masks = make_att_2d_masks(pad_masks, att_masks)
-        position_ids = torch.cumsum(pad_masks, dim=1) - 1
-
-        # Prepare attention masks
-        att_2d_masks_4d = self._prepare_attention_masks_4d(att_2d_masks)
-
-        # Apply gradient checkpointing if enabled
-        def forward_func(prefix_embs, suffix_embs, att_2d_masks_4d, position_ids, adarms_cond):
-            (_, suffix_out), _ = self.paligemma_with_expert.forward(
-                attention_mask=att_2d_masks_4d,
-                position_ids=position_ids,
-                past_key_values=None,
-                inputs_embeds=[prefix_embs, suffix_embs],
-                use_cache=False,
-                adarms_cond=[None, adarms_cond],
-            )
-            return suffix_out
-
-        suffix_out = self._apply_checkpoint(
-            forward_func, prefix_embs, suffix_embs, att_2d_masks_4d, position_ids, adarms_cond
-        )
-
-        suffix_out = suffix_out[:, -self.config.action_horizon :]
-        suffix_out = suffix_out.to(dtype=torch.float32)
-
-        # Apply gradient checkpointing to final action projection if enabled
-        def action_out_proj_func(suffix_out):
-            return self.action_out_proj(suffix_out)
-
-        v_t = self._apply_checkpoint(action_out_proj_func, suffix_out)
-
-        return F.mse_loss(u_t, v_t, reduction="none")
-
-    @torch.no_grad()
-    def sample_actions(self, device, observation, noise=None, num_steps=10) -> Tensor:
-        """Do a full inference forward and compute the action (batch_size x num_steps x num_motors)"""
-        bsize = observation.state.shape[0]
-        if noise is None:
-            actions_shape = (bsize, self.config.action_horizon, self.config.action_dim)
-            noise = self.sample_noise(actions_shape, device)
-
-        images, img_masks, lang_tokens, lang_masks, state = self._preprocess_observation(
-            observation, train=False
-        )
-
-        prefix_embs, prefix_pad_masks, prefix_att_masks = self.embed_prefix(
-            images, img_masks, lang_tokens, lang_masks
-        )
-        prefix_att_2d_masks = make_att_2d_masks(prefix_pad_masks, prefix_att_masks)
-        prefix_position_ids = torch.cumsum(prefix_pad_masks, dim=1) - 1
-
-        # Compute image and language key value cache
-        prefix_att_2d_masks_4d = self._prepare_attention_masks_4d(prefix_att_2d_masks)
-        self.paligemma_with_expert.paligemma.model.language_model.config._attn_implementation = "eager"  # noqa: SLF001
-
-        _, past_key_values = self.paligemma_with_expert.forward(
-            attention_mask=prefix_att_2d_masks_4d,
-            position_ids=prefix_position_ids,
-            past_key_values=None,
-            inputs_embeds=[prefix_embs, None],
-            use_cache=True,
-        )
-
-        dt = -1.0 / num_steps
-        dt = torch.tensor(dt, dtype=torch.float32, device=device)
-
-        x_t = noise
-        time = torch.tensor(1.0, dtype=torch.float32, device=device)
-        while time >= -dt / 2:
-            expanded_time = time.expand(bsize)
-            v_t = self.denoise_step(
-                state,
-                prefix_pad_masks,
-                past_key_values,
-                x_t,
-                expanded_time,
-            )
-
-            # Euler step - use new tensor assignment instead of in-place operation
-            x_t = x_t + dt * v_t
-            time += dt
-        return x_t
-
-    def denoise_step(
-        self,
-        state,
-        prefix_pad_masks,
-        past_key_values,
-        x_t,
-        timestep,
-    ):
-        """Apply one denoising step of the noise `x_t` at a given timestep."""
-        suffix_embs, suffix_pad_masks, suffix_att_masks, adarms_cond = self.embed_suffix(state, x_t, timestep)
-
-        suffix_len = suffix_pad_masks.shape[1]
-        batch_size = prefix_pad_masks.shape[0]
-        prefix_len = prefix_pad_masks.shape[1]
-
-        prefix_pad_2d_masks = prefix_pad_masks[:, None, :].expand(batch_size, suffix_len, prefix_len)
-
-        suffix_att_2d_masks = make_att_2d_masks(suffix_pad_masks, suffix_att_masks)
-
-        full_att_2d_masks = torch.cat([prefix_pad_2d_masks, suffix_att_2d_masks], dim=2)
-
-        prefix_offsets = torch.sum(prefix_pad_masks, dim=-1)[:, None]
-        position_ids = prefix_offsets + torch.cumsum(suffix_pad_masks, dim=1) - 1
-
-        # Prepare attention masks
-        full_att_2d_masks_4d = self._prepare_attention_masks_4d(full_att_2d_masks)
-        self.paligemma_with_expert.gemma_expert.model.config._attn_implementation = "eager"  # noqa: SLF001
-
-        past_key_values = copy.deepcopy(past_key_values)
-        outputs_embeds, _ = self.paligemma_with_expert.forward(
-            attention_mask=full_att_2d_masks_4d,
-            position_ids=position_ids,
-            past_key_values=past_key_values,
-            inputs_embeds=[None, suffix_embs],
-            use_cache=False,
-            adarms_cond=[None, adarms_cond],
-        )
-
-        suffix_out = outputs_embeds[1]
-        suffix_out = suffix_out[:, -self.config.action_horizon :]
-        suffix_out = suffix_out.to(dtype=torch.float32)
-        return self.action_out_proj(suffix_out)
--- a/tests/policies/pi0_pi05/openpi_pytorch/preprocessing_pytorch.py
+++ b/tests/policies/pi0_pi05/openpi_pytorch/preprocessing_pytorch.py
@@ -1,179 +0,0 @@
-import logging
-from collections.abc import Sequence
-
-import torch
-
-from tests.policies.pi0_pi05.openpi_pytorch import image_tools
-
-logger = logging.getLogger("openpi")
-
-# Constants moved from model.py
-IMAGE_KEYS = (
-    "base_0_rgb",
-    "left_wrist_0_rgb",
-    "right_wrist_0_rgb",
-)
-
-IMAGE_RESOLUTION = (224, 224)
-
-
-def preprocess_observation_pytorch(
-    observation,
-    *,
-    train: bool = False,
-    image_keys: Sequence[str] = IMAGE_KEYS,
-    image_resolution: tuple[int, int] = IMAGE_RESOLUTION,
-):
-    """Torch.compile-compatible version of preprocess_observation_pytorch with simplified type annotations.
-
-    This function avoids complex type annotations that can cause torch.compile issues.
-    """
-    if not set(image_keys).issubset(observation.images):
-        raise ValueError(f"images dict missing keys: expected {image_keys}, got {list(observation.images)}")
-
-    batch_shape = observation.state.shape[:-1]
-
-    out_images = {}
-    for key in image_keys:
-        image = observation.images[key]
-
-        # TODO: This is a hack to handle both [B, C, H, W] and [B, H, W, C] formats
-        # Handle both [B, C, H, W] and [B, H, W, C] formats
-        is_channels_first = image.shape[1] == 3  # Check if channels are in dimension 1
-
-        if is_channels_first:
-            # Convert [B, C, H, W] to [B, H, W, C] for processing
-            image = image.permute(0, 2, 3, 1)
-
-        if image.shape[1:3] != image_resolution:
-            logger.info(f"Resizing image {key} from {image.shape[1:3]} to {image_resolution}")
-            image = image_tools.resize_with_pad_torch(image, *image_resolution)
-
-        if train:
-            # Convert from [-1, 1] to [0, 1] for PyTorch augmentations
-            image = image / 2.0 + 0.5
-
-            # Apply PyTorch-based augmentations
-            if "wrist" not in key:
-                # Geometric augmentations for non-wrist cameras
-                height, width = image.shape[1:3]
-
-                # Random crop and resize
-                crop_height = int(height * 0.95)
-                crop_width = int(width * 0.95)
-
-                # Random crop
-                max_h = height - crop_height
-                max_w = width - crop_width
-                if max_h > 0 and max_w > 0:
-                    # Use tensor operations instead of .item() for torch.compile compatibility
-                    start_h = torch.randint(0, max_h + 1, (1,), device=image.device)
-                    start_w = torch.randint(0, max_w + 1, (1,), device=image.device)
-                    image = image[:, start_h : start_h + crop_height, start_w : start_w + crop_width, :]
-
-                # Resize back to original size
-                image = torch.nn.functional.interpolate(
-                    image.permute(0, 3, 1, 2),  # [b, h, w, c] -> [b, c, h, w]
-                    size=(height, width),
-                    mode="bilinear",
-                    align_corners=False,
-                ).permute(0, 2, 3, 1)  # [b, c, h, w] -> [b, h, w, c]
-
-                # Random rotation (small angles)
-                # Use tensor operations instead of .item() for torch.compile compatibility
-                angle = torch.rand(1, device=image.device) * 10 - 5  # Random angle between -5 and 5 degrees
-                if torch.abs(angle) > 0.1:  # Only rotate if angle is significant
-                    # Convert to radians
-                    angle_rad = angle * torch.pi / 180.0
-
-                    # Create rotation matrix
-                    cos_a = torch.cos(angle_rad)
-                    sin_a = torch.sin(angle_rad)
-
-                    # Apply rotation using grid_sample
-                    grid_x = torch.linspace(-1, 1, width, device=image.device)
-                    grid_y = torch.linspace(-1, 1, height, device=image.device)
-
-                    # Create meshgrid
-                    grid_y, grid_x = torch.meshgrid(grid_y, grid_x, indexing="ij")
-
-                    # Expand to batch dimension
-                    grid_x = grid_x.unsqueeze(0).expand(image.shape[0], -1, -1)
-                    grid_y = grid_y.unsqueeze(0).expand(image.shape[0], -1, -1)
-
-                    # Apply rotation transformation
-                    grid_x_rot = grid_x * cos_a - grid_y * sin_a
-                    grid_y_rot = grid_x * sin_a + grid_y * cos_a
-
-                    # Stack and reshape for grid_sample
-                    grid = torch.stack([grid_x_rot, grid_y_rot], dim=-1)
-
-                    image = torch.nn.functional.grid_sample(
-                        image.permute(0, 3, 1, 2),  # [b, h, w, c] -> [b, c, h, w]
-                        grid,
-                        mode="bilinear",
-                        padding_mode="zeros",
-                        align_corners=False,
-                    ).permute(0, 2, 3, 1)  # [b, c, h, w] -> [b, h, w, c]
-
-            # Color augmentations for all cameras
-            # Random brightness
-            # Use tensor operations instead of .item() for torch.compile compatibility
-            brightness_factor = (
-                0.7 + torch.rand(1, device=image.device) * 0.6
-            )  # Random factor between 0.7 and 1.3
-            image = image * brightness_factor
-
-            # Random contrast
-            # Use tensor operations instead of .item() for torch.compile compatibility
-            contrast_factor = (
-                0.6 + torch.rand(1, device=image.device) * 0.8
-            )  # Random factor between 0.6 and 1.4
-            mean = image.mean(dim=[1, 2, 3], keepdim=True)
-            image = (image - mean) * contrast_factor + mean
-
-            # Random saturation (convert to HSV, modify S, convert back)
-            # For simplicity, we'll just apply a random scaling to the color channels
-            # Use tensor operations instead of .item() for torch.compile compatibility
-            saturation_factor = (
-                0.5 + torch.rand(1, device=image.device) * 1.0
-            )  # Random factor between 0.5 and 1.5
-            gray = image.mean(dim=-1, keepdim=True)
-            image = gray + (image - gray) * saturation_factor
-
-            # Clamp values to [0, 1]
-            image = torch.clamp(image, 0, 1)
-
-            # Back to [-1, 1]
-            image = image * 2.0 - 1.0
-
-        # Convert back to [B, C, H, W] format if it was originally channels-first
-        if is_channels_first:
-            image = image.permute(0, 3, 1, 2)  # [B, H, W, C] -> [B, C, H, W]
-
-        out_images[key] = image
-
-    # obtain mask
-    out_masks = {}
-    for key in out_images:
-        if key not in observation.image_masks:
-            # do not mask by default
-            out_masks[key] = torch.ones(batch_shape, dtype=torch.bool, device=observation.state.device)
-        else:
-            out_masks[key] = observation.image_masks[key]
-
-    # Create a simple object with the required attributes instead of using the complex Observation class
-    class SimpleProcessedObservation:
-        def __init__(self, **kwargs):
-            for key, value in kwargs.items():
-                setattr(self, key, value)
-
-    return SimpleProcessedObservation(
-        images=out_images,
-        image_masks=out_masks,
-        state=observation.state,
-        tokenized_prompt=observation.tokenized_prompt,
-        tokenized_prompt_mask=observation.tokenized_prompt_mask,
-        token_ar_mask=observation.token_ar_mask,
-        token_loss_mask=observation.token_loss_mask,
-    )
--- a/tests/policies/pi0_pi05/test_pi05_compile.py
+++ b/tests/policies/pi0_pi05/test_pi05_compile.py
@@ -1,101 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import os
-
-import pytest
-import torch
-
-pytest.importorskip("transformers")
-
-from lerobot.policies.pi05 import PI05Config  # noqa: E402
-from lerobot.policies.pi05.modeling_pi05 import PI05Pytorch  # noqa: E402
-from tests.policies.pi0_pi05.utils.torch_compile import (  # noqa: E402
-    assert_cache_stability,
-    assert_compiled_output_matches_eager,
-    assert_explain_has_no_graph_breaks,
-    benchmark_runtime,
-    make_compile_config,
-    reset_compile_state,
-)
-from tests.utils import require_cuda  # noqa: E402
-
-pytestmark = pytest.mark.skipif(
-    os.environ.get("CI") == "true" or os.environ.get("GITHUB_ACTIONS") == "true",
-    reason="torch.compile benchmark is too slow for CI; run manually on GPU nodes",
-)
-
-
-def _make_model(*, compile_model):
-    return PI05Pytorch(make_compile_config(PI05Config, compile_model=compile_model)).cuda().eval()
-
-
-def _make_dummy_inputs(config):
-    device = torch.device("cuda")
-    common = {
-        "images": [torch.randn(1, 3, *config.image_resolution, device=device)],
-        "img_masks": [torch.ones(1, dtype=torch.bool, device=device)],
-        "tokens": torch.randint(0, 1024, (1, 5), dtype=torch.long, device=device),
-        "masks": torch.ones(1, 5, dtype=torch.bool, device=device),
-    }
-    forward_kwargs = {
-        **common,
-        "actions": torch.randn(1, config.chunk_size, config.max_action_dim, device=device),
-        "noise": torch.randn(1, config.chunk_size, config.max_action_dim, device=device),
-        "time": torch.rand(1, device=device),
-    }
-    sample_kwargs = {
-        **common,
-        "noise": torch.randn(1, config.chunk_size, config.max_action_dim, device=device),
-        "num_steps": config.num_inference_steps,
-    }
-    return forward_kwargs, sample_kwargs
-
-
-@require_cuda
-def test_pi05_torch_compile_forward_and_sample_actions():
-    if not hasattr(torch, "compile"):
-        pytest.skip("torch.compile is not available")
-    if not torch._dynamo.is_dynamo_supported():
-        pytest.skip("torch._dynamo is not supported on this platform")
-
-    torch.manual_seed(0)
-    eager_model = _make_model(compile_model=False)
-    torch.manual_seed(0)
-    compiled_model = _make_model(compile_model=True)
-    forward_kwargs, sample_kwargs = _make_dummy_inputs(compiled_model.config)
-
-    try:
-        assert_compiled_output_matches_eager(eager_model, compiled_model, forward_kwargs, sample_kwargs)
-
-        assert_explain_has_no_graph_breaks(eager_model.forward, forward_kwargs, "pi05.forward")
-        assert_explain_has_no_graph_breaks(eager_model.sample_actions, sample_kwargs, "pi05.sample_actions")
-
-        assert_cache_stability(compiled_model.forward, forward_kwargs, "pi05.forward")
-        assert_cache_stability(compiled_model.sample_actions, sample_kwargs, "pi05.sample_actions")
-
-        benchmark_runtime(eager_model.forward, compiled_model.forward, forward_kwargs, "pi05.forward")
-        benchmark_runtime(
-            eager_model.sample_actions,
-            compiled_model.sample_actions,
-            sample_kwargs,
-            "pi05.sample_actions",
-        )
-    finally:
-        reset_compile_state()
-        del eager_model
-        del compiled_model
-        torch.cuda.empty_cache()
--- a/tests/policies/pi0_pi05/test_pi05_original_vs_lerobot.py
+++ b/tests/policies/pi0_pi05/test_pi05_original_vs_lerobot.py
@@ -14,56 +14,52 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-"""Compare LeRobot PI0.5 against the vendored OpenPI PyTorch reference."""
+"""Test script to verify PI0OpenPI policy integration with LeRobot vs the original implementation"""

-import gc
 import os
+from copy import deepcopy
+from typing import Any

+import numpy as np
 import pytest
 import torch

+# Skip if openpi or transformers is not available
+pytest.importorskip("openpi")
 pytest.importorskip("transformers")

-from lerobot.configs import PreTrainedConfig  # noqa: E402
-from lerobot.policies.pi05 import PI05Policy  # noqa: E402
-from lerobot.policies.pi05.processor_pi05 import make_pi05_pre_post_processors  # noqa: E402
-from lerobot.utils.constants import ACTION, OBS_STATE  # noqa: E402
-from tests.policies.pi0_pi05.openpi_pytorch.pi0_pytorch import PI0Pytorch  # noqa: E402
-from tests.policies.pi0_pi05.utils.openpi_parity import (  # noqa: E402
-    assert_processor_inputs_match_lerobot,
-    clone_batch,
-    deterministic_openpi_forward_preprocess,
-    fix_reference_state_dict,
-    fixed_flow_sampling,
-    load_openpi_reference_state_dict,
-    make_openpi_observation_from_raw,
-    openpi_model_actions_from_raw,
-)
-
+# Skip this entire module in CI
 pytestmark = pytest.mark.skipif(
    os.environ.get("CI") == "true" or os.environ.get("GITHUB_ACTIONS") == "true",
-    reason="OpenPI parity and torch.compile checks are too slow for CI; run manually on GPU nodes",
+    reason="This test requires local OpenPI installation and is not meant for CI",
 )

+from openpi.models_pytorch import preprocessing_pytorch as openpi_preprocessing  # noqa: E402
+
+# NOTE: Assumes PYTHONPATH is set to include OpenPI src as per instructions.
+from openpi.models_pytorch.pi0_pytorch import PI0Pytorch  # noqa: E402
+from transformers import AutoTokenizer  # noqa: E402
+
+from lerobot.policies.pi05 import PI05Config, PI05Policy  # noqa: E402
+from lerobot.policies.pi05.processor_pi05 import make_pi05_pre_post_processors  # noqa: E402
+from lerobot.processor import PolicyProcessorPipeline  # noqa: E402
+from lerobot.types import PolicyAction  # noqa: E402
+
+# TODO: ADDING DEFAULT IMAGES_FEATURES TO CONFIG
 DUMMY_ACTION_DIM = 32
 DUMMY_STATE_DIM = 32
 DUMMY_ACTION_HORIZON = 50
 DUMMY_MAX_TOKEN_LEN = 200
-DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-COMPILE_MODE = "default"
-FORWARD_RTOL = 1e-4
-FORWARD_ATOL = 1e-4
-SAMPLE_RTOL = 1e-2
-SAMPLE_ATOL = 5e-3
+DEVICE = "cpu"  # Use CPU to avoid memory issues for testing

 DUMMY_DATASET_STATS = {
-    OBS_STATE: {
+    "observation.state": {
        "mean": torch.zeros(DUMMY_STATE_DIM),
        "std": torch.ones(DUMMY_STATE_DIM),
        "q01": torch.zeros(DUMMY_STATE_DIM),
        "q99": torch.ones(DUMMY_STATE_DIM),
    },
-    ACTION: {
+    "action": {
        "mean": torch.zeros(DUMMY_ACTION_DIM),
        "std": torch.ones(DUMMY_ACTION_DIM),
        "q01": torch.zeros(DUMMY_ACTION_DIM),
@@ -92,15 +88,6 @@ DUMMY_DATASET_STATS = {
 }


-@pytest.fixture(autouse=True)
-def cleanup_cuda_after_test():
-    yield
-    gc.collect()
-    if torch.cuda.is_available():
-        torch.cuda.empty_cache()
-        torch.cuda.ipc_collect()
-
-
 class PI05BaseOriginalConfig:
    action_dim: int = DUMMY_ACTION_DIM
    action_horizon: int = DUMMY_ACTION_HORIZON
@@ -109,163 +96,341 @@ class PI05BaseOriginalConfig:
    precision: str = "float32"
    pi05: bool = True
    dtype: str = "float32"
-    pytorch_compile_mode: str | None = None


-def instantiate_lerobot_pi05(*, compile_model: bool = False, gradient_checkpointing: bool = False):
-    config = PreTrainedConfig.from_pretrained("lerobot/pi05_base")
-    config.device = str(DEVICE)
-    config.dtype = "float32"
-    config.compile_model = compile_model
-    config.compile_mode = COMPILE_MODE
-    config.gradient_checkpointing = gradient_checkpointing
+def instantiate_lerobot_pi05(
+    from_pretrained: bool = False,
+) -> tuple[
+    PI05Policy,
+    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
+    PolicyProcessorPipeline[PolicyAction, PolicyAction],
+]:
+    if from_pretrained:
+        # Load the policy first
+        policy = PI05Policy.from_pretrained(pretrained_name_or_path="lerobot/pi05_base", strict=True)
+    else:
+        config = PI05Config(max_action_dim=DUMMY_ACTION_DIM, max_state_dim=DUMMY_STATE_DIM, dtype="float32")
+        policy = PI05Policy(config)

-    policy = PI05Policy.from_pretrained("lerobot/pi05_base", config=config, strict=True)
    policy.to(DEVICE)
-    policy.config.device = str(DEVICE)
-    preprocessor, _ = make_pi05_pre_post_processors(config=policy.config, dataset_stats=DUMMY_DATASET_STATS)
-    return policy, preprocessor
+    policy.config.device = DEVICE
+    preprocessor, postprocessor = make_pi05_pre_post_processors(
+        config=policy.config, dataset_stats=DUMMY_DATASET_STATS
+    )
+    return (policy, preprocessor, postprocessor)


-def instantiate_original_pi05():
-    policy = PI0Pytorch(PI05BaseOriginalConfig()).to(DEVICE)
+def instantiate_original_pi05(from_pretrained: bool = False, model_path: str | None = None):
+    config = PI05BaseOriginalConfig()
+    policy = PI0Pytorch(config)

-    # NOTE: `lerobot/pi05_base` 的 LeRobot loader 和 PI0 一样会在 strict load 前做 key
-    # 兼容转换，因此预期没有 missing_keys 或 unexpected_keys。vendored reference 则是裸
-    # `nn.Module`，需要在测试侧补齐 checkpoint 与模块命名之间的最小差异。
-    # NOTE: `lm_head.weight` 是 PaliGemma tied embedding 的保存名；LeRobot 的
-    # from_pretrained 会把它映射到内部 `embed_tokens.weight`，而 reference 模型没有这层
-    # loader，所以这里手动复用同一份 tensor，避免把权重别名差异误判成模型差异。
-    state_dict = fix_reference_state_dict(load_openpi_reference_state_dict("lerobot/pi05_base"))
-    missing_keys, unexpected_keys = policy.load_state_dict(state_dict, strict=False)
-    assert missing_keys == []
-    assert unexpected_keys == []
+    if from_pretrained:
+        try:
+            print("Loading converted PyTorch weights from HuggingFace Hub (lerobot/pi05_base)...")
+
+            # Download the model from HuggingFace Hub
+            import safetensors.torch
+            from huggingface_hub import snapshot_download
+
+            # Download the entire repository
+            if model_path and os.path.exists(model_path):
+                cache_dir = model_path
+                print(f"Using cached model from: {cache_dir}")
+            else:
+                cache_dir = snapshot_download(repo_id="lerobot/pi05_base", repo_type="model")
+                print(f"Downloaded model to: {cache_dir}")
+
+            # Try to load safetensors format first
+            model_file = os.path.join(cache_dir, "model.safetensors")
+            if os.path.exists(model_file):
+                state_dict = safetensors.torch.load_file(model_file)
+                print(f"Loaded {len(state_dict)} parameters from safetensors")
+            else:
+                raise FileNotFoundError(f"No safetensors file found in {cache_dir}")
+
+            # Load the state dict into the model
+            missing_keys, unexpected_keys = policy.load_state_dict(state_dict, strict=False)
+
+            if missing_keys:
+                print(f"Missing keys: {len(missing_keys)}")
+                if len(missing_keys) <= 5:
+                    for key in missing_keys:
+                        print(f"    - {key}")
+                else:
+                    for key in missing_keys[:5]:
+                        print(f"    - {key}")
+                    print(f"    ... and {len(missing_keys) - 5} more")
+
+            if unexpected_keys:
+                print(f"Unexpected keys: {len(unexpected_keys)}")
+                if len(unexpected_keys) <= 5:
+                    for key in unexpected_keys:
+                        print(f"    - {key}")
+                else:
+                    for key in unexpected_keys[:5]:
+                        print(f"    - {key}")
+                    print(f"    ... and {len(unexpected_keys) - 5} more")
+
+            if not missing_keys and not unexpected_keys:
+                print("All pretrained weights loaded successfully!")
+            else:
+                print("Pretrained weights loaded with some missing/unexpected keys (this may be normal)")
+
+        except Exception as e:
+            print(f"Failed to load pretrained weights: {e}")
+            print("   Using randomly initialized weights...")
+            import traceback
+
+            traceback.print_exc()
+
+    policy.to(DEVICE)
    return policy


 def create_dummy_data():
-    batch_size = 2
+    batch_size = 2  # Reduce batch size for testing
+    device = DEVICE
+
+    # Use the exact same prompt for both implementations
    prompt = "Pick up the red block and place it in the bin"
-    return {
-        OBS_STATE: torch.randn(batch_size, DUMMY_STATE_DIM, dtype=torch.float32, device=DEVICE),
-        ACTION: torch.randn(
-            batch_size, DUMMY_ACTION_HORIZON, DUMMY_ACTION_DIM, dtype=torch.float32, device=DEVICE
+
+    batch = {
+        "observation.state": torch.randn(batch_size, DUMMY_STATE_DIM, dtype=torch.float32, device=device),
+        "action": torch.randn(
+            batch_size, DUMMY_ACTION_HORIZON, DUMMY_ACTION_DIM, dtype=torch.float32, device=device
        ),
+        # Create images in [0, 1] range as expected by LeRobot (will be converted to [-1, 1] internally)
        "observation.images.base_0_rgb": torch.rand(
-            batch_size, 3, 224, 224, dtype=torch.float32, device=DEVICE
+            batch_size, 3, 224, 224, dtype=torch.float32, device=device
        ),
        "observation.images.left_wrist_0_rgb": torch.rand(
-            batch_size, 3, 224, 224, dtype=torch.float32, device=DEVICE
+            batch_size, 3, 224, 224, dtype=torch.float32, device=device
        ),
        "observation.images.right_wrist_0_rgb": torch.rand(
-            batch_size, 3, 224, 224, dtype=torch.float32, device=DEVICE
+            batch_size, 3, 224, 224, dtype=torch.float32, device=device
        ),
+        # Add the task prompt for LeRobot - provide as list with single element to trigger expansion
        "task": [prompt for _ in range(batch_size)],
    }
+    return batch


-def prepare_parity_inputs(lerobot_pi05, lerobot_preprocessor):
-    torch.manual_seed(0)
-    raw_batch = create_dummy_data()
-    lerobot_batch = lerobot_preprocessor(clone_batch(raw_batch))
-    openpi_observation = make_openpi_observation_from_raw(
-        raw_batch,
-        action_dim=DUMMY_ACTION_DIM,
-        max_token_len=DUMMY_MAX_TOKEN_LEN,
-        dataset_stats=DUMMY_DATASET_STATS,
-        pi05=True,
-    )
-    openpi_actions = openpi_model_actions_from_raw(
-        raw_batch,
-        action_dim=DUMMY_ACTION_DIM,
-        dataset_stats=DUMMY_DATASET_STATS,
-        pi05=True,
-    )
-    assert_processor_inputs_match_lerobot(
-        lerobot_pi05,
-        lerobot_batch,
-        openpi_observation,
-        compare_state=False,
-    )
-    batch_size = raw_batch[OBS_STATE].shape[0]
-    noise = torch.randn(
-        batch_size,
-        DUMMY_ACTION_HORIZON,
-        DUMMY_ACTION_DIM,
-        dtype=torch.float32,
-        device=DEVICE,
-    )
-    time = torch.linspace(0.2, 0.8, batch_size, dtype=torch.float32, device=DEVICE)
-    return lerobot_batch, openpi_observation, openpi_actions, noise, time
+def extract_lerobot_processed_inputs(lerobot_pi0, batch):
+    """Extract the exact same processed inputs that LeRobot uses internally."""
+    # Get the tokenized language from LeRobot's internal method
+    lang_tokens, lang_masks = lerobot_pi0._tokenize_language(batch)
+
+    # Get the preprocessed images from LeRobot's internal method
+    images, img_masks = lerobot_pi0._preprocess_images(batch, train=False)
+
+    # Create dummy token_ar_mask and token_loss_mask for original implementation
+    token_ar_mask = torch.zeros_like(lang_tokens, dtype=torch.int32)
+    token_loss_mask = torch.ones_like(lang_masks, dtype=torch.bool)
+
+    return images, img_masks, lang_tokens, lang_masks, token_ar_mask, token_loss_mask


-def assert_forward_matches(*, compile_model: bool = False, gradient_checkpointing: bool = False):
-    lerobot_pi05, lerobot_preprocessor = instantiate_lerobot_pi05(
-        compile_model=compile_model,
-        gradient_checkpointing=gradient_checkpointing,
-    )
-    original_pi05 = instantiate_original_pi05()
-    lerobot_batch, openpi_observation, openpi_actions, noise, time = prepare_parity_inputs(
-        lerobot_pi05,
-        lerobot_preprocessor,
+class PI05Observation:
+    """Observation class that matches the original OpenPI format."""
+
+    def __init__(
+        self,
+        state,
+        images,
+        image_masks,
+        tokenized_prompt,
+        tokenized_prompt_mask,
+        token_ar_mask,
+        token_loss_mask,
+    ):
+        self.state = state
+        self.images = images
+        self.image_masks = image_masks
+        self.tokenized_prompt = tokenized_prompt
+        self.tokenized_prompt_mask = tokenized_prompt_mask
+        self.token_ar_mask = token_ar_mask
+        self.token_loss_mask = token_loss_mask
+
+
+def create_original_observation_with_openpi_preprocessing(batch):
+    """Create observation object for OpenPI using OpenPI's own preprocessing with pi05 state tokenizer."""
+    batch_size = batch["observation.state"].shape[0]
+    device = batch["observation.state"].device
+
+    # Create tokenizer for OpenPI (same as LeRobot uses)
+    tokenizer = AutoTokenizer.from_pretrained("google/paligemma-3b-pt-224")
+
+    # Get task description (pi05 processor handles all text formatting)
+    tasks = batch.get("task", ["Pick up the object"] * batch_size)
+    if isinstance(tasks, str):
+        tasks = [tasks] * batch_size
+    elif len(tasks) == 1:
+        tasks = tasks * batch_size
+
+    # Use pi05 state and input tokenizer logic (same as Pi05PrepareStateTokenizerProcessorStep)
+    state = batch["observation.state"]
+    state = deepcopy(state)
+
+    # Prepare state (pad to max_state_dim)
+    from lerobot.policies.pi05.modeling_pi05 import pad_vector
+
+    state = pad_vector(state, DUMMY_STATE_DIM)
+
+    # Normalize state to [-1, 1] range if needed (assuming it's already normalized from normalize_inputs)
+    # Discretize into 256 bins (see openpi `PaligemmaTokenizer.tokenize()`)
+    state_np = state.cpu().numpy()
+    discretized_states = np.digitize(state_np, bins=np.linspace(-1, 1, 256 + 1)[:-1]) - 1
+
+    # Create pi05-formatted prompts that include state information
+    full_prompts = []
+    for i, task in enumerate(tasks):
+        cleaned_text = task.strip().replace("_", " ").replace("\n", " ")
+        state_str = " ".join(map(str, discretized_states[i]))
+        full_prompt = f"Task: {cleaned_text}, State: {state_str};\nAction: "
+        full_prompts.append(full_prompt)
+
+    # Tokenize with max_length padding to match OpenPI's expected format
+    tokenized = tokenizer(
+        full_prompts,
+        padding="max_length",
+        padding_side="right",
+        truncation=True,
+        max_length=DUMMY_MAX_TOKEN_LEN,
+        return_tensors="pt",
    )

-    if gradient_checkpointing:
-        lerobot_pi05.train()
-    else:
-        lerobot_pi05.eval()
-    original_pi05.eval()
+    lang_tokens = tokenized["input_ids"].to(device)
+    lang_masks = tokenized["attention_mask"].to(device, dtype=torch.bool)

-    with fixed_flow_sampling(lerobot_pi05.model, noise=noise, time=time):
-        lerobot_loss, _ = lerobot_pi05(lerobot_batch, reduction="none")
-    with deterministic_openpi_forward_preprocess(original_pi05):
-        openpi_losses = original_pi05(openpi_observation, openpi_actions, noise=noise, time=time)
-    openpi_loss = openpi_losses.mean(dim=(1, 2))
+    # Create dummy token_ar_mask and token_loss_mask for OpenPI
+    token_ar_mask = torch.zeros_like(lang_tokens, dtype=torch.int32)
+    token_loss_mask = torch.ones_like(lang_masks, dtype=torch.bool)

-    torch.testing.assert_close(lerobot_loss, openpi_loss, rtol=FORWARD_RTOL, atol=FORWARD_ATOL)
+    # Convert LeRobot images format to OpenPI format (convert [0,1] to [-1,1] range)
+    image_dict = {
+        "base_0_rgb": batch["observation.images.base_0_rgb"] * 2.0 - 1.0,
+        "left_wrist_0_rgb": batch["observation.images.left_wrist_0_rgb"] * 2.0 - 1.0,
+        "right_wrist_0_rgb": batch["observation.images.right_wrist_0_rgb"] * 2.0 - 1.0,
+    }

+    # Create image masks (all ones for real images)
+    image_masks_dict = {}
+    for key in image_dict:
+        image_masks_dict[key] = torch.ones(batch_size, dtype=torch.bool, device=device)

-def assert_sample_actions_match_openpi(*, compile_model: bool = False):
-    lerobot_pi05, lerobot_preprocessor = instantiate_lerobot_pi05(compile_model=compile_model)
-    original_pi05 = instantiate_original_pi05()
-    lerobot_batch, openpi_observation, _openpi_actions, noise, _time = prepare_parity_inputs(
-        lerobot_pi05,
-        lerobot_preprocessor,
+    # Create raw observation object (before preprocessing)
+    raw_observation = PI05Observation(
+        state=batch["observation.state"],
+        images=image_dict,
+        image_masks=image_masks_dict,
+        tokenized_prompt=lang_tokens,
+        tokenized_prompt_mask=lang_masks,
+        token_ar_mask=token_ar_mask,
+        token_loss_mask=token_loss_mask,
    )

-    lerobot_pi05.eval()
-    original_pi05.eval()
+    # Now use OpenPI's preprocessing
+    processed_obs = openpi_preprocessing.preprocess_observation_pytorch(raw_observation, train=False)
+
+    return processed_obs
+
+
+def create_original_observation_from_lerobot(lerobot_pi0, batch):
+    """Create observation object compatible with original OpenPI using the exact same inputs as LeRobot."""
+    _batch_size = batch["observation.state"].shape[0]
+    _device = batch["observation.state"].device
+
+    # Extract the exact same processed inputs that LeRobot uses
+    images, img_masks, lang_tokens, lang_masks, token_ar_mask, token_loss_mask = (
+        extract_lerobot_processed_inputs(lerobot_pi0, batch)
+    )
+
+    # Convert images list to dict with original OpenPI keys
+    image_dict = {
+        "base_0_rgb": images[0],
+        "left_wrist_0_rgb": images[1],
+        "right_wrist_0_rgb": images[2],
+    }
+
+    # Convert image masks list to dict with original OpenPI keys
+    image_masks_dict = {
+        "base_0_rgb": img_masks[0],
+        "left_wrist_0_rgb": img_masks[1],
+        "right_wrist_0_rgb": img_masks[2],
+    }
+
+    return PI05Observation(
+        state=batch["observation.state"],
+        images=image_dict,
+        image_masks=image_masks_dict,
+        tokenized_prompt=lang_tokens,
+        tokenized_prompt_mask=lang_masks,
+        token_ar_mask=token_ar_mask,
+        token_loss_mask=token_loss_mask,
+    )
+
+
+def test_pi05_original_vs_lerobot():
+    """Test PI05 original implementation vs LeRobot implementation."""
+    print("Initializing models...")
+    lerobot_pi05, lerobot_preprocessor, lerobot_postprocessor = instantiate_lerobot_pi05(
+        from_pretrained=True
+    )  # Load pretrained LeRobot model
+    original_pi0 = instantiate_original_pi05(
+        from_pretrained=True
+    )  # Load pretrained OpenPI model from HuggingFace Hub
+
+    print("Creating dummy data...")
+    batch = create_dummy_data()
+    batch_lerobot = deepcopy(batch)
+
+    # Test each model with its own preprocessing (more realistic end-to-end test)
+    print("\nTest each model with its own preprocessing")
+    print("Creating observation for OpenPI using OpenPI's own preprocessing...")
+    pi0_obs_openpi = create_original_observation_with_openpi_preprocessing(batch)
+
+    print(f"Task prompt: '{batch['task'][0]}'")
+    print(f"OpenPI tokenized prompt shape: {pi0_obs_openpi.tokenized_prompt.shape}")
+    print(f"OpenPI image shapes: {[img.shape for img in pi0_obs_openpi.images.values()]}")
+    print(f"OpenPI state shape: {pi0_obs_openpi.state.shape}")
+
+    print("Testing OpenPI with own preprocessing...")
+    original_pi0.eval()
+    torch.manual_seed(42)  # Set seed for reproducibility
+    batch_size = batch["observation.state"].shape[0]
+    noise_shape = (batch_size, DUMMY_ACTION_HORIZON, DUMMY_ACTION_DIM)
+    fixed_noise = torch.randn(noise_shape, dtype=torch.float32, device=DEVICE)
+
    with torch.no_grad():
-        lerobot_actions = lerobot_pi05.predict_action_chunk(lerobot_batch, noise=noise, num_steps=10)
-        openpi_actions = original_pi05.sample_actions(
-            device=DEVICE,
-            observation=openpi_observation,
-            noise=noise,
-            num_steps=10,
+        openpi_actions = original_pi0.sample_actions(
+            device=DEVICE, observation=pi0_obs_openpi, noise=fixed_noise, num_steps=10
        )
+        openpi_actions_unit = openpi_actions[:, 0, :]
+    print(f"OpenPI (own preprocessing) Actions shape: {openpi_actions.shape}")
+    print(f"OpenPI (own preprocessing) Actions unit shape: {openpi_actions_unit.shape}")
+    print(f"OpenPI (own preprocessing) Actions mean: {openpi_actions.mean().item():.6f}")
+    print(f"OpenPI (own preprocessing) Actions std: {openpi_actions.std().item():.6f}")

-    torch.testing.assert_close(lerobot_actions, openpi_actions, rtol=SAMPLE_RTOL, atol=SAMPLE_ATOL)
+    print("Testing LeRobot with own preprocessing...")
+    lerobot_pi05.eval()
+    torch.manual_seed(42)  # Set the same seed

+    batch_lerobot_processed = lerobot_preprocessor(batch_lerobot)
+    with torch.no_grad():
+        lerobot_actions_own = lerobot_pi05.predict_action_chunk(
+            batch_lerobot_processed
+        )  # batch_size, n_action_steps, action_dim
+        lerobot_actions_unit = lerobot_actions_own[:, 0, :]
+    print(f"LeRobot (own preprocessing) Actions shape: {lerobot_actions_own.shape}")
+    print(f"LeRobot (own preprocessing) Actions unit shape: {lerobot_actions_unit.shape}")
+    print(f"LeRobot (own preprocessing) Actions mean: {lerobot_actions_own.mean().item():.6f}")
+    print(f"LeRobot (own preprocessing) Actions std: {lerobot_actions_own.std().item():.6f}")

-def test_pi05_forward_matches_openpi():
-    assert_forward_matches()
+    print("\nComparing end-to-end implementations:")
+    print(f"Actions close (atol=1e-4): {torch.allclose(lerobot_actions_own, openpi_actions, atol=1e-4)}")
+    print(f"Actions close (atol=1e-2): {torch.allclose(lerobot_actions_own, openpi_actions, atol=1e-2)}")
+    print(f"Max absolute difference: {torch.abs(lerobot_actions_own - openpi_actions).max().item():.6f}")

-
-def test_pi05_sample_actions_match_openpi():
-    assert_sample_actions_match_openpi()
-
-
-def test_pi05_gradient_checkpointing_forward_matches_openpi():
-    assert_forward_matches(gradient_checkpointing=True)
-
-
-def test_pi05_compile_forward_matches_openpi():
-    assert_forward_matches(compile_model=True)
-
-
-def test_pi05_compile_sample_actions_match_openpi():
-    assert_sample_actions_match_openpi(compile_model=True)
-
-
-def test_pi05_compile_gradient_checkpointing_forward_matches_openpi():
-    assert_forward_matches(compile_model=True, gradient_checkpointing=True)
+    assert torch.allclose(lerobot_actions_own, openpi_actions, atol=1e-4)
+    assert torch.allclose(lerobot_actions_own, openpi_actions, atol=1e-2)
+    assert torch.abs(lerobot_actions_own - openpi_actions).max().item() < 1e-4
--- a/tests/policies/pi0_pi05/test_pi0_compile.py
+++ b/tests/policies/pi0_pi05/test_pi0_compile.py
@@ -1,99 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import os
-
-import pytest
-import torch
-
-pytest.importorskip("transformers")
-
-from lerobot.policies.pi0 import PI0Config  # noqa: E402
-from lerobot.policies.pi0.modeling_pi0 import PI0Pytorch  # noqa: E402
-from tests.policies.pi0_pi05.utils.torch_compile import (  # noqa: E402
-    assert_cache_stability,
-    assert_compiled_output_matches_eager,
-    assert_explain_has_no_graph_breaks,
-    benchmark_runtime,
-    make_compile_config,
-    reset_compile_state,
-)
-from tests.utils import require_cuda  # noqa: E402
-
-pytestmark = pytest.mark.skipif(
-    os.environ.get("CI") == "true" or os.environ.get("GITHUB_ACTIONS") == "true",
-    reason="torch.compile benchmark is too slow for CI; run manually on GPU nodes",
-)
-
-
-def _make_model(*, compile_model):
-    return PI0Pytorch(make_compile_config(PI0Config, compile_model=compile_model)).cuda().eval()
-
-
-def _make_dummy_inputs(config):
-    device = torch.device("cuda")
-    common = {
-        "images": [torch.randn(1, 3, *config.image_resolution, device=device)],
-        "img_masks": [torch.ones(1, dtype=torch.bool, device=device)],
-        "lang_tokens": torch.randint(0, 1024, (1, 5), dtype=torch.long, device=device),
-        "lang_masks": torch.ones(1, 5, dtype=torch.bool, device=device),
-        "state": torch.randn(1, config.max_state_dim, device=device),
-    }
-    forward_kwargs = {
-        **common,
-        "actions": torch.randn(1, config.chunk_size, config.max_action_dim, device=device),
-        "noise": torch.randn(1, config.chunk_size, config.max_action_dim, device=device),
-        "time": torch.rand(1, device=device),
-    }
-    sample_kwargs = {
-        **common,
-        "noise": torch.randn(1, config.chunk_size, config.max_action_dim, device=device),
-        "num_steps": config.num_inference_steps,
-    }
-    return forward_kwargs, sample_kwargs
-
-
-@require_cuda
-def test_pi0_torch_compile_forward_and_sample_actions():
-    if not hasattr(torch, "compile"):
-        pytest.skip("torch.compile is not available")
-    if not torch._dynamo.is_dynamo_supported():
-        pytest.skip("torch._dynamo is not supported on this platform")
-
-    torch.manual_seed(0)
-    eager_model = _make_model(compile_model=False)
-    torch.manual_seed(0)
-    compiled_model = _make_model(compile_model=True)
-    forward_kwargs, sample_kwargs = _make_dummy_inputs(compiled_model.config)
-
-    try:
-        assert_compiled_output_matches_eager(eager_model, compiled_model, forward_kwargs, sample_kwargs)
-
-        assert_explain_has_no_graph_breaks(eager_model.forward, forward_kwargs, "pi0.forward")
-        assert_explain_has_no_graph_breaks(eager_model.sample_actions, sample_kwargs, "pi0.sample_actions")
-
-        assert_cache_stability(compiled_model.forward, forward_kwargs, "pi0.forward")
-        assert_cache_stability(compiled_model.sample_actions, sample_kwargs, "pi0.sample_actions")
-
-        benchmark_runtime(eager_model.forward, compiled_model.forward, forward_kwargs, "pi0.forward")
-        benchmark_runtime(
-            eager_model.sample_actions, compiled_model.sample_actions, sample_kwargs, "pi0.sample_actions"
-        )
-    finally:
-        reset_compile_state()
-        del eager_model
-        del compiled_model
-        torch.cuda.empty_cache()
--- a/tests/policies/pi0_pi05/test_pi0_original_vs_lerobot.py
+++ b/tests/policies/pi0_pi05/test_pi0_original_vs_lerobot.py
@@ -14,56 +14,51 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-"""Compare LeRobot PI0 against the vendored OpenPI PyTorch reference."""
+"""Test script to verify PI0 policy integration with LeRobot vs the original implementation"""

-import gc
 import os
+from copy import deepcopy
+from typing import Any

 import pytest
 import torch

+# Skip if openpi or transformers is not available
+pytest.importorskip("openpi")
 pytest.importorskip("transformers")

-from lerobot.configs import PreTrainedConfig  # noqa: E402
-from lerobot.policies.pi0 import PI0Policy  # noqa: E402
-from lerobot.policies.pi0.processor_pi0 import make_pi0_pre_post_processors  # noqa: E402
-from lerobot.utils.constants import ACTION, OBS_STATE  # noqa: E402
-from tests.policies.pi0_pi05.openpi_pytorch.pi0_pytorch import PI0Pytorch  # noqa: E402
-from tests.policies.pi0_pi05.utils.openpi_parity import (  # noqa: E402
-    assert_processor_inputs_match_lerobot,
-    clone_batch,
-    deterministic_openpi_forward_preprocess,
-    fix_reference_state_dict,
-    fixed_flow_sampling,
-    load_openpi_reference_state_dict,
-    make_openpi_observation_from_raw,
-    openpi_model_actions_from_raw,
-)
-
+# Skip this entire module in CI
 pytestmark = pytest.mark.skipif(
    os.environ.get("CI") == "true" or os.environ.get("GITHUB_ACTIONS") == "true",
-    reason="OpenPI parity and torch.compile checks are too slow for CI; run manually on GPU nodes",
+    reason="This test requires local OpenPI installation and is not meant for CI",
 )

+from openpi.models_pytorch import preprocessing_pytorch as openpi_preprocessing  # noqa: E402
+
+# NOTE: Assumes PYTHONPATH is set to include OpenPI src as per instructions.
+from openpi.models_pytorch.pi0_pytorch import PI0Pytorch  # noqa: E402
+from transformers import AutoTokenizer  # noqa: E402
+
+from lerobot.policies.pi0 import PI0Config, PI0Policy  # noqa: E402
+from lerobot.policies.pi0.processor_pi0 import make_pi0_pre_post_processors  # noqa: E402
+from lerobot.processor import PolicyProcessorPipeline  # noqa: E402
+from lerobot.types import PolicyAction  # noqa: E402
+
+# TODO: ADDING DEFAULT IMAGES_FEATURES TO CONFIG
 DUMMY_ACTION_DIM = 32
 DUMMY_STATE_DIM = 32
 DUMMY_ACTION_HORIZON = 50
-DUMMY_MAX_TOKEN_LEN = 48
-DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-COMPILE_MODE = "default"
-FORWARD_RTOL = 1e-4
-FORWARD_ATOL = 1e-4
-SAMPLE_RTOL = 1e-2
-SAMPLE_ATOL = 5e-3
+DUMMY_MAX_TOKEN_LEN = 48  # Default for PI0 (non-pi05)
+DEVICE = "cpu"  # Use CPU to avoid memory issues for testing

 DUMMY_DATASET_STATS = {
-    OBS_STATE: {
+    "observation.state": {
        "mean": torch.zeros(DUMMY_STATE_DIM),
        "std": torch.ones(DUMMY_STATE_DIM),
        "q01": torch.zeros(DUMMY_STATE_DIM),
        "q99": torch.ones(DUMMY_STATE_DIM),
    },
-    ACTION: {
+    "action": {
        "mean": torch.zeros(DUMMY_ACTION_DIM),
        "std": torch.ones(DUMMY_ACTION_DIM),
        "q01": torch.zeros(DUMMY_ACTION_DIM),
@@ -92,15 +87,6 @@ DUMMY_DATASET_STATS = {
 }


-@pytest.fixture(autouse=True)
-def cleanup_cuda_after_test():
-    yield
-    gc.collect()
-    if torch.cuda.is_available():
-        torch.cuda.empty_cache()
-        torch.cuda.ipc_collect()
-
-
 class PI0BaseOriginalConfig:
    action_dim: int = DUMMY_ACTION_DIM
    action_horizon: int = DUMMY_ACTION_HORIZON
@@ -109,156 +95,333 @@ class PI0BaseOriginalConfig:
    precision: str = "float32"
    pi05: bool = False
    dtype: str = "float32"
-    pytorch_compile_mode: str | None = None


-def instantiate_lerobot_pi0(*, compile_model: bool = False, gradient_checkpointing: bool = False):
-    config = PreTrainedConfig.from_pretrained("lerobot/pi0_base")
-    config.device = str(DEVICE)
-    config.dtype = "float32"
-    config.compile_model = compile_model
-    config.compile_mode = COMPILE_MODE
-    config.gradient_checkpointing = gradient_checkpointing
+def instantiate_lerobot_pi0(
+    from_pretrained: bool = False,
+) -> tuple[
+    PI0Policy,
+    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
+    PolicyProcessorPipeline[PolicyAction, PolicyAction],
+]:
+    if from_pretrained:
+        # Load the policy first
+        policy = PI0Policy.from_pretrained(pretrained_name_or_path="lerobot/pi0_base", strict=True)
+    else:
+        config = PI0Config(max_action_dim=DUMMY_ACTION_DIM, max_state_dim=DUMMY_STATE_DIM, dtype="float32")
+        policy = PI0Policy(config)

-    policy = PI0Policy.from_pretrained("lerobot/pi0_base", config=config, strict=True)
    policy.to(DEVICE)
-    policy.config.device = str(DEVICE)
-    preprocessor, _ = make_pi0_pre_post_processors(config=policy.config, dataset_stats=DUMMY_DATASET_STATS)
-    return policy, preprocessor
+    policy.config.device = DEVICE
+    preprocessor, postprocessor = make_pi0_pre_post_processors(
+        config=policy.config, dataset_stats=DUMMY_DATASET_STATS
+    )
+    return (policy, preprocessor, postprocessor)


-def instantiate_original_pi0():
-    policy = PI0Pytorch(PI0BaseOriginalConfig()).to(DEVICE)
-    state_dict = fix_reference_state_dict(load_openpi_reference_state_dict("lerobot/pi0_base"))
-    missing_keys, unexpected_keys = policy.load_state_dict(state_dict, strict=False)
-    assert missing_keys == []
-    assert unexpected_keys == []
+def instantiate_original_pi0(from_pretrained: bool = False, model_path: str = None):
+    config = PI0BaseOriginalConfig()
+    policy = PI0Pytorch(config)
+
+    if from_pretrained:
+        try:
+            print("Loading converted PyTorch weights from HuggingFace Hub (lerobot/pi0_base)...")
+
+            # Download the model from HuggingFace Hub
+            import safetensors.torch
+            from huggingface_hub import snapshot_download
+
+            # Download the entire repository
+            if model_path and os.path.exists(model_path):
+                cache_dir = model_path
+                print(f"Using cached model from: {cache_dir}")
+            else:
+                cache_dir = snapshot_download(repo_id="lerobot/pi0_base", repo_type="model")
+                print(f"Downloaded model to: {cache_dir}")
+
+            # Try to load safetensors format first
+            model_file = os.path.join(cache_dir, "model.safetensors")
+            if os.path.exists(model_file):
+                state_dict = safetensors.torch.load_file(model_file)
+                print(f"Loaded {len(state_dict)} parameters from safetensors")
+            else:
+                raise FileNotFoundError(f"No safetensors file found in {cache_dir}")
+
+            # Load the state dict into the model
+            missing_keys, unexpected_keys = policy.load_state_dict(state_dict, strict=False)
+
+            if missing_keys:
+                print(f"Missing keys: {len(missing_keys)}")
+                if len(missing_keys) <= 5:
+                    for key in missing_keys:
+                        print(f"    - {key}")
+                else:
+                    for key in missing_keys[:5]:
+                        print(f"    - {key}")
+                    print(f"    ... and {len(missing_keys) - 5} more")
+
+            if unexpected_keys:
+                print(f"Unexpected keys: {len(unexpected_keys)}")
+                if len(unexpected_keys) <= 5:
+                    for key in unexpected_keys:
+                        print(f"    - {key}")
+                else:
+                    for key in unexpected_keys[:5]:
+                        print(f"    - {key}")
+                    print(f"    ... and {len(unexpected_keys) - 5} more")
+
+            if not missing_keys and not unexpected_keys:
+                print("All pretrained weights loaded successfully!")
+            else:
+                print("Pretrained weights loaded with some missing/unexpected keys (this may be normal)")
+
+        except Exception as e:
+            print(f"Failed to load pretrained weights: {e}")
+            print("   Using randomly initialized weights...")
+            import traceback
+
+            traceback.print_exc()
+
+    policy.to(DEVICE)
    return policy


 def create_dummy_data():
-    batch_size = 2
+    batch_size = 2  # Reduce batch size for testing
+    device = DEVICE
+
+    # Use the exact same prompt for both implementations
    prompt = "Pick up the red block and place it in the bin"
-    return {
-        OBS_STATE: torch.randn(batch_size, DUMMY_STATE_DIM, dtype=torch.float32, device=DEVICE),
-        ACTION: torch.randn(
-            batch_size, DUMMY_ACTION_HORIZON, DUMMY_ACTION_DIM, dtype=torch.float32, device=DEVICE
+
+    batch = {
+        "observation.state": torch.randn(batch_size, DUMMY_STATE_DIM, dtype=torch.float32, device=device),
+        "action": torch.randn(
+            batch_size, DUMMY_ACTION_HORIZON, DUMMY_ACTION_DIM, dtype=torch.float32, device=device
        ),
+        # Create images in [0, 1] range as expected by LeRobot (will be converted to [-1, 1] internally)
        "observation.images.base_0_rgb": torch.rand(
-            batch_size, 3, 224, 224, dtype=torch.float32, device=DEVICE
+            batch_size, 3, 224, 224, dtype=torch.float32, device=device
        ),
        "observation.images.left_wrist_0_rgb": torch.rand(
-            batch_size, 3, 224, 224, dtype=torch.float32, device=DEVICE
+            batch_size, 3, 224, 224, dtype=torch.float32, device=device
        ),
        "observation.images.right_wrist_0_rgb": torch.rand(
-            batch_size, 3, 224, 224, dtype=torch.float32, device=DEVICE
+            batch_size, 3, 224, 224, dtype=torch.float32, device=device
        ),
+        # Add the task prompt for LeRobot - provide as list with single element to trigger expansion
        "task": [prompt for _ in range(batch_size)],
    }
+    return batch


-def prepare_parity_inputs(lerobot_pi0, lerobot_preprocessor):
-    torch.manual_seed(0)
-    raw_batch = create_dummy_data()
-    lerobot_batch = lerobot_preprocessor(clone_batch(raw_batch))
-    openpi_observation = make_openpi_observation_from_raw(
-        raw_batch,
-        action_dim=DUMMY_ACTION_DIM,
-        max_token_len=DUMMY_MAX_TOKEN_LEN,
-        dataset_stats=DUMMY_DATASET_STATS,
-        pi05=False,
-    )
-    openpi_actions = openpi_model_actions_from_raw(
-        raw_batch,
-        action_dim=DUMMY_ACTION_DIM,
-        dataset_stats=DUMMY_DATASET_STATS,
-        pi05=False,
-    )
-    assert_processor_inputs_match_lerobot(
-        lerobot_pi0,
-        lerobot_batch,
-        openpi_observation,
-        compare_state=True,
-    )
-    batch_size = raw_batch[OBS_STATE].shape[0]
-    noise = torch.randn(
-        batch_size,
-        DUMMY_ACTION_HORIZON,
-        DUMMY_ACTION_DIM,
-        dtype=torch.float32,
-        device=DEVICE,
-    )
-    time = torch.linspace(0.2, 0.8, batch_size, dtype=torch.float32, device=DEVICE)
-    return lerobot_batch, openpi_observation, openpi_actions, noise, time
+def extract_lerobot_processed_inputs(lerobot_pi0, batch):
+    """Extract the exact same processed inputs that LeRobot uses internally."""
+    # Get the tokenized language from LeRobot's internal method
+    lang_tokens, lang_masks = lerobot_pi0._tokenize_language(batch)
+
+    # Get the preprocessed images from LeRobot's internal method
+    images, img_masks = lerobot_pi0._preprocess_images(batch, train=False)
+
+    # Create dummy token_ar_mask and token_loss_mask for original implementation
+    token_ar_mask = torch.zeros_like(lang_tokens, dtype=torch.int32)
+    token_loss_mask = torch.ones_like(lang_masks, dtype=torch.bool)
+
+    return images, img_masks, lang_tokens, lang_masks, token_ar_mask, token_loss_mask


-def assert_forward_matches(*, compile_model: bool = False, gradient_checkpointing: bool = False):
-    lerobot_pi0, lerobot_preprocessor = instantiate_lerobot_pi0(
-        compile_model=compile_model,
-        gradient_checkpointing=gradient_checkpointing,
-    )
-    original_pi0 = instantiate_original_pi0()
-    lerobot_batch, openpi_observation, openpi_actions, noise, time = prepare_parity_inputs(
-        lerobot_pi0,
-        lerobot_preprocessor,
-    )
+class PI0Observation:
+    """Observation class that matches the original OpenPI format."""

-    if gradient_checkpointing:
-        lerobot_pi0.train()
+    def __init__(
+        self,
+        state,
+        images,
+        image_masks,
+        tokenized_prompt,
+        tokenized_prompt_mask,
+        token_ar_mask,
+        token_loss_mask,
+    ):
+        self.state = state
+        self.images = images
+        self.image_masks = image_masks
+        self.tokenized_prompt = tokenized_prompt
+        self.tokenized_prompt_mask = tokenized_prompt_mask
+        self.token_ar_mask = token_ar_mask
+        self.token_loss_mask = token_loss_mask
+
+
+def create_original_observation_with_openpi_preprocessing(batch):
+    """Create observation object for OpenPI using OpenPI's own preprocessing."""
+    batch_size = batch["observation.state"].shape[0]
+    device = batch["observation.state"].device
+
+    # Create tokenizer for OpenPI (same as LeRobot uses)
+    tokenizer = AutoTokenizer.from_pretrained("google/paligemma-3b-pt-224")
+
+    # Get task description
+    if "task" in batch:
+        tasks = batch["task"]
+        if isinstance(tasks, str):
+            # Single string: add newline if not present, then convert to list
+            if not tasks.endswith("\n"):
+                tasks = f"{tasks}\n"
+            tasks = [tasks]
+        elif isinstance(tasks, list) and all(isinstance(t, str) for t in tasks):
+            # List of strings: add newline to each if not present
+            tasks = [t if t.endswith("\n") else f"{t}\n" for t in tasks]
+            if len(tasks) == 1:
+                # Expand to batch size
+                tasks = tasks * batch_size
+                if len(tasks) != batch_size:
+                    raise ValueError(f"Expected batch size {batch_size}, got {len(tasks)}")
+        # If task is neither string nor list of strings, leave unchanged
    else:
-        lerobot_pi0.eval()
-    original_pi0.eval()
+        # Default task if not provided
+        tasks = ["Pick up the object\n"] * batch_size

-    with fixed_flow_sampling(lerobot_pi0.model, noise=noise, time=time):
-        lerobot_loss, _ = lerobot_pi0(lerobot_batch, reduction="none")
-    with deterministic_openpi_forward_preprocess(original_pi0):
-        openpi_losses = original_pi0(openpi_observation, openpi_actions, noise=noise, time=time)
-    openpi_loss = openpi_losses.mean(dim=(1, 2))
-
-    torch.testing.assert_close(lerobot_loss, openpi_loss, rtol=FORWARD_RTOL, atol=FORWARD_ATOL)
-
-
-def assert_sample_actions_match_openpi(*, compile_model: bool = False):
-    lerobot_pi0, lerobot_preprocessor = instantiate_lerobot_pi0(compile_model=compile_model)
-    original_pi0 = instantiate_original_pi0()
-    lerobot_batch, openpi_observation, _openpi_actions, noise, _time = prepare_parity_inputs(
-        lerobot_pi0,
-        lerobot_preprocessor,
+    # Tokenize with max_length padding to match OpenPI's expected format
+    tokenized = tokenizer(
+        tasks,
+        padding="max_length",
+        padding_side="right",
+        truncation=True,
+        max_length=DUMMY_MAX_TOKEN_LEN,
+        return_tensors="pt",
    )

-    lerobot_pi0.eval()
+    lang_tokens = tokenized["input_ids"].to(device)
+    lang_masks = tokenized["attention_mask"].to(device, dtype=torch.bool)
+
+    # Create dummy token_ar_mask and token_loss_mask for OpenPI
+    token_ar_mask = torch.zeros_like(lang_tokens, dtype=torch.int32)
+    token_loss_mask = torch.ones_like(lang_masks, dtype=torch.bool)
+
+    # Convert LeRobot images format to OpenPI format (convert [0,1] to [-1,1] range)
+    image_dict = {
+        "base_0_rgb": batch["observation.images.base_0_rgb"] * 2.0 - 1.0,
+        "left_wrist_0_rgb": batch["observation.images.left_wrist_0_rgb"] * 2.0 - 1.0,
+        "right_wrist_0_rgb": batch["observation.images.right_wrist_0_rgb"] * 2.0 - 1.0,
+    }
+
+    # Create image masks (all ones for real images)
+    image_masks_dict = {}
+    for key in image_dict:
+        image_masks_dict[key] = torch.ones(batch_size, dtype=torch.bool, device=device)
+
+    # Create raw observation object (before preprocessing)
+    raw_observation = PI0Observation(
+        state=batch["observation.state"],
+        images=image_dict,
+        image_masks=image_masks_dict,
+        tokenized_prompt=lang_tokens,
+        tokenized_prompt_mask=lang_masks,
+        token_ar_mask=token_ar_mask,
+        token_loss_mask=token_loss_mask,
+    )
+
+    # Now use OpenPI's preprocessing
+    processed_obs = openpi_preprocessing.preprocess_observation_pytorch(raw_observation, train=False)
+
+    return processed_obs
+
+
+def create_original_observation_from_lerobot(lerobot_pi0, batch):
+    """Create observation object compatible with original OpenPI using the exact same inputs as LeRobot."""
+    _batch_size = batch["observation.state"].shape[0]
+    _device = batch["observation.state"].device
+
+    # Extract the exact same processed inputs that LeRobot uses
+    images, img_masks, lang_tokens, lang_masks, token_ar_mask, token_loss_mask = (
+        extract_lerobot_processed_inputs(lerobot_pi0, batch)
+    )
+
+    # Convert images list to dict with original OpenPI keys
+    image_dict = {
+        "base_0_rgb": images[0],
+        "left_wrist_0_rgb": images[1],
+        "right_wrist_0_rgb": images[2],
+    }
+
+    # Convert image masks list to dict with original OpenPI keys
+    image_masks_dict = {
+        "base_0_rgb": img_masks[0],
+        "left_wrist_0_rgb": img_masks[1],
+        "right_wrist_0_rgb": img_masks[2],
+    }
+
+    return PI0Observation(
+        state=batch["observation.state"],
+        images=image_dict,
+        image_masks=image_masks_dict,
+        tokenized_prompt=lang_tokens,
+        tokenized_prompt_mask=lang_masks,
+        token_ar_mask=token_ar_mask,
+        token_loss_mask=token_loss_mask,
+    )
+
+
+def test_pi0_original_vs_lerobot():
+    """Test PI0 original implementation vs LeRobot implementation."""
+    print("Initializing models...")
+    lerobot_pi0, lerobot_preprocessor, lerobot_postprocessor = instantiate_lerobot_pi0(
+        from_pretrained=True
+    )  # Load pretrained LeRobot model
+    original_pi0 = instantiate_original_pi0(
+        from_pretrained=True
+    )  # Load pretrained OpenPI model from HuggingFace Hub
+
+    print("Creating dummy data...")
+    batch = create_dummy_data()
+    batch_lerobot = deepcopy(batch)
+
+    # Test each model with its own preprocessing (more realistic end-to-end test)
+    print("\nTest each model with its own preprocessing")
+    print("Creating observation for OpenPI using OpenPI's own preprocessing...")
+    pi0_obs_openpi = create_original_observation_with_openpi_preprocessing(batch)
+
+    print(f"Task prompt: '{batch['task'][0]}'")
+    print(f"OpenPI tokenized prompt shape: {pi0_obs_openpi.tokenized_prompt.shape}")
+    print(f"OpenPI image shapes: {[img.shape for img in pi0_obs_openpi.images.values()]}")
+    print(f"OpenPI state shape: {pi0_obs_openpi.state.shape}")
+
+    print("Testing OpenPI with own preprocessing...")
    original_pi0.eval()
+    torch.manual_seed(42)  # Set seed for reproducibility
+    batch_size = batch["observation.state"].shape[0]
+    noise_shape = (batch_size, DUMMY_ACTION_HORIZON, DUMMY_ACTION_DIM)
+    fixed_noise = torch.randn(noise_shape, dtype=torch.float32, device=DEVICE)
+
    with torch.no_grad():
-        lerobot_actions = lerobot_pi0.predict_action_chunk(lerobot_batch, noise=noise, num_steps=10)
        openpi_actions = original_pi0.sample_actions(
-            device=DEVICE,
-            observation=openpi_observation,
-            noise=noise,
-            num_steps=10,
+            device=DEVICE, observation=pi0_obs_openpi, noise=fixed_noise, num_steps=10
        )
+        openpi_actions_unit = openpi_actions[:, 0, :]
+    print(f"OpenPI (own preprocessing) Actions shape: {openpi_actions.shape}")
+    print(f"OpenPI (own preprocessing) Actions unit shape: {openpi_actions_unit.shape}")
+    print(f"OpenPI (own preprocessing) Actions mean: {openpi_actions.mean().item():.6f}")
+    print(f"OpenPI (own preprocessing) Actions std: {openpi_actions.std().item():.6f}")

-    torch.testing.assert_close(lerobot_actions, openpi_actions, rtol=SAMPLE_RTOL, atol=SAMPLE_ATOL)
+    print("Testing LeRobot with own preprocessing...")
+    lerobot_pi0.eval()
+    torch.manual_seed(42)  # Set the same seed

+    batch_lerobot_processed = lerobot_preprocessor(batch_lerobot)
+    with torch.no_grad():
+        lerobot_actions_own = lerobot_pi0.predict_action_chunk(
+            batch_lerobot_processed
+        )  # batch_size, n_action_steps, action_dim
+        lerobot_actions_unit = lerobot_actions_own[:, 0, :]
+    print(f"LeRobot (own preprocessing) Actions shape: {lerobot_actions_own.shape}")
+    print(f"LeRobot (own preprocessing) Actions unit shape: {lerobot_actions_unit.shape}")
+    print(f"LeRobot (own preprocessing) Actions mean: {lerobot_actions_own.mean().item():.6f}")
+    print(f"LeRobot (own preprocessing) Actions std: {lerobot_actions_own.std().item():.6f}")

-def test_pi0_forward_matches_openpi():
-    assert_forward_matches()
+    print("\nComparing end-to-end implementations:")
+    print(f"Actions close (atol=1e-4): {torch.allclose(lerobot_actions_own, openpi_actions, atol=1e-4)}")
+    print(f"Actions close (atol=1e-2): {torch.allclose(lerobot_actions_own, openpi_actions, atol=1e-2)}")
+    print(f"Max absolute difference: {torch.abs(lerobot_actions_own - openpi_actions).max().item():.6f}")

-
-def test_pi0_sample_actions_match_openpi():
-    assert_sample_actions_match_openpi()
-
-
-def test_pi0_gradient_checkpointing_forward_matches_openpi():
-    assert_forward_matches(gradient_checkpointing=True)
-
-
-def test_pi0_compile_forward_matches_openpi():
-    assert_forward_matches(compile_model=True)
-
-
-def test_pi0_compile_sample_actions_match_openpi():
-    assert_sample_actions_match_openpi(compile_model=True)
-
-
-def test_pi0_compile_gradient_checkpointing_forward_matches_openpi():
-    assert_forward_matches(compile_model=True, gradient_checkpointing=True)
+    assert torch.allclose(lerobot_actions_own, openpi_actions, atol=1e-4)
+    assert torch.allclose(lerobot_actions_own, openpi_actions, atol=1e-2)
+    assert torch.abs(lerobot_actions_own - openpi_actions).max().item() < 1e-4
--- a/tests/policies/pi0_pi05/utils/init.py
+++ b/tests/policies/pi0_pi05/utils/init.py
@@ -1 +0,0 @@
-"""Utilities shared by PI0/PI05 policy tests."""
--- a/tests/policies/pi0_pi05/utils/openpi_parity.py
+++ b/tests/policies/pi0_pi05/utils/openpi_parity.py
@@ -1,291 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import annotations
-
-from collections.abc import Iterator
-from contextlib import contextmanager
-from dataclasses import dataclass
-from functools import lru_cache
-from pathlib import Path
-
-import numpy as np
-import safetensors.torch
-import torch
-import torch.nn.functional as F  # noqa: N812
-from huggingface_hub import snapshot_download
-from transformers import AutoTokenizer
-
-from lerobot.utils.constants import (
-    ACTION,
-    OBS_LANGUAGE_ATTENTION_MASK,
-    OBS_LANGUAGE_TOKENS,
-    OBS_STATE,
-)
-from tests.policies.pi0_pi05.openpi_pytorch import preprocessing_pytorch as openpi_preprocessing
-
-IMAGE_KEYS = ("base_0_rgb", "left_wrist_0_rgb", "right_wrist_0_rgb")
-TOKENIZER_NAME = "google/paligemma-3b-pt-224"
-
-
-@dataclass
-class OpenPIObservation:
-    state: torch.Tensor
-    images: dict[str, torch.Tensor]
-    image_masks: dict[str, torch.Tensor]
-    tokenized_prompt: torch.Tensor
-    tokenized_prompt_mask: torch.Tensor
-    token_ar_mask: torch.Tensor
-    token_loss_mask: torch.Tensor
-
-
-@lru_cache(maxsize=1)
-def paligemma_tokenizer():
-    return AutoTokenizer.from_pretrained(TOKENIZER_NAME)
-
-
-def clone_batch(batch: dict) -> dict:
-    return {
-        key: value.clone() if isinstance(value, torch.Tensor) else list(value) for key, value in batch.items()
-    }
-
-
-def pad_last_dim(tensor: torch.Tensor, target_dim: int) -> torch.Tensor:
-    if tensor.shape[-1] > target_dim:
-        raise ValueError(f"Cannot pad last dimension {tensor.shape[-1]} down to {target_dim}")
-    return F.pad(tensor, (0, target_dim - tensor.shape[-1]))
-
-
-def mean_std_normalize(tensor: torch.Tensor, stats: dict[str, torch.Tensor]) -> torch.Tensor:
-    mean = stats["mean"].to(device=tensor.device, dtype=tensor.dtype)
-    std = stats["std"].to(device=tensor.device, dtype=tensor.dtype)
-    return (tensor - mean) / (std + 1e-8)
-
-
-def quantile_normalize(tensor: torch.Tensor, stats: dict[str, torch.Tensor]) -> torch.Tensor:
-    q01 = stats["q01"].to(device=tensor.device, dtype=tensor.dtype)
-    q99 = stats["q99"].to(device=tensor.device, dtype=tensor.dtype)
-    denom = torch.where(q99 == q01, torch.full_like(q99, 1e-8), q99 - q01)
-    return 2.0 * (tensor - q01) / denom - 1.0
-
-
-def openpi_model_state_from_raw(
-    batch: dict[str, torch.Tensor],
-    *,
-    action_dim: int,
-    dataset_stats: dict[str, dict[str, torch.Tensor]],
-    pi05: bool,
-) -> torch.Tensor:
-    state = batch[OBS_STATE].to(dtype=torch.float32)
-    if pi05:
-        state = quantile_normalize(state, dataset_stats[OBS_STATE])
-    else:
-        state = mean_std_normalize(state, dataset_stats[OBS_STATE])
-    return pad_last_dim(state, action_dim)
-
-
-def openpi_model_actions_from_raw(
-    batch: dict[str, torch.Tensor],
-    *,
-    action_dim: int,
-    dataset_stats: dict[str, dict[str, torch.Tensor]],
-    pi05: bool,
-) -> torch.Tensor:
-    actions = batch[ACTION].to(dtype=torch.float32)
-    if pi05:
-        actions = quantile_normalize(actions, dataset_stats[ACTION])
-    else:
-        actions = mean_std_normalize(actions, dataset_stats[ACTION])
-    return pad_last_dim(actions, action_dim)
-
-
-def _tasks_from_raw(batch: dict, batch_size: int) -> list[str]:
-    tasks = batch.get("task")
-    if tasks is None:
-        raise ValueError("The parity batch must include a task prompt.")
-    if isinstance(tasks, str):
-        return [tasks] * batch_size
-    if len(tasks) == 1:
-        return [tasks[0]] * batch_size
-    if len(tasks) != batch_size:
-        raise ValueError(f"Expected {batch_size} task prompts, got {len(tasks)}")
-    return list(tasks)
-
-
-def _format_pi0_prompts(tasks: list[str]) -> list[str]:
-    return [f"{task.strip().replace('_', ' ').replace(chr(10), ' ')}\n" for task in tasks]
-
-
-def _format_pi05_prompts(tasks: list[str], normalized_state: torch.Tensor) -> list[str]:
-    state_np = normalized_state.detach().cpu().numpy()
-    discretized_states = np.digitize(state_np, bins=np.linspace(-1, 1, 256 + 1)[:-1]) - 1
-    prompts = []
-    for task, state in zip(tasks, discretized_states, strict=True):
-        cleaned_text = task.strip().replace("_", " ").replace("\n", " ")
-        state_str = " ".join(map(str, state))
-        prompts.append(f"Task: {cleaned_text}, State: {state_str};\nAction: ")
-    return prompts
-
-
-def _tokenize_prompts(prompts: list[str], *, max_token_len: int, device: torch.device | str):
-    tokenized = paligemma_tokenizer()(
-        prompts,
-        padding="max_length",
-        padding_side="right",
-        truncation=True,
-        max_length=max_token_len,
-        return_tensors="pt",
-    )
-    tokens = tokenized["input_ids"].to(device)
-    masks = tokenized["attention_mask"].to(device=device, dtype=torch.bool)
-    return tokens, masks
-
-
-def make_openpi_observation_from_raw(
-    batch: dict[str, torch.Tensor],
-    *,
-    action_dim: int,
-    max_token_len: int,
-    dataset_stats: dict[str, dict[str, torch.Tensor]],
-    pi05: bool,
-) -> OpenPIObservation:
-    batch_size = batch[OBS_STATE].shape[0]
-    device = batch[OBS_STATE].device
-    state = openpi_model_state_from_raw(
-        batch,
-        action_dim=action_dim,
-        dataset_stats=dataset_stats,
-        pi05=pi05,
-    )
-
-    tasks = _tasks_from_raw(batch, batch_size)
-    prompts = _format_pi05_prompts(tasks, state) if pi05 else _format_pi0_prompts(tasks)
-    tokens, masks = _tokenize_prompts(prompts, max_token_len=max_token_len, device=device)
-
-    images = {
-        key: batch[f"observation.images.{key}"].to(device=device, dtype=torch.float32) * 2.0 - 1.0
-        for key in IMAGE_KEYS
-    }
-    image_masks = {key: torch.ones(batch_size, dtype=torch.bool, device=device) for key in IMAGE_KEYS}
-
-    return OpenPIObservation(
-        state=state,
-        images=images,
-        image_masks=image_masks,
-        tokenized_prompt=tokens,
-        tokenized_prompt_mask=masks,
-        token_ar_mask=torch.zeros_like(tokens, dtype=torch.int32),
-        token_loss_mask=torch.ones_like(masks, dtype=torch.bool),
-    )
-
-
-def assert_processor_inputs_match_lerobot(
-    lerobot_policy,
-    lerobot_batch: dict[str, torch.Tensor],
-    openpi_observation: OpenPIObservation,
-    *,
-    compare_state: bool,
-):
-    openpi_processed = openpi_preprocessing.preprocess_observation_pytorch(openpi_observation, train=False)
-    lerobot_images, lerobot_image_masks = lerobot_policy._preprocess_images(lerobot_batch)
-
-    # Token IDs, token masks, images, image masks, and PI0 state are intentionally built from the same
-    # raw batch through independent LeRobot/OpenPI-style processor logic. They must be bitwise equal.
-    torch.testing.assert_close(
-        openpi_observation.tokenized_prompt, lerobot_batch[OBS_LANGUAGE_TOKENS], rtol=0, atol=0
-    )
-    torch.testing.assert_close(
-        openpi_observation.tokenized_prompt_mask,
-        lerobot_batch[OBS_LANGUAGE_ATTENTION_MASK],
-        rtol=0,
-        atol=0,
-    )
-
-    for openpi_image, lerobot_image in zip(openpi_processed.images.values(), lerobot_images, strict=True):
-        torch.testing.assert_close(openpi_image, lerobot_image, rtol=0, atol=0)
-
-    for openpi_mask, lerobot_mask in zip(
-        openpi_processed.image_masks.values(), lerobot_image_masks, strict=True
-    ):
-        torch.testing.assert_close(openpi_mask, lerobot_mask, rtol=0, atol=0)
-
-    if compare_state:
-        torch.testing.assert_close(
-            openpi_processed.state, lerobot_policy.prepare_state(lerobot_batch), rtol=0, atol=0
-        )
-
-
-def load_openpi_reference_state_dict(repo_id: str) -> dict[str, torch.Tensor]:
-    cache_dir = Path(snapshot_download(repo_id=repo_id, repo_type="model"))
-    return safetensors.torch.load_file(cache_dir / "model.safetensors")
-
-
-def fix_reference_state_dict(state_dict: dict[str, torch.Tensor]) -> dict[str, torch.Tensor]:
-    fixed_state_dict = dict(state_dict)
-    lm_head_key = "paligemma_with_expert.paligemma.lm_head.weight"
-    embed_tokens_key = "paligemma_with_expert.paligemma.model.language_model.embed_tokens.weight"
-    if lm_head_key in fixed_state_dict and embed_tokens_key not in fixed_state_dict:
-        fixed_state_dict[embed_tokens_key] = fixed_state_dict[lm_head_key].clone()
-    return fixed_state_dict
-
-
-@contextmanager
-def fixed_flow_sampling(model, *, noise: torch.Tensor, time: torch.Tensor) -> Iterator[None]:
-    original_sample_noise = model.sample_noise
-    original_sample_time = model.sample_time
-
-    def sample_noise(shape, device):
-        if tuple(shape) != tuple(noise.shape):
-            raise ValueError(f"Expected noise shape {tuple(noise.shape)}, got {tuple(shape)}")
-        return noise.to(device=device)
-
-    def sample_time(batch_size, device):
-        if batch_size != time.shape[0]:
-            raise ValueError(f"Expected time batch size {time.shape[0]}, got {batch_size}")
-        return time.to(device=device)
-
-    model.sample_noise = sample_noise
-    model.sample_time = sample_time
-    try:
-        yield
-    finally:
-        model.sample_noise = original_sample_noise
-        model.sample_time = original_sample_time
-
-
-@contextmanager
-def deterministic_openpi_forward_preprocess(openpi_policy) -> Iterator[None]:
-    """Disable OpenPI's training-time image augmentation only inside a parity forward block.
-
-    OpenPI's `forward()` calls `_preprocess_observation(..., train=True)`, which can apply stochastic
-    image augmentation. LeRobot's policy forward path does not apply that augmentation, so parity would
-    otherwise compare two different image tensors rather than two model implementations. The context manager
-    keeps the public `openpi_policy.forward(observation, ...)` call while making preprocessing deterministic.
-
-    `yield` marks the body of the caller's `with` block. The `try/finally` restores the original method even
-    if the assertion inside the block fails, so the temporary monkeypatch cannot leak into later tests.
-    """
-
-    original_preprocess_observation = openpi_policy._preprocess_observation
-
-    def preprocess_observation(observation, *, train=True):
-        return original_preprocess_observation(observation, train=False)
-
-    openpi_policy._preprocess_observation = preprocess_observation
-    try:
-        yield
-    finally:
-        openpi_policy._preprocess_observation = original_preprocess_observation
--- a/tests/policies/pi0_pi05/utils/torch_compile.py
+++ b/tests/policies/pi0_pi05/utils/torch_compile.py
@@ -1,207 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import time
-from collections.abc import Callable
-
-import torch
-from torch._dynamo.utils import counters, guard_failures
-from torch.profiler import ProfilerActivity
-
-FORWARD_RTOL = 1e-5
-FORWARD_ATOL = 5e-2
-SAMPLE_RTOL = 1e-5
-SAMPLE_ATOL = 1e-2
-COMPILE_MODE = "max-autotune"
-STEADY_STATE_WARMUPS = 3
-STEADY_STATE_REPEATS = 3
-
-
-def make_compile_config(config_cls, *, compile_model):
-    return config_cls(device="cuda", compile_model=compile_model, compile_mode=COMPILE_MODE)
-
-
-def counter_total(name):
-    return sum(counters.get(name, {}).values())
-
-
-def compile_snapshot():
-    return {
-        "graph_breaks": counter_total("graph_break"),
-        "recompiles": counter_total("recompiles"),
-        "recompile_limits": counter_total("recompile_limit"),
-        "unique_graphs": counters["stats"].get("unique_graphs", 0),
-    }
-
-
-def reset_compile_state():
-    torch._dynamo.reset()
-    counters.clear()
-    guard_failures.clear()
-
-
-def clone_cuda_graph_output(output):
-    if torch.is_tensor(output):
-        return output.clone()
-    if isinstance(output, tuple):
-        return tuple(clone_cuda_graph_output(item) for item in output)
-    if isinstance(output, list):
-        return [clone_cuda_graph_output(item) for item in output]
-    if isinstance(output, dict):
-        return {key: clone_cuda_graph_output(value) for key, value in output.items()}
-    return output
-
-
-def run_model_step(fn: Callable, kwargs: dict):
-    if hasattr(torch.compiler, "cudagraph_mark_step_begin"):
-        torch.compiler.cudagraph_mark_step_begin()
-    return fn(**kwargs)
-
-
-def assert_explain_has_no_graph_breaks(fn: Callable, kwargs: dict, label: str):
-    reset_compile_state()
-    explanation = torch._dynamo.explain(fn)(**kwargs)
-
-    assert explanation.graph_count > 0, f"{label} was not captured by Dynamo"
-    assert explanation.graph_break_count == 0, (
-        f"{label} has {explanation.graph_break_count} graph break(s): {explanation.break_reasons}"
-    )
-    assert not explanation.break_reasons, f"{label} graph break reasons: {explanation.break_reasons}"
-
-    print(
-        f"{label} capture: graphs={explanation.graph_count}, "
-        f"graph_breaks={explanation.graph_break_count}, ops={explanation.op_count}, "
-        f"guards={len(explanation.out_guards or [])}"
-    )
-    return explanation
-
-
-@torch.no_grad()
-def assert_compiled_output_matches_eager(eager_model, compiled_model, forward_kwargs, sample_kwargs):
-    eager_forward = eager_model.forward(**forward_kwargs)
-    compiled_forward = compiled_model.forward(**forward_kwargs)
-    torch.testing.assert_close(compiled_forward, eager_forward, rtol=FORWARD_RTOL, atol=FORWARD_ATOL)
-
-    eager_actions = eager_model.sample_actions(**sample_kwargs)
-    compiled_actions = compiled_model.sample_actions(**sample_kwargs)
-    torch.testing.assert_close(compiled_actions, eager_actions, rtol=SAMPLE_RTOL, atol=SAMPLE_ATOL)
-
-
-@torch.no_grad()
-def assert_cache_stability(fn: Callable, kwargs: dict, label: str):
-    reset_compile_state()
-
-    first_output = clone_cuda_graph_output(run_model_step(fn, kwargs))
-    first_snapshot = compile_snapshot()
-    second_output = clone_cuda_graph_output(run_model_step(fn, kwargs))
-    second_snapshot = compile_snapshot()
-    third_output = clone_cuda_graph_output(run_model_step(fn, kwargs))
-    third_snapshot = compile_snapshot()
-
-    torch.testing.assert_close(second_output, first_output, rtol=FORWARD_RTOL, atol=FORWARD_ATOL)
-    torch.testing.assert_close(third_output, first_output, rtol=FORWARD_RTOL, atol=FORWARD_ATOL)
-    assert first_snapshot["unique_graphs"] > 0, f"{label} did not compile any graph"
-    assert third_snapshot["graph_breaks"] == 0, f"{label} graph breaks: {third_snapshot}"
-    assert third_snapshot["recompiles"] == 0, f"{label} recompiled: {third_snapshot}"
-    assert third_snapshot["recompile_limits"] == 0, f"{label} hit recompile limit: {third_snapshot}"
-    assert second_snapshot["unique_graphs"] == first_snapshot["unique_graphs"], (
-        f"{label} compiled new graph on second call: first={first_snapshot}, second={second_snapshot}"
-    )
-    assert third_snapshot["unique_graphs"] == first_snapshot["unique_graphs"], (
-        f"{label} compiled new graph on third call: first={first_snapshot}, third={third_snapshot}"
-    )
-    assert not guard_failures, f"{label} guard failures: {dict(guard_failures)}"
-
-    print(f"{label} cache: first={first_snapshot}, third={third_snapshot}")
-
-
-@torch.no_grad()
-def benchmark_runtime(eager_fn: Callable, compiled_fn: Callable, kwargs: dict, label: str):
-    run_warmups(eager_fn, kwargs)
-    run_warmups(compiled_fn, kwargs)
-    torch.cuda.synchronize()
-
-    eager_metrics = profile_callable(eager_fn, kwargs)
-    compiled_metrics = profile_callable(compiled_fn, kwargs)
-    speedup = eager_metrics["cuda_event_ms"] / compiled_metrics["cuda_event_ms"]
-
-    print(
-        f"{label} runtime: eager_cuda={eager_metrics['cuda_event_ms']:.3f} ms, "
-        f"compiled_cuda={compiled_metrics['cuda_event_ms']:.3f} ms, speedup={speedup:.3f}x, "
-        f"host_wall_ms eager/compiled={eager_metrics['host_wall_ms']:.3f}/"
-        f"{compiled_metrics['host_wall_ms']:.3f}, "
-        f"cpu_self_time_ms eager/compiled={eager_metrics['cpu_self_time_ms']:.3f}/"
-        f"{compiled_metrics['cpu_self_time_ms']:.3f}, "
-        f"cuda_launches eager/compiled={eager_metrics['cuda_launch_count']}/"
-        f"{compiled_metrics['cuda_launch_count']}, "
-        f"profiler_events eager/compiled={eager_metrics['profiler_event_count']}/"
-        f"{compiled_metrics['profiler_event_count']}, "
-        f"peak_mem_mib eager/compiled={eager_metrics['peak_mem_mib']:.1f}/"
-        f"{compiled_metrics['peak_mem_mib']:.1f}"
-    )
-
-    assert eager_metrics["cuda_event_ms"] > 0
-    assert compiled_metrics["cuda_event_ms"] > 0
-    assert eager_metrics["profiler_event_count"] > 0
-    assert compiled_metrics["profiler_event_count"] > 0
-    return eager_metrics, compiled_metrics
-
-
-def run_warmups(fn: Callable, kwargs: dict):
-    for _ in range(STEADY_STATE_WARMUPS):
-        run_model_step(fn, kwargs)
-    torch.cuda.synchronize()
-
-
-def profile_callable(fn: Callable, kwargs: dict):
-    torch.cuda.synchronize()
-    torch.cuda.reset_peak_memory_stats()
-
-    start_event = torch.cuda.Event(enable_timing=True)
-    end_event = torch.cuda.Event(enable_timing=True)
-    host_start = time.perf_counter()
-    start_event.record()
-    for _ in range(STEADY_STATE_REPEATS):
-        run_model_step(fn, kwargs)
-    end_event.record()
-    torch.cuda.synchronize()
-    cuda_event_ms = start_event.elapsed_time(end_event) / STEADY_STATE_REPEATS
-    host_wall_ms = (time.perf_counter() - host_start) * 1000 / STEADY_STATE_REPEATS
-    peak_mem_mib = torch.cuda.max_memory_allocated() / 1024**2
-
-    with torch.profiler.profile(
-        activities=[ProfilerActivity.CPU],
-    ) as profiler:
-        run_model_step(fn, kwargs)
-        torch.cuda.synchronize()
-
-    key_averages = profiler.key_averages()
-    cpu_self_time_ms = sum(event.self_cpu_time_total for event in key_averages) / 1000
-    cuda_launch_count = sum(
-        event.count
-        for event in key_averages
-        if event.key in {"cudaLaunchKernel", "cudaGraphLaunch", "cudaLaunchKernelExC"}
-    )
-    profiler_event_count = sum(event.count for event in key_averages)
-
-    return {
-        "cuda_event_ms": cuda_event_ms,
-        "host_wall_ms": host_wall_ms,
-        "cpu_self_time_ms": cpu_self_time_ms,
-        "cuda_launch_count": cuda_launch_count,
-        "profiler_event_count": profiler_event_count,
-        "peak_mem_mib": peak_mem_mib,
-    }
--- a/tests/processor/test_pi05_processor.py
+++ b/tests/processor/test_pi05_processor.py
@@ -1,155 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""Compare the PI0.5 processor pipeline against the vendored OpenPI reference processors."""
-
-import os
-
-import pytest
-import torch
-
-pytest.importorskip("transformers")
-
-from lerobot.configs import FeatureType, PolicyFeature  # noqa: E402
-from lerobot.policies.pi05 import PI05Policy  # noqa: E402
-from lerobot.policies.pi05.configuration_pi05 import PI05Config  # noqa: E402
-from lerobot.policies.pi05.processor_pi05 import make_pi05_pre_post_processors  # noqa: E402
-from lerobot.utils.constants import ACTION, OBS_STATE  # noqa: E402
-from tests.policies.pi0_pi05.utils.openpi_parity import (  # noqa: E402
-    IMAGE_KEYS,
-    assert_processor_inputs_match_lerobot,
-    clone_batch,
-    make_openpi_observation_from_raw,
-    openpi_model_actions_from_raw,
-)
-
-pytestmark = pytest.mark.skipif(
-    os.environ.get("CI") == "true" or os.environ.get("GITHUB_ACTIONS") == "true",
-    reason="OpenPI processor parity uses the PaliGemma tokenizer; run manually outside CI.",
-)
-
-DUMMY_ACTION_DIM = 32
-DUMMY_STATE_DIM = 32
-DUMMY_ACTION_HORIZON = 50
-DUMMY_MAX_TOKEN_LEN = 200
-DEVICE = torch.device("cpu")
-
-DUMMY_DATASET_STATS = {
-    OBS_STATE: {
-        "mean": torch.zeros(DUMMY_STATE_DIM),
-        "std": torch.ones(DUMMY_STATE_DIM),
-        "q01": torch.zeros(DUMMY_STATE_DIM),
-        "q99": torch.ones(DUMMY_STATE_DIM),
-    },
-    ACTION: {
-        "mean": torch.zeros(DUMMY_ACTION_DIM),
-        "std": torch.ones(DUMMY_ACTION_DIM),
-        "q01": torch.zeros(DUMMY_ACTION_DIM),
-        "q99": torch.ones(DUMMY_ACTION_DIM),
-    },
-    "images": {
-        key: {
-            "mean": torch.zeros(3, 224, 224),
-            "std": torch.ones(3, 224, 224),
-            "q01": torch.zeros(3, 224, 224),
-            "q99": torch.ones(3, 224, 224),
-        }
-        for key in IMAGE_KEYS
-    },
-}
-
-
-class PI05PolicyInputAdapter(torch.nn.Module):
-    """Minimal adapter exposing PI0.5 policy image preparation without loading model weights."""
-
-    _preprocess_images = PI05Policy._preprocess_images
-
-    def __init__(self, config: PI05Config) -> None:
-        super().__init__()
-        self.config = config
-        self._device_anchor = torch.nn.Parameter(torch.empty((), device=config.device), requires_grad=False)
-
-
-def create_pi05_config() -> PI05Config:
-    config = PI05Config(device=str(DEVICE))
-    config.max_state_dim = DUMMY_STATE_DIM
-    config.max_action_dim = DUMMY_ACTION_DIM
-    config.chunk_size = DUMMY_ACTION_HORIZON
-    config.n_action_steps = DUMMY_ACTION_HORIZON
-    config.tokenizer_max_length = DUMMY_MAX_TOKEN_LEN
-    config.input_features = {
-        OBS_STATE: PolicyFeature(type=FeatureType.STATE, shape=(DUMMY_STATE_DIM,)),
-        **{
-            f"observation.images.{key}": PolicyFeature(type=FeatureType.VISUAL, shape=(3, 224, 224))
-            for key in IMAGE_KEYS
-        },
-    }
-    config.output_features = {
-        ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(DUMMY_ACTION_DIM,)),
-    }
-    return config
-
-
-def create_dummy_data() -> dict:
-    batch_size = 2
-    prompt = "Pick up the red block and place it in the bin"
-    return {
-        OBS_STATE: torch.randn(batch_size, DUMMY_STATE_DIM, dtype=torch.float32, device=DEVICE),
-        ACTION: torch.randn(
-            batch_size, DUMMY_ACTION_HORIZON, DUMMY_ACTION_DIM, dtype=torch.float32, device=DEVICE
-        ),
-        **{
-            f"observation.images.{key}": torch.rand(
-                batch_size, 3, 224, 224, dtype=torch.float32, device=DEVICE
-            )
-            for key in IMAGE_KEYS
-        },
-        "task": [prompt for _ in range(batch_size)],
-    }
-
-
-def test_pi05_processor_inputs_match_openpi_reference():
-    torch.manual_seed(0)
-    config = create_pi05_config()
-    preprocessor, _ = make_pi05_pre_post_processors(config=config, dataset_stats=DUMMY_DATASET_STATS)
-
-    raw_batch = create_dummy_data()
-    lerobot_batch = preprocessor(clone_batch(raw_batch))
-    openpi_observation = make_openpi_observation_from_raw(
-        raw_batch,
-        action_dim=DUMMY_ACTION_DIM,
-        max_token_len=DUMMY_MAX_TOKEN_LEN,
-        dataset_stats=DUMMY_DATASET_STATS,
-        pi05=True,
-    )
-
-    assert_processor_inputs_match_lerobot(
-        PI05PolicyInputAdapter(config),
-        lerobot_batch,
-        openpi_observation,
-        compare_state=False,
-    )
-    torch.testing.assert_close(
-        lerobot_batch[ACTION],
-        openpi_model_actions_from_raw(
-            raw_batch,
-            action_dim=DUMMY_ACTION_DIM,
-            dataset_stats=DUMMY_DATASET_STATS,
-            pi05=True,
-        ),
-        rtol=0,
-        atol=0,
-    )
--- a/tests/processor/test_pi0_processor.py
+++ b/tests/processor/test_pi0_processor.py
@@ -1,156 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""Compare the PI0 processor pipeline against the vendored OpenPI reference processors."""
-
-import os
-
-import pytest
-import torch
-
-pytest.importorskip("transformers")
-
-from lerobot.configs import FeatureType, PolicyFeature  # noqa: E402
-from lerobot.policies.pi0 import PI0Policy  # noqa: E402
-from lerobot.policies.pi0.configuration_pi0 import PI0Config  # noqa: E402
-from lerobot.policies.pi0.processor_pi0 import make_pi0_pre_post_processors  # noqa: E402
-from lerobot.utils.constants import ACTION, OBS_STATE  # noqa: E402
-from tests.policies.pi0_pi05.utils.openpi_parity import (  # noqa: E402
-    IMAGE_KEYS,
-    assert_processor_inputs_match_lerobot,
-    clone_batch,
-    make_openpi_observation_from_raw,
-    openpi_model_actions_from_raw,
-)
-
-pytestmark = pytest.mark.skipif(
-    os.environ.get("CI") == "true" or os.environ.get("GITHUB_ACTIONS") == "true",
-    reason="OpenPI processor parity uses the PaliGemma tokenizer; run manually outside CI.",
-)
-
-DUMMY_ACTION_DIM = 32
-DUMMY_STATE_DIM = 32
-DUMMY_ACTION_HORIZON = 50
-DUMMY_MAX_TOKEN_LEN = 48
-DEVICE = torch.device("cpu")
-
-DUMMY_DATASET_STATS = {
-    OBS_STATE: {
-        "mean": torch.zeros(DUMMY_STATE_DIM),
-        "std": torch.ones(DUMMY_STATE_DIM),
-        "q01": torch.zeros(DUMMY_STATE_DIM),
-        "q99": torch.ones(DUMMY_STATE_DIM),
-    },
-    ACTION: {
-        "mean": torch.zeros(DUMMY_ACTION_DIM),
-        "std": torch.ones(DUMMY_ACTION_DIM),
-        "q01": torch.zeros(DUMMY_ACTION_DIM),
-        "q99": torch.ones(DUMMY_ACTION_DIM),
-    },
-    "images": {
-        key: {
-            "mean": torch.zeros(3, 224, 224),
-            "std": torch.ones(3, 224, 224),
-            "q01": torch.zeros(3, 224, 224),
-            "q99": torch.ones(3, 224, 224),
-        }
-        for key in IMAGE_KEYS
-    },
-}
-
-
-class PI0PolicyInputAdapter(torch.nn.Module):
-    """Minimal adapter exposing PI0 policy input-preparation helpers without loading model weights."""
-
-    _preprocess_images = PI0Policy._preprocess_images
-    prepare_state = PI0Policy.prepare_state
-
-    def __init__(self, config: PI0Config) -> None:
-        super().__init__()
-        self.config = config
-        self._device_anchor = torch.nn.Parameter(torch.empty((), device=config.device), requires_grad=False)
-
-
-def create_pi0_config() -> PI0Config:
-    config = PI0Config(device=str(DEVICE))
-    config.max_state_dim = DUMMY_STATE_DIM
-    config.max_action_dim = DUMMY_ACTION_DIM
-    config.chunk_size = DUMMY_ACTION_HORIZON
-    config.n_action_steps = DUMMY_ACTION_HORIZON
-    config.tokenizer_max_length = DUMMY_MAX_TOKEN_LEN
-    config.input_features = {
-        OBS_STATE: PolicyFeature(type=FeatureType.STATE, shape=(DUMMY_STATE_DIM,)),
-        **{
-            f"observation.images.{key}": PolicyFeature(type=FeatureType.VISUAL, shape=(3, 224, 224))
-            for key in IMAGE_KEYS
-        },
-    }
-    config.output_features = {
-        ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(DUMMY_ACTION_DIM,)),
-    }
-    return config
-
-
-def create_dummy_data() -> dict:
-    batch_size = 2
-    prompt = "Pick up the red block and place it in the bin"
-    return {
-        OBS_STATE: torch.randn(batch_size, DUMMY_STATE_DIM, dtype=torch.float32, device=DEVICE),
-        ACTION: torch.randn(
-            batch_size, DUMMY_ACTION_HORIZON, DUMMY_ACTION_DIM, dtype=torch.float32, device=DEVICE
-        ),
-        **{
-            f"observation.images.{key}": torch.rand(
-                batch_size, 3, 224, 224, dtype=torch.float32, device=DEVICE
-            )
-            for key in IMAGE_KEYS
-        },
-        "task": [prompt for _ in range(batch_size)],
-    }
-
-
-def test_pi0_processor_inputs_match_openpi_reference():
-    torch.manual_seed(0)
-    config = create_pi0_config()
-    preprocessor, _ = make_pi0_pre_post_processors(config=config, dataset_stats=DUMMY_DATASET_STATS)
-
-    raw_batch = create_dummy_data()
-    lerobot_batch = preprocessor(clone_batch(raw_batch))
-    openpi_observation = make_openpi_observation_from_raw(
-        raw_batch,
-        action_dim=DUMMY_ACTION_DIM,
-        max_token_len=DUMMY_MAX_TOKEN_LEN,
-        dataset_stats=DUMMY_DATASET_STATS,
-        pi05=False,
-    )
-
-    assert_processor_inputs_match_lerobot(
-        PI0PolicyInputAdapter(config),
-        lerobot_batch,
-        openpi_observation,
-        compare_state=True,
-    )
-    torch.testing.assert_close(
-        lerobot_batch[ACTION],
-        openpi_model_actions_from_raw(
-            raw_batch,
-            action_dim=DUMMY_ACTION_DIM,
-            dataset_stats=DUMMY_DATASET_STATS,
-            pi05=False,
-        ),
-        rtol=0,
-        atol=0,
-    )
--- a/tests/rewards/test_modeling_robometer.py
+++ b/tests/rewards/test_modeling_robometer.py
@@ -1,340 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""Tests for Robometer reward model."""
-
-from __future__ import annotations
-
-from types import SimpleNamespace
-
-import pytest
-import torch
-
-from lerobot.configs.rewards import RewardModelConfig
-from lerobot.rewards.factory import get_reward_model_class, make_reward_model_config
-from lerobot.rewards.robometer import RobometerConfig
-from lerobot.rewards.robometer.configuration_robometer import ROBOMETER_SPECIAL_TOKENS
-from lerobot.rewards.robometer.modeling_robometer import (
-    ROBOMETER_FEATURE_PREFIX,
-    convert_bins_to_continuous,
-    decode_progress_outputs,
-)
-from tests.utils import skip_if_package_missing
-
-# Length of the fake tokenizer used in `_patch_build`. The deterministic
-# resize target derived in ``RobometerConfig.__post_init__`` is therefore
-# ``_FAKE_TOKENIZER_LEN + len(ROBOMETER_SPECIAL_TOKENS)``.
-_FAKE_TOKENIZER_LEN = 100
-_EXPECTED_RESIZED_VOCAB = _FAKE_TOKENIZER_LEN + len(ROBOMETER_SPECIAL_TOKENS)
-
-
-class _FakeQwenConfig:
-    """Stand-in for a Qwen3-VL config (the `model.config` attribute).
-
-    ``to_dict`` matches HF's ``PretrainedConfig.to_dict`` closely enough for
-    ``RobometerConfig.__post_init__`` to snapshot a meaningful ``vlm_config``
-    into the saved ``config.json`` and for the reload path to round-trip
-    through ``AutoConfig.for_model``.
-    """
-
-    def __init__(self, hidden_dim: int = 8, vocab_size: int = _FAKE_TOKENIZER_LEN) -> None:
-        # `vocab_size` here is the *pre-resize* value the fake backbone advertises.
-        # `__post_init__` is expected to overwrite it with `len(tokenizer) + 5`.
-        self.text_config = SimpleNamespace(hidden_size=hidden_dim, vocab_size=vocab_size)
-        self._hidden_dim = hidden_dim
-        self._vocab_size = vocab_size
-
-    def to_dict(self) -> dict:
-        return {
-            "model_type": "fake_qwen",
-            "text_config": {
-                "hidden_size": self._hidden_dim,
-                "vocab_size": self._vocab_size,
-            },
-        }
-
-
-class _FakeEmbeddings(torch.nn.Module):
-    def __init__(self, num_embeddings: int = _FAKE_TOKENIZER_LEN) -> None:
-        super().__init__()
-        self.num_embeddings = num_embeddings
-
-
-class _FakeBaseModel(torch.nn.Module):
-    """Stand-in for the Qwen3-VL backbone during tests.
-
-    Provides the minimum surface `RobometerRewardModel.__init__` and
-    `_compute_rbm_logits` rely on: a `parameters()` iterator (for dtype +
-    device), a `config.text_config.hidden_size`, a `config.to_dict()` so
-    `_save_pretrained` can snapshot `vlm_config`,
-    `get_input_embeddings()` / `resize_token_embeddings()` so the fresh-init
-    embed resize is a no-op, and a forward that returns a `SimpleNamespace`
-    with a `hidden_states` tuple.
-    """
-
-    def __init__(self, hidden_dim: int = 8) -> None:
-        super().__init__()
-        self._param = torch.nn.Parameter(torch.zeros(1))
-        self.hidden_dim = hidden_dim
-        self.config = _FakeQwenConfig(hidden_dim)
-        self._embeddings = _FakeEmbeddings()
-
-    def get_input_embeddings(self) -> _FakeEmbeddings:
-        return self._embeddings
-
-    def resize_token_embeddings(self, new_size: int) -> None:
-        self._embeddings.num_embeddings = new_size
-
-    def forward(self, **kwargs):  # noqa: ARG002 - intentional kwargs sink
-        input_ids = kwargs["input_ids"]
-        return SimpleNamespace(
-            hidden_states=(torch.zeros(input_ids.shape[0], input_ids.shape[1], self.hidden_dim),),
-            last_hidden_state=torch.zeros(input_ids.shape[0], input_ids.shape[1], self.hidden_dim),
-        )
-
-
-class _FakeTokenizer:
-    """Minimal stand-in for an HF tokenizer.
-
-    ``RobometerConfig.__post_init__`` uses ``len(tokenizer)`` to compute the
-    deterministic resize target ``len(tokenizer) + len(ROBOMETER_SPECIAL_TOKENS)``,
-    so a working ``__len__`` is all we need.
-    """
-
-    def __init__(self, length: int = _FAKE_TOKENIZER_LEN) -> None:
-        self._length = length
-
-    def __len__(self) -> int:
-        return self._length
-
-
-def _patch_build(monkeypatch) -> None:
-    """Stub out the HF AutoX calls so Robometer construction stays cheap in tests.
-
-    Covers (EO-1 style — no model-side override hooks):
-    * ``AutoConfig.from_pretrained`` (config side) — used by
-      ``RobometerConfig.__post_init__`` to snapshot the backbone config.
-    * ``AutoTokenizer.from_pretrained`` (config side) — used by
-      ``__post_init__`` to compute ``len(tokenizer) + 5``.
-    * ``AutoConfig.for_model``                       — used by
-      ``RobometerConfig.vlm_backbone_config`` when rebuilding for ``from_config``.
-    * ``AutoModelForImageTextToText.from_pretrained`` — fresh-training path
-      (``pretrained_path is None``).
-    * ``AutoModelForImageTextToText.from_config``    — checkpoint-reload path
-      (``pretrained_path`` is set).
-    """
-    from lerobot.rewards.robometer import configuration_robometer, modeling_robometer
-
-    monkeypatch.setattr(
-        modeling_robometer.AutoModelForImageTextToText,
-        "from_pretrained",
-        lambda *args, **kwargs: _FakeBaseModel(hidden_dim=8),
-    )
-    monkeypatch.setattr(
-        modeling_robometer.AutoModelForImageTextToText,
-        "from_config",
-        lambda *args, **kwargs: _FakeBaseModel(hidden_dim=8),
-    )
-    monkeypatch.setattr(
-        configuration_robometer.AutoConfig,
-        "for_model",
-        lambda *args, **kwargs: _FakeQwenConfig(hidden_dim=8),
-    )
-    monkeypatch.setattr(
-        configuration_robometer.AutoConfig,
-        "from_pretrained",
-        lambda *args, **kwargs: _FakeQwenConfig(hidden_dim=8),
-    )
-    monkeypatch.setattr(
-        configuration_robometer.AutoTokenizer,
-        "from_pretrained",
-        lambda *args, **kwargs: _FakeTokenizer(length=_FAKE_TOKENIZER_LEN),
-    )
-
-
-def _make_batch(features: dict[str, torch.Tensor]) -> dict[str, torch.Tensor]:
-    """Build a `compute_reward`-ready batch using Robometer's namespaced keys."""
-    return {f"{ROBOMETER_FEATURE_PREFIX}{key}": value for key, value in features.items()}
-
-
-@skip_if_package_missing("transformers")
-def test_robometer_config_registered(monkeypatch):
-    _patch_build(monkeypatch)
-    assert "robometer" in RewardModelConfig.get_known_choices()
-    assert RewardModelConfig.get_choice_class("robometer") is RobometerConfig
-    assert isinstance(make_reward_model_config("robometer", device="cpu"), RobometerConfig)
-
-
-def test_robometer_factory_returns_in_tree_class():
-    from lerobot.rewards.robometer.modeling_robometer import RobometerRewardModel
-
-    assert get_reward_model_class("robometer") is RobometerRewardModel
-
-
-def test_convert_bins_to_continuous_returns_expected_values():
-    # Two frames: first peaks at bin 0 (center 0.0), second peaks at bin 9 (center 1.0).
-    bin_logits = torch.full((2, 10), -10.0)
-    bin_logits[0, 0] = 10.0
-    bin_logits[1, -1] = 10.0
-    values = convert_bins_to_continuous(bin_logits)
-    assert values.shape == (2,)
-    assert torch.allclose(values, torch.tensor([0.0, 1.0]), atol=1e-3)
-
-
-def test_decode_progress_outputs_returns_last_frame_values():
-    progress = torch.tensor([[0.1, 0.9], [0.4, 0.6]])
-    success_logits = torch.tensor([[0.0, 5.0], [0.0, -5.0]])
-
-    outputs = decode_progress_outputs(progress, success_logits, is_discrete_mode=False)
-
-    assert outputs["progress_pred"] == [pytest.approx([0.1, 0.9]), pytest.approx([0.4, 0.6])]
-    assert outputs["success_probs"][0][-1] == pytest.approx(torch.sigmoid(torch.tensor(5.0)).item(), abs=1e-3)
-    assert outputs["success_probs"][1][-1] == pytest.approx(
-        torch.sigmoid(torch.tensor(-5.0)).item(), abs=1e-3
-    )
-
-
-def test_decode_progress_outputs_discrete_mode_softmaxes_over_bins():
-    # 2 frames, peaks at bin 0 and bin 9 → continuous predictions 0.0 and 1.0
-    bin_logits = torch.full((1, 2, 10), -10.0)
-    bin_logits[0, 0, 0] = 10.0
-    bin_logits[0, 1, -1] = 10.0
-
-    outputs = decode_progress_outputs(bin_logits, success_logits=None, is_discrete_mode=True)
-
-    assert outputs["success_probs"] == []
-    assert outputs["progress_pred"][0] == pytest.approx([0.0, 1.0], abs=1e-3)
-
-
-@skip_if_package_missing("transformers")
-def test_robometer_post_init_overwrites_vocab_size_with_tokenizer_length(monkeypatch):
-    """``RobometerConfig.__post_init__`` must overwrite the backbone's stale
-    ``text_config.vocab_size`` (which on the real Qwen3-VL config is the
-    padded embedding size, ``151,936``) with ``len(tokenizer) + 5``. This is
-    the contract that makes the published ``Robometer-4B`` checkpoint load
-    byte-equivalently."""
-    _patch_build(monkeypatch)
-
-    cfg = RobometerConfig(device="cpu", progress_loss_type="l2")
-
-    assert cfg.vlm_config["text_config"]["vocab_size"] == _EXPECTED_RESIZED_VOCAB
-
-
-@skip_if_package_missing("transformers")
-def test_robometer_compute_reward_reads_pre_encoded_inputs(monkeypatch):
-    from lerobot.rewards.robometer.modeling_robometer import RobometerRewardModel
-
-    progress = torch.tensor([[0.1, 0.9], [0.4, 0.6]])
-    success_logits = torch.tensor([[0.0, 5.0], [0.0, -5.0]])
-    _patch_build(monkeypatch)
-
-    cfg = RobometerConfig(device="cpu", reward_output="progress", progress_loss_type="l2")
-    model = RobometerRewardModel(cfg)
-    # Bypass the Qwen3-VL forward + head extraction with deterministic logits.
-    monkeypatch.setattr(model, "_compute_rbm_logits", lambda _inputs: (progress, success_logits))
-
-    batch = _make_batch({"input_ids": torch.zeros(2, 2, dtype=torch.long)})
-    rewards = model.compute_reward(batch)
-
-    assert torch.allclose(rewards, torch.tensor([0.9, 0.6]))
-
-
-@skip_if_package_missing("transformers")
-def test_robometer_compute_reward_can_return_binary_success(monkeypatch):
-    from lerobot.rewards.robometer.modeling_robometer import RobometerRewardModel
-
-    progress = torch.tensor([[0.1, 0.9], [0.4, 0.6]])
-    success_logits = torch.tensor([[0.0, 5.0], [0.0, -5.0]])  # sigmoid(5) > 0.5; sigmoid(-5) < 0.5
-    _patch_build(monkeypatch)
-
-    cfg = RobometerConfig(
-        device="cpu",
-        reward_output="success",
-        success_threshold=0.5,
-        progress_loss_type="l2",
-    )
-    model = RobometerRewardModel(cfg)
-    monkeypatch.setattr(model, "_compute_rbm_logits", lambda _inputs: (progress, success_logits))
-
-    batch = _make_batch({"input_ids": torch.zeros(2, 2, dtype=torch.long)})
-    rewards = model.compute_reward(batch)
-
-    assert torch.equal(rewards, torch.tensor([1.0, 0.0]))
-
-
-@skip_if_package_missing("transformers")
-def test_robometer_compute_reward_errors_when_inputs_missing(monkeypatch):
-    from lerobot.rewards.robometer.modeling_robometer import RobometerRewardModel
-
-    _patch_build(monkeypatch)
-
-    cfg = RobometerConfig(device="cpu", progress_loss_type="l2")
-    model = RobometerRewardModel(cfg)
-
-    with pytest.raises(KeyError, match=r"observation\.robometer\.input_ids"):
-        model.compute_reward({})
-
-
-@skip_if_package_missing("transformers")
-def test_robometer_save_pretrained_roundtrips(monkeypatch, tmp_path):
-    """Saving and reloading a Robometer model in LeRobot HF format must produce
-    a single ``model.safetensors`` + ``config.json`` (no Hydra ``config.yaml``),
-    must round-trip user-tunable config fields, and must persist all three
-    prediction heads (``progress_head``, ``success_head``, ``preference_head``)
-    so the published ``Robometer-4B`` checkpoint loads byte-equivalently.
-    """
-    from huggingface_hub.constants import CONFIG_NAME, SAFETENSORS_SINGLE_FILE
-    from safetensors.torch import load_file
-
-    from lerobot.rewards.robometer.modeling_robometer import RobometerRewardModel
-
-    _patch_build(monkeypatch)
-    cfg = RobometerConfig(
-        device="cpu",
-        pretrained_path="robometer/Robometer-4B",
-        # Knobs the user might tweak — must survive the round-trip.
-        image_key="observation.images.cam_top",
-        task_key="task",
-        reward_output="success",
-        success_threshold=0.7,
-        progress_loss_type="l2",
-    )
-    model = RobometerRewardModel(cfg)
-    model.save_pretrained(str(tmp_path))
-
-    # Exactly the files LeRobot's HubMixin promises.
-    assert (tmp_path / CONFIG_NAME).exists()
-    assert (tmp_path / SAFETENSORS_SINGLE_FILE).exists()
-    assert not (tmp_path / "config.yaml").exists()  # we want HF-style, not Hydra
-
-    # All three heads must be present in the saved safetensors. The preference
-    # head is unused at inference but the published checkpoint expects its
-    # rows — losing it would silently break weight loading.
-    state = load_file(str(tmp_path / SAFETENSORS_SINGLE_FILE))
-    assert any(k.startswith("progress_head.") for k in state), "progress_head weights missing"
-    assert any(k.startswith("success_head.") for k in state), "success_head weights missing"
-    assert any(k.startswith("preference_head.") for k in state), "preference_head weights missing"
-
-    # Reload from the local directory: no Hub fetch, no YAML overlay. The
-    # base class drives subclass dispatch via the `type` field in config.json.
-    reloaded_cfg = RewardModelConfig.from_pretrained(str(tmp_path))
-    assert isinstance(reloaded_cfg, RobometerConfig)
-    reloaded_cfg.pretrained_path = str(tmp_path)  # mimic lerobot-train's `validate()`
-    reloaded = RobometerRewardModel.from_pretrained(str(tmp_path), config=reloaded_cfg)
-
-    assert reloaded.config.image_key == "observation.images.cam_top"
-    assert reloaded.config.task_key == "task"
-    assert reloaded.config.reward_output == "success"
-    assert reloaded.config.success_threshold == 0.7
-    assert reloaded.config.progress_loss_type == "l2"  # came back from config.json
--- a/tests/rewards/test_modeling_topreward.py
+++ b/tests/rewards/test_modeling_topreward.py
@@ -1,296 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""Tests for the TOPReward reward model."""
-
-from __future__ import annotations
-
-from types import SimpleNamespace
-
-import pytest
-import torch
-
-from lerobot.configs.rewards import RewardModelConfig
-from lerobot.rewards.factory import get_reward_model_class, make_reward_model_config
-from lerobot.rewards.topreward import TOPRewardConfig
-from lerobot.rewards.topreward.processor_topreward import TOPREWARD_FEATURE_PREFIX, TOPREWARD_INPUT_KEYS
-from tests.utils import skip_if_package_missing
-
-
-class _FakeQwenModel(torch.nn.Module):
-    """Stand-in for ``Qwen3VLForConditionalGeneration``.
-
-    Returns a ``SimpleNamespace`` with ``logits`` of a controlled shape so
-    the log-prob extraction path in ``compute_reward`` can be exercised
-    without downloading real VLM weights.
-    """
-
-    def __init__(self) -> None:
-        super().__init__()
-        self._param = torch.nn.Parameter(torch.zeros(1))
-        self._reward_value: float = -1.5
-
-    @classmethod
-    def from_pretrained(cls, *args, **kwargs):  # noqa: ARG003
-        return cls()
-
-    def forward(  # noqa: ARG002
-        self, input_ids, attention_mask=None, labels=None, logits_to_keep=0, **kwargs
-    ):
-        batch_size, seq_len = input_ids.shape
-        vocab_size = 1000
-        logits = torch.zeros(batch_size, seq_len, vocab_size)
-        # Place a controlled log-prob at the target token position so the
-        # model returns a predictable reward value.
-        # The label-masked suffix is the last token.
-        # After the causal-LM shift (logits[:, :-1], labels[:, 1:]) the scored
-        # position is logits[:, -2, :] predicting labels[:, -1].
-        # We set logits so that log_softmax at the target token ≈ _reward_value.
-        for i in range(batch_size):
-            target_idx = int(input_ids[i, -1].item())
-            logits[i, -2, target_idx] = self._reward_value * -10  # high logit -> high log-prob
-        if logits_to_keep:
-            logits = logits[:, -logits_to_keep:, :]
-        return SimpleNamespace(logits=logits)
-
-
-def _patch_build(monkeypatch) -> None:
-    """Stub out HF AutoX so TOPReward construction is cheap and offline."""
-    from lerobot.rewards.topreward import modeling_topreward
-
-    monkeypatch.setattr(modeling_topreward, "Qwen3VLForConditionalGeneration", _FakeQwenModel)
-
-
-def _make_batch(
-    input_ids: torch.Tensor,
-    attention_mask: torch.Tensor | None = None,
-    labels: torch.Tensor | None = None,
-    *,
-    omit: str | None = None,
-) -> dict[str, torch.Tensor]:
-    """Build a ``compute_reward``-ready batch using TOPReward's namespaced keys."""
-    batch_size, seq_len = input_ids.shape
-    if attention_mask is None:
-        attention_mask = torch.ones(batch_size, seq_len, dtype=torch.long)
-    batch: dict[str, torch.Tensor] = {}
-    if labels is not None:
-        batch[f"{TOPREWARD_FEATURE_PREFIX}labels"] = labels
-    batch.update(
-        {
-            f"{TOPREWARD_FEATURE_PREFIX}input_ids": input_ids,
-            f"{TOPREWARD_FEATURE_PREFIX}attention_mask": attention_mask,
-            f"{TOPREWARD_FEATURE_PREFIX}pixel_values_videos": torch.zeros(
-                batch_size, 1536, dtype=torch.float32
-            ),
-            f"{TOPREWARD_FEATURE_PREFIX}video_grid_thw": torch.ones(batch_size, 3, dtype=torch.long),
-            f"{TOPREWARD_FEATURE_PREFIX}mm_token_type_ids": torch.zeros_like(input_ids),
-        }
-    )
-    if omit is not None:
-        batch.pop(f"{TOPREWARD_FEATURE_PREFIX}{omit}", None)
-    return batch
-
-
-def _terminal_labels(input_ids: torch.Tensor) -> torch.Tensor:
-    labels = torch.full_like(input_ids, -100)
-    labels[:, -1] = input_ids[:, -1]
-    return labels
-
-
-# ---------------------------------------------------------------------------
-# Registry + factory
-# ---------------------------------------------------------------------------
-
-
-def test_topreward_config_registered():
-    assert "topreward" in RewardModelConfig.get_known_choices()
-    assert RewardModelConfig.get_choice_class("topreward") is TOPRewardConfig
-    assert isinstance(make_reward_model_config("topreward", device="cpu"), TOPRewardConfig)
-
-
-def test_topreward_factory_returns_in_tree_class():
-    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
-
-    assert get_reward_model_class("topreward") is TOPRewardModel
-
-
-# ---------------------------------------------------------------------------
-# Config validation
-# ---------------------------------------------------------------------------
-
-
-def test_topreward_config_rejects_zero_max_frames():
-    with pytest.raises(ValueError, match="max_frames must be >= 1"):
-        TOPRewardConfig(device="cpu", max_frames=0)
-
-
-def test_topreward_config_rejects_non_positive_fps():
-    with pytest.raises(ValueError, match="fps must be > 0"):
-        TOPRewardConfig(device="cpu", fps=0.0)
-
-
-def test_topreward_config_rejects_suffix_without_instruction_placeholder():
-    with pytest.raises(ValueError, match=r"\{instruction\}"):
-        TOPRewardConfig(device="cpu", prompt_suffix_template="no placeholder here")
-
-
-# ---------------------------------------------------------------------------
-# compute_reward
-# ---------------------------------------------------------------------------
-
-
-@skip_if_package_missing("transformers")
-def test_topreward_compute_reward_returns_one_scalar_per_sample(monkeypatch):
-    """``compute_reward`` must return a ``(B,)`` float32 tensor with one
-    log-prob reward per sample, consuming pre-encoded Qwen-VL tensors."""
-    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
-
-    _patch_build(monkeypatch)
-    cfg = TOPRewardConfig(device="cpu")
-    model = TOPRewardModel(cfg)
-
-    input_ids = torch.randint(0, 100, (2, 10))
-    attention_mask = torch.ones(2, 10, dtype=torch.long)
-    labels = _terminal_labels(input_ids)
-
-    batch = _make_batch(input_ids, attention_mask, labels)
-    rewards = model.compute_reward(batch)
-
-    assert rewards.shape == (2,)
-    assert rewards.dtype == torch.float32
-
-
-@skip_if_package_missing("transformers")
-def test_topreward_compute_reward_applies_success_threshold(monkeypatch):
-    """When ``success_threshold`` is finite, the model returns binary success."""
-    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
-
-    _patch_build(monkeypatch)
-    cfg = TOPRewardConfig(device="cpu", success_threshold=0.0)
-    model = TOPRewardModel(cfg)
-
-    input_ids = torch.randint(0, 100, (2, 10))
-    attention_mask = torch.ones(2, 10, dtype=torch.long)
-    labels = _terminal_labels(input_ids)
-
-    batch = _make_batch(input_ids, attention_mask, labels)
-    rewards = model.compute_reward(batch)
-
-    assert rewards.shape == (2,)
-    assert set(rewards.tolist()).issubset({0.0, 1.0})
-
-
-@skip_if_package_missing("transformers")
-def test_topreward_compute_reward_errors_when_inputs_missing(monkeypatch):
-    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
-
-    _patch_build(monkeypatch)
-    cfg = TOPRewardConfig(device="cpu")
-    model = TOPRewardModel(cfg)
-
-    with pytest.raises(KeyError, match=r"observation\.topreward\.input_ids"):
-        model.compute_reward(_make_batch(torch.randint(0, 100, (1, 10)), omit="input_ids"))
-
-
-@skip_if_package_missing("transformers")
-def test_topreward_compute_reward_errors_when_labels_missing(monkeypatch):
-    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
-
-    _patch_build(monkeypatch)
-    cfg = TOPRewardConfig(device="cpu")
-    model = TOPRewardModel(cfg)
-
-    input_ids = torch.randint(0, 100, (1, 10))
-    with pytest.raises(KeyError, match=r"observation\.topreward\.labels"):
-        model.compute_reward(_make_batch(input_ids, labels=None))
-
-
-@skip_if_package_missing("transformers")
-def test_topreward_compute_reward_requires_all_encoder_keys(monkeypatch):
-    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
-
-    _patch_build(monkeypatch)
-    cfg = TOPRewardConfig(device="cpu")
-    model = TOPRewardModel(cfg)
-
-    input_ids = torch.randint(0, 100, (1, 10))
-    labels = _terminal_labels(input_ids)
-    required_encoder_keys = set(TOPREWARD_INPUT_KEYS) - {"input_ids", "labels"}
-
-    for key in required_encoder_keys:
-        with pytest.raises(KeyError, match=rf"observation\.topreward\.{key}"):
-            model.compute_reward(_make_batch(input_ids, labels=labels, omit=key))
-
-
-# ---------------------------------------------------------------------------
-# Save / load — config-only checkpoint
-# ---------------------------------------------------------------------------
-
-
-@skip_if_package_missing("transformers")
-def test_topreward_save_pretrained_writes_only_config_json(monkeypatch, tmp_path):
-    from huggingface_hub.constants import CONFIG_NAME, SAFETENSORS_SINGLE_FILE
-
-    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
-
-    _patch_build(monkeypatch)
-    cfg = TOPRewardConfig(
-        device="cpu",
-        vlm_name="Qwen/Qwen3-VL-8B-Instruct",
-        fps=4.0,
-        image_key="observation.images.front",
-    )
-    model = TOPRewardModel(cfg)
-    model.save_pretrained(str(tmp_path))
-
-    assert (tmp_path / CONFIG_NAME).exists()
-    assert not (tmp_path / SAFETENSORS_SINGLE_FILE).exists()
-
-
-@skip_if_package_missing("transformers")
-def test_topreward_from_pretrained_local_dir_roundtrips_config(monkeypatch, tmp_path):
-    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
-
-    _patch_build(monkeypatch)
-    cfg = TOPRewardConfig(
-        device="cpu",
-        vlm_name="Qwen/Qwen3-VL-8B-Instruct",
-        fps=4.0,
-        image_key="observation.images.front",
-        add_chat_template=True,
-        success_threshold=-1.5,
-    )
-    TOPRewardModel(cfg).save_pretrained(str(tmp_path))
-
-    reloaded = TOPRewardModel.from_pretrained(str(tmp_path))
-
-    assert isinstance(reloaded.config, TOPRewardConfig)
-    assert reloaded.config.vlm_name == "Qwen/Qwen3-VL-8B-Instruct"
-    assert reloaded.config.fps == 4.0
-    assert reloaded.config.image_key == "observation.images.front"
-    assert reloaded.config.add_chat_template is True
-    assert reloaded.config.success_threshold == -1.5
-
-
-@skip_if_package_missing("transformers")
-def test_topreward_is_not_trainable(monkeypatch):
-    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
-
-    _patch_build(monkeypatch)
-    cfg = TOPRewardConfig(device="cpu")
-    model = TOPRewardModel(cfg)
-
-    assert model.is_trainable is False
-    with pytest.raises(NotImplementedError, match="not trainable"):
-        model.forward({"x": torch.zeros(1)})
--- a/tests/rewards/test_robometer_processor.py
+++ b/tests/rewards/test_robometer_processor.py
@@ -1,354 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""Tests for Robometer's pre-processing helpers and encoder step.
-
-Covers the pure helpers (``_video_to_numpy`` and ``_expand_tasks``) directly,
-and exercises :class:`RobometerEncoderProcessorStep` with a stubbed
-``AutoProcessor`` so we don't need to download Qwen-VL just to test the
-dataclass plumbing (``transform_features`` / ``get_config``).
-
-The full ``__call__`` path that runs ``process_vision_info`` + the Qwen
-processor is intentionally *not* covered here — it is essentially HF glue
-that's exercised by the integration / parity scripts.
-"""
-
-from __future__ import annotations
-
-from typing import Any
-
-import numpy as np
-import pytest
-import torch
-
-from lerobot.configs import FeatureType, PipelineFeatureType, PolicyFeature
-from lerobot.rewards.robometer.processor_robometer import (
-    PROGRESS_PROMPT,
-    _expand_tasks,
-    _frames_to_pil,
-    _video_to_numpy,
-)
-from tests.utils import skip_if_package_missing
-
-
-def _skip_if_robometer_extras_missing(func):
-    """Apply both optional-dependency guards in one shot.
-
-    ``RobometerEncoderProcessorStep.__post_init__`` calls
-    ``require_package("transformers", ...)`` *and*
-    ``require_package("qwen-vl-utils", ...)``, so both need to be present
-    before we can instantiate the step.
-    """
-    func = skip_if_package_missing("qwen-vl-utils", import_name="qwen_vl_utils")(func)
-    func = skip_if_package_missing("transformers")(func)
-    return func
-
-
-# ---------------------------------------------------------------------------
-# _video_to_numpy — pure tensor → uint8 (T, H, W, C) conversion
-# ---------------------------------------------------------------------------
-
-
-def test_video_to_numpy_chw_float_is_converted_to_thwc_uint8():
-    video = torch.rand(4, 3, 8, 8)  # (T, C, H, W) floats in [0, 1]
-    array = _video_to_numpy(video, max_frames=None)
-
-    assert array.shape == (4, 8, 8, 3)
-    assert array.dtype == np.uint8
-    assert array.min() >= 0 and array.max() <= 255
-
-
-def test_video_to_numpy_already_thwc_uint8_passes_through():
-    video = torch.randint(0, 256, (3, 8, 8, 3), dtype=torch.uint8)  # (T, H, W, C)
-    array = _video_to_numpy(video, max_frames=None)
-
-    assert array.shape == (3, 8, 8, 3)
-    assert array.dtype == np.uint8
-
-
-def test_video_to_numpy_max_frames_tail_crops_recent_frames():
-    """``max_frames`` should keep the **last** K frames (most recent)."""
-    video = torch.zeros(10, 3, 4, 4)
-    for t in range(10):
-        video[t] = t / 9.0  # marker: 0 at t=0, ≈1 at t=9
-
-    array = _video_to_numpy(video, max_frames=3)
-
-    assert array.shape == (3, 4, 4, 3)
-    # The first kept frame is t=7 → marker ≈ 7/9 → uint8 ≈ 198
-    assert int(array[0, 0, 0, 0]) == int(round(7 / 9 * 255))
-    # The last kept frame is t=9 → marker = 1.0 → uint8 = 255
-    assert int(array[-1, 0, 0, 0]) == 255
-
-
-def test_video_to_numpy_rejects_3d_input():
-    with pytest.raises(ValueError, match="Expected channel dim"):
-        _video_to_numpy(torch.zeros(4, 8, 8), max_frames=None)
-
-
-def test_video_to_numpy_floats_above_one_pass_through_without_rescaling():
-    """If ``array.max() > 1`` the helper assumes the tensor is already in the
-    [0, 255] range (uint8-as-float), so values pass through unchanged."""
-    video = torch.full((1, 3, 2, 2), 5.0)
-    array = _video_to_numpy(video, max_frames=None)
-
-    assert array.shape == (1, 2, 2, 3)
-    assert int(array.max()) == 5
-
-
-def test_video_to_numpy_clips_very_large_floats_to_uint8_max():
-    """Out-of-uint8-range floats are clipped at 255 before the cast."""
-    video = torch.full((1, 3, 2, 2), 300.0)
-    array = _video_to_numpy(video, max_frames=None)
-
-    assert int(array.max()) == 255
-
-
-# ---------------------------------------------------------------------------
-# _expand_tasks — string / list / tuple broadcasting to batch size
-# ---------------------------------------------------------------------------
-
-
-def test_expand_tasks_string_is_broadcast_to_batch_size():
-    assert _expand_tasks("pick up", batch_size=3, default=None) == ["pick up", "pick up", "pick up"]
-
-
-def test_expand_tasks_list_of_matching_size_passes_through():
-    assert _expand_tasks(["a", "b", "c"], batch_size=3, default=None) == ["a", "b", "c"]
-
-
-def test_expand_tasks_tuple_is_normalised_to_list():
-    assert _expand_tasks(("a", "b"), batch_size=2, default=None) == ["a", "b"]
-
-
-def test_expand_tasks_single_element_list_is_broadcast():
-    assert _expand_tasks(["only one"], batch_size=3, default=None) == ["only one"] * 3
-
-
-def test_expand_tasks_size_mismatch_raises():
-    with pytest.raises(ValueError, match="Expected 3 tasks"):
-        _expand_tasks(["a", "b"], batch_size=3, default=None)
-
-
-def test_expand_tasks_missing_uses_default():
-    assert _expand_tasks(None, batch_size=2, default="fallback") == ["fallback", "fallback"]
-
-
-def test_expand_tasks_missing_without_default_raises():
-    with pytest.raises(KeyError, match="task description"):
-        _expand_tasks(None, batch_size=1, default=None)
-
-
-def test_expand_tasks_wrong_type_raises():
-    with pytest.raises(TypeError, match="must be a string or list"):
-        _expand_tasks(42, batch_size=1, default=None)
-
-
-# ---------------------------------------------------------------------------
-# _frames_to_pil — uint8 (T, H, W, C) → list[PIL.Image]
-# ---------------------------------------------------------------------------
-
-
-def test_frames_to_pil_returns_one_image_per_frame():
-    frames = np.zeros((4, 8, 8, 3), dtype=np.uint8)
-    images = _frames_to_pil(frames)
-
-    assert len(images) == 4
-    assert all(img.size == (8, 8) for img in images)
-
-
-def test_frames_to_pil_casts_floats_to_uint8():
-    frames = np.full((2, 4, 4, 3), 200.0, dtype=np.float32)
-    images = _frames_to_pil(frames)
-
-    assert len(images) == 2
-    # PIL converted from clipped uint8 - sanity check pixel values come through.
-    assert np.asarray(images[0]).dtype == np.uint8
-
-
-def test_frames_to_pil_rejects_non_4d_input():
-    with pytest.raises(ValueError, match=r"\(T,H,W,C\)"):
-        _frames_to_pil(np.zeros((4, 8, 8), dtype=np.uint8))
-
-
-# ---------------------------------------------------------------------------
-# Encoder step plumbing — exercise dataclass surface with a stubbed AutoProcessor
-# ---------------------------------------------------------------------------
-
-
-class _FakeTokenizer:
-    """Tokenizer surface the encoder step touches in ``__post_init__``."""
-
-    def __init__(self) -> None:
-        self.pad_token: str | None = None
-        self.eos_token = "<|endoftext|>"
-        self._vocab: dict[str, int] = {"<|endoftext|>": 0}
-        self.added: list[str] = []
-
-    def get_vocab(self) -> dict[str, int]:
-        return self._vocab
-
-    def add_special_tokens(self, payload: dict[str, Any]) -> int:
-        for token in payload.get("additional_special_tokens", []):
-            if token not in self._vocab:
-                self._vocab[token] = len(self._vocab)
-                self.added.append(token)
-        return len(self.added)
-
-
-class _FakeAutoProcessor:
-    """Stand-in returned by ``AutoProcessor.from_pretrained`` during tests."""
-
-    def __init__(self) -> None:
-        self.tokenizer = _FakeTokenizer()
-        self.image_processor = None
-        self.video_processor = None
-
-    @classmethod
-    def from_pretrained(cls, *args, **kwargs):  # noqa: ARG003
-        return cls()
-
-
-def _build_step(monkeypatch, **overrides):
-    from lerobot.rewards.robometer import processor_robometer
-
-    monkeypatch.setattr(processor_robometer, "AutoProcessor", _FakeAutoProcessor)
-
-    return processor_robometer.RobometerEncoderProcessorStep(**overrides)
-
-
-@_skip_if_robometer_extras_missing
-def test_encoder_step_registers_special_tokens_on_tokenizer(monkeypatch):
-    """``__post_init__`` must register Robometer's five special tokens on the
-    tokenizer that ships with the chosen Qwen-VL checkpoint."""
-    from lerobot.rewards.robometer.configuration_robometer import ROBOMETER_SPECIAL_TOKENS
-
-    step = _build_step(monkeypatch)
-
-    vocab = step._processor.tokenizer.get_vocab()
-    for token in ROBOMETER_SPECIAL_TOKENS:
-        assert token in vocab, f"{token} not registered on the tokenizer"
-
-
-@_skip_if_robometer_extras_missing
-def test_encoder_step_sets_pad_token_to_eos_when_missing(monkeypatch):
-    """Qwen tokenizers ship without a pad token; the step must reuse EOS so
-    batched processing doesn't crash on padding."""
-    step = _build_step(monkeypatch)
-
-    assert step._processor.tokenizer.pad_token == "<|endoftext|>"
-
-
-@_skip_if_robometer_extras_missing
-def test_encoder_step_get_config_roundtrips_user_fields(monkeypatch):
-    """``get_config`` must serialise every user-tunable field — these are what
-    the processor pipeline saves under ``preprocessor_config.json``."""
-    step = _build_step(
-        monkeypatch,
-        base_model_id="Qwen/Qwen3-VL-4B-Instruct",
-        image_key="observation.images.cam_top",
-        task_key="task",
-        default_task="do the thing",
-        max_frames=12,
-        use_multi_image=True,
-        use_per_frame_progress_token=True,
-        max_length=2048,
-    )
-
-    cfg = step.get_config()
-    assert cfg == {
-        "base_model_id": "Qwen/Qwen3-VL-4B-Instruct",
-        "image_key": "observation.images.cam_top",
-        "task_key": "task",
-        "default_task": "do the thing",
-        "max_frames": 12,
-        "use_multi_image": True,
-        "use_per_frame_progress_token": True,
-        "max_length": 2048,
-    }
-
-
-@_skip_if_robometer_extras_missing
-def test_encoder_step_transform_features_is_identity(monkeypatch):
-    """The encoder step writes Qwen tensors into ``observation`` at call time,
-    but it does **not** advertise new typed features at pipeline-build time —
-    the downstream model consumes them via the ``ROBOMETER_FEATURE_PREFIX``
-    namespace, not via the typed feature map.
-    """
-    step = _build_step(monkeypatch)
-
-    features = {
-        PipelineFeatureType.OBSERVATION: {
-            "observation.images.top": PolicyFeature(shape=(3, 224, 224), type=FeatureType.VISUAL),
-        }
-    }
-    assert step.transform_features(features) == features
-
-
-@_skip_if_robometer_extras_missing
-def test_encoder_step_build_conversation_inserts_prog_token_per_frame(monkeypatch):
-    """In multi-image mode with per-frame progress tokens, the conversation
-    must alternate ``image`` and ``<|prog_token|>`` text entries, one pair
-    per frame, after the task prompt."""
-    step = _build_step(
-        monkeypatch,
-        use_multi_image=True,
-        use_per_frame_progress_token=True,
-    )
-
-    frames = np.zeros((3, 8, 8, 3), dtype=np.uint8)
-    conversation = step._build_conversation(frames, task="pick up the cube")
-
-    assert len(conversation) == 1 and conversation[0]["role"] == "user"
-    content = conversation[0]["content"]
-
-    # First entry is the task prompt.
-    assert content[0] == {"type": "text", "text": PROGRESS_PROMPT.format(task="pick up the cube")}
-
-    # Then 3 (image, <|prog_token|>) pairs.
-    expected_tail = [
-        item
-        for _ in range(3)
-        for item in (
-            {"type": "image"},  # value asserted below
-            {"type": "text", "text": "<|prog_token|>"},
-        )
-    ]
-    assert len(content) == 1 + len(expected_tail)
-    for got, exp in zip(content[1:], expected_tail, strict=True):
-        assert got["type"] == exp["type"]
-        if exp["type"] == "text":
-            assert got["text"] == exp["text"]
-
-
-@_skip_if_robometer_extras_missing
-def test_encoder_step_build_conversation_video_mode_uses_single_video_entry(monkeypatch):
-    """When ``use_multi_image=False``, frames are bundled into a single
-    ``video`` content entry instead of individual ``image`` entries."""
-    step = _build_step(
-        monkeypatch,
-        use_multi_image=False,
-        use_per_frame_progress_token=False,
-    )
-
-    frames = np.zeros((4, 8, 8, 3), dtype=np.uint8)
-    conversation = step._build_conversation(frames, task="pour the water")
-
-    content = conversation[0]["content"]
-    # Exactly two entries: the prompt and one video entry.
-    assert len(content) == 2
-    assert content[0]["type"] == "text"
-    assert content[1]["type"] == "video"
-    # The video entry carries all four frames.
-    assert len(content[1]["video"]) == 4
--- a/tests/rewards/test_topreward.py
+++ b/tests/rewards/test_topreward.py
@@ -1,80 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""End-to-end TOPReward smoke test with the real Qwen3-VL model."""
-
-import os
-
-import pytest
-import torch
-
-pytest.importorskip("transformers")
-
-from lerobot.rewards.topreward.configuration_topreward import TOPRewardConfig  # noqa: E402
-from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel  # noqa: E402
-from lerobot.rewards.topreward.processor_topreward import (  # noqa: E402
-    TOPREWARD_FEATURE_PREFIX,
-    TOPREWARD_INPUT_KEYS,
-    make_topreward_pre_post_processors,
-)
-from tests.utils import require_cuda  # noqa: E402
-
-pytestmark = pytest.mark.skipif(
-    os.environ.get("CI") == "true" or os.environ.get("GITHUB_ACTIONS") == "true",
-    reason="This test requires downloading and loading Qwen3-VL and is not meant for CI",
-)
-
-
-def _make_dummy_topreward_batch(image_key: str, task_key: str) -> dict[str, object]:
-    num_frames = 4
-    image_size = 64
-    frames = torch.zeros(1, num_frames, 3, image_size, image_size, dtype=torch.uint8)
-    for frame_idx in range(num_frames):
-        frames[0, frame_idx, 0].fill_(min(frame_idx * 48, 255))
-        frames[0, frame_idx, 1].fill_(96)
-        frames[0, frame_idx, 2].fill_(192)
-
-    return {
-        image_key: frames,
-        task_key: ["pick up the red cube"],
-    }
-
-
-@require_cuda
-def test_topreward_full_qwen3vl_preprocessor_to_compute_reward():
-    cfg = TOPRewardConfig(
-        vlm_name="Qwen/Qwen3-VL-8B-Instruct",
-        device="cuda",
-        max_frames=4,
-        fps=2.0,
-        max_input_length=4096,
-    )
-
-    preprocessor, _ = make_topreward_pre_post_processors(cfg)
-    encoded_batch = preprocessor(_make_dummy_topreward_batch(cfg.image_key, cfg.task_key))
-    for key in TOPREWARD_INPUT_KEYS:
-        assert f"{TOPREWARD_FEATURE_PREFIX}{key}" in encoded_batch
-
-    model = TOPRewardModel(cfg)
-    try:
-        model.to(cfg.device)
-        model.eval()
-        rewards = model.compute_reward(encoded_batch)
-    finally:
-        del model
-        torch.cuda.empty_cache()
-
-    assert rewards.shape == (1,)
-    assert rewards.dtype == torch.float32
-    assert torch.isfinite(rewards).all()
--- a/tests/rewards/test_topreward_processor.py
+++ b/tests/rewards/test_topreward_processor.py
@@ -1,246 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""Tests for TOPReward's pre-processing helpers and encoder step."""
-
-from __future__ import annotations
-
-import pytest
-import torch
-
-from lerobot.configs import FeatureType, PipelineFeatureType, PolicyFeature
-from lerobot.rewards.topreward.processor_topreward import (
-    TOPREWARD_FEATURE_PREFIX,
-    TOPREWARD_INPUT_KEYS,
-    _expand_tasks,
-    _prepare_video_batch,
-)
-from lerobot.types import TransitionKey
-from tests.utils import skip_if_package_missing
-
-# ---------------------------------------------------------------------------
-# _prepare_video_batch — raw image/video batch -> (B, T, C, H, W) uint8
-# ---------------------------------------------------------------------------
-
-
-def test_prepare_video_batch_batched_chw_float_is_converted_to_uint8():
-    video = torch.rand(2, 4, 3, 8, 8)
-    tensor = _prepare_video_batch(video, max_frames=None)
-
-    assert tensor.shape == (2, 4, 3, 8, 8)
-    assert tensor.dtype == torch.uint8
-    assert tensor.min() >= 0 and tensor.max() <= 255
-
-
-def test_prepare_video_batch_batched_thwc_uint8_is_permuted_to_channel_first():
-    video = torch.randint(0, 256, (2, 3, 8, 8, 3), dtype=torch.uint8)
-    tensor = _prepare_video_batch(video, max_frames=None)
-
-    assert tensor.shape == (2, 3, 3, 8, 8)
-    assert tensor.dtype == torch.uint8
-
-
-def test_prepare_video_batch_max_frames_tail_crops_recent_frames():
-    video = torch.zeros(1, 10, 3, 4, 4)
-    for t in range(10):
-        video[:, t] = t / 9.0
-
-    tensor = _prepare_video_batch(video, max_frames=3)
-
-    assert tensor.shape == (1, 3, 3, 4, 4)
-    assert int(tensor[0, 0, 0, 0, 0]) == int(7 / 9 * 255)
-    assert int(tensor[0, -1, 0, 0, 0]) == 255
-
-
-def test_prepare_video_batch_rejects_3d_input():
-    with pytest.raises(ValueError, match="Expected TOPReward frames"):
-        _prepare_video_batch(torch.zeros(4, 8, 8), max_frames=None)
-
-
-def test_prepare_video_batch_floats_above_one_are_rescaled_and_clipped():
-    video = torch.full((1, 1, 3, 2, 2), 5.0)
-    tensor = _prepare_video_batch(video, max_frames=None)
-
-    assert tensor.shape == (1, 1, 3, 2, 2)
-    assert int(tensor.max()) == 255
-
-
-def test_prepare_video_batch_clips_very_large_floats_to_uint8_max():
-    video = torch.full((1, 1, 3, 2, 2), 300.0)
-    tensor = _prepare_video_batch(video, max_frames=None)
-
-    assert int(tensor.max()) == 255
-
-
-# ---------------------------------------------------------------------------
-# _expand_tasks — string / list / tuple broadcasting to batch size
-# ---------------------------------------------------------------------------
-
-
-def test_expand_tasks_string_is_broadcast_to_batch_size():
-    assert _expand_tasks("pick up", batch_size=3, default=None) == ["pick up", "pick up", "pick up"]
-
-
-def test_expand_tasks_list_of_matching_size_passes_through():
-    assert _expand_tasks(["a", "b", "c"], batch_size=3, default=None) == ["a", "b", "c"]
-
-
-def test_expand_tasks_tuple_is_normalised_to_list():
-    assert _expand_tasks(("a", "b"), batch_size=2, default=None) == ["a", "b"]
-
-
-def test_expand_tasks_single_element_list_is_broadcast():
-    assert _expand_tasks(["only one"], batch_size=3, default=None) == ["only one"] * 3
-
-
-def test_expand_tasks_size_mismatch_raises():
-    with pytest.raises(ValueError, match="Expected 3 tasks"):
-        _expand_tasks(["a", "b"], batch_size=3, default=None)
-
-
-def test_expand_tasks_missing_uses_default():
-    assert _expand_tasks(None, batch_size=2, default="fallback") == ["fallback", "fallback"]
-
-
-def test_expand_tasks_missing_without_default_raises():
-    with pytest.raises(KeyError, match="task description"):
-        _expand_tasks(None, batch_size=1, default=None)
-
-
-def test_expand_tasks_wrong_type_raises():
-    with pytest.raises(TypeError, match="must be a string or list"):
-        _expand_tasks(42, batch_size=1, default=None)
-
-
-# ---------------------------------------------------------------------------
-# Encoder step — stubbed AutoProcessor
-# ---------------------------------------------------------------------------
-
-
-def _skip_if_topreward_extras_missing(func):
-    func = skip_if_package_missing("transformers")(func)
-    return func
-
-
-class _FakeTokenizer:
-    eos_token = "<|endoftext|>"
-    pad_token = "<|endoftext|>"
-
-    def __call__(self, *args, **kwargs):
-        return {"input_ids": torch.zeros(1, 10, dtype=torch.long)}
-
-
-class _FakeAutoProcessor:
-    def __init__(self) -> None:
-        self.tokenizer = _FakeTokenizer()
-
-    @classmethod
-    def from_pretrained(cls, *args, **kwargs):  # noqa: ARG003
-        return cls()
-
-    def apply_chat_template(self, messages, **kwargs):  # noqa: ARG002
-        return "fake_prompt_text"
-
-    def __call__(self, text=None, images=None, videos=None, **kwargs):  # noqa: ARG002
-        seq_len = 10
-        batch_size = len(text) if isinstance(text, list) else 1
-        return {
-            "input_ids": torch.randint(0, 100, (batch_size, seq_len)),
-            "attention_mask": torch.ones(batch_size, seq_len, dtype=torch.long),
-            "pixel_values_videos": torch.zeros(batch_size, 1536, dtype=torch.float32),
-            "video_grid_thw": torch.ones(batch_size, 3, dtype=torch.long),
-            "mm_token_type_ids": torch.zeros(batch_size, seq_len, dtype=torch.long),
-        }
-
-
-def _build_step(monkeypatch, **overrides):
-    from lerobot.rewards.topreward import processor_topreward
-
-    monkeypatch.setattr(processor_topreward, "AutoProcessor", _FakeAutoProcessor)
-    return processor_topreward.TOPRewardEncoderProcessorStep(**overrides)
-
-
-def _make_transition(observation: dict, complementary: dict | None = None) -> dict:
-    transition: dict = {TransitionKey.OBSERVATION: observation}
-    if complementary is not None:
-        transition[TransitionKey.COMPLEMENTARY_DATA] = complementary
-    return transition
-
-
-@_skip_if_topreward_extras_missing
-def test_encoder_step_emits_input_ids_and_labels(monkeypatch):
-    """The processor must emit Qwen-VL tensors including ``input_ids`` and
-    ``labels`` under the ``observation.topreward.*`` namespace."""
-    step = _build_step(monkeypatch)
-
-    frames_batch = torch.zeros(2, 4, 3, 8, 8)
-    out = step(
-        _make_transition(
-            observation={"observation.images.top": frames_batch},
-            complementary={"task": ["pick", "place"]},
-        )
-    )
-
-    obs_out = out[TransitionKey.OBSERVATION]
-    for key in TOPREWARD_INPUT_KEYS:
-        assert f"{TOPREWARD_FEATURE_PREFIX}{key}" in obs_out
-
-    input_ids = obs_out[f"{TOPREWARD_FEATURE_PREFIX}input_ids"]
-    labels = obs_out[f"{TOPREWARD_FEATURE_PREFIX}labels"]
-    assert labels.dtype == torch.long
-    assert labels.shape == (2, 10)
-    assert labels[:, :-1].eq(-100).all()
-    assert labels[:, -1].equal(input_ids[:, -1])
-
-
-@_skip_if_topreward_extras_missing
-def test_encoder_step_get_config_roundtrips_user_fields(monkeypatch):
-    step = _build_step(
-        monkeypatch,
-        vlm_name="Qwen/Qwen3-VL-8B-Instruct",
-        image_key="observation.images.cam_top",
-        task_key="task",
-        default_task="do the thing",
-        max_frames=8,
-        fps=4.0,
-        add_chat_template=True,
-        max_length=2048,
-    )
-
-    cfg = step.get_config()
-    assert cfg["vlm_name"] == "Qwen/Qwen3-VL-8B-Instruct"
-    assert cfg["image_key"] == "observation.images.cam_top"
-    assert cfg["default_task"] == "do the thing"
-    assert cfg["max_frames"] == 8
-    assert cfg["fps"] == 4.0
-    assert cfg["add_chat_template"] is True
-    assert cfg["max_length"] == 2048
-
-
-@_skip_if_topreward_extras_missing
-def test_encoder_step_transform_features_is_identity(monkeypatch):
-    step = _build_step(monkeypatch)
-    features = {
-        PipelineFeatureType.OBSERVATION: {
-            "observation.images.top": PolicyFeature(shape=(3, 224, 224), type=FeatureType.VISUAL),
-        }
-    }
-    assert step.transform_features(features) == features
-
-
-@_skip_if_topreward_extras_missing
-def test_encoder_step_rejects_missing_image_key(monkeypatch):
-    step = _build_step(monkeypatch, image_key="observation.images.top")
-    with pytest.raises(KeyError, match="image key"):
-        step(_make_transition(observation={}, complementary={"task": "pick"}))
--- a/tests/test_yaml_policy_path.py
+++ b/tests/test_yaml_policy_path.py
@@ -1,14 +1,10 @@
 """Tests for policy.path support in YAML config files (issue #2957)."""

 import json
-import sys
 import tempfile
-from dataclasses import dataclass, field
-from unittest.mock import patch

 import yaml

-from lerobot.configs import parser
 from lerobot.configs.parser import (
    _config_path_args,
    _config_yaml_overrides,
@@ -20,8 +16,7 @@ from lerobot.configs.parser import (


 def test_extract_path_fields_from_yaml():
-    """Test that policy.path is extracted from a YAML config and the policy block
-    is removed entirely (siblings are captured separately as cli_overrides)."""
+    """Test that policy.path is extracted from a YAML config and removed."""
    config = {
        "dataset": {"repo_id": "lerobot/pusht"},
        "policy": {"type": "smolvla", "path": "lerobot/smolvla_base", "push_to_hub": False},
@@ -31,33 +26,26 @@ def test_extract_path_fields_from_yaml():
        config_path = f.name

    _config_path_args.clear()
-    _config_yaml_overrides.clear()
    cleaned_path = extract_path_fields_from_config(config_path, ["policy"])

    # Path should be extracted and stored
    assert _config_path_args["policy"] == "lerobot/smolvla_base"

-    # Cleaned config should not have the policy block at all -- draccus must not
-    # try to decode it as PreTrainedConfig; the actual config comes from
-    # from_pretrained(path) with the captured overrides applied on top.
+    # Cleaned config should not have the path field
    with open(cleaned_path) as f:
        cleaned = yaml.safe_load(f)
-    assert "policy" not in cleaned
+    assert "path" not in cleaned["policy"]
+    assert cleaned["policy"]["type"] == "smolvla"
+    assert cleaned["policy"]["push_to_hub"] is False

    # Original dataset should be untouched
    assert cleaned["dataset"]["repo_id"] == "lerobot/pusht"

-    # Sibling overrides (excluding type/path) captured for from_pretrained.
-    overrides = get_yaml_overrides("policy")
-    assert any("push_to_hub=false" in o for o in overrides)
-
    _config_path_args.clear()
-    _config_yaml_overrides.clear()


 def test_extract_path_fields_from_json():
-    """Test that policy.path is extracted from a JSON config and the policy
-    block is removed entirely."""
+    """Test that policy.path is extracted from a JSON config."""
    config = {
        "policy": {"type": "act", "path": "some/local/path"},
    }
@@ -66,17 +54,15 @@ def test_extract_path_fields_from_json():
        config_path = f.name

    _config_path_args.clear()
-    _config_yaml_overrides.clear()
    cleaned_path = extract_path_fields_from_config(config_path, ["policy"])

    assert _config_path_args["policy"] == "some/local/path"

    with open(cleaned_path) as f:
        cleaned = json.load(f)
-    assert "policy" not in cleaned
+    assert "path" not in cleaned["policy"]

    _config_path_args.clear()
-    _config_yaml_overrides.clear()


 def test_extract_no_path_returns_original():
@@ -230,91 +216,3 @@ def test_flatten_nested_with_bools():
    args = _flatten_to_cli_args(d)
    assert "--optimizer.use_warmup=true" in args
    assert "--optimizer.lr=0.01" in args
-
-
-def test_extract_removes_field_with_siblings_and_no_type():
-    """Regression: when policy.path has siblings but no type:, the entire policy
-    block must still be removed from the cleaned config. Otherwise draccus tries
-    to decode the leftover dict as PreTrainedConfig and crashes on the missing
-    type discriminator.
-    """
-    config = {
-        "dataset": {"repo_id": "lerobot/pusht"},
-        "policy": {
-            "path": "lerobot/smolvla_base",
-            "n_action_steps": 10,
-            "dtype": "bfloat16",
-        },
-    }
-    with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
-        yaml.dump(config, f)
-        config_path = f.name
-
-    _config_path_args.clear()
-    _config_yaml_overrides.clear()
-    cleaned_path = extract_path_fields_from_config(config_path, ["policy"])
-
-    with open(cleaned_path) as f:
-        cleaned = yaml.safe_load(f) or {}
-    assert "policy" not in cleaned, "policy block should be fully removed when path is present"
-    assert cleaned["dataset"]["repo_id"] == "lerobot/pusht"
-    assert _config_path_args["policy"] == "lerobot/smolvla_base"
-    overrides = get_yaml_overrides("policy")
-    assert any("n_action_steps=10" in o for o in overrides)
-    assert any("dtype=bfloat16" in o for o in overrides)
-
-    _config_path_args.clear()
-    _config_yaml_overrides.clear()
-
-
-@dataclass
-class _DummyNested:
-    foo: int = 0
-
-
-@dataclass
-class _DummyConfig:
-    nested: _DummyNested = field(default_factory=_DummyNested)
-    other: str = "default"
-
-    @classmethod
-    def __get_path_fields__(cls):
-        return ["nested"]
-
-
-def test_wrap_uses_cleaned_config_for_draccus_parse():
-    """Regression: wrap() updates config_path_cli to point at the cleaned temp
-    file but must propagate that to the draccus.parse fallback branch. Without
-    the fix, cli_args still contains --config_path=<original> and draccus reads
-    the original YAML with `path:` still in it, crashing on the unknown field.
-    """
-    config = {
-        "nested": {"path": "some/checkpoint", "foo": 42},
-        "other": "set-via-yaml",
-    }
-    with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
-        yaml.dump(config, f)
-        config_path = f.name
-
-    _config_path_args.clear()
-    _config_yaml_overrides.clear()
-
-    captured: dict = {}
-
-    @parser.wrap()
-    def main(cfg: _DummyConfig) -> _DummyConfig:
-        captured["cfg"] = cfg
-        return cfg
-
-    with patch.object(sys, "argv", ["prog", f"--config_path={config_path}"]):
-        main()
-
-    assert captured["cfg"].other == "set-via-yaml"
-    assert _config_path_args["nested"] == "some/checkpoint"
-    # Cleaned config dropped `nested:` entirely; defaults stand for this wrapper
-    # class (a real PreTrainedConfig would now load the checkpoint and apply
-    # the captured yaml_overrides via from_pretrained()).
-    assert captured["cfg"].nested.foo == 0
-
-    _config_path_args.clear()
-    _config_yaml_overrides.clear()
--- a/uv.lock
+++ b/uv.lock
@@ -2915,11 +2915,6 @@ metaworld = [
    { name = "scipy" },
    { name = "torchcodec", marker = "(platform_machine == 'arm64' and sys_platform == 'darwin') or (platform_machine == 'AMD64' and sys_platform == 'linux') or (platform_machine == 'aarch64' and sys_platform == 'linux') or (platform_machine == 'arm64' and sys_platform == 'linux') or (platform_machine == 'x86_64' and sys_platform == 'linux') or sys_platform == 'win32'" },
 ]
-molmoact2 = [
-    { name = "peft" },
-    { name = "scipy" },
-    { name = "transformers" },
-]
 motorbridge-dep = [
    { name = "motorbridge" },
 ]
@@ -2989,11 +2984,6 @@ rebot = [
    { name = "motorbridge" },
    { name = "motorbridge-smart-servo" },
 ]
-robometer = [
-    { name = "peft" },
-    { name = "qwen-vl-utils" },
-    { name = "transformers" },
-]
 robstride = [
    { name = "python-can" },
 ]
@@ -3019,9 +3009,6 @@ test = [
    { name = "pytest-cov" },
    { name = "pytest-timeout" },
 ]
-topreward = [
-    { name = "transformers" },
-]
 training = [
    { name = "accelerate" },
    { name = "av" },
@@ -3141,7 +3128,6 @@ requires-dist = [
    { name = "lerobot", extras = ["matplotlib-dep"], marker = "extra == 'sarm'" },
    { name = "lerobot", extras = ["matplotlib-dep"], marker = "extra == 'unitree-g1'" },
    { name = "lerobot", extras = ["metaworld"], marker = "extra == 'all'" },
-    { name = "lerobot", extras = ["molmoact2"], marker = "extra == 'all'" },
    { name = "lerobot", extras = ["motorbridge-dep"], marker = "extra == 'rebot'" },
    { name = "lerobot", extras = ["motorbridge-smart-servo-dep"], marker = "extra == 'rebot'" },
    { name = "lerobot", extras = ["multi-task-dit"], marker = "extra == 'all'" },
@@ -3149,9 +3135,7 @@ requires-dist = [
    { name = "lerobot", extras = ["openarms"], marker = "extra == 'all'" },
    { name = "lerobot", extras = ["peft"], marker = "extra == 'all'" },
    { name = "lerobot", extras = ["peft-dep"], marker = "extra == 'groot'" },
-    { name = "lerobot", extras = ["peft-dep"], marker = "extra == 'molmoact2'" },
    { name = "lerobot", extras = ["peft-dep"], marker = "extra == 'peft'" },
-    { name = "lerobot", extras = ["peft-dep"], marker = "extra == 'robometer'" },
    { name = "lerobot", extras = ["peft-dep"], marker = "extra == 'wallx'" },
    { name = "lerobot", extras = ["phone"], marker = "extra == 'all'" },
    { name = "lerobot", extras = ["pi"], marker = "extra == 'all'" },
@@ -3169,37 +3153,30 @@ requires-dist = [
    { name = "lerobot", extras = ["pyzmq-dep"], marker = "extra == 'lekiwi'" },
    { name = "lerobot", extras = ["pyzmq-dep"], marker = "extra == 'unitree-g1'" },
    { name = "lerobot", extras = ["qwen-vl-utils-dep"], marker = "extra == 'eo1'" },
-    { name = "lerobot", extras = ["qwen-vl-utils-dep"], marker = "extra == 'robometer'" },
    { name = "lerobot", extras = ["qwen-vl-utils-dep"], marker = "extra == 'sarm'" },
    { name = "lerobot", extras = ["qwen-vl-utils-dep"], marker = "extra == 'wallx'" },
    { name = "lerobot", extras = ["reachy2"], marker = "extra == 'all'" },
    { name = "lerobot", extras = ["rebot"], marker = "extra == 'all'" },
-    { name = "lerobot", extras = ["robometer"], marker = "extra == 'all'" },
    { name = "lerobot", extras = ["robstride"], marker = "extra == 'all'" },
    { name = "lerobot", extras = ["sarm"], marker = "extra == 'all'" },
    { name = "lerobot", extras = ["scipy-dep"], marker = "extra == 'aloha'" },
    { name = "lerobot", extras = ["scipy-dep"], marker = "extra == 'libero'" },
    { name = "lerobot", extras = ["scipy-dep"], marker = "extra == 'metaworld'" },
-    { name = "lerobot", extras = ["scipy-dep"], marker = "extra == 'molmoact2'" },
    { name = "lerobot", extras = ["scipy-dep"], marker = "extra == 'phone'" },
    { name = "lerobot", extras = ["scipy-dep"], marker = "extra == 'pi'" },
    { name = "lerobot", extras = ["scipy-dep"], marker = "extra == 'wallx'" },
    { name = "lerobot", extras = ["smolvla"], marker = "extra == 'all'" },
    { name = "lerobot", extras = ["test"], marker = "extra == 'all'" },
-    { name = "lerobot", extras = ["topreward"], marker = "extra == 'all'" },
    { name = "lerobot", extras = ["training"], marker = "extra == 'all'" },
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'eo1'" },
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'groot'" },
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'hilserl'" },
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'libero'" },
-    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'molmoact2'" },
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'multi-task-dit'" },
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'peft'" },
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'pi'" },
-    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'robometer'" },
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'sarm'" },
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'smolvla'" },
-    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'topreward'" },
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'wallx'" },
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'xvla'" },
    { name = "lerobot", extras = ["video-benchmark"], marker = "extra == 'all'" },
@@ -3226,7 +3203,7 @@ requires-dist = [
    { name = "pandas", marker = "extra == 'video-benchmark'", specifier = ">=2.2.2,<2.4.0" },
    { name = "peft", marker = "extra == 'peft-dep'", specifier = ">=0.18.0,<1.0.0" },
    { name = "pillow", specifier = ">=10.0.0,<13.0.0" },
-    { name = "placo", marker = "extra == 'placo-dep'", specifier = ">=0.9.6,<0.9.16" },
+    { name = "placo", marker = "extra == 'placo-dep'", specifier = ">=0.9.6,<0.9.17" },
    { name = "pre-commit", marker = "extra == 'dev'", specifier = ">=3.7.0,<5.0.0" },
    { name = "protobuf", marker = "extra == 'grpcio-dep'", specifier = ">=6.31.1,<6.32.0" },
    { name = "pyarrow", marker = "extra == 'dataset'", specifier = ">=21.0.0,<30.0.0" },
@@ -3267,7 +3244,7 @@ requires-dist = [
    { name = "transformers", marker = "extra == 'transformers-dep'", specifier = ">=5.4.0,<5.6.0" },
    { name = "wandb", marker = "extra == 'training'", specifier = ">=0.24.0,<0.25.0" },
 ]
-provides-extras = ["dataset", "training", "hardware", "viz", "core-scripts", "evaluation", "dataset-viz", "av-dep", "pygame-dep", "placo-dep", "transformers-dep", "grpcio-dep", "can-dep", "peft-dep", "scipy-dep", "diffusers-dep", "qwen-vl-utils-dep", "matplotlib-dep", "pyserial-dep", "deepdiff-dep", "pynput-dep", "pyzmq-dep", "motorbridge-dep", "motorbridge-smart-servo-dep", "feetech", "dynamixel", "damiao", "robstride", "openarms", "gamepad", "hopejr", "lekiwi", "unitree-g1", "reachy2", "rebot", "kinematics", "intelrealsense", "phone", "diffusion", "wallx", "pi", "molmoact2", "smolvla", "multi-task-dit", "groot", "sarm", "robometer", "topreward", "xvla", "eo1", "hilserl", "async", "peft", "dev", "notebook", "test", "video-benchmark", "aloha", "pusht", "libero", "metaworld", "all"]
+provides-extras = ["dataset", "training", "hardware", "viz", "core-scripts", "evaluation", "dataset-viz", "av-dep", "pygame-dep", "placo-dep", "transformers-dep", "grpcio-dep", "can-dep", "peft-dep", "scipy-dep", "diffusers-dep", "qwen-vl-utils-dep", "matplotlib-dep", "pyserial-dep", "deepdiff-dep", "pynput-dep", "pyzmq-dep", "motorbridge-dep", "motorbridge-smart-servo-dep", "feetech", "dynamixel", "damiao", "robstride", "openarms", "gamepad", "hopejr", "lekiwi", "unitree-g1", "reachy2", "rebot", "kinematics", "intelrealsense", "phone", "diffusion", "wallx", "pi", "smolvla", "multi-task-dit", "groot", "sarm", "xvla", "eo1", "hilserl", "async", "peft", "dev", "notebook", "test", "video-benchmark", "aloha", "pusht", "libero", "metaworld", "all"]

 [[package]]
 name = "librt"
@@ -4615,7 +4592,7 @@ wheels = [

 [[package]]
 name = "placo"
-version = "0.9.15"
+version = "0.9.16"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
    { name = "cmeel" },
@@ -4625,16 +4602,16 @@ dependencies = [
    { name = "pin" },
    { name = "rhoban-cmeel-jsoncpp" },
 ]
-sdist = { url = "https://files.pythonhosted.org/packages/40/c4/a33a0ee2ad798471a1c43a96109d28f358fd95c78a56f8cff57acb66d2bc/placo-0.9.15.tar.gz", hash = "sha256:df47f1154bae305c943bd20ba4f56d50ffc65625efc98679fefb11e8ff3c462c", size = 136856, upload-time = "2025-11-03T10:49:13.151Z" }
+sdist = { url = "https://files.pythonhosted.org/packages/9e/0a/36c5b729d0d69075e7dfafd1b36c4df6fbb8c1ff1585e88d3c56d4c15010/placo-0.9.16.tar.gz", hash = "sha256:5314faaf6442e7ffe17347680d236af953951813bbfb1c09c4a27f7388d332e4", size = 136871, upload-time = "2025-11-07T14:24:58.811Z" }
 wheels = [
-    { url = "https://files.pythonhosted.org/packages/ef/03/207b1c087996b918fdbaa5a3a685e3b14b068cd303bf87affdf83f722b33/placo-0.9.15-0-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:eab7a299e73291fe631c02448b9e9826539f4824e198bcf85f7c91fdd77d054b", size = 1641975, upload-time = "2025-11-03T10:48:48.887Z" },
-    { url = "https://files.pythonhosted.org/packages/92/55/40432b26bb1c5b9e677fbc41e8d85b54fa8897b7daebb2a22d410b0a7f7b/placo-0.9.15-0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:23f9dd19b8d15fa9d86968948b57981ebc6f1decafeffc2d646d8b56f685b50d", size = 1515448, upload-time = "2025-11-03T10:48:50.562Z" },
-    { url = "https://files.pythonhosted.org/packages/fd/8e/e6283201d329409dccf2045b5c1efd73b3dad5268143bbea4668029ca9c6/placo-0.9.15-0-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:2680a2166c23a0a2aa6226ad75c63a2b2310c812673a5db296616d9af053e076", size = 2106550, upload-time = "2025-11-03T10:48:52.364Z" },
-    { url = "https://files.pythonhosted.org/packages/51/c3/77efe4c999e1d80ec14879ef73ea2a2144aa12db2b67870a562f87ed5b43/placo-0.9.15-0-cp312-cp312-manylinux_2_28_x86_64.whl", hash = "sha256:1a2202a78bcd2874ca09a9a6526a95b38874803923cb9b3b4b96cd68ab4b7217", size = 2178531, upload-time = "2025-11-03T10:48:53.932Z" },
-    { url = "https://files.pythonhosted.org/packages/fe/e7/b5cc5ad53414ff7af3357e0c9d97d902a3ce276e7810f8814fe9f0c1fb70/placo-0.9.15-0-cp313-cp313-macosx_10_9_x86_64.whl", hash = "sha256:84a445a99b059a512d1b4c64841a91d6f50149c7be9255c65bedeebbe6663989", size = 1641982, upload-time = "2025-11-03T10:48:55.277Z" },
-    { url = "https://files.pythonhosted.org/packages/ad/1c/1c9163d941698a077617f218041efc573d3bf5a1c169a284112bd622fccd/placo-0.9.15-0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:b3106e7e6b05cbfa494239d8aa14795f7da8ee5dec851602f0d6297e311d7334", size = 1515447, upload-time = "2025-11-03T10:48:56.975Z" },
-    { url = "https://files.pythonhosted.org/packages/cd/22/3d9b9045b89248c8476dd42243bc9821a123d9199e4e96a944124ad80cf1/placo-0.9.15-0-cp313-cp313-manylinux_2_28_aarch64.whl", hash = "sha256:66c3d099e87551401aace04f1293a3c3563b1399319976647846845bf92c3ccf", size = 2106558, upload-time = "2025-11-03T10:48:58.667Z" },
-    { url = "https://files.pythonhosted.org/packages/20/0b/45dbdd2c378a7cece578b7344fda493d5a2aa6777089798a315ce4f97c22/placo-0.9.15-0-cp313-cp313-manylinux_2_28_x86_64.whl", hash = "sha256:0e06b7d3d618ddc2b649ab8b0b46db8001fe72fe2fbcc801524df0ccc8a3da40", size = 2178531, upload-time = "2025-11-03T10:49:00.533Z" },
+    { url = "https://files.pythonhosted.org/packages/a4/95/8a85b58033303fd354a680e1494f47801abdca9133c222ae1c2473983f25/placo-0.9.16-0-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:417a89920b340e3aec19f1f49e1fb06789c679a807450157af8bdf4aef4bc82b", size = 1641806, upload-time = "2025-11-07T14:24:34.736Z" },
+    { url = "https://files.pythonhosted.org/packages/92/bd/2fb3556c71b0689b3168c0e85fce5befb605affcfe4afb3b5e7b5ba6749f/placo-0.9.16-0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:a7ef7ac33ba889d2122db0d7ed55eeecdffed020e2282712989bb11e408bab40", size = 1515468, upload-time = "2025-11-07T14:24:36.587Z" },
+    { url = "https://files.pythonhosted.org/packages/ea/fd/7dba380720dfb89df582a51d0b2cb43957a36849f676baa3dfc74704e67f/placo-0.9.16-0-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:885773fe8a8e809022451ec16d47479562a042596f663b8c5bbe762cd616f573", size = 2106540, upload-time = "2025-11-07T14:24:38.149Z" },
+    { url = "https://files.pythonhosted.org/packages/7a/40/97c7c799fe4f89111b973d7a5f86626a2ec1d0e6e20ce2988e0a2bda66f5/placo-0.9.16-0-cp312-cp312-manylinux_2_28_x86_64.whl", hash = "sha256:19f097305c714e539fbf19e761897f6daab2ff73f639319431b144e77dd3852e", size = 2178511, upload-time = "2025-11-07T14:24:40.04Z" },
+    { url = "https://files.pythonhosted.org/packages/f7/4d/f1700aae269584477b5d72561d2fc5ace37b4bca167892a74a369849c67e/placo-0.9.16-0-cp313-cp313-macosx_10_9_x86_64.whl", hash = "sha256:be11fa987702114097ccf3d94e1c4a891796878429e25c8d88b187ecc652e7ae", size = 1641812, upload-time = "2025-11-07T14:24:41.308Z" },
+    { url = "https://files.pythonhosted.org/packages/43/d7/21d1d0dd1311c0cbd9ccd233cdae520bbe2370095e3c831059d6077c90bd/placo-0.9.16-0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:c2d65aeb4844eae28006ad3a50c8519b27c701912cc99c46c95e33ed049f3635", size = 1515457, upload-time = "2025-11-07T14:24:42.758Z" },
+    { url = "https://files.pythonhosted.org/packages/0f/e8/939ba23bfa539fb90ab9ab1c2c59ff9a9a46e24699fc90e8ca3ff2948646/placo-0.9.16-0-cp313-cp313-manylinux_2_28_aarch64.whl", hash = "sha256:a7633aff1c592c1f45e86a174a372d5d7972673935cb9151391277ff49ec2072", size = 2106538, upload-time = "2025-11-07T14:24:44.517Z" },
+    { url = "https://files.pythonhosted.org/packages/08/00/ad24cc0ad85fbe12267df28c2061e1eaef8f852146c467fcd7a681e11028/placo-0.9.16-0-cp313-cp313-manylinux_2_28_x86_64.whl", hash = "sha256:0d97a7284b65fc45aef27865c80cf7e53f04646d35bb18494ab62dfbbc9a35bd", size = 2178514, upload-time = "2025-11-07T14:24:45.994Z" },
 ]

 [[package]]
Author	SHA1	Message	Date
Nikodem Bartnik	99c0d93b34	add more base models to generate model card	2026-05-20 12:24:32 +02:00
Nikodem Bartnik	c62784e14c	add port and fix formatting	2026-05-20 10:56:59 +02:00
Nikodem Bartnik	cc6a2cac43	update policy deployment instruction with rollout	2026-05-20 10:55:18 +02:00
				`@@ -1 +0,0 @@`
				`../../../../docs/source/policy_molmoact2_README.md`
				`@@ -1 +0,0 @@`
				`"""Lightweight vendored OpenPI PyTorch modules for PI0/PI05 parity tests."""`
				`@@ -1 +0,0 @@`
				`"""Utilities shared by PI0/PI05 policy tests."""`