feat(annotate): add plan toggle, drop subtask verify pass, 4xH200 job

- PlanConfig.emit_plan (default True): keep subtasks + memory but skip the per-boundary "plan" rows and their VLM call when False. - Remove the subtask_verify pass entirely: pruning dropped legitimate subtasks and the stitch step already guarantees full-episode coverage. Deletes _verify_subtasks, both call sites, and the now-unused module_1_subtask_verify prompt. - run_hf_job example: 4xH200 (4 vllm servers), emit_plan=false, vqa off. Co-authored-by: Cursor <cursoragent@cursor.com>
docs(annotate): prettier format annotation_pipeline.mdx
2026-06-02 20:01:25 +00:00 · 2026-06-02 18:02:13 +02:00 · 2026-06-02 17:41:46 +02:00 · 2026-06-02 17:38:18 +02:00 · 2026-06-02 17:36:07 +02:00 · 2026-06-02 16:26:14 +02:00
73 changed files with 23769 additions and 245 deletions
--- a/6
+++ b/6
@@ -178,3 +178,9 @@ test-smolvla-ete-eval:
 		--env.episode_length=5 \
 		--eval.n_episodes=1 \
 		--eval.batch_size=1
+
+# E2E annotation pipeline smoke test against a tiny in-memory fixture
+# dataset. Opt-in (not part of `make test-end-to-end`) and uses a stub VLM
+# backend, so it does not require a real model checkpoint or GPU.
+annotation-e2e:
+	uv run python -m tests.annotations.run_e2e_smoke
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -43,6 +43,8 @@
    title: Language Columns and Recipes
  - local: tools
    title: Tools
+  - local: annotation_pipeline
+    title: Annotation Pipeline
  - local: video_encoding_parameters
    title: Video encoding parameters
  - local: streaming_video_encoding
@@ -59,6 +61,8 @@
    title: π₀-FAST (Pi0Fast)
  - local: pi05
    title: π₀.₅ (Pi05)
+  - local: molmoact2
+    title: MolmoAct2
  - local: eo1
    title: EO-1
  - local: groot
@@ -73,6 +77,8 @@
 - sections:
  - local: sarm
    title: SARM
+  - local: robometer
+    title: ROBOMETER
  - local: topreward
    title: TOPReward
  title: "Reward Models"
--- a/docs/source/annotation_pipeline.mdx
+++ b/docs/source/annotation_pipeline.mdx
@@ -0,0 +1,198 @@
+# Annotation Pipeline
+
+`lerobot-annotate` populates the two language columns introduced by the
+[Language Columns and Recipes](./language_and_recipes) page —
+`language_persistent` and `language_events` — directly into
+`data/chunk-*/file-*.parquet`.
+
+## What the pipeline produces
+
+A vocabulary-discovery phase derives a small canonical wording, then three
+modules write into a per-episode staging tree, then a single writer
+rewrites the data shards in place:
+
+| Style / atom                                | Column                | Module          |
+| ------------------------------------------- | --------------------- | --------------- |
+| `subtask` (Pi0.7-style "how, not what")     | `language_persistent` | `plan`          |
+| `plan` (initial + refresh on interjection)  | `language_persistent` | `plan`          |
+| `memory` (MEM-style compression)            | `language_persistent` | `plan`          |
+| `task_aug` (rephrasings of canonical task)  | `language_persistent` | `plan`          |
+| `interjection`                              | `language_events`     | `interjections` |
+| speech tool-call atom (`style=null`, `say`) | `language_events`     | `interjections` |
+| `vqa` (user / assistant pair)               | `language_events`     | `vqa`           |
+
+The `plan` module is constrained to a **canonical vocabulary** discovered
+once per dataset by the `vocabulary` module (phase 0). It watches a few
+sample episode videos (`--vocabulary.sample_episodes`, default `3`) and
+asks the VLM to derive a small set of imperative subtask labels and
+first-person memory milestones that recur across the demos. The VLM
+picks the right number of entries itself based on what it sees in the
+clips — short pick-and-place demos get ~6 subtask labels, longer
+multi-step recipes get more. The result lands at
+`meta/canonical_vocabulary.json` (human-readable / hand-editable) and
+is reused on every subsequent run. The `plan` module then constrains
+both subtask + memory generation to those exact strings — the
+downstream low-level policy sees a small, repeatable target
+distribution instead of thousands of LLM paraphrases. Disable with
+`--vocabulary.enabled=False` to fall back to free-form generation.
+
+The writer does **not** add a `tools` column to the parquet — the tool
+catalog lives at `meta/info.json["tools"]` instead (see
+[Tools](./tools)). After every annotation run the pipeline ensures the
+canonical `say` schema is present in that list, preserving any tools the
+user pre-declared.
+
+If you want to declare additional tools for a dataset before annotation
+runs, edit `meta/info.json["tools"]` directly — the pipeline preserves
+anything already there. Implementations of those tools live under
+`src/lerobot/tools/`; one file per tool, registered via
+`TOOL_REGISTRY`. See the [Tools](./tools) doc for the authoring guide.
+
+## Running locally
+
+Install the extra and invoke the console script. Episode-level
+concurrency comes from `--executor.episode_parallelism` (default 16);
+that is the only knob the in-process executor exposes.
+
+```bash
+uv sync --extra annotations
+uv run lerobot-annotate \
+  --root=/path/to/dataset \
+  --vlm.model_id=Qwen/Qwen2.5-VL-7B-Instruct
+```
+
+The pipeline attaches actual camera footage to every `plan` /
+`interjections` / `vqa` prompt by default, decoded from the dataset's
+first `observation.images.*` stream. Override with
+`--vlm.camera_key=observation.images.<name>` to pin a specific
+viewpoint. Datasets with no video tracks fall back to text-only prompts
+automatically.
+
+**The `plan` module sees the whole episode as one video block.** Subtask
+decomposition gets a `{"type":"video", "video":[<frames>]}` block
+covering the entire demonstration; Qwen-VL pools temporally on its own
+and decides where to cut. There is no keyframe stride or count knob —
+`--plan.max_video_frames` (default 128) only caps the frames packed
+into the video block as a model-capacity bound. The `interjections`
+module attaches a short window of frames straddling the interjection
+timestamp. The `vqa` module grounds each VQA pair on a single frame —
+its `--vqa.K` knob sets how many consecutive frames each emission tick
+anchors, and every anchored frame gets its own VQA pair on that one
+frame (there is no per-pair frame window).
+
+## Running on Hugging Face Jobs
+
+Distributed annotation is delegated to
+[Hugging Face Jobs](https://huggingface.co/docs/hub/en/jobs). The repo
+ships a launcher script you copy and edit for your dataset:
+
+```bash
+HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py
+```
+
+[`examples/annotations/run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
+spawns one `h200x2` job that:
+
+1. installs the branch under test plus the annotation extras,
+2. boots two vllm servers (one per GPU) for the chosen model,
+3. runs the `plan` / `interjections` / `vqa` modules across the dataset
+   via `lerobot-annotate`,
+4. uploads the annotated dataset to `--push_to_hub`.
+
+To target a different dataset, model, or hub repo, edit the `CMD` block
+inside the script — every flag in there maps directly onto a CLI flag of
+`lerobot-annotate` (see `lerobot-annotate --help` for the full list).
+
+## Style-to-recipe consumer mapping
+
+The pipeline's outputs are designed to be consumed by recipes (see
+[Language Columns and Recipes](./language_and_recipes)) — typically:
+
+- low-level / high-level / memory-update branches consume
+  `subtask`/`plan`/`memory` from `language_persistent`.
+- An interjection-response branch consumes `interjection` events plus
+  the paired speech atom (merged into one assistant target turn via
+  `tool_calls_from`) and the same-timestamp `plan` refresh.
+- A VQA branch consumes the `(vqa, user)` and `(vqa, assistant)` pairs
+  from `language_events`.
+
+## Why the design splits state from events
+
+Two things drive the scope:
+
+1. **Persistent state vs exact-event split.** Persistent rows
+   (`subtask`, `plan`, `memory`) broadcast per episode and answer "what
+   state is in force at this frame?". Event rows (`interjection`, `vqa`,
+   speech) only appear on the exact frame whose timestamp matches the
+   emission. The pipeline writes timestamps taken straight from the
+   source parquet — no floating-point recomputation.
+2. **One Qwen-VL pass.** All three modules share a single VLM client
+   (vLLM if available, transformers fallback) so the cost is one model
+   load per dataset, not three.
+
+## Module independence and staged reruns
+
+Each module writes its raw output to
+`<root>/.annotate_staging/episode_{N:06d}/<module>.jsonl`. That makes
+prompt iteration cheap — re-running one module overwrites only its own
+JSONL file before the writer composes the final parquet. Modules can be
+disabled via `--plan.enabled=false` (and likewise `--interjections.enabled`
+/ `--vqa.enabled`) to
+test them in isolation.
+
+## Validation/report checks before final write
+
+Before the writer runs, `StagingValidator` checks:
+
+- exact frame-timestamp alignment for every event row;
+- no orphan speech / interjection pairs;
+- `plan` is refreshed at every interjection timestamp;
+- `memory` rows fall on subtask boundaries (warning, not error);
+- VQA assistant `content` parses as JSON in one of the
+  bbox / keypoint / count / attribute / spatial shapes;
+- every row routes to the column dictated by `column_for_style(style)`.
+
+Errors abort the writer (`--skip_validation=true` overrides for debugging).
+
+## Paper inspirations per module
+
+- **`plan` module — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417))
+  atom granularity ("pick up one piece of lettuce", "place bowl to box");
+  Pi0.7 ([Physical Intelligence 2025](https://pi.website/pi07)) "how, not
+  what" detail.
+- **`plan` module — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596))
+  compression directive: keep only minimal relevant information; functional
+  outcomes preserved, specific attributes dropped.
+- **`interjections` module.** Hi Robot scenario taxonomy: negative task,
+  situated correction, specific constraint, preference. Speech is a
+  tool-call-only atom (`tool_calls=[{type:function, function:{name:"say",
+arguments:{text:...}}}]`).
+- **`vqa` module.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693))
+  grounded features (bounding boxes in pixel `[x_min, y_min, x_max, y_max]`,
+  keypoints) and Steerable VLA Policies ([Zhao 2025](https://arxiv.org/abs/2509.07626))
+  multi-abstraction grounding. Pi0.7 also grounds answers across
+  multiple abstraction levels.
+
+Future maintainers should adjust the prompt templates in
+`src/lerobot/annotations/steerable_pipeline/prompts/` against these
+references rather than rewriting from scratch.
+
+## Compute and list-size estimates
+
+Per episode, the pipeline issues O(`max_steps`) `plan`-module calls,
+O(`max_interjections_per_episode`) `interjections`-module calls, and
+O(`vqa_emission_hz × episode_seconds`) `vqa`-module calls. With defaults
+(8 subtasks, 1 interjection, 1 Hz × 3 pairs) and 30-second episodes, that
+is ~50 VLM calls per episode. `language_persistent` per episode is ~10s of
+KB at most (parquet dictionary-encodes one entry per episode);
+`language_events` is empty on most frames and is bounded by the number of
+emissions, not `num_frames × num_emissions`.
+
+## Reproducibility via seed and prompt hashes
+
+`--seed` (default 1729) feeds the per-episode RNGs that select interjection
+timestamps and VQA question types. Combined with the deterministic prompt
+templates checked into `prompts/`, two runs at the same seed against the
+same dataset and the same model checkpoint produce byte-identical staging
+artifacts. Prompt edits are recorded by file hash; future tooling can pin
+expected `(seed, prompt_hash)` pairs into the dataset card.
--- a/docs/source/molmoact2.mdx
+++ b/docs/source/molmoact2.mdx
@@ -0,0 +1,433 @@
+# MolmoAct2 Policy
+
+MolmoAct2 is the LeRobot policy implementation of
+[MolmoAct2](https://allenai.org/blog/molmoact2), ported into the LeRobot
+training, evaluation, checkpointing, and dataset interfaces for easier use with
+LeRobot datasets.
+
+This implementation currently supports training and evaluation for the regular
+MolmoAct2 model. MolmoAct2-Think, which supports adaptive depth reasoning, is
+not included in this LeRobot policy yet and is coming soon.
+
+For the original MolmoAct2 training code used for the experiments reported in
+the paper, see [allenai/molmoact2](https://github.com/allenai/molmoact2).
+
+## Installation Requirements
+
+Install LeRobot with the MolmoAct2 optional dependencies:
+
+```bash
+pip install -e ".[molmoact2]"
+```
+
+To run the models in this repository, you need an NVIDIA GPU. The measurements
+below were taken on a single NVIDIA H100 80GB with bf16 model loading, LIBERO with two RGB cameras. MolmoAct2 rows use `chunk_size=10`, action dim 7
+padded to `expected_max_action_dim=32`, and `num_flow_timesteps=8`. Training measurements use
+`gradient_checkpointing=true` and include the forward pass, backward pass,
+gradient clipping, optimizer step, and optimizer state allocation. Values are
+peak GPU memory sampled with `nvidia-smi`. Leave a few GiB of headroom for
+dataloader workers, CUDA context, and fragmentation.
+
+Multi-GPU training through `accelerate` increases throughput and global batch
+size, but this LeRobot port does not currently expose the original MolmoAct2
+`fsdp_devices` model-parallel training path. The current training script has
+not been tested for multi-node training.
+
+| Mode                                             | Peak Memory, bs=8 | Peak Memory, bs=16 | Peak Memory, bs=32 |
+| ------------------------------------------------ | ----------------: | -----------------: | -----------------: |
+| Inference, continuous, CUDA graph enabled (bs=1) |          12.1 GiB |                  - |                  - |
+| Fine-tuning, action expert only, continuous      |          16.5 GiB |           18.3 GiB |           21.4 GiB |
+| Fine-tuning, LoRA VLM, both action modes         |          20.2 GiB |           26.8 GiB |           41.3 GiB |
+| Fine-tuning, full model, both action modes       |          48.3 GiB |           49.8 GiB |           60.1 GiB |
+
+The repo has been tested with Ubuntu 22.04.
+
+## Usage
+
+To use MolmoAct2 in a LeRobot training config, set:
+
+```python
+policy.type=molmoact2
+```
+
+## Training
+
+MolmoAct2 can be fine-tuned from either the released MolmoAct2 Hugging Face
+checkpoint format or from a checkpoint already saved by LeRobot. Both routes use
+the same LeRobot training loop, dataset transforms, checkpoint saving, and
+logging. The difference is only how the initial policy weights and processor
+state are loaded.
+
+### Training With Original MolmoAct2 Weight
+
+Use `policy.checkpoint_path` when starting from a released MolmoAct2 checkpoint,
+for example `allenai/MolmoAct2` or `allenai/MolmoAct2-LIBERO`. LeRobot will load
+the original HF model files, then build its own policy processor from the
+dataset metadata and the policy options below.
+
+The command below shows full fine-tuning on the merged LIBERO dataset. It uses
+bf16 model loading, 8 flow timesteps, LeRobot dataset statistics, image
+augmentation, and LeRobot's checkpointing/logging path.
+
+```bash
+accelerate launch \
+  --num_processes=8 \
+  --mixed_precision=bf16 \
+  -m lerobot.scripts.lerobot_train \
+  --dataset.repo_id=allenai/MolmoAct2-LIBERO-Dataset \
+  --dataset.root=/path/to/lerobot/data/allenai/MolmoAct2-LIBERO-Dataset \
+  --dataset.video_backend=pyav \
+  --dataset.image_transforms.enable=true \
+  --policy.type=molmoact2 \
+  --policy.checkpoint_path=allenai/MolmoAct2-LIBERO \
+  --policy.device=cuda \
+  --policy.action_mode=both \
+  --policy.chunk_size=10 \
+  --policy.n_action_steps=10 \
+  --policy.setup_type="single franka robotic arm in libero" \
+  --policy.control_mode="delta end-effector pose" \
+  --policy.image_keys='["observation.images.image","observation.images.wrist_image"]' \
+  --policy.model_dtype=bfloat16 \
+  --policy.num_flow_timesteps=8 \
+  --policy.gradient_checkpointing=true \
+  --policy.freeze_embedding=true \
+  --policy.normalize_gripper=false \
+  --policy.enable_knowledge_insulation=false \
+  --policy.push_to_hub=false \
+  --wandb.enable=true \
+  --wandb.entity=<wandb_entity> \
+  --wandb.project=<wandb_project> \
+  --job_name=<job_name> \
+  --output_dir=outputs/<job_name> \
+  --steps=10000 \
+  --batch_size=32 \
+  --num_workers=4 \
+  --log_freq=20 \
+  --eval_freq=-1 \
+  --save_checkpoint=true \
+  --save_freq=2000
+```
+
+### Training With LeRobot MolmoAct2 Weight
+
+Use `policy.path` when starting from a MolmoAct2 checkpoint that was saved by
+LeRobot, either from a local `pretrained_model` directory or from the Hub. This
+restores the saved LeRobot policy config, model weights, processor, and
+normalization statistics. You can still override training-time options such as
+`batch_size`, `steps`, LoRA flags, or `policy.action_mode`.
+
+```bash
+accelerate launch \
+  --num_processes=8 \
+  --mixed_precision=bf16 \
+  -m lerobot.scripts.lerobot_train \
+  --dataset.repo_id=allenai/MolmoAct2-LIBERO-Dataset \
+  --dataset.root=/path/to/lerobot/data/allenai/MolmoAct2-LIBERO-Dataset \
+  --dataset.video_backend=pyav \
+  --dataset.image_transforms.enable=true \
+  --policy.path=/path/to/pretrained_model \
+  --policy.device=cuda \
+  --policy.action_mode=both \
+  --policy.chunk_size=10 \
+  --policy.n_action_steps=10 \
+  --policy.model_dtype=bfloat16 \
+  --policy.num_flow_timesteps=8 \
+  --policy.gradient_checkpointing=true \
+  --wandb.enable=true \
+  --wandb.entity=<wandb_entity> \
+  --wandb.project=<wandb_project> \
+  --job_name=<job_name> \
+  --output_dir=outputs/<job_name> \
+  --steps=10000 \
+  --batch_size=32 \
+  --num_workers=4 \
+  --log_freq=20 \
+  --eval_freq=-1 \
+  --save_checkpoint=true \
+  --save_freq=2000
+```
+
+### Common Practices
+
+For fine-tuning on a comparatively small dataset, such as a single LIBERO suite
+or a real-world dataset with less than 200 demonstrations, a global batch size of
+16 to 32 is a good starting point. In these settings, `policy.enable_lora_vlm=true` or `policy.train_action_expert_only=true` is also a practical choice. In both
+cases, we intentionally keep the action expert fully trainable, which we found
+to be crucial for model performance. For larger fine-tuning datasets, larger
+global batch sizes and full fine-tuning are usually preferred.
+
+### Common Policy Options
+
+- `policy.checkpoint_path`: original MolmoAct2 HF checkpoint to initialize from.
+  Use this for released MolmoAct2 weights.
+- `policy.path`: LeRobot checkpoint to initialize from. Use this for checkpoints
+  created by LeRobot training.
+- `policy.action_mode`: training target, one of `continuous`, `discrete`, or
+  `both`. `both` trains the flow-matching action expert and the discrete
+  action-token loss.
+- `policy.train_action_expert_only`: trains only parameters whose names contain
+  `action_expert`. It requires `policy.action_mode=continuous`.
+- `policy.enable_lora_vlm`: enables LoRA on VLM linear layers. Use
+  `policy.enable_lora_action_expert=true` only if LoRA should also cover action
+  expert linear layers. When `policy.enable_lora_action_expert=false`, the
+  action expert base weights remain fully trainable while the VLM is trained
+  through LoRA adapters. When `policy.enable_lora_action_expert=true`, the
+  action expert is also adapter-tuned instead of fully fine-tuned.
+- `policy.enable_knowledge_insulation`: when `true`, detaches action-expert
+  context K/V states before the action loss. The default is `false`.
+- `policy.chunk_size`: action horizon used by the policy. For LIBERO we use
+  `10`. This LeRobot port overrides the loaded checkpoint's
+  `max_action_horizon` with this value.
+- `policy.n_action_steps`: number of actions consumed from each predicted
+  chunk before querying the policy again. For LIBERO, set it to `chunk_size`.
+- `policy.setup_type`: text inserted into the prompt to describe the robot and
+  scene, e.g. `single franka robotic arm in libero`. More examples are listed
+  in the `metadata_by_tag` entries of
+  [`norm_stats.json`](https://huggingface.co/allenai/MolmoAct2/blob/main/norm_stats.json).
+- `policy.control_mode`: text inserted into the prompt to describe the action
+  space, e.g. `delta end-effector pose` or `absolute joint pose`.
+- `policy.image_keys`: ordered LeRobot image observation keys passed to the
+  processor.
+- `policy.model_dtype`: checkpoint/forward dtype, one of `float32`,
+  `bfloat16`, or `float16`. Use `bfloat16` for normal training.
+- `policy.num_flow_timesteps`: number of flow-matching timesteps sampled per
+  example during training. We use `8` for fine-tuning.
+- `policy.num_inference_steps`: optional override for continuous action
+  generation steps at inference time.
+- `policy.gradient_checkpointing`: enables checkpointing in the VLM/action path
+  to reduce activation memory.
+- `policy.freeze_embedding`: freezes input embeddings. The default is `true`.
+- `policy.normalize_gripper`: controls whether gripper dimensions are included
+  in state/action quantile normalization. The default is `false`.
+- `policy.normalize_language`: normalizes task strings before prompt
+  construction. The default is `true`.
+- `policy.mask_action_dim_padding`: masks padded dimensions in the flow loss.
+  Released checkpoints use `policy.expected_max_action_dim=32`.
+- `policy.max_sequence_length`: optional manual sequence cap. Leave unset to
+  infer it from images, state dimension, action dimension, action horizon, and
+  discrete-action mode.
+
+### Learning Rates
+
+MolmoAct2 uses parameter-group learning rates to match the original MolmoAct2
+fine-tuning experiments.
+
+- Full fine-tuning uses `policy.optimizer_lr=1e-5` for the VLM,
+  `policy.optimizer_vit_lr=5e-6` for the vision tower,
+  `policy.optimizer_connector_lr=5e-6` for image connector layers, and
+  `policy.optimizer_action_expert_lr=5e-5` for the action expert.
+- LoRA VLM fine-tuning sets the VLM, vision, and connector LoRA parameter
+  groups to `5e-5` when `policy.enable_lora_vlm=true`. By default,
+  `policy.enable_lora_action_expert=false`, so the action expert is still fully
+  fine-tuned with `policy.optimizer_action_expert_lr`. If
+  `policy.enable_lora_action_expert=true`, the action expert is trained through
+  LoRA adapters instead.
+- Action-expert-only fine-tuning trains only the action expert and uses
+  `policy.optimizer_action_expert_lr=5e-5`.
+
+You can override the full fine-tuning and action-expert learning rates with
+`policy.optimizer_lr`, `policy.optimizer_vit_lr`,
+`policy.optimizer_connector_lr`, and `policy.optimizer_action_expert_lr`.
+Scheduler settings can be changed with `policy.scheduler_warmup_steps`,
+`policy.scheduler_decay_steps`, and `policy.scheduler_decay_lr`.
+
+### Dataset Quantile Statistics
+
+MolmoAct2 defaults to quantile normalization for state and action features. If
+your dataset has not been converted with quantile statistics, you can add them
+with:
+
+```bash
+python src/lerobot/datasets/v30/augment_dataset_quantile_stats.py \
+  --repo-id=your_dataset
+```
+
+Alternatively, train MolmoAct2 with mean/std normalization:
+
+```bash
+--policy.normalization_mapping='{"ACTION": "MEAN_STD", "STATE": "MEAN_STD", "VISUAL": "IDENTITY"}'
+```
+
+## Evaluation
+
+Evaluation also supports both LeRobot-saved checkpoints and original MolmoAct2
+HF checkpoints. For LIBERO replication, keep the EGL rendering environment
+fixed and use `policy.per_episode_seed=true`.
+
+**Important:** We found that `num_steps_wait=10` does not reliably let the
+LIBERO scene stabilize and can degrade measured success. All LIBERO evaluation
+results reported here use `num_steps_wait=50`.
+
+### Evaluation With LeRobot MolmoAct2 Weight
+
+Use `policy.path` for a checkpoint saved by LeRobot. The saved processor and
+normalization statistics are restored together with the model.
+
+```bash
+export MUJOCO_GL=egl
+export PYOPENGL_PLATFORM=egl
+export OMP_NUM_THREADS=1
+export MKL_NUM_THREADS=1
+
+lerobot-eval \
+  --policy.path=allenai/MolmoAct2-LIBERO-LeRobot \
+  --policy.inference_action_mode=continuous \
+  --policy.model_dtype=bfloat16 \
+  --policy.use_amp=true \
+  --policy.enable_inference_cuda_graph=true \
+  --policy.device=cuda \
+  --policy.per_episode_seed=true \
+  --policy.eval_seed=1000 \
+  --env.type=libero \
+  --env.task=libero_10,libero_goal,libero_object,libero_spatial \
+  --env.camera_name_mapping='{"agentview_image":"image","robot0_eye_in_hand_image":"wrist_image"}' \
+  --eval.batch_size=1 \
+  --eval.n_episodes=50 \
+  --seed=1000
+```
+
+### Evaluation With Original MolmoAct2 Weight
+
+You can evaluate a released Hugging Face checkpoint directly without first
+converting it to a LeRobot checkpoint. In this case, set
+`policy.checkpoint_path` to the HF model repo and provide `policy.norm_tag`.
+For LIBERO, `policy.norm_tag=libero` loads the LIBERO action/state
+normalization statistics, action horizon, prompt metadata, and image-key order
+from the checkpoint's `norm_stats.json`.
+
+To fully replicate the MolmoAct2 paper results with released Hugging Face
+checkpoints, we recommend using the v0.5.1-pinned
+[`allenai/lerobot` `molmoact2-hf-inference`](https://github.com/allenai/lerobot/tree/molmoact2-hf-inference)
+branch. That branch matches the original evaluation settings used for the
+reported numbers.
+
+```bash
+export MUJOCO_GL=egl
+export PYOPENGL_PLATFORM=egl
+export OMP_NUM_THREADS=1
+export MKL_NUM_THREADS=1
+
+lerobot-eval \
+  --policy.type=molmoact2 \
+  --policy.checkpoint_path=allenai/MolmoAct2-LIBERO \
+  --policy.norm_tag=libero \
+  --policy.inference_action_mode=continuous \
+  --policy.model_dtype=float32 \
+  --policy.use_amp=false \
+  --policy.enable_inference_cuda_graph=true \
+  --policy.device=cuda \
+  --policy.per_episode_seed=true \
+  --policy.eval_seed=1000 \
+  --env.type=libero \
+  --env.task=libero_goal \
+  --env.camera_name_mapping='{"agentview_image":"image","robot0_eye_in_hand_image":"wrist_image"}' \
+  --eval.batch_size=1 \
+  --eval.n_episodes=50 \
+  --seed=1000
+```
+
+Use `--env.task=libero_10,libero_goal,libero_object,libero_spatial` to run the
+full LIBERO suite. The same command works for other released MolmoAct2
+checkpoints as long as the requested `policy.norm_tag` exists in that
+checkpoint's `norm_stats.json`.
+
+### Common Evaluation Options
+
+- `policy.inference_action_mode`: required for rollout. Use `continuous` for
+  flow-matching inference or `discrete` for action-token inference. It must be
+  compatible with the training-time `policy.action_mode` saved in the
+  checkpoint.
+- `policy.path`: LeRobot checkpoint path or Hub repo. Use this for checkpoints
+  saved by LeRobot.
+- `policy.checkpoint_path`: original MolmoAct2 HF checkpoint path or Hub repo.
+  Use this with `policy.type=molmoact2` and `policy.norm_tag`.
+- `policy.norm_tag`: selects normalization statistics, prompt metadata,
+  image-key order, and action horizon from the original checkpoint's
+  `norm_stats.json`. It is required for direct original-HF checkpoint
+  evaluation.
+- `policy.model_dtype`: model load/forward dtype. Use `bfloat16` for normal
+  GPU evaluation. Use `float32` only when you explicitly want fp32 inference.
+- `policy.use_amp`: runs the policy forward under autocast during eval. For
+  `model_dtype=bfloat16`, keep this enabled.
+- `policy.enable_inference_cuda_graph`: enables the MolmoAct2 inference CUDA
+  graph path for faster repeated continuous-action rollout.
+- `policy.per_episode_seed` and `policy.eval_seed`: make stochastic continuous
+  action generation deterministic per episode for replication.
+- `env.task`: comma-separated LIBERO suites or a single suite. Use
+  `libero_10,libero_goal,libero_object,libero_spatial` for the full benchmark.
+- `env.camera_name_mapping`: maps LIBERO camera names to the image keys expected
+  by the policy processor.
+
+## Performance Results
+
+### LIBERO Benchmark Results
+
+MolmoAct2 has demonstrated strong performance on the LIBERO benchmark suite. To
+compare and test its LeRobot implementation, we fine-tuned
+[`allenai/MolmoAct2-LIBERO`](https://huggingface.co/allenai/MolmoAct2-LIBERO)
+for an additional 10k steps on the LIBERO dataset with per-GPU batch size 32 on
+8 H100 GPUs, then compared the results to the original MolmoAct2 reference
+results.
+
+The LeRobot fine-tuned checkpoint reported here is available at
+[`allenai/MolmoAct2-LIBERO-LeRobot`](https://huggingface.co/allenai/MolmoAct2-LIBERO-LeRobot)
+and was trained on
+[`allenai/MolmoAct2-LIBERO-Dataset`](https://huggingface.co/datasets/allenai/MolmoAct2-LIBERO-Dataset).
+
+| Benchmark      | LeRobot Implementation | MolmoAct2 Original |
+| -------------- | ---------------------: | -----------------: |
+| LIBERO Spatial |                  98.4% |              97.8% |
+| LIBERO Object  |                 100.0% |             100.0% |
+| LIBERO Goal    |                  98.0% |              97.8% |
+| LIBERO 10      |                  96.6% |              93.2% |
+| Average        |                 98.25% |             97.20% |
+
+These results demonstrate MolmoAct2's strong performance across diverse robotic
+manipulation tasks. To reproduce them, follow the instructions in the LIBERO
+evaluation section.
+
+## Differences From the Original Implementation
+
+This LeRobot port is intended to match MolmoAct2 behavior while using LeRobot's
+dataset, training, evaluation, checkpoint, and logging infrastructure. The main
+differences from the original training repository are:
+
+- The original paper training stack loads the model in fp32 and trains under
+  mixed precision. This LeRobot port usually loads the checkpoint directly in
+  `policy.model_dtype=bfloat16` for lower memory use.
+- The original repository uses its own FSDP/model-parallel training path. The
+  LeRobot port uses the standard LeRobot/Accelerate training path and has not
+  been tested for multi-node training.
+- The original repository supports sequence packing. The LeRobot port trains on
+  one LeRobot sample per item and pads to an inferred fixed sequence budget.
+- The LeRobot port follows LeRobot's optimizer, scheduler, checkpoint saving,
+  dataset transforms, image augmentation, and Weights & Biases logging
+  conventions.
+- The original training path supports mixed action horizons by padding to
+  `max_action_horizon` and masking padded horizon slots in the action expert
+  self-attention. This is useful when training across datasets with different
+  control frequencies. The LeRobot port currently targets single-dataset
+  fine-tuning, so `policy.chunk_size` overrides the checkpoint
+  `max_action_horizon` and horizon masking is not implemented yet. Support for
+  this mixed-horizon path is planned.
+
+## Citation
+
+```bibtex
+@misc{fang2026molmoact2actionreasoningmodels,
+      title={MolmoAct2: Action Reasoning Models for Real-world Deployment},
+      author={Haoquan Fang and Jiafei Duan and Donovan Clay and Sam Wang and Shuo Liu and Weikai Huang and Xiang Fan and Wei-Chuan Tsai and Shirui Chen and Yi Ru Wang and Shanli Xing and Jaemin Cho and Jae Sung Park and Ainaz Eftekhar and Peter Sushko and Karen Farley and Angad Wadhwa and Cole Harrison and Winson Han and Ying-Chun Lee and Eli VanderBilt and Rose Hendrix and Suveen Ellawela and Lucas Ngoo and Joyce Chai and Zhongzheng Ren and Ali Farhadi and Dieter Fox and Ranjay Krishna},
+      year={2026},
+      eprint={2605.02881},
+      archivePrefix={arXiv},
+      primaryClass={cs.RO},
+      url={https://arxiv.org/abs/2605.02881},
+}
+```
+
+## License
+
+This model is licensed under Apache 2.0. It is intended for research and
+educational use in accordance with
+[Ai2's Responsible Use Guidelines](https://allenai.org/responsible-use),
+consistent with [allenai/molmoact2](https://github.com/allenai/molmoact2).
--- a/docs/source/policy_molmoact2_README.md
+++ b/docs/source/policy_molmoact2_README.md
@@ -0,0 +1,39 @@
+# MolmoAct2
+
+This repository contains the LeRobot policy implementation of
+[MolmoAct2](https://allenai.org/blog/molmoact2), ported into LeRobot for
+training, evaluation, checkpointing, and dataset compatibility.
+
+This implementation currently supports training and evaluation for the regular
+MolmoAct2 model. MolmoAct2-Think, which supports adaptive depth reasoning, is
+not included in this LeRobot policy yet and is coming soon.
+
+For the original MolmoAct2 training code used for the experiments reported in
+the paper, see [allenai/molmoact2](https://github.com/allenai/molmoact2).
+
+## LIBERO Evaluation
+
+Important: we found that `num_steps_wait=10` does not reliably let the LIBERO
+scene stabilize and can degrade measured success. All LIBERO evaluation results
+reported for this LeRobot implementation use `num_steps_wait=50`.
+
+## Citation
+
+```bibtex
+@misc{fang2026molmoact2actionreasoningmodels,
+      title={MolmoAct2: Action Reasoning Models for Real-world Deployment},
+      author={Haoquan Fang and Jiafei Duan and Donovan Clay and Sam Wang and Shuo Liu and Weikai Huang and Xiang Fan and Wei-Chuan Tsai and Shirui Chen and Yi Ru Wang and Shanli Xing and Jaemin Cho and Jae Sung Park and Ainaz Eftekhar and Peter Sushko and Karen Farley and Angad Wadhwa and Cole Harrison and Winson Han and Ying-Chun Lee and Eli VanderBilt and Rose Hendrix and Suveen Ellawela and Lucas Ngoo and Joyce Chai and Zhongzheng Ren and Ali Farhadi and Dieter Fox and Ranjay Krishna},
+      year={2026},
+      eprint={2605.02881},
+      archivePrefix={arXiv},
+      primaryClass={cs.RO},
+      url={https://arxiv.org/abs/2605.02881},
+}
+```
+
+## License
+
+This model is licensed under Apache 2.0. It is intended for research and
+educational use in accordance with
+[Ai2's Responsible Use Guidelines](https://allenai.org/responsible-use),
+consistent with [allenai/molmoact2](https://github.com/allenai/molmoact2).
--- a/docs/source/robometer.mdx
+++ b/docs/source/robometer.mdx
@@ -0,0 +1,185 @@
+# ROBOMETER
+
+ROBOMETER is a **general-purpose video-language robotic reward model**. It predicts dense, frame-level task progress and frame-level success from a trajectory video and a task description.
+
+**Paper**: [ROBOMETER: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons](https://arxiv.org/abs/2603.02115)
+**Project**: [robometer.github.io](https://robometer.github.io/)
+**Original code**: [github.com/robometer/robometer](https://github.com/robometer/robometer)
+**Checkpoint**: [lerobot/Robometer-4B](https://huggingface.co/lerobot/Robometer-4B)
+
+## Overview
+
+ROBOMETER builds on `Qwen/Qwen3-VL-4B-Instruct` and adds three lightweight prediction heads:
+
+- **Progress head**: predicts per-frame task progress in `[0, 1]`.
+- **Success head**: predicts per-frame task success probability.
+- **Preference head**: predicts which of two trajectories better completes the task during training.
+
+The paper trains ROBOMETER with a composite objective:
+
+```text
+L = L_pref + L_prog + L_succ
+```
+
+The LeRobot integration is currently **inference-only**. It preserves the preference head so that the published `Robometer-4B` checkpoint loads without remapping, but `compute_reward()` queries the progress or success head only.
+
+## What the LeRobot Integration Covers
+
+- Standard `reward_model.type=robometer` configuration through LeRobot.
+- Qwen3-VL image and text preprocessing through `RobometerEncoderProcessorStep`.
+- LeRobot reward-model save/load APIs through `PreTrainedRewardModel`.
+- Dense, frame-level progress and success predictions internally.
+- A scalar reward through `compute_reward()` for downstream LeRobot reward-model usage.
+
+This page focuses on using the published ROBOMETER checkpoint as a zero-shot reward model. Training ROBOMETER from scratch is outside the current LeRobot integration.
+
+## Installation Requirements
+
+1. Install LeRobot by following the [Installation Guide](./installation).
+2. Install the ROBOMETER dependencies:
+
+```bash
+pip install -e ".[robometer]"
+```
+
+If you use `uv` directly from a source checkout:
+
+```bash
+uv sync --extra robometer
+```
+
+ROBOMETER uses a Qwen3-VL-4B backbone, so GPU inference is strongly recommended.
+
+## Model Inputs and Outputs
+
+ROBOMETER expects:
+
+- A trajectory video or sequence of frames.
+- A natural-language task description.
+
+In LeRobot datasets, the preprocessor reads:
+
+| Config field              | Default                  | Meaning                                               |
+| ------------------------- | ------------------------ | ----------------------------------------------------- |
+| `reward_model.image_key`  | `observation.images.top` | Camera/video observation used by ROBOMETER            |
+| `reward_model.task_key`   | `task`                   | Key in complementary data that stores the task string |
+| `reward_model.max_frames` | `8`                      | Maximum number of frames passed to ROBOMETER          |
+
+The model predicts per-frame progress and success internally. The LeRobot reward API returns a scalar per sample:
+
+- `reward_output="progress"` (default): return the last-frame progress, clamped to `[0, 1]`.
+- `reward_output="success"`: return `1.0` if the last-frame success probability is above `success_threshold`, otherwise `0.0`.
+
+## Usage
+
+### Load the Reward Model Directly
+
+```python
+from lerobot.rewards.robometer import RobometerConfig, RobometerRewardModel
+
+cfg = RobometerConfig(
+    pretrained_path="lerobot/Robometer-4B",
+    device="cuda",
+    reward_output="progress",
+)
+reward_model = RobometerRewardModel.from_pretrained(cfg.pretrained_path, config=cfg)
+```
+
+### Encode Frames and Compute a Reward
+
+For a direct Python call, provide frames as `uint8` arrays with shape `(T, H, W, C)` and a task string:
+
+```python
+from lerobot.rewards.robometer.modeling_robometer import ROBOMETER_FEATURE_PREFIX
+from lerobot.rewards.robometer.processor_robometer import RobometerEncoderProcessorStep
+
+# frames: np.ndarray, shape (T, H, W, C), dtype uint8
+# task: str
+encoder = RobometerEncoderProcessorStep(
+    base_model_id=cfg.base_model_id,
+    use_multi_image=cfg.use_multi_image,
+    use_per_frame_progress_token=cfg.use_per_frame_progress_token,
+    max_frames=cfg.max_frames,
+)
+
+encoded = encoder.encode_samples([(frames, task)])
+batch = {f"{ROBOMETER_FEATURE_PREFIX}{key}": value for key, value in encoded.items()}
+
+reward = reward_model.compute_reward(batch)
+```
+
+`reward` is a tensor of shape `(batch_size,)`.
+
+### Use the Reward Factory
+
+You can also instantiate ROBOMETER through the reward factory:
+
+```python
+from lerobot.rewards import make_reward_model, make_reward_model_config, make_reward_pre_post_processors
+
+cfg = make_reward_model_config(
+    "robometer",
+    pretrained_path="lerobot/Robometer-4B",
+    device="cuda",
+    image_key="observation.images.top",
+)
+reward_model = make_reward_model(cfg)
+preprocessor, postprocessor = make_reward_pre_post_processors(cfg)
+```
+
+The preprocessor writes Qwen-VL tensors under the `observation.robometer.*` namespace, and `compute_reward()` reads those encoded tensors.
+
+## Configuration Notes
+
+### Backbone and Vocabulary
+
+The published checkpoint uses a Qwen3-VL-4B backbone. ROBOMETER adds five special tokens to the tokenizer in a fixed order:
+
+```text
+<|split_token|>
+<|reward_token|>
+<|pref_token|>
+<|sim_token|>
+<|prog_token|>
+```
+
+`<|prog_token|>` is inserted after each frame and is the hidden-state position used for per-frame progress and success prediction. `<|split_token|>` and `<|pref_token|>` are used by the paper's pairwise trajectory preference objective. `<|reward_token|>` and `<|sim_token|>` are preserved for checkpoint compatibility.
+
+The LeRobot config stores a serialized `vlm_config` with the post-resize vocabulary so the model can reload from `config.json` without downloading the base Qwen weights first. For `Qwen/Qwen3-VL-4B-Instruct`, the tokenizer length is `151669`, and the five ROBOMETER tokens produce the checkpoint vocabulary size `151674`.
+
+### Progress Prediction
+
+In the published checkpoint, progress is discrete. The progress head outputs logits over `progress_discrete_bins=10` uniformly spaced bin centers in `[0, 1]`. LeRobot converts these logits into a continuous value by applying a softmax and taking the expectation over bin centers, matching the upstream ROBOMETER implementation.
+
+### Success Prediction
+
+The success head outputs raw logits per frame. LeRobot converts them to probabilities with `sigmoid`. When `reward_output="success"`, `compute_reward()` thresholds the last-frame success probability using `success_threshold`.
+
+## Limitations
+
+- The current LeRobot integration is inference-only; it does not implement ROBOMETER training or preference-pair training.
+- `compute_reward()` returns a scalar per sample for the LeRobot reward-model API, even though ROBOMETER predicts per-frame progress and success internally.
+- ROBOMETER is video-language based; it does not use privileged robot state such as contact forces or object poses.
+
+## References
+
+- [ROBOMETER project](https://robometer.github.io/)
+- [ROBOMETER paper](https://arxiv.org/abs/2603.02115)
+- [Original ROBOMETER code](https://github.com/robometer/robometer)
+- [Published ROBOMETER-4B checkpoint](https://huggingface.co/lerobot/Robometer-4B)
+- [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct)
+
+## Citation
+
+```bibtex
+@inproceedings{liang2026robometer,
+title = {Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons},
+author={Anthony Liang and Yigit Korkmaz and Jiahui Zhang and Minyoung Hwang and Abrar Anwar and Sidhant Kaushik and Aditya Shah and Alex S. Huang and Luke Zettlemoyer and Dieter Fox and Yu Xiang and Anqi Li and Andreea Bobu and Abhishek Gupta and Stephen Tu and Erdem Biyik and Jesse Zhang},
+year={2026},
+booktitle={Robotics: Science and Systems 2026},
+}
+```
+
+## License
+
+This LeRobot integration follows the **Apache 2.0 License** used by LeRobot. Check the upstream ROBOMETER code and model pages for the licenses of the original implementation and released checkpoints.
--- a/examples/annotations/run_hf_job.py
+++ b/examples/annotations/run_hf_job.py
@@ -0,0 +1,119 @@
+#!/usr/bin/env python
+"""Launch ``lerobot-annotate`` on a Hugging Face job (vllm + Qwen3.6-27B VLM).
+
+Spawns one ``h200x4`` job that:
+
+  1. installs this branch of ``lerobot`` plus the annotation extras,
+  2. boots four vllm servers (one per GPU) with Qwen3.6-27B (dense VLM),
+  3. runs the plan / interjections / vqa modules across the dataset
+     in free-form mode (each episode generates its own subtasks +
+     memory),
+  4. uploads the annotated dataset to ``--dest_repo_id`` (when set)
+     or back to ``--repo_id``.
+
+Usage:
+
+    HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py
+
+Adjust ``CMD`` below to point at your own dataset / target hub repo.
+"""
+
+import os
+
+from huggingface_hub import get_token, run_job
+
+token = os.environ.get("HF_TOKEN") or get_token()
+if not token:
+    raise RuntimeError("No HF token. Run `huggingface-cli login` or `export HF_TOKEN=hf_...`")
+
+CMD = (
+    "apt-get update -qq && apt-get install -y -qq git ffmpeg && "
+    "pip install --no-deps "
+    "'lerobot @ git+https://github.com/huggingface/lerobot.git@feat/language-annotation-pipeline' && "
+    "pip install --upgrade-strategy only-if-needed "
+    "datasets pyarrow av jsonlines draccus gymnasium torchcodec mergedeep pyyaml-include toml typing-inspect "
+    "openai && "
+    "export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 && "
+    "export VLLM_VIDEO_BACKEND=pyav && "
+    "lerobot-annotate "
+    "--repo_id=pepijn223/robocasa_pretrain_human300_v4 "
+    "--dest_repo_id=pepijn223/robocasa_pretrain_human300_v4_annotated5 "
+    "--push_to_hub=true "
+    "--vlm.backend=openai "
+    "--vlm.model_id=Qwen/Qwen3.6-27B "
+    "--vlm.parallel_servers=4 "
+    "--vlm.num_gpus=4 "
+    '--vlm.serve_command="vllm serve Qwen/Qwen3.6-27B '
+    "--tensor-parallel-size 1 --max-model-len 32768 "
+    '--gpu-memory-utilization 0.8 --uvicorn-log-level warning --port {port}" '
+    "--vlm.serve_ready_timeout_s=1800 "
+    "--vlm.client_concurrency=128 "
+    "--vlm.max_new_tokens=512 "
+    "--vlm.temperature=0.7 "
+    "--executor.episode_parallelism=16 "
+    "--vlm.chat_template_kwargs='{\"enable_thinking\": false}' "
+    "--vlm.camera_key=observation.images.robot0_agentview_right "
+    # Phase 1 — plan module (subtasks + plan + memory).
+    # Embed decoded frames directly (use_video_url=false) rather than
+    # handing the server a file:// clip. The embedded path is more
+    # reliable: if clip extraction ever fails, the video_url path would
+    # silently send NO video and the VLM would hallucinate subtasks from
+    # the task text alone.
+    #
+    # CONTEXT BUDGET: with embedded frames, each frame is ~250-320 vision
+    # tokens. The model's context is 32768 (see --max-model-len). 32
+    # frames sampled uniformly across the episode (~8-10k tokens) fits
+    # comfortably alongside the prompt and the describe pass.
+    # Do NOT raise max_video_frames toward 128 with embedded frames — that
+    # is ~33-39k tokens and overflows the context (BadRequestError 400,
+    # "Input length exceeds maximum context length").
+    "--plan.use_video_url=false "
+    "--plan.frames_per_second=1.0 "
+    "--plan.max_video_frames=32 "
+    # Constant 1 fps density via windowing: episodes longer than 32s are
+    # split into 32-second windows (each 32 frames @ 1 fps, fits context),
+    # so long episodes get MORE subtasks instead of a sparser whole-episode
+    # view. describe->segment runs per window; spans are merged +
+    # stitched to a contiguous whole-episode cover. 0 disables.
+    "--plan.subtask_window_seconds=32 "
+    # IMPORTANT for RoboCasa: the dataset's task string ("Navigate to the
+    # stove", "Pick the mug...") is authoritative and is what eval uses.
+    # ``derive_task_from_video=off`` keeps that canonical task driving
+    # subtask generation. Do NOT use ``always`` here — it throws the real
+    # task away, asks the VLM "what is this video about?" with no hint,
+    # and the hallucinated task then poisons every subtask + plan row.
+    "--plan.derive_task_from_video=off "
+    # NO task augmentation for RoboCasa: eval conditions on the exact task
+    # strings, so synthetic rephrasings are unused at best and (when they
+    # drift, e.g. "wander around the kitchen") harmful. 0 rephrasings +
+    # axes disabled = the policy only ever sees the canonical task.
+    "--plan.n_task_rephrasings=0 "
+    # action_records OFF: the structured {verb,object,arm,grasp,dest}
+    # schema is a manipulation schema; RoboCasa navigation / atomic tasks
+    # don't fit it and the VLM hallucinates. When on, records are purely
+    # additive (emitted as style="action_record" rows) and never touch
+    # the subtask text — useful only for long composite manipulation
+    # tasks. Leave off for RoboCasa atomic / navigation.
+    # Keep subtask decomposition tight for atomic tasks:
+    "--plan.plan_max_steps=10 "
+    # Only annotate subtasks + memory — skip the numbered "plan" rows
+    # (and their per-boundary VLM call). Flip to true to re-enable plan.
+    "--plan.emit_plan=false "
+    # NOTE: the grounding pass (describe -> segment, +1 VLM call/episode)
+    # is ON BY DEFAULT. Pass --plan.subtask_describe_first=false to disable
+    # on datasets you've verified are easy and want fewer calls.
+    # Phase 2 — interjections + speech.
+    "--interjections.max_interjections_per_episode=6 "
+    # Phase 4 — general VQA: DISABLED for this run.
+    "--vqa.enabled=false"
+)
+
+job = run_job(
+    image="vllm/vllm-openai:latest",
+    command=["bash", "-c", CMD],
+    flavor="h200x4",
+    secrets={"HF_TOKEN": token},
+    timeout="2h",
+)
+print(f"Job URL: {job.url}")
+print(f"Job ID:  {job.id}")
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -198,6 +198,7 @@ wallx = [
    "lerobot[qwen-vl-utils-dep]",
 ]
 pi = ["lerobot[transformers-dep]", "lerobot[scipy-dep]"]
+molmoact2 = ["lerobot[transformers-dep]", "lerobot[peft-dep]", "lerobot[scipy-dep]"]
 smolvla = ["lerobot[transformers-dep]", "num2words>=0.5.14,<0.6.0", "accelerate>=1.7.0,<2.0.0"]
 multi_task_dit = ["lerobot[transformers-dep]", "lerobot[diffusers-dep]"]
 groot = [
@@ -211,6 +212,7 @@ groot = [
    "flash-attn>=2.5.9,<3.0.0 ; sys_platform != 'darwin'"
 ]
 sarm = ["lerobot[transformers-dep]", "pydantic>=2.0.0,<3.0.0", "faker>=33.0.0,<35.0.0", "lerobot[matplotlib-dep]", "lerobot[qwen-vl-utils-dep]"]
+robometer = ["lerobot[transformers-dep]", "lerobot[qwen-vl-utils-dep]", "lerobot[peft-dep]"]
 topreward = ["lerobot[transformers-dep]"]
 xvla = ["lerobot[transformers-dep]"]
 eo1 = ["lerobot[transformers-dep]", "lerobot[qwen-vl-utils-dep]"]
@@ -220,6 +222,18 @@ hilserl = ["lerobot[transformers-dep]", "lerobot[dataset]", "gym-hil>=0.1.13,<0.
 async = ["lerobot[grpcio-dep]", "lerobot[matplotlib-dep]"]
 peft = ["lerobot[transformers-dep]", "lerobot[peft-dep]"]

+# Annotation pipeline (lerobot-annotate). vllm is the preferred backend
+# on Linux, with a transformers fallback elsewhere; openai is the default
+# backend and talks to any OpenAI-compatible server (``vllm serve`` /
+# ``transformers serve`` / hosted endpoints). Distributed execution is
+# delegated to Hugging Face Jobs (see examples/annotations/run_hf_job.py).
+annotations = [
+    "lerobot[dataset]",
+    "lerobot[transformers-dep]",
+    "openai>=1.40,<2.0",
+    "vllm>=0.6.0,<1.0.0; sys_platform == 'linux'",
+]
+
 # Development
 dev = ["pre-commit>=3.7.0,<5.0.0", "debugpy>=1.8.1,<1.9.0", "lerobot[grpcio-dep]", "grpcio-tools==1.73.1", "mypy>=1.19.1", "ruff>=0.14.1", "lerobot[notebook]"]
 notebook = ["jupyter>=1.0.0,<2.0.0", "ipykernel>=6.0.0,<7.0.0"]
@@ -275,6 +289,7 @@ all = [
    "lerobot[multi_task_dit]",
    "lerobot[wallx]",
    "lerobot[pi]",
+    "lerobot[molmoact2]",
    "lerobot[smolvla]",
    # "lerobot[groot]", TODO(Steven): Gr00t requires specific installation instructions for flash-attn
    "lerobot[xvla]",
@@ -289,6 +304,7 @@ all = [
    "lerobot[libero]; sys_platform == 'linux'",
    "lerobot[metaworld]",
    "lerobot[sarm]",
+    "lerobot[robometer]",
    "lerobot[topreward]",
    "lerobot[peft]",
    # "lerobot[unitree_g1]", TODO: Unitree requires specific installation instructions for unitree_sdk2
@@ -311,6 +327,7 @@ lerobot-find-joint-limits="lerobot.scripts.lerobot_find_joint_limits:main"
 lerobot-imgtransform-viz="lerobot.scripts.lerobot_imgtransform_viz:main"
 lerobot-edit-dataset="lerobot.scripts.lerobot_edit_dataset:main"
 lerobot-setup-can="lerobot.scripts.lerobot_setup_can:main"
+lerobot-annotate="lerobot.scripts.lerobot_annotate:main"
 lerobot-rollout="lerobot.scripts.lerobot_rollout:main"

 # ---------------- Tool Configurations ----------------
@@ -329,7 +346,7 @@ torch = [{ index = "pytorch-cu128", marker = "sys_platform == 'linux'" }]
 torchvision = [{ index = "pytorch-cu128", marker = "sys_platform == 'linux'" }]

 [tool.setuptools.package-data]
-lerobot = ["envs/*.json"]
+lerobot = ["envs/*.json", "annotations/steerable_pipeline/prompts/*.txt"]

 [tool.setuptools.packages.find]
 where = ["src"]
@@ -390,7 +407,7 @@ exclude_dirs = [
    "benchmarks",
    "src/lerobot/datasets/push_dataset_to_hub",
 ]
-skips = ["B101", "B311", "B404", "B603", "B615"]
+skips = ["B101", "B311", "B404", "B603", "B607", "B615"]

 [tool.typos]
 default.extend-ignore-re = [
@@ -405,8 +422,11 @@ default.extend-ignore-identifiers-re = [
    "ein",
    "thw",
    "inpt",
+    "arange",
+    "is_compileable",
    "ROBOTIS",
-    "OT_VALUE"
+    "OT_VALUE",
+    "VanderBilt"
 ]

 # TODO: Uncomment when ready to use
--- a/src/lerobot/annotations/init.py
+++ b/src/lerobot/annotations/init.py
@@ -0,0 +1,15 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
--- a/src/lerobot/annotations/steerable_pipeline/init.py
+++ b/src/lerobot/annotations/steerable_pipeline/init.py
@@ -0,0 +1,36 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Steerable annotation pipeline producing ``language_persistent`` and
+``language_events`` columns for LeRobot datasets.
+
+The pipeline is decomposed into three independently runnable modules whose
+outputs are staged per-episode before a final parquet rewrite:
+
+- :mod:`.modules.plan_subtasks_memory` (the ``plan`` module) — persistent styles
+- :mod:`.modules.interjections_and_speech` (the ``interjections`` module) — event styles + speech
+- :mod:`.modules.general_vqa` (the ``vqa`` module) — event-style VQA pairs
+"""
+
+from .config import AnnotationPipelineConfig
+from .validator import StagingValidator, ValidationReport
+from .writer import LanguageColumnsWriter
+
+__all__ = [
+    "AnnotationPipelineConfig",
+    "LanguageColumnsWriter",
+    "StagingValidator",
+    "ValidationReport",
+]
--- a/src/lerobot/annotations/steerable_pipeline/config.py
+++ b/src/lerobot/annotations/steerable_pipeline/config.py
@@ -0,0 +1,412 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+
+
+@dataclass
+class PlanConfig:
+    """``plan`` module: plan + subtasks + memory + task augmentation.
+
+    The ``plan`` module attaches the whole episode as one Qwen-VL video
+    block; ``max_video_frames`` only caps the frames packed in (a
+    model-capacity bound, not an annotation-logic knob).
+    """
+
+    enabled: bool = True
+
+    # Number of ``task_aug`` rephrasings emitted at ``t=0``. The renderer's
+    # ``${task}`` binding rotates among them per ``sample_idx``. ``0`` disables.
+    n_task_rephrasings: int = 10
+
+    # When to derive the task from the video instead of using
+    # ``record.episode_task``: ``off``, ``if_short`` (short / placeholder /
+    # missing canonical task), or ``always``. The derived task replaces the
+    # canonical one for every ``plan``-module prompt; ``meta/tasks.parquet``
+    # is never modified.
+    derive_task_from_video: str = "if_short"
+    derive_task_min_words: int = 3
+
+    # Frame sampling for the subtask-decomposition prompt. Frames are
+    # sampled uniformly across the whole episode up to ``max_video_frames``
+    # (so longer episodes are subsampled, not truncated).
+    #
+    # ``max_video_frames`` is a HARD context-budget cap. With the embedded-
+    # frame path (use_video_url=false), every frame becomes ~250-320 vision
+    # tokens, so 128 frames ≈ 33-39k tokens — over a 32k-context VLM. 32
+    # frames (~8-10k tokens) leaves ample room for the prompt + the
+    # describe / verify passes. Raise only if your serving context is
+    # larger AND your episodes need finer temporal resolution; if you hit
+    # "Input length exceeds maximum context length", lower this.
+    frames_per_second: float = 1.0
+    max_video_frames: int = 32
+
+    # Windowed subtask generation for CONSTANT temporal density. When > 0
+    # and an episode is longer than this many seconds, the plan module
+    # processes the episode in consecutive windows of this length, each
+    # sampled at ``frames_per_second``, instead of subsampling the whole
+    # episode to ``max_video_frames`` (which makes long episodes sparse).
+    # The describe -> segment -> verify chain runs per window; results are
+    # offset to absolute time, merged, and stitched into a contiguous
+    # whole-episode cover. Cost scales with episode length (≈ chain calls
+    # × ceil(duration / window)). Set to ~max_video_frames / frames_per_
+    # second (e.g. 32s at 1 fps) so each window fills — but never exceeds —
+    # the per-call frame budget. 0 disables (single whole-episode call).
+    subtask_window_seconds: float = 0.0
+
+    min_subtask_seconds: float = 1.5
+    plan_max_steps: int = 8
+
+    # ``subtask_describe_first``: run a grounding pass that narrates ONLY
+    # what is visible in the video (no subtask JSON yet), then inject that
+    # description into the segmentation prompt. Forces the model to observe
+    # before committing to structured output — the strongest lever against
+    # subtasks invented from the task text. ON by default; +1 VLM call/ep.
+    # Set False to trade quality for fewer calls on easy datasets.
+    subtask_describe_first: bool = True
+
+    # Emit ``style="plan"`` rows (the numbered still-todo list re-emitted at
+    # every subtask boundary). Set False to keep only subtasks + memory and
+    # skip the plan rows entirely — saves one ``_generate_plan`` VLM call per
+    # subtask boundary. Subtask and memory generation are unaffected.
+    emit_plan: bool = True
+
+    # NOTE: subtask spans are ALWAYS stitched into a contiguous
+    # full-episode cover (first subtask pulled back to t0, gaps closed,
+    # last span extended to t_last) as a deterministic post-step in
+    # ``_generate_subtasks._stitch_full_coverage``. This is not
+    # configurable — a sparse / gap-ridden subtask timeline is never
+    # desirable for conditioning, so it is unconditional.
+
+    # When True (and backend supports it, e.g. ``openai``), the ``plan``
+    # module sends a ``video_url`` block pointing at a per-episode mp4
+    # subclip and lets the server sample frames at ``use_video_url_fps``.
+    use_video_url: bool = False
+    use_video_url_fps: float = 1.0
+
+    # Structured per-subtask action records (Phase 1a + 1b, inspired by
+    # EgoMimic's annotator form). For each generated subtask span, the
+    # VLM extracts a typed record (verb / object / arm / grasp_type /
+    # destination / mistake). A deterministic Python template renders
+    # that record back to canonical subtask text — reducing the VLM's
+    # "creative" surface to just the perception step. See
+    # ``ActionRecordsConfig`` for details. Off by default (back-compat).
+    action_records: ActionRecordsConfig = field(default_factory=lambda: ActionRecordsConfig())
+
+    # Structured 5-axis augmentation taxonomy for the t=0 task variants
+    # (replaces the free-form ``n_task_rephrasings`` flow when enabled).
+    # Mirrors EgoMimic's ``augment_prompt.txt`` taxonomy: instead of N
+    # free-form rephrasings, the VLM produces variants along named
+    # axes (synonym / omit_arm / omit_orientation / omit_grasp_method /
+    # combined). Off by default (back-compat).
+    task_aug_axes: TaskAugAxesConfig = field(default_factory=lambda: TaskAugAxesConfig())
+
+
+@dataclass
+class ActionRecordsConfig:
+    """Structured per-subtask action record extraction.
+
+    When ``enabled=True``, after the existing subtask-span generation in
+    ``plan_subtasks_memory.py``, the module makes one extra VLM call per
+    subtask to extract a typed record::
+
+        {
+            "verb": "pick" | "place" | "press" | ...,  # closed vocabulary
+            "object": "<canonical_object_name>",
+            "arm": "left" | "right" | "both" | null,
+            "grasp_type": "pinch" | "wrap" | "hook" | ... | null,
+            "destination": "<canonical_destination>" | null,
+            "mistake": "<short text>" | null,
+        }
+
+    The record is emitted as a separate row with ``style="action_record"``
+    (``content=json.dumps(record)``) at the subtask's start timestamp.
+    It is PURELY ADDITIVE — it never touches the VLM's subtask text.
+    Downstream training can consume the typed schema directly (e.g.
+    auxiliary supervision on verb / arm / grasp classification heads)
+    while the subtask string the policy conditions on stays exactly what
+    the subtask module produced. (Reconstructing subtask text from these
+    fields was too easy for the VLM to hallucinate on tasks that don't
+    fit the manipulation schema — navigation tasks yielded nonsense like
+    ``move stove to stove`` — so that path was removed.)
+
+    Cost: one extra VLM call per subtask. For an 8-subtask episode this
+    means ~8x more VLM calls in the plan module — still cheap relative
+    to the action-expert training cost, but worth knowing.
+    """
+
+    enabled: bool = False
+
+    # When True (default), emit a separate row with ``style="action_record"``
+    # and ``content=json.dumps(record)`` at the subtask's start timestamp.
+    # This is the only output of the feature — set ``enabled=False`` to
+    # skip the extra VLM calls entirely.
+    emit_record_row: bool = True
+
+    # Frame sampling for the per-subtask VLM call (similar to the
+    # interjection module's window). Anchored to the subtask span.
+    frames_per_subtask: int = 4
+
+    # Closed verb vocabulary. The prompt instructs the VLM to pick
+    # exactly one. Override per-dataset (e.g. ``["pick", "place", "open",
+    # "close"]`` for door-only manipulation) for tighter constraint.
+    verb_vocabulary: tuple[str, ...] = (
+        "pick",
+        "place",
+        "push",
+        "pull",
+        "open",
+        "close",
+        "turn",
+        "press",
+        "lift",
+        "insert",
+        "pour",
+        "move",
+        "reach",
+        "grasp",
+        "release",
+        "wipe",
+        "dump",
+    )
+
+    # Closed grasp-type vocabulary. ``null`` is always allowed (no
+    # contact / unclear). Adjust per-hardware (e.g. drop ``hook`` /
+    # ``key`` for parallel-jaw grippers).
+    grasp_vocabulary: tuple[str, ...] = (
+        "pinch",
+        "wrap",
+        "hook",
+        "key",
+        "lateral",
+    )
+
+
+@dataclass
+class TaskAugAxesConfig:
+    """Structured 5-axis augmentation taxonomy for t=0 task variants.
+
+    When ``enabled=True``, replaces the free-form ``n_task_rephrasings``
+    flow with a structured prompt that produces variants along five
+    named axes (mirroring EgoMimic's ``augment_prompt.txt``):
+
+      * ``synonym_paraphrase`` — different wording / verbs, all
+        information preserved.
+      * ``omit_arm`` — drop the left/right/both arm specification.
+      * ``omit_orientation`` — drop orientation cues (upright,
+        sideways, ...).
+      * ``omit_grasp_method`` — drop grip / grasp method specification.
+      * ``combined_omissions`` — combine two of the above
+        simultaneously.
+
+    Default counts (3+3+2+2+2 = 12 variants per task) match EgoMimic.
+    Axes that have nothing to omit in the source task (e.g. ``omit_arm``
+    when the task doesn't mention an arm) emit fewer entries rather
+    than pad — the prompt instructs the VLM accordingly.
+
+    Each variant is emitted as a ``task_aug`` row at ``t=0`` (same
+    style as the free-form variants), so the rest of the pipeline /
+    training recipe doesn't need to know about the taxonomy.
+    """
+
+    enabled: bool = False
+
+    synonym_paraphrase: int = 3
+    omit_arm: int = 3
+    omit_orientation: int = 2
+    omit_grasp_method: int = 2
+    combined_omissions: int = 2
+
+    @property
+    def total(self) -> int:
+        """Sum of requested variants across all axes (upper bound)."""
+        return (
+            self.synonym_paraphrase
+            + self.omit_arm
+            + self.omit_orientation
+            + self.omit_grasp_method
+            + self.combined_omissions
+        )
+
+
+@dataclass
+class InterjectionsConfig:
+    """``interjections`` module: interjections + paired speech."""
+
+    enabled: bool = True
+
+    # Each interjection emits a paired ``(interjection, speech)`` event row
+    # and triggers a ``plan`` refresh at the same timestamp via the
+    # ``plan`` module.
+    max_interjections_per_episode: int = 3
+    interjection_min_t: float = 2.0
+
+    # Visual context attached to the interjection prompt: a short window
+    # of frames centered on the chosen timestamp so the VLM sees the
+    # ongoing motion rather than a single frozen frame.
+    interjection_window_seconds: float = 2.0
+    interjection_window_frames: int = 4
+
+
+@dataclass
+class VqaConfig:
+    """``vqa`` module: general VQA."""
+
+    enabled: bool = True
+    vqa_emission_hz: float = 1.0
+    K: int = 1
+    """How many *consecutive* frames each emission tick anchors a VQA pair
+    to. The VLM grounds its answer (bbox / keypoint coordinates, count, …)
+    against the *first* anchored frame's image, so anchoring K>1 frames
+    copies that same answer onto later frames where the scene has already
+    moved — stale labels. Default ``1``: a VQA pair lands on exactly its
+    emission frame, no temporal smear. Raise it only to trade label
+    precision for more (noisier) VQA frames."""
+    question_types: tuple[str, ...] = ("bbox", "keypoint", "count", "attribute", "spatial")
+
+    # Camera restriction. By default VQA iterates EVERY camera the
+    # dataset declares (one VQA pair per camera per emission tick). Set
+    # ``restrict_to_default_camera=True`` to ground VQA on only the
+    # single ``--vlm.camera_key`` stream — the same camera the plan /
+    # interjection modules use — so the whole pipeline focuses on one
+    # view. Use this when you want every annotation grounded on, e.g.,
+    # ``observation.images.base`` and nothing else.
+    restrict_to_default_camera: bool = False
+
+
+@dataclass
+class VlmConfig:
+    """Shared Qwen-VL client configuration."""
+
+    # One of ``vllm``, ``transformers``, ``openai``, or ``stub`` (tests).
+    # ``openai`` talks to a local OpenAI-compatible server; the CLI
+    # auto-spawns one when ``auto_serve=True``.
+    backend: str = "openai"
+    model_id: str = "Qwen/Qwen3.6-35B-A3B-FP8"
+
+    # OpenAI-compatible server endpoint; ``EMPTY`` works for local servers.
+    api_base: str = "http://localhost:8000/v1"
+    api_key: str = "EMPTY"
+
+    # When True with ``backend=openai``, the CLI probes ``api_base`` and
+    # spawns a server if none answers (default: ``transformers serve``).
+    # Set to False to fail fast when pointing at a remote endpoint.
+    auto_serve: bool = True
+    serve_port: int = 8000
+    # Override the auto-serve command. ``{port}`` is substituted per replica
+    # when ``parallel_servers > 1``.
+    serve_command: str | None = None
+
+    # Run multiple independent inference servers for round-robin client
+    # routing (each pinned to a GPU via ``CUDA_VISIBLE_DEVICES`` and bound
+    # to ``serve_port + i``). ``num_gpus=0`` means one GPU per replica.
+    parallel_servers: int = 1
+    num_gpus: int = 0
+    client_concurrency: int = 16
+    serve_ready_timeout_s: float = 600.0
+
+    max_new_tokens: int = 512
+    temperature: float = 0.2
+    json_mode: bool = True
+    batch_size: int = 4
+    tensor_parallel_size: int = 1
+
+    # Fraction of GPU memory vllm allocates for weights + KV cache.
+    gpu_memory_utilization: float = 0.9
+    # Cap context length (None = model default). On 80 GB H100 a 30B BF16
+    # model often needs <= 8192 to leave KV-cache headroom.
+    max_model_len: int | None = None
+    trust_remote_code: bool = False
+
+    # Override the camera stream used for keyframe attachment. None picks
+    # the first ``observation.images.*`` key the dataset declares.
+    camera_key: str | None = None
+    # Forwarded as ``extra_body.chat_template_kwargs`` on every chat call;
+    # use to pass model-specific flags such as ``{"enable_thinking": false}``.
+    chat_template_kwargs: dict[str, Any] | None = None
+
+
+@dataclass
+class ExecutorConfig:
+    """Executor settings.
+
+    Distributed execution is provided by Hugging Face Jobs (see
+    ``examples/annotation/run_hf_job.py``); this config only controls
+    intra-process episode concurrency.
+    """
+
+    # Episodes processed concurrently within each module phase. Each
+    # in-flight episode dispatches 3-5 dependent VLM calls, so this is the
+    # main knob for saturating ``parallel_servers`` and ``client_concurrency``.
+    episode_parallelism: int = 16
+
+
+@dataclass
+class AnnotationPipelineConfig:
+    """Top-level config for ``lerobot-annotate``.
+
+    The writer rewrites ``data/chunk-*/file-*.parquet`` in place. Multiple
+    revisions of the same dataset live in separate copies.
+    """
+
+    # Hub dataset id. Used as the download source when ``root`` is unset,
+    # and as the destination repo when ``push_to_hub`` is enabled and
+    # ``dest_repo_id`` is unset.
+    repo_id: str | None = None
+
+    # Optional separate Hub dataset id to push the annotated result to. When
+    # unset, ``push_to_hub`` uploads back to ``repo_id`` (annotate in place);
+    # when set, the source ``repo_id`` is left untouched.
+    dest_repo_id: str | None = None
+
+    root: Path | None = None
+
+    # Defaults to ``<root>/.annotate_staging/`` when unset.
+    staging_dir: Path | None = None
+
+    seed: int = 1729
+
+    plan: PlanConfig = field(default_factory=PlanConfig)
+    interjections: InterjectionsConfig = field(default_factory=InterjectionsConfig)
+    vqa: VqaConfig = field(default_factory=VqaConfig)
+
+    vlm: VlmConfig = field(default_factory=VlmConfig)
+    executor: ExecutorConfig = field(default_factory=ExecutorConfig)
+
+    skip_validation: bool = False
+    only_episodes: tuple[int, ...] | None = None
+
+    # Keyframe decode backend. When unset, the pipeline decodes with the
+    # ffmpeg CLI: it decodes AV1 and runs each decode as an isolated child
+    # process, which is both crash-safe and safe under the concurrent
+    # decode the executor performs (torchcodec is not thread-safe and
+    # SIGSEGVs there). Set to ``"torchcodec"`` or ``"pyav"`` to pin an
+    # in-process decoder when its build is known thread-safe.
+    video_backend: str | None = None
+
+    # When True, upload the annotated dataset to the Hugging Face Hub:
+    # to ``dest_repo_id`` if set, otherwise back to ``repo_id``. One of
+    # the two must be set for this to take effect.
+    push_to_hub: bool = False
+    push_private: bool = False
+    push_commit_message: str | None = None
+
+    def resolved_staging_dir(self, root: Path) -> Path:
+        return self.staging_dir if self.staging_dir is not None else root / ".annotate_staging"
--- a/src/lerobot/annotations/steerable_pipeline/executor.py
+++ b/src/lerobot/annotations/steerable_pipeline/executor.py
@@ -0,0 +1,259 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""In-process executor that runs the annotation phases.
+
+The executor plans **seven phases** in the dependency order from the plan:
+
+    phase 0: vocabulary discovery — derive a small canonical vocabulary
+             from the first few sample-episode videos (subtask labels +
+             memory milestones) and persist it next to the dataset; the
+             ``plan`` module then constrains every per-episode generation
+             to those strings, so the downstream policy sees a small,
+             repeatable conditioning distribution
+    phase 1: ``plan`` module (plan + subtasks + memory)
+    phase 2: ``interjections`` module (interjections + speech)
+    phase 3: ``plan`` plan-update pass — re-runs plan emission at every
+             interjection timestamp produced by phase 2
+    phase 4: ``vqa`` module (VQA)
+    phase 5: validator
+    phase 6: writer
+
+Phase 3 is why the ``plan`` module must be re-entered after the
+``interjections`` module — to refresh ``plan`` rows at interjection
+timestamps.
+
+Distributed execution is provided by Hugging Face Jobs (see
+``examples/annotations/run_hf_job.py``); the runner inside the job
+invokes ``lerobot-annotate`` which uses this in-process executor.
+Episode-level concurrency is controlled by
+``ExecutorConfig.episode_parallelism``.
+"""
+
+from __future__ import annotations
+
+import logging
+import time
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+
+from .config import AnnotationPipelineConfig
+from .reader import EpisodeRecord, iter_episodes
+from .staging import EpisodeStaging
+from .validator import StagingValidator
+from .writer import LanguageColumnsWriter
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class PhaseResult:
+    """Summary of one pipeline phase across all episodes."""
+
+    name: str
+    episodes_processed: int
+    episodes_skipped: int
+
+
+@dataclass
+class PipelineRunSummary:
+    """Aggregated result returned by :meth:`Executor.run`."""
+
+    phases: list[PhaseResult]
+    written_paths: list[Path]
+    validation_report: Any  # ValidationReport, kept Any to avoid import cycle
+
+
+@dataclass
+class Executor:
+    """Run all six phases over a dataset root in-process.
+
+    Episode-level concurrency comes from ``ExecutorConfig.episode_parallelism``
+    (a thread pool); cluster-level concurrency comes from running this
+    executor inside a Hugging Face Job. Tests construct the executor
+    directly with stub modules.
+    """
+
+    config: AnnotationPipelineConfig
+    plan: Any  # PlanSubtasksMemoryModule
+    interjections: Any  # InterjectionsAndSpeechModule
+    vqa: Any  # GeneralVqaModule
+    writer: LanguageColumnsWriter
+    validator: StagingValidator
+
+    def run(self, root: Path) -> PipelineRunSummary:
+        records = list(iter_episodes(root, only_episodes=self.config.only_episodes))
+        n = len(records)
+        if n == 0:
+            raise ValueError(f"No episodes found under {root}/data/")
+
+        print(f"[annotate] {n} episodes total", flush=True)
+
+        staging_dir = self.config.resolved_staging_dir(root)
+        staging_dir.mkdir(parents=True, exist_ok=True)
+
+        phases: list[PhaseResult] = []
+
+        # Phase 1: ``plan`` module (plan + subtasks + memory)
+        phases.append(self._run_module_phase("plan", records, staging_dir, self.plan))
+        # Phase 2: ``interjections`` module (interjections + speech). It
+        # reads the ``plan`` module's subtask rows from the same staging
+        # tree to ground the interjection prompt in the correct local subtask.
+        phases.append(self._run_module_phase("interjections", records, staging_dir, self.interjections))
+        # Phase 3: ``plan`` plan-update pass at interjection timestamps.
+        phases.append(self._run_plan_update_phase(records, staging_dir))
+        # Phase 4: ``vqa`` module (VQA)
+        phases.append(self._run_module_phase("vqa", records, staging_dir, self.vqa))
+
+        print("[annotate] running validator...", flush=True)
+        report = self.validator.validate(records, staging_dir)
+        if not report.ok and not self.config.skip_validation:
+            raise RuntimeError(f"Staging validation failed: {report.summary()}")
+        print(f"[annotate] validator: {report.summary()}", flush=True)
+
+        print(f"[annotate] writing parquet shards into {root}/data/...", flush=True)
+        written = self.writer.write_all(records, staging_dir, root)
+        print(f"[annotate] wrote {len(written)} shard(s); pipeline complete", flush=True)
+
+        # Keep meta/info.json aligned with the parquet schema we just wrote.
+        # Idempotent and additive: existing user metadata is preserved.
+        self._ensure_annotation_metadata_in_info(root)
+
+        return PipelineRunSummary(phases=phases, written_paths=written, validation_report=report)
+
+    @staticmethod
+    def _ensure_annotation_metadata_in_info(root: Path) -> None:
+        """Write language features and canonical tools to ``meta/info.json``.
+
+        ``LanguageColumnsWriter`` adds ``language_persistent`` and
+        ``language_events`` to parquet shards. The metadata must advertise
+        those columns too, otherwise non-streaming ``LeRobotDataset`` loads
+        cast against the old schema and fail on the extra parquet columns.
+        """
+        from lerobot.datasets.io_utils import load_info, write_info  # noqa: PLC0415
+        from lerobot.datasets.language import SAY_TOOL_SCHEMA, language_feature_info  # noqa: PLC0415
+
+        info_path = root / "meta" / "info.json"
+        if not info_path.exists():
+            return
+        try:
+            info = load_info(root)
+        except Exception as exc:  # noqa: BLE001
+            print(f"[annotate] could not read {info_path}: {exc}", flush=True)
+            return
+
+        changed = False
+
+        merged_features = {**info.features, **language_feature_info()}
+        if merged_features != info.features:
+            info.features = merged_features
+            changed = True
+
+        existing = info.tools or []
+        names = {(t.get("function") or {}).get("name") for t in existing if isinstance(t, dict)}
+        if SAY_TOOL_SCHEMA["function"]["name"] not in names:
+            info.tools = [*existing, SAY_TOOL_SCHEMA]
+            changed = True
+
+        if changed:
+            write_info(info, root)
+            print(
+                "[annotate] meta/info.json: "
+                f"language_features={list(language_feature_info())}, "
+                f"tools={[t['function']['name'] for t in (info.tools or [])]}",
+                flush=True,
+            )
+
+    def _run_module_phase(
+        self,
+        name: str,
+        records: list[EpisodeRecord],
+        staging_dir: Path,
+        module: Any,
+    ) -> PhaseResult:
+        if not module.enabled:
+            print(f"[annotate] phase={name} skipped (module disabled)", flush=True)
+            return PhaseResult(name=name, episodes_processed=0, episodes_skipped=len(records))
+        n = len(records)
+        parallelism = max(1, min(self.config.executor.episode_parallelism, n))
+        print(
+            f"[annotate] phase={name} starting on {n} episode(s) (parallelism={parallelism})",
+            flush=True,
+        )
+        t0 = time.time()
+
+        def _do(idx_record: tuple[int, EpisodeRecord]) -> tuple[int, int, float]:
+            i, record = idx_record
+            ep_start = time.time()
+            staging = EpisodeStaging(staging_dir, record.episode_index)
+            module.run_episode(record, staging)
+            return i, record.episode_index, time.time() - ep_start
+
+        processed = 0
+        if parallelism == 1:
+            for i, record in enumerate(records, 1):
+                _, ep_idx, elapsed = _do((i, record))
+                processed += 1
+                print(
+                    f"[annotate]   {name} episode {i}/{n} (idx={ep_idx}) done in {elapsed:.1f}s",
+                    flush=True,
+                )
+        else:
+            with ThreadPoolExecutor(max_workers=parallelism) as pool:
+                futures = [pool.submit(_do, (i, r)) for i, r in enumerate(records, 1)]
+                for fut in as_completed(futures):
+                    i, ep_idx, elapsed = fut.result()
+                    processed += 1
+                    print(
+                        f"[annotate]   {name} episode {processed}/{n} "
+                        f"(idx={ep_idx}, submit_order={i}) done in {elapsed:.1f}s",
+                        flush=True,
+                    )
+        total = time.time() - t0
+        print(f"[annotate] phase={name} complete: {processed}/{n} in {total:.1f}s", flush=True)
+        return PhaseResult(name=name, episodes_processed=processed, episodes_skipped=0)
+
+    def _run_plan_update_phase(  # noqa: PLR0915
+        self, records: list[EpisodeRecord], staging_dir: Path
+    ) -> PhaseResult:
+        """Re-emit ``plan`` rows at each timestamp the ``interjections`` module produced.
+
+        The ``plan`` module owns the prompt; the ``interjections`` module
+        produced the timestamps. This phase therefore calls back into the
+        ``plan`` module with the interjection timestamps so its existing
+        prompt path is reused.
+        """
+        if not self.plan.enabled or not self.interjections.enabled:
+            return PhaseResult(name="plan_update", episodes_processed=0, episodes_skipped=len(records))
+        processed = 0
+        for record in records:
+            staging = EpisodeStaging(staging_dir, record.episode_index)
+            interjection_rows = [
+                row for row in staging.read("interjections") if row.get("style") == "interjection"
+            ]
+            interjection_times = [float(row["timestamp"]) for row in interjection_rows]
+            interjection_texts = [str(row.get("content") or "") for row in interjection_rows]
+            if interjection_times:
+                self.plan.run_plan_updates(record, staging, interjection_times, interjection_texts)
+                processed += 1
+        # Episodes without any interjections are skipped (no plan refresh
+        # needed); count them so the summary's processed+skipped == total.
+        return PhaseResult(
+            name="plan_update",
+            episodes_processed=processed,
+            episodes_skipped=len(records) - processed,
+        )
--- a/src/lerobot/annotations/steerable_pipeline/frames.py
+++ b/src/lerobot/annotations/steerable_pipeline/frames.py
@@ -0,0 +1,494 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Keyframe extraction for the annotation pipeline.
+
+Modules attach decoded camera frames to their VLM prompts so the model can
+ground subtask decomposition, interjection scenarios, and VQA in actual
+visual content. The pipeline shares one provider across modules and one
+episode at a time, with a small per-episode cache so multiple modules
+querying the same timestamp pay decode cost once.
+"""
+
+from __future__ import annotations
+
+import logging
+import threading
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any, Protocol
+
+import PIL.Image
+import torch
+
+from lerobot.datasets.video_utils import decode_video_frames
+
+from .reader import EpisodeRecord
+
+logger = logging.getLogger(__name__)
+
+
+class FrameProvider(Protocol):
+    """Decodes camera frames at episode-relative timestamps."""
+
+    @property
+    def camera_keys(self) -> list[str]:
+        """All ``observation.images.*`` feature keys this provider can decode."""
+
+    def frames_at(
+        self,
+        record: EpisodeRecord,
+        timestamps: list[float],
+        camera_key: str | None = None,
+    ) -> list[Any]:
+        """Return one decoded frame per timestamp from ``camera_key`` (or default).
+
+        Frames are ``torch.Tensor`` (``C, H, W`` uint8) — the shape
+        :func:`lerobot.datasets.video_utils.decode_video_frames` returns.
+        :func:`to_image_blocks` converts them to PIL only at the VLM-message
+        boundary.
+
+        Empty list if the camera is unavailable. ``camera_key=None`` falls back
+        to the provider's default camera so existing single-camera callers
+        (the ``plan`` and ``interjections`` modules) keep working unchanged.
+        """
+
+    def video_for_episode(
+        self,
+        record: EpisodeRecord,
+        max_frames: int,
+        camera_key: str | None = None,
+    ) -> list[Any]:
+        """Return up to ``max_frames`` decoded frames covering the whole episode.
+
+        Sampling is uniform across the episode duration. Frames are
+        ``torch.Tensor`` (``C, H, W`` uint8); :func:`to_video_block` wraps
+        them into one ``{"type":"video", "video":<list>}`` block for a
+        Qwen-VL-compatible model that pools temporally itself. Empty list if
+        no camera available.
+        """
+
+
+@dataclass
+class _NullProvider:
+    """No-op provider used when the dataset has no video keys or in tests."""
+
+    @property
+    def camera_keys(self) -> list[str]:
+        return []
+
+    def frames_at(
+        self,
+        record: EpisodeRecord,
+        timestamps: list[float],
+        camera_key: str | None = None,
+    ) -> list[Any]:
+        return []
+
+    def video_for_episode(
+        self,
+        record: EpisodeRecord,
+        max_frames: int,
+        camera_key: str | None = None,
+    ) -> list[Any]:
+        return []
+
+
+def null_provider() -> FrameProvider:
+    return _NullProvider()
+
+
+@dataclass
+class VideoFrameProvider:
+    """Decodes frames from the dataset's ``observation.images.*`` streams.
+
+    By default the *first* camera key is used for the ``plan`` module
+    (subtask decomposition) and the ``interjections`` module (interjection
+    scenarios) — those prompts care about *what is happening*, not which
+    angle. The ``vqa`` module instead iterates over every camera in
+    :attr:`camera_keys` so each frame's
+    grounded answer (bbox/keypoint/...) is tagged with the camera it was
+    grounded against.
+
+    ``camera_key`` overrides the default-camera choice but does not restrict
+    :attr:`camera_keys`. Pass ``camera_key`` explicitly to ``frames_at`` /
+    ``video_for_episode`` to read a non-default stream.
+
+    Caches up to ``cache_size`` decoded frames per process to keep
+    co-timestamped ``interjections`` + ``plan`` plan-update calls cheap.
+    """
+
+    root: Path
+    camera_key: str | None = None
+    tolerance_s: float = 1e-2
+    cache_size: int = 256
+    # Keyframe decode backend. ``None`` uses the ffmpeg CLI — the
+    # concurrency- and crash-safe default for the pipeline's threaded
+    # decode. Set to ``"torchcodec"`` or ``"pyav"`` to pin an in-process
+    # decoder when the build is known thread-safe.
+    video_backend: str | None = None
+    _meta: Any = field(default=None, init=False, repr=False)
+    _cache: dict = field(default_factory=dict, init=False, repr=False)
+    _camera_keys: list[str] = field(default_factory=list, init=False, repr=False)
+    # Pipeline runs the three module phases under a ThreadPoolExecutor (see
+    # ``ExecutorConfig.episode_parallelism``); guard the dict cache and the
+    # one-shot warn flag against concurrent updates from worker threads.
+    _lock: threading.Lock = field(default_factory=threading.Lock, init=False, repr=False)
+
+    def __post_init__(self) -> None:
+        from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata  # noqa: PLC0415
+
+        self._meta = LeRobotDatasetMetadata(repo_id="local", root=self.root)
+        # Only ``video_keys`` are decodable here: the clip/decode paths read
+        # ``videos/<key>/from_timestamp`` from episode metadata, which exists
+        # only for video-stored cameras. Image-stored cameras (also in
+        # ``camera_keys``) would KeyError, so restrict the list — and the
+        # default — to video keys.
+        keys = list(self._meta.video_keys)
+        # Last-resort fallback: if metadata didn't surface any video keys but
+        # the caller explicitly named a camera (``--vlm.camera_key=...``),
+        # trust them — the key is by definition known to exist on the dataset.
+        if not keys and self.camera_key:
+            keys = [self.camera_key]
+        self._camera_keys = keys
+        if self.camera_key is None:
+            self.camera_key = keys[0] if keys else None
+
+    @property
+    def camera_keys(self) -> list[str]:
+        """All ``observation.images.*`` keys available on this dataset."""
+        return list(self._camera_keys)
+
+    def frames_at(
+        self,
+        record: EpisodeRecord,
+        timestamps: list[float],
+        camera_key: str | None = None,
+    ) -> list[Any]:
+        target = camera_key if camera_key is not None else self.camera_key
+        if not timestamps or target is None:
+            return []
+
+        out: list[Any] = []
+        misses: list[float] = []
+        miss_indices: list[int] = []
+        with self._lock:
+            for i, ts in enumerate(timestamps):
+                key = (record.episode_index, target, round(float(ts), 6))
+                cached = self._cache.get(key)
+                if cached is not None:
+                    out.append(cached)
+                else:
+                    out.append(None)
+                    misses.append(float(ts))
+                    miss_indices.append(i)
+
+        if misses:
+            decoded = self._decode(record.episode_index, misses, target)
+            # ``_decode`` returns exactly one frame per requested timestamp,
+            # or an empty list if decoding failed wholesale. A partial list
+            # would mean a frame/timestamp misalignment, so only pair them up
+            # when the counts match (``strict=True`` then guards regressions).
+            if len(decoded) == len(miss_indices):
+                with self._lock:
+                    for i, frame in zip(miss_indices, decoded, strict=True):
+                        out[i] = frame
+                        key = (record.episode_index, target, round(float(timestamps[i]), 6))
+                        if len(self._cache) >= self.cache_size:
+                            self._cache.pop(next(iter(self._cache)))
+                        self._cache[key] = frame
+        # filter out any None left over from decode failures
+        return [frame for frame in out if frame is not None]
+
+    def video_for_episode(
+        self,
+        record: EpisodeRecord,
+        max_frames: int,
+        camera_key: str | None = None,
+    ) -> list[Any]:
+        """Return up to ``max_frames`` frames uniformly sampled across the episode.
+
+        The whole episode duration is covered; the model picks subtask
+        boundaries from the temporal pooling it does internally. Frames are
+        ``torch.Tensor`` (see :meth:`frames_at`).
+        """
+        target = camera_key if camera_key is not None else self.camera_key
+        if max_frames <= 0 or target is None or not record.frame_timestamps:
+            return []
+        n_frames = min(max_frames, len(record.frame_timestamps))
+        if n_frames == len(record.frame_timestamps):
+            timestamps = list(record.frame_timestamps)
+        else:
+            t0 = record.frame_timestamps[0]
+            t_last = record.frame_timestamps[-1]
+            if t_last <= t0:
+                timestamps = [float(t0)] * n_frames
+            else:
+                step = (t_last - t0) / (n_frames - 1) if n_frames > 1 else 0.0
+                timestamps = [float(t0 + i * step) for i in range(n_frames)]
+        return self.frames_at(record, timestamps, camera_key=target)
+
+    def episode_clip_path(self, record: EpisodeRecord, cache_dir: Path) -> Path | None:
+        """Extract the episode's subclip to ``cache_dir/ep_{idx:06d}.mp4``.
+
+        Returns ``None`` if the dataset has no video tracks. Skips
+        re-extract when the cached clip already exists. Re-encodes to
+        H.264 (libx264) so the resulting mp4 is decodable by every
+        downstream video processor — stream-copy would inherit the
+        source codec (often AV1 in modern LeRobot datasets), which
+        vllm's libav build cannot decode.
+        """
+        import subprocess  # noqa: PLC0415
+
+        if self.camera_key is None:
+            return None
+        cache_dir.mkdir(parents=True, exist_ok=True)
+        out_path = cache_dir / f"ep_{record.episode_index:06d}.mp4"
+        if out_path.exists() and out_path.stat().st_size > 0:
+            return out_path
+        ep = self._meta.episodes[record.episode_index]
+        from_timestamp = float(ep[f"videos/{self.camera_key}/from_timestamp"])
+        to_timestamp = float(ep[f"videos/{self.camera_key}/to_timestamp"])
+        src = self.root / self._meta.get_video_file_path(record.episode_index, self.camera_key)
+        cmd = [
+            "ffmpeg",
+            "-y",
+            "-loglevel",
+            "error",
+            "-ss",
+            f"{from_timestamp:.3f}",
+            "-to",
+            f"{to_timestamp:.3f}",
+            "-i",
+            str(src),
+            "-c:v",
+            "libx264",
+            "-preset",
+            "ultrafast",
+            "-crf",
+            "23",
+            "-pix_fmt",
+            "yuv420p",
+            "-an",
+            str(out_path),
+        ]
+        try:
+            subprocess.run(cmd, check=True, timeout=300)
+        except (subprocess.CalledProcessError, subprocess.TimeoutExpired, FileNotFoundError):
+            return None
+        return out_path if out_path.exists() and out_path.stat().st_size > 0 else None
+
+    def _decode(self, episode_index: int, timestamps: list[float], camera_key: str) -> list[Any]:
+        """Decode ``timestamps`` from the episode's video as ``(C, H, W)`` tensors.
+
+        Delegates to :func:`lerobot.datasets.video_utils.decode_video_frames`
+        (torchcodec by default, PyAV fallback) rather than a bespoke decoder.
+        Returns one frame per requested timestamp, or ``[]`` if decoding
+        failed wholesale — callers treat ``[]`` as "no frames available".
+        """
+        ep = self._meta.episodes[episode_index]
+        from_timestamp = ep[f"videos/{camera_key}/from_timestamp"]
+        shifted = [from_timestamp + ts for ts in timestamps]
+        video_path = self.root / self._meta.get_video_file_path(episode_index, camera_key)
+
+        # Default to the ffmpeg CLI. The pipeline decodes under a 16-wide
+        # ThreadPoolExecutor and the in-process decoders are unsafe there:
+        # torchcodec is not thread-safe and SIGSEGVs under concurrent decode
+        # (a crash no try/except can catch), PyAV can likewise segfault on
+        # AV1, and lerobot's ``pyav`` backend routes through the removed
+        # ``torchvision.io.VideoReader``. ``_decode_frames_ffmpeg`` shells
+        # out per frame: each decode is an isolated child process, so it is
+        # both crash-safe and concurrency-safe. ``video_backend`` can pin
+        # ``torchcodec`` / ``pyav`` explicitly for callers that know their
+        # build is safe.
+        chain = [self.video_backend] if self.video_backend else ["ffmpeg"]
+
+        exc: Exception | None = None
+        for backend in chain:
+            try:
+                if backend == "ffmpeg":
+                    return _decode_frames_ffmpeg(video_path, shifted)
+                if backend in ("pyav", "av"):
+                    return _decode_frames_av(video_path, shifted)
+                # Stacked ``(N, C, H, W)`` uint8 tensor; one row per timestamp.
+                decoded = decode_video_frames(
+                    video_path, shifted, self.tolerance_s, backend=backend, return_uint8=True
+                )
+                return list(decoded)
+            except Exception as e:  # noqa: PERF203
+                exc = e
+
+        # Every backend raised. Log loudly the first time so a silent
+        # vqa-module no-op (every prompt skipped because frames_at returned
+        # []) is debuggable from the job log instead of post-hoc parquet
+        # inspection. Subsequent failures stay quiet.
+        with self._lock:
+            already_warned = getattr(self, "_warned_decode_fail", False)
+            if not already_warned:
+                self._warned_decode_fail = True
+        if not already_warned:
+            logger.warning(
+                "VideoFrameProvider._decode failed for episode=%s camera=%s video_path=%s backends=%s: %s",
+                episode_index,
+                camera_key,
+                video_path,
+                chain,
+                exc,
+                exc_info=exc,
+            )
+        return []
+
+
+def make_frame_provider(
+    root: Path, camera_key: str | None = None, video_backend: str | None = None
+) -> FrameProvider:
+    """Build a :class:`VideoFrameProvider` if videos are present, else null."""
+    try:
+        provider = VideoFrameProvider(root=root, camera_key=camera_key, video_backend=video_backend)
+    except Exception:
+        return null_provider()
+    if provider.camera_key is None:
+        return null_provider()
+    return provider
+
+
+def _decode_frames_ffmpeg(video_path: Path, timestamps: list[float]) -> list[Any]:
+    """Decode the frames nearest to ``timestamps`` via the ffmpeg CLI.
+
+    Runs one ``ffmpeg`` process per timestamp, seeking with ``-ss`` and
+    piping a single PNG to stdout. Unlike the in-process decoders this
+    survives a hostile container: a full ffmpeg build decodes AV1 (the codec
+    modern LeRobot datasets use) where torchcodec raises and PyAV can
+    SIGSEGV, and a crash stays isolated to the child process — a non-zero
+    exit is a catchable error, not a segfault of the whole job. Returns one
+    ``(C, H, W)`` uint8 tensor per timestamp.
+    """
+    import io  # noqa: PLC0415
+    import subprocess  # noqa: PLC0415
+
+    import numpy as np  # noqa: PLC0415
+
+    frames: list[Any] = []
+    for ts in timestamps:
+        proc = subprocess.run(
+            [
+                "ffmpeg",
+                "-nostdin",
+                "-loglevel",
+                "error",
+                "-ss",
+                f"{max(ts, 0.0):.3f}",
+                "-i",
+                str(video_path),
+                "-frames:v",
+                "1",
+                "-f",
+                "image2pipe",
+                "-vcodec",
+                "png",
+                "pipe:1",
+            ],
+            capture_output=True,
+            check=True,
+            timeout=120,
+        )
+        if not proc.stdout:
+            raise RuntimeError(f"ffmpeg returned no frame for t={ts:.3f}s of {video_path}")
+        img = PIL.Image.open(io.BytesIO(proc.stdout)).convert("RGB")
+        frames.append(torch.from_numpy(np.asarray(img).copy()).permute(2, 0, 1).contiguous())
+    return frames
+
+
+def _decode_frames_av(video_path: Path, timestamps: list[float]) -> list[Any]:
+    """Decode the frames nearest to ``timestamps`` using PyAV directly.
+
+    lerobot's ``decode_video_frames(backend="pyav")`` routes through
+    ``torchvision.io.VideoReader``, removed in torchvision 0.23+. This helper
+    talks to the ``av`` package directly. Note PyAV can SIGSEGV on AV1
+    streams in some builds — prefer ``_decode_frames_ffmpeg`` as the default
+    fallback; this stays available behind ``video_backend="pyav"``. Returns
+    one ``(C, H, W)`` uint8 tensor per timestamp.
+    """
+    import av  # noqa: PLC0415
+
+    first_ts = min(timestamps)
+    last_ts = max(timestamps)
+    loaded_frames: list[torch.Tensor] = []
+    loaded_ts: list[float] = []
+    with av.open(str(video_path)) as container:
+        stream = container.streams.video[0]
+        # Seek to the keyframe at or before the first requested timestamp.
+        offset = max(int(first_ts / stream.time_base), 0) if stream.time_base else 0
+        container.seek(offset, stream=stream, backward=True, any_frame=False)
+        for idx, frame in enumerate(container.decode(stream)):
+            ts = frame.time
+            if ts is None:
+                ts = float(frame.pts * stream.time_base) if frame.pts is not None else float(idx)
+            loaded_ts.append(ts)
+            loaded_frames.append(
+                torch.from_numpy(frame.to_ndarray(format="rgb24")).permute(2, 0, 1).contiguous()
+            )
+            if ts >= last_ts:
+                break
+    if not loaded_frames:
+        raise RuntimeError(f"PyAV decoded no frames from {video_path}")
+    ts_tensor = torch.tensor(loaded_ts)
+    return [loaded_frames[int(torch.argmin((ts_tensor - q).abs()))] for q in timestamps]
+
+
+def _frame_to_pil(frame: Any) -> Any:
+    """Materialise a decoded frame as a ``PIL.Image`` for the VLM message.
+
+    Frames flow through the provider as ``torch.Tensor`` (``C, H, W`` uint8,
+    straight from :func:`decode_video_frames`); PIL is only created here, at
+    the VLM-message boundary, because the chat backends expect PIL images /
+    data URLs. Non-tensor inputs (e.g. test stubs) pass through untouched.
+    """
+    if not isinstance(frame, torch.Tensor):
+        return frame
+    array = frame.detach().cpu()
+    if array.ndim == 3 and array.shape[0] in (1, 3):
+        array = array.permute(1, 2, 0)  # (C, H, W) -> (H, W, C)
+    if array.shape[-1] == 1:
+        array = array.squeeze(-1)
+    return PIL.Image.fromarray(array.to(torch.uint8).numpy())
+
+
+def to_image_blocks(frames: list[Any]) -> list[dict[str, Any]]:
+    """Convert decoded frames to Qwen-VL-compatible image content blocks."""
+    return [{"type": "image", "image": _frame_to_pil(frame)} for frame in frames]
+
+
+def to_video_block(frames: list[Any]) -> list[dict[str, Any]]:
+    """Wrap a list of decoded frames as one Qwen-VL video block.
+
+    Returns ``[]`` when the list is empty, so the caller can splat the result
+    into a content array without a separate emptiness check.
+    """
+    if not frames:
+        return []
+    return [{"type": "video", "video": [_frame_to_pil(frame) for frame in frames]}]
+
+
+def to_video_url_block(url: str | None, fps: float = 2.0) -> list[dict[str, Any]]:
+    """Wrap a video file URL as one ``video_url`` block.
+
+    Used by the ``openai`` backend (transformers serve / vllm serve /
+    ktransformers serve), where the server handles frame sampling.
+    Returns ``[]`` when ``url`` is ``None`` so the caller can splat.
+    """
+    if not url:
+        return []
+    return [{"type": "video_url", "video_url": {"url": url}, "fps": fps}]
--- a/src/lerobot/annotations/steerable_pipeline/modules/init.py
+++ b/src/lerobot/annotations/steerable_pipeline/modules/init.py
@@ -0,0 +1,25 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .general_vqa import GeneralVqaModule
+from .interjections_and_speech import InterjectionsAndSpeechModule
+from .plan_subtasks_memory import PlanSubtasksMemoryModule
+
+__all__ = [
+    "GeneralVqaModule",
+    "InterjectionsAndSpeechModule",
+    "PlanSubtasksMemoryModule",
+]
--- a/src/lerobot/annotations/steerable_pipeline/modules/general_vqa.py
+++ b/src/lerobot/annotations/steerable_pipeline/modules/general_vqa.py
@@ -0,0 +1,238 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""``vqa`` module: general VQA at a timed cadence.
+
+Every ``1/hz`` seconds an emission tick fires; each tick anchors ``K``
+consecutive frames, and every anchored frame gets its own VQA pair. Each
+pair is grounded on that single anchor frame — there is no per-pair frame
+window. For datasets with multiple cameras, every anchored frame produces
+one ``(vqa, user)`` + ``(vqa, assistant)`` pair *per camera*: each pair is
+generated against that camera's frame and stamped with the matching
+``camera`` field on the emitted rows. The resolver disambiguates via
+``camera=...``; recipes that consume VQA do so through one sub-recipe
+per camera (see ``recipes/pi05_hirobot.yaml``).
+
+Within a single (frame, camera) we still emit at most one ``(vqa, user)``
+and one ``(vqa, assistant)`` row, so the resolver contract stays scalar.
+
+Question types covered (per the plan's ``vqa`` table): bbox, keypoint,
+count, attribute, spatial. The assistant's ``content`` is a JSON string
+whose schema depends on the question type. Malformed JSON triggers one
+retry inside :meth:`VlmClient.generate_json`.
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+import random
+from collections.abc import Sequence
+from dataclasses import dataclass, field
+from typing import Any
+
+from ..config import VqaConfig
+from ..frames import FrameProvider, null_provider, to_image_blocks
+from ..prompts import load as load_prompt
+from ..reader import EpisodeRecord
+from ..staging import EpisodeStaging
+from ..validator import classify_vqa_answer
+from ..vlm_client import VlmClient
+
+
+def _emission_anchor_indices(frame_timestamps: Sequence[float], hz: float, k: int) -> list[int]:
+    """Return the relative frame indices to anchor VQA emissions to.
+
+    For each emission tick (every ``1/hz`` seconds), we anchor ``k``
+    consecutive frames starting at the tick. Ticks fall on the nearest
+    available source frame timestamp.
+    """
+    if hz <= 0 or k <= 0 or not frame_timestamps:
+        return []
+    t0 = frame_timestamps[0]
+    t_last = frame_timestamps[-1]
+    period = 1.0 / hz
+    indices: list[int] = []
+    t = t0
+    while t <= t_last + 1e-9:
+        # find the index of the nearest frame to t
+        nearest_i = min(range(len(frame_timestamps)), key=lambda i: abs(frame_timestamps[i] - t))
+        for offset in range(k):
+            j = nearest_i + offset
+            if j >= len(frame_timestamps):
+                break
+            if not indices or indices[-1] != j:
+                indices.append(j)
+        t += period
+    # dedupe while preserving order
+    seen: set[int] = set()
+    deduped: list[int] = []
+    for i in indices:
+        if i in seen:
+            continue
+        seen.add(i)
+        deduped.append(i)
+    return deduped
+
+
+@dataclass
+class GeneralVqaModule:
+    """Emit grounded VQA pairs at a timed cadence."""
+
+    vlm: VlmClient
+    config: VqaConfig
+    seed: int = 1729
+    frame_provider: FrameProvider = field(default_factory=null_provider)
+
+    @property
+    def enabled(self) -> bool:
+        return self.config.enabled
+
+    def run_episode(self, record: EpisodeRecord, staging: EpisodeStaging) -> None:
+        if not record.frame_timestamps:
+            staging.write("vqa", [])
+            return
+        rng = random.Random(f"{self.seed}:{record.episode_index}:vqa")
+        anchor_idx = _emission_anchor_indices(
+            record.frame_timestamps, self.config.vqa_emission_hz, self.config.K
+        )
+        cameras = self._target_cameras()
+        if not cameras:
+            # No camera available — emit nothing rather than producing
+            # untagged rows that would fail validation. Surface a loud one-
+            # time warning so this is never silently a no-op.
+            if not getattr(self, "_warned_no_camera", False):
+                logging.getLogger(__name__).warning(
+                    "vqa module found no cameras on the frame provider — "
+                    "every episode will emit zero VQA rows. Check that the "
+                    "dataset declares observation.images.* features in "
+                    "meta/info.json; passing --vlm.camera_key=<key> at the "
+                    "CLI now also seeds the cameras list as a fallback."
+                )
+                self._warned_no_camera = True
+            staging.write("vqa", [])
+            return
+
+        # Build all messages first (one per (frame, camera)), then issue them
+        # as a single batched generate_json call so the client can fan them
+        # out concurrently.
+        per_call: list[tuple[float, str, str, list[dict[str, Any]]]] = []
+        for idx in anchor_idx:
+            ts = float(record.frame_timestamps[idx])
+            qtype = rng.choice(self.config.question_types)
+            for camera in cameras:
+                messages = self._build_messages(record, qtype, ts, camera)
+                # Skip cameras that decoded to zero frames at this ts: no point
+                # asking the VLM to ground a bbox without an image.
+                if not _has_image_block(messages):
+                    continue
+                per_call.append((ts, camera, qtype, messages))
+
+        if not per_call:
+            staging.write("vqa", [])
+            return
+
+        results = self.vlm.generate_json([m for _, _, _, m in per_call])
+
+        rows: list[dict[str, Any]] = []
+        for (ts, camera, _qtype, _messages), result in zip(per_call, results, strict=True):
+            qa = self._postprocess(result)
+            if qa is None:
+                continue
+            question, answer = qa
+            rows.append(
+                {
+                    "role": "user",
+                    "content": question,
+                    "style": "vqa",
+                    "timestamp": ts,
+                    "camera": camera,
+                    "tool_calls": None,
+                }
+            )
+            rows.append(
+                {
+                    "role": "assistant",
+                    "content": json.dumps(answer, sort_keys=True),
+                    "style": "vqa",
+                    "timestamp": ts,
+                    "camera": camera,
+                    "tool_calls": None,
+                }
+            )
+        staging.write("vqa", rows)
+
+    def _target_cameras(self) -> list[str]:
+        """Return the cameras the ``vqa`` module should iterate per anchored frame.
+
+        Defaults to every camera the provider exposes. Datasets with no
+        cameras (or test/null providers) yield an empty list, which makes
+        ``run_episode`` a no-op.
+
+        When ``config.restrict_to_default_camera`` is set, VQA grounds on
+        only the provider's default camera (the single ``--vlm.camera_key``
+        stream), matching the plan / interjection modules so the whole
+        pipeline focuses on one view.
+        """
+        all_cameras = list(getattr(self.frame_provider, "camera_keys", []) or [])
+        if getattr(self.config, "restrict_to_default_camera", False):
+            default = getattr(self.frame_provider, "camera_key", None)
+            if default and default in all_cameras:
+                return [default]
+            if default:
+                return [default]
+        return all_cameras
+
+    def _build_messages(
+        self,
+        record: EpisodeRecord,
+        question_type: str,
+        frame_timestamp: float,
+        camera_key: str,
+    ) -> list[dict[str, Any]]:
+        prompt = load_prompt("module_3_vqa").format(
+            episode_task=record.episode_task,
+            question_type=question_type,
+        )
+        images = self.frame_provider.frames_at(record, [frame_timestamp], camera_key=camera_key)
+        content = [*to_image_blocks(images), {"type": "text", "text": prompt}]
+        return [{"role": "user", "content": content}]
+
+    def _postprocess(self, result: Any) -> tuple[str, dict[str, Any]] | None:
+        if not isinstance(result, dict):
+            return None
+        question = result.get("question")
+        answer = result.get("answer")
+        if not isinstance(question, str) or not question.strip():
+            return None
+        if not isinstance(answer, dict):
+            return None
+        # The validator will enforce shape; here we just sanity-check that the
+        # answer matches *some* known shape so we can drop garbage early.
+        if classify_vqa_answer(answer) is None:
+            return None
+        return question.strip(), answer
+
+
+def _has_image_block(messages: list[dict[str, Any]]) -> bool:
+    """Return True if any user content block is a populated image block."""
+    for msg in messages:
+        content = msg.get("content")
+        if not isinstance(content, list):
+            continue
+        for block in content:
+            if isinstance(block, dict) and block.get("type") == "image":
+                return True
+    return False
--- a/src/lerobot/annotations/steerable_pipeline/modules/interjections_and_speech.py
+++ b/src/lerobot/annotations/steerable_pipeline/modules/interjections_and_speech.py
@@ -0,0 +1,210 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""``interjections`` module: interjections + paired speech (EVENT styles + speech atoms).
+
+Two sub-passes:
+
+1. At ``t=0``, emit ONLY a speech tool-call atom (acknowledgement of the
+   canonical task). No interjection row — the canonical task is already the
+   user utterance from ``meta/tasks.parquet``.
+
+2. For mid-episode interruptions, emit a co-timestamped pair:
+       {role:user, style:interjection, content:<text>}
+       speech atom (role:assistant, style:None, tool_calls=[say(...)])
+   Both rows go in ``language_events`` at the same timestamp.
+
+The ``plan`` module's :meth:`run_plan_updates` reuses this module's
+interjection timestamps to refresh the ``plan`` row at the same instant.
+"""
+
+from __future__ import annotations
+
+import random
+from collections.abc import Sequence
+from dataclasses import dataclass, field
+from typing import Any
+
+from ..config import InterjectionsConfig
+from ..frames import FrameProvider, null_provider, to_image_blocks
+from ..prompts import load as load_prompt
+from ..reader import EpisodeRecord, reconstruct_subtask_spans, snap_to_frame
+from ..staging import EpisodeStaging
+from ..vlm_client import VlmClient
+from ..writer import speech_atom
+
+
+@dataclass
+class InterjectionsAndSpeechModule:
+    """Generate task-start speech and mid-episode interjection/speech pairs."""
+
+    vlm: VlmClient
+    config: InterjectionsConfig
+    seed: int = 1729
+    frame_provider: FrameProvider = field(default_factory=null_provider)
+
+    @property
+    def enabled(self) -> bool:
+        return self.config.enabled
+
+    def run_episode(self, record: EpisodeRecord, staging: EpisodeStaging) -> None:
+        rows: list[dict[str, Any]] = []
+        if record.frame_timestamps:
+            t0 = float(record.frame_timestamps[0])
+            initial = self._initial_speech(record)
+            if initial:
+                rows.append(speech_atom(t0, initial))
+        # Pull the ``plan`` module's subtask spans for this episode so the
+        # interjection prompt can ground itself in the actual current
+        # subtask at each chosen timestamp. The ``plan`` module ran first.
+        episode_end_t = float(record.frame_timestamps[-1]) if record.frame_timestamps else None
+        subtask_spans = reconstruct_subtask_spans(staging.read("plan"), episode_end_t=episode_end_t)
+        rows.extend(self._mid_episode_interjections(record, subtask_spans))
+        staging.write("interjections", rows)
+
+    @staticmethod
+    def _subtask_at(spans: Sequence[dict[str, Any]], t: float) -> str | None:
+        current: str | None = None
+        for span in spans:
+            if float(span["start"]) <= t:
+                current = span.get("text")
+            else:
+                break
+        return current
+
+    def _initial_speech(self, record: EpisodeRecord) -> str | None:
+        prompt = load_prompt("module_2_initial_speech").format(
+            episode_task=record.episode_task,
+        )
+        messages = [{"role": "user", "content": [{"type": "text", "text": prompt}]}]
+        result = self.vlm.generate_json([messages])[0]
+        if isinstance(result, dict) and isinstance(result.get("text"), str):
+            text = result["text"].strip()
+            if text:
+                return text
+        return None
+
+    def _mid_episode_interjections(
+        self,
+        record: EpisodeRecord,
+        subtask_spans: Sequence[dict[str, Any]],
+    ) -> list[dict[str, Any]]:
+        """Generate interjections aligned with the actual demo trajectory.
+
+        Teleop data is frozen — the robot already executed every step in
+        the video. A *counterfactual* interjection like "actually skip
+        the wipe" contradicts what then happens in the video, which is
+        what qwen36moe-10/11 surfaced as low-quality interjections.
+
+        Instead, anchor every interjection at a subtask boundary and
+        write it as a natural user request for the *upcoming* subtask.
+        The robot's visible next behavior IS the interjection's effect,
+        so the training signal stays consistent: interjection text →
+        plan refresh → action stream all line up.
+        """
+        if self.config.max_interjections_per_episode <= 0:
+            return []
+        if len(subtask_spans) < 2:
+            # Need at least one transition (subtask 0 → subtask 1).
+            return []
+        # Deterministic per-episode RNG so reruns are stable across SLURM jobs.
+        rng = random.Random(f"{self.seed}:{record.episode_index}:interjection")
+
+        # Boundaries: the start time of every subtask except the first
+        # (which is just t0 and is covered by the initial-task speech atom).
+        boundaries: list[tuple[float, str, str]] = []
+        for i in range(1, len(subtask_spans)):
+            ts = float(subtask_spans[i]["start"])
+            if ts < self.config.interjection_min_t:
+                continue
+            prev_text = (subtask_spans[i - 1].get("text") or "").strip()
+            next_text = (subtask_spans[i].get("text") or "").strip()
+            if not next_text:
+                continue
+            boundaries.append((ts, prev_text, next_text))
+        if not boundaries:
+            return []
+
+        n = min(self.config.max_interjections_per_episode, len(boundaries))
+        chosen = sorted(rng.sample(boundaries, n), key=lambda b: b[0])
+
+        out: list[dict[str, Any]] = []
+        for t, prev_subtask, next_subtask in chosen:
+            t_snap = snap_to_frame(t, record.frame_timestamps)
+            # Window straddles the boundary so the VLM sees the end of the
+            # previous subtask and the start of the next one — same
+            # conditioning the policy will see at training time.
+            window_ts = self._window_timestamps(t_snap, record.frame_timestamps)
+            prompt = load_prompt("module_2_interjection").format(
+                episode_task=record.episode_task,
+                prev_subtask=prev_subtask or "(starting from initial state)",
+                next_subtask=next_subtask,
+                timestamp=t_snap,
+                window_seconds=self.config.interjection_window_seconds,
+            )
+            images = self.frame_provider.frames_at(record, window_ts)
+            content = [*to_image_blocks(images), {"type": "text", "text": prompt}]
+            messages = [{"role": "user", "content": content}]
+            result = self.vlm.generate_json([messages])[0]
+            if not isinstance(result, dict):
+                continue
+            interjection_text = result.get("interjection")
+            speech_text = result.get("speech")
+            if not isinstance(interjection_text, str) or not interjection_text.strip():
+                continue
+            if not isinstance(speech_text, str) or not speech_text.strip():
+                continue
+            out.append(
+                {
+                    "role": "user",
+                    "content": interjection_text.strip(),
+                    "style": "interjection",
+                    "timestamp": t_snap,
+                    "tool_calls": None,
+                }
+            )
+            out.append(speech_atom(t_snap, speech_text.strip()))
+        return out
+
+    def _window_timestamps(self, t_anchor: float, frame_timestamps: Sequence[float]) -> list[float]:
+        """Return a small set of frame timestamps centered on ``t_anchor``.
+
+        The window straddles the subtask boundary the interjection sits
+        on: roughly half the frames cover the end of the previous
+        subtask, half cover the start of the next one. The VLM therefore
+        sees BOTH what just finished AND what's about to start, which is
+        the conditioning we need to write a natural "now please do X"
+        request that matches the visible upcoming behavior.
+        """
+        if not frame_timestamps:
+            return [t_anchor]
+        n = max(1, int(self.config.interjection_window_frames))
+        if n == 1:
+            return [t_anchor]
+        window = float(self.config.interjection_window_seconds)
+        step = window / max(1, n - 1)
+        # Center the window on the anchor so half lands before, half after.
+        start_offset = -window / 2.0
+        targets = [t_anchor + start_offset + step * i for i in range(n)]
+        last_ts = float(frame_timestamps[-1])
+        snapped: list[float] = []
+        seen: set[float] = set()
+        for tgt in targets:
+            clamped = min(last_ts, max(0.0, tgt))
+            t = snap_to_frame(clamped, frame_timestamps)
+            if t not in seen:
+                seen.add(t)
+                snapped.append(t)
+        return snapped or [t_anchor]
--- a/src/lerobot/annotations/steerable_pipeline/modules/plan_subtasks_memory.py
+++ b/src/lerobot/annotations/steerable_pipeline/modules/plan_subtasks_memory.py
@@ -0,0 +1,892 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""``plan`` module: subtask decomposition + plan + memory (PERSISTENT styles)."""
+
+from __future__ import annotations
+
+import json
+import logging
+from collections.abc import Sequence
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+
+from ..config import PlanConfig
+from ..frames import (
+    FrameProvider,
+    VideoFrameProvider,
+    null_provider,
+    to_image_blocks,
+    to_video_block,
+    to_video_url_block,
+)
+from ..prompts import load as load_prompt
+from ..reader import EpisodeRecord, reconstruct_subtask_spans, snap_to_frame
+from ..staging import EpisodeStaging
+from ..vlm_client import VlmClient
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class PlanSubtasksMemoryModule:
+    """Generate subtask spans, plan, and memory rows.
+
+    All output is persistent (lives in ``language_persistent``):
+
+    - ``subtask`` rows: one per span, stamped at the span's *start* timestamp
+      (snapped to an exact frame).
+    - ``plan`` rows: emitted at ``t=0``; refreshed at every interjection
+      timestamp via :meth:`run_plan_updates` (called by the executor after
+      the ``interjections`` module completes).
+    - ``memory`` rows: emitted at each subtask boundary (= subtask start
+      timestamp from the second subtask onward).
+    """
+
+    vlm: VlmClient
+    config: PlanConfig
+    frame_provider: FrameProvider = field(default_factory=null_provider)
+
+    @property
+    def enabled(self) -> bool:
+        return self.config.enabled
+
+    def run_episode(self, record: EpisodeRecord, staging: EpisodeStaging) -> None:
+        rows: list[dict[str, Any]] = []
+        # Resolve the task that drives every other ``plan``-module prompt.
+        # May be the canonical ``record.episode_task`` (default), or a fresh
+        # description derived from the video when the canonical task is
+        # empty / placeholder / forced-off (see PlanConfig.derive_task_*).
+        effective_task = self._resolve_effective_task(record)
+        # ``task_aug`` rows at t=0 (role=user), one per rephrasing — the
+        # message renderer rotates ``${task}`` deterministically through
+        # them so the policy sees diverse phrasings during training.
+        # Two paths:
+        #   * ``task_aug_axes.enabled=True`` — structured 5-axis taxonomy
+        #     (synonym / omit_arm / omit_orientation / omit_grasp_method
+        #     / combined). Replaces the free-form rephrasings flow.
+        #   * Otherwise — free-form ``n_task_rephrasings`` (original).
+        t0 = float(record.frame_timestamps[0]) if record.frame_timestamps else 0.0
+        axes_cfg = self.config.task_aug_axes
+        if axes_cfg.enabled and effective_task:
+            variants = self._generate_task_aug_by_axes(effective_task, axes_cfg)
+            seen: set[str] = set()
+            ordered = [effective_task, *variants]
+            for phrasing in ordered:
+                key = phrasing.strip()
+                if not key or key in seen:
+                    continue
+                seen.add(key)
+                rows.append(
+                    {
+                        "role": "user",
+                        "content": key,
+                        "style": "task_aug",
+                        "timestamp": t0,
+                        "tool_calls": None,
+                    }
+                )
+        elif self.config.n_task_rephrasings > 0 and effective_task:
+            rephrasings = self._generate_task_rephrasings(effective_task, n=self.config.n_task_rephrasings)
+            # Always include the effective task itself as the first variant
+            # so the rotation is guaranteed to cover the source-of-truth
+            # phrasing, not just synthetic alternatives.
+            seen = set()
+            ordered = [effective_task, *rephrasings]
+            for phrasing in ordered:
+                key = phrasing.strip()
+                if not key or key in seen:
+                    continue
+                seen.add(key)
+                rows.append(
+                    {
+                        "role": "user",
+                        "content": key,
+                        "style": "task_aug",
+                        "timestamp": t0,
+                        "tool_calls": None,
+                    }
+                )
+
+        subtask_spans = self._generate_subtasks(record, task=effective_task)
+
+        # ----------------------------------------------------------------
+        # Phase 1a: structured per-subtask action records (additive)
+        # ----------------------------------------------------------------
+        # When enabled, for every subtask span we ask the VLM for a typed
+        # ActionRecord (verb / object / arm / grasp_type / destination /
+        # mistake) and emit it as a separate ``style="action_record"``
+        # row for downstream use. This is purely additive — it never
+        # touches the VLM's subtask text (reconstructing subtask text
+        # from these fields was too easy to hallucinate on tasks that
+        # don't fit the manipulation schema).
+        records_cfg = self.config.action_records
+        action_records: list[dict[str, Any] | None] = [None] * len(subtask_spans)
+        if records_cfg.enabled and subtask_spans:
+            for i, span in enumerate(subtask_spans):
+                rec = self._extract_action_record(record, span, effective_task)
+                if rec is not None:
+                    action_records[i] = rec
+
+        # subtask rows
+        for i, span in enumerate(subtask_spans):
+            rows.append(
+                {
+                    "role": "assistant",
+                    "content": span["text"],
+                    "style": "subtask",
+                    "timestamp": snap_to_frame(span["start"], record.frame_timestamps),
+                    "tool_calls": None,
+                }
+            )
+            if records_cfg.enabled and records_cfg.emit_record_row and action_records[i] is not None:
+                rows.append(
+                    {
+                        "role": "assistant",
+                        "content": json.dumps(action_records[i], sort_keys=True),
+                        "style": "action_record",
+                        "timestamp": snap_to_frame(span["start"], record.frame_timestamps),
+                        "tool_calls": None,
+                    }
+                )
+        # Plan rows at every subtask boundary — including t=0 (start of
+        # the first subtask). Because the plan is just a numbered list
+        # of *still-todo* subtasks, re-emitting at each boundary makes
+        # the active plan shrink as work progresses: at frame t the
+        # rendered ``${plan}`` is the most recent emission, which
+        # contains exactly the subtasks that started at or after the
+        # current span. Saves the runtime from having to derive
+        # "what's still left" at inference time.
+        if self.config.emit_plan:
+            for span in subtask_spans:
+                boundary_t = snap_to_frame(span["start"], record.frame_timestamps)
+                plan_text = self._generate_plan(
+                    record, subtask_spans, refresh_t=boundary_t, task=effective_task
+                )
+                if plan_text is not None:
+                    rows.append(
+                        {
+                            "role": "assistant",
+                            "content": plan_text,
+                            "style": "plan",
+                            "timestamp": float(boundary_t),
+                            "tool_calls": None,
+                        }
+                    )
+        # memory rows at every subtask boundary except the very first start
+        prior_memory = ""
+        for i, span in enumerate(subtask_spans[1:], start=1):
+            completed = subtask_spans[i - 1]["text"]
+            remaining = [s["text"] for s in subtask_spans[i:]]
+            mem_text = self._generate_memory(record, prior_memory, completed, remaining, task=effective_task)
+            if mem_text:
+                ts = snap_to_frame(span["start"], record.frame_timestamps)
+                rows.append(
+                    {
+                        "role": "assistant",
+                        "content": mem_text,
+                        "style": "memory",
+                        "timestamp": ts,
+                        "tool_calls": None,
+                    }
+                )
+                prior_memory = mem_text
+        staging.write("plan", rows)
+
+    # ------------------------------------------------------------------
+    # Task derivation + rephrasings
+    # ------------------------------------------------------------------
+
+    _PLACEHOLDER_TASKS: frozenset[str] = frozenset(
+        {
+            "debug",
+            "test",
+            "tbd",
+            "todo",
+            "n/a",
+            "na",
+            "untitled",
+            "unnamed",
+            "default",
+            "placeholder",
+        }
+    )
+
+    def _resolve_effective_task(self, record: EpisodeRecord) -> str:
+        """Decide which task string drives the ``plan`` module for this episode.
+
+        Returns the user-supplied ``record.episode_task`` unless
+        ``derive_task_from_video`` says otherwise (see config docstring).
+        Falls back gracefully to the canonical task if video derivation
+        fails.
+        """
+        canonical = (record.episode_task or "").strip()
+        mode = (self.config.derive_task_from_video or "off").strip().lower()
+        if mode == "always":
+            derived = self._derive_task_from_video(record)
+            return derived or canonical
+        if mode == "if_short" and self._task_seems_bad(canonical):
+            derived = self._derive_task_from_video(record)
+            if derived:
+                return derived
+        return canonical
+
+    def _task_seems_bad(self, task: str) -> bool:
+        if not task:
+            return True
+        if len(task.split()) < int(self.config.derive_task_min_words):
+            return True
+        return task.lower() in self._PLACEHOLDER_TASKS
+
+    # ------------------------------------------------------------------
+    # VLM call helpers (factored out: every ``plan``-module prompt below follows
+    # the same "build messages → single VLM call → pull a named field"
+    # shape, only differing in field name + post-processing).
+    # ------------------------------------------------------------------
+
+    def _vlm_field(self, messages: list[dict[str, Any]], field: str) -> Any:
+        """Run a single VLM call and return ``result[field]`` or ``None``.
+
+        Centralizes the ``vlm.generate_json([m])[0]`` + ``isinstance(dict)``
+        dance every prompt-call site needs.
+        """
+        result = self.vlm.generate_json([messages])[0]
+        if isinstance(result, dict):
+            return result.get(field)
+        return None
+
+    @staticmethod
+    def _text_message(text: str) -> list[dict[str, Any]]:
+        """One-shot text-only user message wrapped for ``generate_json``."""
+        return [{"role": "user", "content": [{"type": "text", "text": text}]}]
+
+    def _video_message(
+        self,
+        record: EpisodeRecord,
+        prompt: str,
+        window: tuple[float, float] | None = None,
+    ) -> list[dict[str, Any]]:
+        """User message combining the (optionally windowed) video block with ``prompt``."""
+        content = [*self._episode_video_block(record, window=window), {"type": "text", "text": prompt}]
+        return [{"role": "user", "content": content}]
+
+    def _derive_task_from_video(self, record: EpisodeRecord) -> str | None:
+        """Ask the VLM "what is this video about" with no task hint at all."""
+        text = self._vlm_field(self._video_message(record, load_prompt("module_1_video_task")), "task")
+        return text.strip() if isinstance(text, str) and text.strip() else None
+
+    def _generate_task_rephrasings(self, base_task: str, *, n: int) -> list[str]:
+        """Generate ``n`` text-only paraphrases of ``base_task``."""
+        if n <= 0 or not base_task:
+            return []
+        prompt = load_prompt("module_1_task_rephrasings").format(base_task=base_task, n=n)
+        raw = self._vlm_field(self._text_message(prompt), "rephrasings")
+        if not isinstance(raw, list):
+            return []
+        out = [item.strip().strip('"').strip("'") for item in raw if isinstance(item, str)]
+        return [s for s in out if s][:n]
+
+    # ------------------------------------------------------------------
+    # Phase 1a + 1b: structured per-subtask action records
+    # ------------------------------------------------------------------
+
+    def _extract_action_record(
+        self,
+        record: EpisodeRecord,
+        span: dict[str, Any],
+        episode_task: str,
+    ) -> dict[str, Any] | None:
+        """Ask the VLM to extract a typed ``ActionRecord`` from a subtask span.
+
+        Sends ``frames_per_subtask`` frames uniformly sampled from
+        ``[span.start, span.end]`` plus the canonical subtask text. The
+        VLM is constrained to verb + grasp vocabularies from the config
+        — invalid values are silently dropped at this layer (the
+        validator catches structural problems pre-write).
+
+        Returns ``None`` when the call fails or the VLM returns something
+        unrecognizable; callers fall back to the free-form subtask text.
+        """
+        cfg = self.config.action_records
+        start_t = float(span.get("start", 0.0))
+        end_t = float(span.get("end", start_t))
+        duration = max(0.0, end_t - start_t)
+
+        # Uniform timestamps within the span; fall back to a single
+        # center frame for very short spans.
+        n = max(1, int(cfg.frames_per_subtask))
+        if n == 1 or duration <= 0.0:
+            timestamps = [0.5 * (start_t + end_t)]
+        else:
+            step = duration / (n - 1)
+            timestamps = [start_t + i * step for i in range(n)]
+        frames = self.frame_provider.frames_at(record, timestamps)
+        if not frames:
+            logger.debug(
+                "action_record: no frames at span %.2f-%.2f for ep %s; skipping",
+                start_t,
+                end_t,
+                record.episode_index,
+            )
+            return None
+
+        prompt = load_prompt("module_1_action_record").format(
+            episode_task=episode_task,
+            subtask_text=span.get("text", ""),
+            start_time=start_t,
+            end_time=end_t,
+            duration=duration,
+            n_frames=len(frames),
+            verb_vocabulary=", ".join(cfg.verb_vocabulary),
+            grasp_vocabulary=" | ".join(f'"{g}"' for g in cfg.grasp_vocabulary),
+        )
+        message = [
+            {
+                "role": "user",
+                "content": [*to_image_blocks(frames), {"type": "text", "text": prompt}],
+            }
+        ]
+        result = self.vlm.generate_json([message])[0]
+        if not isinstance(result, dict):
+            return None
+
+        # Light validation + normalisation. Verb is required; everything
+        # else may be null. Verb / grasp_type are clamped to the
+        # vocabularies (out-of-vocab → reject or null).
+        verb = (result.get("verb") or "").strip().lower()
+        if not verb or verb not in {v.lower() for v in cfg.verb_vocabulary}:
+            return None
+        obj = (result.get("object") or "").strip()
+        if not obj:
+            return None
+        grasp = result.get("grasp_type")
+        if isinstance(grasp, str):
+            grasp = grasp.strip().lower()
+            if grasp not in {g.lower() for g in cfg.grasp_vocabulary}:
+                grasp = None
+        else:
+            grasp = None
+        arm = result.get("arm")
+        if isinstance(arm, str):
+            arm = arm.strip().lower()
+            if arm not in {"left", "right", "both"}:
+                arm = None
+        else:
+            arm = None
+        destination = result.get("destination")
+        destination = destination.strip() if isinstance(destination, str) and destination.strip() else None
+        mistake = result.get("mistake")
+        mistake = mistake.strip() if isinstance(mistake, str) and mistake.strip() else None
+
+        return {
+            "verb": verb,
+            "object": obj,
+            "arm": arm,
+            "grasp_type": grasp,
+            "destination": destination,
+            "mistake": mistake,
+        }
+
+    # ------------------------------------------------------------------
+    # Structured 5-axis task augmentation (EgoMimic-style taxonomy)
+    # ------------------------------------------------------------------
+
+    def _generate_task_aug_by_axes(self, base_task: str, axes_cfg: Any) -> list[str]:
+        """One VLM call → variants along the 5-axis taxonomy.
+
+        Variants from all axes are flattened into a single list (the
+        downstream pipeline doesn't need to know about the per-axis
+        bucketing — every variant becomes a ``task_aug`` row). Order
+        is preserved for reproducibility: synonym_paraphrase first,
+        then omit_arm, then omit_orientation, then omit_grasp_method,
+        then combined_omissions.
+        """
+        if not base_task:
+            return []
+        prompt = load_prompt("module_1_task_aug_axes").format(
+            base_task=base_task,
+            n_synonym=axes_cfg.synonym_paraphrase,
+            n_omit_arm=axes_cfg.omit_arm,
+            n_omit_orientation=axes_cfg.omit_orientation,
+            n_omit_grasp_method=axes_cfg.omit_grasp_method,
+            n_combined=axes_cfg.combined_omissions,
+        )
+        result = self.vlm.generate_json([self._text_message(prompt)])[0]
+        if not isinstance(result, dict):
+            return []
+        ordered_axes = (
+            "synonym_paraphrase",
+            "omit_arm",
+            "omit_orientation",
+            "omit_grasp_method",
+            "combined_omissions",
+        )
+        flat: list[str] = []
+        seen: set[str] = set()
+        for axis in ordered_axes:
+            entries = result.get(axis)
+            if not isinstance(entries, list):
+                continue
+            for item in entries:
+                if not isinstance(item, str):
+                    continue
+                key = item.strip().strip('"').strip("'")
+                if not key or key in seen:
+                    continue
+                seen.add(key)
+                flat.append(key)
+        return flat
+
+    def _episode_video_block(
+        self, record: EpisodeRecord, window: tuple[float, float] | None = None
+    ) -> list[dict[str, Any]]:
+        """Video block for the segmentation / describe prompts.
+
+        Always returns a block that actually carries the video. When
+        ``use_video_url`` is set we try the server-side ``video_url``
+        path first, but if clip extraction fails we FALL BACK to
+        decoding + embedding frames rather than returning an empty
+        block — an empty block would leave the VLM with no visual
+        grounding at all and it would hallucinate subtasks purely from
+        the task text.
+
+        When ``window=(w0, w1)`` is given (windowed subtask generation,
+        ``subtask_window_seconds > 0``), embed frames sampled at the FIXED
+        ``frames_per_second`` rate within ``[w0, w1]`` — constant temporal
+        density regardless of episode length, so long episodes are split
+        into windows rather than subsampled to a sparse 32-frame whole-
+        episode view. The ``video_url`` path is skipped for windows (it is
+        a whole-episode clip). ``max_video_frames`` still caps each window
+        as a context-budget safety net.
+        """
+        if not record.frame_timestamps:
+            return []
+        if window is not None:
+            w0, w1 = float(window[0]), float(window[1])
+            dur = max(0.0, w1 - w0)
+            n = max(1, int(round(dur * self.config.frames_per_second)) + 1)
+            n = min(n, self.config.max_video_frames)
+            if n <= 1 or dur <= 0.0:
+                timestamps = [0.5 * (w0 + w1)]
+            else:
+                step = dur / (n - 1)
+                timestamps = [w0 + i * step for i in range(n)]
+            return to_video_block(self.frame_provider.frames_at(record, timestamps))
+        if self.config.use_video_url and isinstance(self.frame_provider, VideoFrameProvider):
+            cache_dir = Path(self.frame_provider.root) / ".annotate_staging" / ".video_clips"
+            clip = self.frame_provider.episode_clip_path(record, cache_dir)
+            if clip is not None:
+                return to_video_url_block(f"file://{clip}", fps=self.config.use_video_url_fps)
+            logger.warning(
+                "episode %d: video_url clip extraction failed — falling back to "
+                "embedded frames so the VLM still sees the demonstration",
+                record.episode_index,
+            )
+        episode_duration = record.frame_timestamps[-1] - record.frame_timestamps[0]
+        target_count = max(1, int(round(episode_duration * self.config.frames_per_second)))
+        target_count = min(target_count, self.config.max_video_frames)
+        video_frames = self.frame_provider.video_for_episode(record, target_count)
+        return to_video_block(video_frames)
+
+    def run_plan_updates(
+        self,
+        record: EpisodeRecord,
+        staging: EpisodeStaging,
+        interjection_times: Sequence[float],
+        interjection_texts: Sequence[str] | None = None,
+    ) -> None:
+        """Append additional ``plan`` rows at every interjection timestamp.
+
+        Plans refresh ONLY on user interjections — subtask generation
+        runs ~1 Hz at inference, but plan re-emission is event-driven.
+        Now also forwards the interjection's own text into the prompt so
+        the refreshed plan can actually reflect the user's correction
+        (the previous version told the model "an interjection happened"
+        without telling it what the user said).
+        """
+        if not self.config.emit_plan:
+            return
+        existing = staging.read("plan")
+        # Pass the episode's last frame timestamp so the final subtask
+        # span is closed (otherwise its ``end`` equals its ``start``,
+        # zero duration, and the "current subtask at refresh_t" lookup
+        # in ``_generate_plan`` misses any refresh that lands inside it).
+        episode_end_t = float(record.frame_timestamps[-1]) if record.frame_timestamps else None
+        spans = reconstruct_subtask_spans(existing, episode_end_t=episode_end_t)
+        already_planned: set[float] = {float(r["timestamp"]) for r in existing if r.get("style") == "plan"}
+        new_rows = list(existing)
+
+        texts: list[str | None] = (
+            [None] * len(interjection_times)
+            if interjection_texts is None
+            else [str(t) if t else None for t in interjection_texts]
+        )
+        for raw_t, inter_text in zip(interjection_times, texts, strict=True):
+            t = snap_to_frame(raw_t, record.frame_timestamps)
+            if t in already_planned:
+                continue
+            already_planned.add(t)
+            plan_text = self._generate_plan(record, spans, refresh_t=t, interjection=inter_text)
+            if plan_text is not None:
+                new_rows.append(
+                    {
+                        "role": "assistant",
+                        "content": plan_text,
+                        "style": "plan",
+                        "timestamp": t,
+                        "tool_calls": None,
+                    }
+                )
+        staging.write("plan", new_rows)
+
+    def _generate_subtasks(self, record: EpisodeRecord, *, task: str | None = None) -> list[dict[str, Any]]:
+        """Generate subtask spans, optionally via a multi-call quality chain.
+
+        Single call (default): watch video → emit subtask JSON.
+
+        Multi-call (opt-in, higher quality, more VLM calls):
+          1. ``subtask_describe_first`` — a grounding pass that narrates
+             ONLY what is visible (no JSON commitment to subtasks yet);
+             its description is injected into the segmentation prompt so
+             the model segments its own grounded observations instead of
+             pattern-matching the task text.
+          2. segmentation — emit subtask JSON (as before).
+        """
+        if record.row_count == 0 or not record.frame_timestamps:
+            return []
+        episode_duration = record.frame_timestamps[-1] - record.frame_timestamps[0]
+        effective_task = task if task is not None else record.episode_task
+
+        # ---- Windowed path (constant temporal density) ---------------
+        # When ``subtask_window_seconds > 0`` and the episode is longer
+        # than one window, process the episode in fixed-length windows so
+        # the VLM always sees ``frames_per_second`` density (instead of a
+        # sparse 32-frame whole-episode view). Each window runs the full
+        # describe -> segment chain on its own frames; results are merged +
+        # stitched into a contiguous whole-episode cover.
+        window_s = float(getattr(self.config, "subtask_window_seconds", 0.0) or 0.0)
+        if window_s > 0.0 and episode_duration > window_s:
+            return self._generate_subtasks_windowed(record, effective_task, window_s)
+
+        # ---- Pass 1 (optional): grounding description ----------------
+        observation_block = ""
+        if getattr(self.config, "subtask_describe_first", False):
+            description = self._describe_episode(record, effective_task)
+            if description:
+                observation_block = (
+                    "You watched this video and described, chronologically, "
+                    "ONLY what the robot actually does:\n"
+                    f'"""{description}"""\n\n'
+                    "Segment THAT grounded description (cross-checked against "
+                    "the video) into atomic subtasks. Do not introduce any "
+                    "action that is not in your description above.\n\n"
+                )
+
+        # ---- Pass 2: segmentation ------------------------------------
+        prompt = load_prompt("module_1_subtasks").format(
+            episode_task=effective_task,
+            min_subtask_seconds=self.config.min_subtask_seconds,
+            max_steps=self.config.plan_max_steps,
+            episode_duration=f"{episode_duration:.3f}",
+            observation_block=observation_block,
+        )
+        spans = self._vlm_field(self._video_message(record, prompt), "subtasks")
+        cleaned = self._clean_spans(spans, record)
+        if not cleaned:
+            return []
+
+        # ---- Full-episode coverage stitch ----------------------------
+        # The VLM can leave the first subtask starting after t0 or leave
+        # gaps between spans, so the subtask timeline no longer tiles the
+        # whole episode and frames fall through with no active subtask.
+        # Always stitch the surviving spans into a contiguous cover of
+        # [t0, t_last] — there is no scenario where a sparse, gap-ridden
+        # subtask timeline is desirable for conditioning.
+        cleaned = self._stitch_full_coverage(cleaned, record)
+
+        return cleaned
+
+    def _generate_subtasks_windowed(
+        self, record: EpisodeRecord, task: str, window_s: float
+    ) -> list[dict[str, Any]]:
+        """Subtask generation in fixed-length windows at constant fps.
+
+        Splits ``[t0, t_last]`` into consecutive windows of ``window_s``
+        seconds, runs the describe -> segment chain on each window's own
+        frames (sampled at ``frames_per_second``), offsets
+        each window's spans back to absolute episode time, then merges +
+        stitches into a contiguous whole-episode cover.
+        """
+        t0 = float(record.frame_timestamps[0])
+        t_last = float(record.frame_timestamps[-1])
+        all_spans: list[dict[str, Any]] = []
+        w0 = t0
+        n_windows = 0
+        while w0 < t_last - 1e-6:
+            w1 = min(w0 + window_s, t_last)
+            all_spans.extend(self._subtasks_for_window(record, task, w0, w1))
+            n_windows += 1
+            w0 = w1
+        logger.info(
+            "episode %d: windowed subtask gen over %d window(s) of %.1fs -> %d raw spans",
+            record.episode_index,
+            n_windows,
+            window_s,
+            len(all_spans),
+        )
+        # Merge across windows: clamp to the absolute episode, sort, and
+        # frame-snap to distinct starts (handles any boundary collisions).
+        cleaned = self._clean_spans(all_spans, record)
+        if not cleaned:
+            return []
+        return self._stitch_full_coverage(cleaned, record)
+
+    def _subtasks_for_window(
+        self, record: EpisodeRecord, task: str, w0: float, w1: float
+    ) -> list[dict[str, Any]]:
+        """Run describe -> segment on one ``[w0, w1]`` window.
+
+        The model works in window-RELATIVE time ``[0, L]`` (it perceives
+        the window as a clip starting at 0); spans are offset back to
+        absolute ``[w0, w1]`` before returning.
+        """
+        window = (w0, w1)
+        win_len = max(0.0, w1 - w0)
+
+        observation_block = ""
+        if getattr(self.config, "subtask_describe_first", False):
+            description = self._describe_episode(record, task, window=window)
+            if description:
+                observation_block = (
+                    "You watched this video clip and described, chronologically, "
+                    "ONLY what the robot actually does:\n"
+                    f'"""{description}"""\n\n'
+                    "Segment THAT grounded description (cross-checked against "
+                    "the clip) into atomic subtasks. Do not introduce any "
+                    "action that is not in your description above.\n\n"
+                )
+
+        prompt = load_prompt("module_1_subtasks").format(
+            episode_task=task,
+            min_subtask_seconds=self.config.min_subtask_seconds,
+            max_steps=self.config.plan_max_steps,
+            episode_duration=f"{win_len:.3f}",
+            observation_block=observation_block,
+        )
+        spans = self._vlm_field(self._video_message(record, prompt, window=window), "subtasks")
+        # Window-relative clamp; no frame-snap dedupe yet (done on the
+        # merged absolute set).
+        cleaned = self._clean_spans(spans, record, bounds=(0.0, win_len), dedupe=False)
+        if not cleaned:
+            return []
+
+        # Offset window-relative spans back to absolute episode time.
+        for s in cleaned:
+            s["start"] = w0 + float(s["start"])
+            s["end"] = w0 + float(s["end"])
+        return cleaned
+
+    def _stitch_full_coverage(
+        self, spans: list[dict[str, Any]], record: EpisodeRecord
+    ) -> list[dict[str, Any]]:
+        """Make subtask spans tile the full episode with no gaps.
+
+        * The first subtask starts at the episode's first frame ``t0``
+          (any idle / approach before the first labelled action is folded
+          into it), so every early frame has an active subtask.
+        * Each subtask's ``end`` is snapped to the next subtask's
+          ``start`` (gaps between spans are closed), and the final
+          subtask's ``end`` extends to the last frame ``t_last``.
+
+        Starts are otherwise left as the (already frame-snapped, distinct)
+        values the VLM produced — only the FIRST start is pulled
+        back to ``t0``, which can't collide with a later span because it
+        was already the earliest. Purely deterministic; runs after the
+        VLM passes.
+        """
+        if not spans or not record.frame_timestamps:
+            return spans
+        t0 = float(record.frame_timestamps[0])
+        t_last = float(record.frame_timestamps[-1])
+        spans = sorted(spans, key=lambda s: float(s["start"]))
+        spans[0]["start"] = t0
+        for i in range(len(spans) - 1):
+            spans[i]["end"] = float(spans[i + 1]["start"])
+        spans[-1]["end"] = t_last
+        for s in spans:
+            if float(s["end"]) < float(s["start"]):
+                s["end"] = float(s["start"])
+        return spans
+
+    def _clean_spans(
+        self,
+        spans: Any,
+        record: EpisodeRecord,
+        bounds: tuple[float, float] | None = None,
+        dedupe: bool = True,
+    ) -> list[dict[str, Any]]:
+        """Clamp / sort / (optionally) dedupe raw VLM subtask spans into valid rows.
+
+        ``bounds`` overrides the clamp range — pass the window's
+        ``(w_lo, w_hi)`` when cleaning window-relative spans, or leave
+        ``None`` to clamp to the whole episode ``[t0, t_last]``.
+        ``dedupe`` runs the frame-snap distinct-start step; skip it for
+        window-relative spans (frame snapping is done once on the merged,
+        absolute-time set).
+        """
+        if not spans:
+            return []
+        if bounds is not None:
+            lo, hi = float(bounds[0]), float(bounds[1])
+        else:
+            lo = record.frame_timestamps[0]
+            hi = record.frame_timestamps[-1]
+        cleaned: list[dict[str, Any]] = []
+        for span in spans:
+            try:
+                start = float(span["start"])
+                end = float(span["end"])
+                text = str(span["text"]).strip()
+            except (KeyError, ValueError, TypeError):
+                continue
+            start = max(lo, min(start, hi))
+            end = max(lo, min(end, hi))
+            if end < start:
+                start, end = end, start
+            if not text:
+                continue
+            cleaned.append({"text": text, "start": start, "end": end})
+        cleaned.sort(key=lambda s: s["start"])
+        if dedupe:
+            return self._dedupe_starts_to_distinct_frames(cleaned, record)
+        return cleaned
+
+    def _describe_episode(
+        self, record: EpisodeRecord, task: str, window: tuple[float, float] | None = None
+    ) -> str:
+        """Grounding pass: free-form chronological description of the (windowed) video."""
+        prompt = load_prompt("module_1_subtask_describe").format(episode_task=task)
+        text = self._vlm_field(self._video_message(record, prompt, window=window), "description")
+        return text.strip() if isinstance(text, str) and text.strip() else ""
+
+    @staticmethod
+    def _dedupe_starts_to_distinct_frames(
+        spans: list[dict[str, Any]], record: EpisodeRecord
+    ) -> list[dict[str, Any]]:
+        """Bump same-frame subtask starts onto distinct frames.
+
+        Two consecutive VLM spans whose ``start`` rounds to the same
+        source frame (after :func:`snap_to_frame`) would otherwise emit
+        two ``style=subtask`` rows at the identical persistent
+        timestamp. The training-time renderer's ``active_at(t,
+        style=subtask)`` resolver can't disambiguate that and raises
+        ``Ambiguous resolver for style='subtask'``.
+
+        Walk the (sorted-by-start) spans, snap each to its frame, and
+        if the snapped frame is already taken push the span onto the
+        next unused frame so both subtasks survive on distinct
+        timestamps. If the episode ends before a free frame is found,
+        the trailing span is dropped with a warning — better than
+        poisoning the render.
+        """
+        if not spans:
+            return spans
+        frames = record.frame_timestamps
+        if not frames:
+            return spans
+        used: set[float] = set()
+        out: list[dict[str, Any]] = []
+        for span in spans:
+            ts = snap_to_frame(span["start"], frames)
+            if ts in used:
+                next_ts = next((f for f in frames if f > ts and f not in used), None)
+                if next_ts is None:
+                    logger.warning(
+                        "episode %d: subtask %r snapped to occupied frame "
+                        "%.3f and no free later frame exists — dropping",
+                        record.episode_index,
+                        span.get("text"),
+                        ts,
+                    )
+                    continue
+                ts = next_ts
+            used.add(ts)
+            new_span = {**span, "start": ts}
+            if float(new_span.get("end", ts)) < ts:
+                new_span["end"] = ts
+            out.append(new_span)
+        return out
+
+    def _generate_plan(
+        self,
+        record: EpisodeRecord,  # noqa: ARG002  (kept for signature stability)
+        subtask_spans: Sequence[dict[str, Any]],
+        *,
+        refresh_t: float | None = None,
+        interjection: str | None = None,  # noqa: ARG002
+        task: str | None = None,  # noqa: ARG002
+    ) -> str | None:
+        """Deterministic plan = numbered list of *still-todo* subtasks.
+
+        Previously this called the VLM with a prompt that asked it to
+        compress the subtasks into a "compact hierarchical plan". That
+        produced longer-than-necessary plans, cost an extra VLM round-trip
+        per episode (plus one per interjection on refresh), and could
+        diverge from the actual subtask sequence the model is going to
+        execute. Replacing it with a plain summarisation keeps the plan
+        tightly aligned with the upcoming subtasks and removes the VLM
+        call entirely.
+
+        Layout — short imperative fragments prefixed by "N. ":
+
+            1. <subtask 1>
+            2. <subtask 2>
+            ...
+
+        On a refresh at ``refresh_t`` (called from ``run_plan_updates``
+        on interjection events, and from ``run_episode`` at every subtask
+        boundary), only subtasks whose start is at or after ``refresh_t``
+        are included — the plan shrinks as work progresses, so it always
+        describes what's left.
+        """
+        if not subtask_spans:
+            return None
+        remaining = [
+            s for s in subtask_spans if refresh_t is None or float(s.get("start", 0.0)) >= float(refresh_t)
+        ]
+        if not remaining:
+            # Past the last subtask boundary on a late refresh — nothing
+            # left to plan; emit None so the caller skips the row.
+            return None
+        return "\n".join(f"{i}. {span.get('text', '').strip()}" for i, span in enumerate(remaining, start=1))
+
+    def _generate_memory(
+        self,
+        record: EpisodeRecord,
+        prior_memory: str,
+        completed: str,
+        remaining: Sequence[str],
+        *,
+        task: str | None = None,
+    ) -> str:
+        prompt = load_prompt("module_1_memory").format(
+            episode_task=(task if task is not None else record.episode_task),
+            prior_memory=prior_memory or "(none)",
+            completed_subtask=completed,
+            remaining_subtasks=", ".join(remaining) if remaining else "(none)",
+        )
+        memory = self._vlm_field(self._text_message(prompt), "memory")
+        return memory.strip() if isinstance(memory, str) else ""
--- a/src/lerobot/annotations/steerable_pipeline/prompts/init.py
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/init.py
@@ -0,0 +1,33 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Prompt templates loaded as plain text.
+
+One file per use site. Templates use ``str.format(**vars)`` substitution; we
+intentionally avoid jinja2 here so the templates remain inspectable in
+plain editors and roundtrip cleanly through ``ruff format``.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+_DIR = Path(__file__).parent
+
+
+def load(name: str) -> str:
+    """Read prompt template ``name.txt`` from the ``prompts/`` directory."""
+    path = _DIR / f"{name}.txt"
+    return path.read_text(encoding="utf-8")
--- a/src/lerobot/annotations/steerable_pipeline/prompts/module_1_action_record.txt
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/module_1_action_record.txt
@@ -0,0 +1,64 @@
+You are extracting a structured action record from a subtask span of a
+teleoperated robot demonstration. This is Phase 1a of a two-step
+process: you extract a typed record; a deterministic template then
+renders it back to canonical subtask text. Your job is the PERCEPTION
+step — not the language step.
+
+The user originally asked: "{episode_task}"
+The subtask span is:        "{subtask_text}"
+Span time window:           [{start_time:.2f}s, {end_time:.2f}s]
+                            ({duration:.2f}s of robot activity)
+
+You are shown {n_frames} frames sampled uniformly from the subtask
+window. Fill in a structured record describing the action that takes
+place between the first and last frame.
+
+Hard rules:
+- Use ONLY information visible in the frames. Do not infer details from
+  outside the span. Do not extrapolate from the original task wording.
+- Use canonical object names from the original task VERBATIM. Never
+  introduce synonyms: if the task says "cube", the record says "cube",
+  never "block" / "object" / "item".
+- For non-applicable fields, use ``null`` (not "n/a", not "none", not
+  an empty string).
+- For ``verb`` and ``grasp_type``, pick EXACTLY one value from the
+  vocabulary below. Never invent a new one.
+
+Field schema:
+
+  verb (required) — the imperative verb of the action. Vocabulary:
+    {verb_vocabulary}
+
+  object (required) — the manipulated object. Use the canonical noun
+    from the original task above.
+
+  arm — which arm performs the action. One of:
+    "left" | "right" | "both" | null
+    Use ``null`` when the source robot is single-arm or when the arm
+    is genuinely not visible in the frames.
+
+  grasp_type — which grip the gripper uses on contact. One of:
+    {grasp_vocabulary} | null
+    Use ``null`` when there is no contact in this span (e.g. a pure
+    ``move`` / ``reach`` subtask) or the grip is genuinely unclear.
+
+  destination — the target location for actions like ``place``,
+    ``move``, ``insert``, ``pour``. Use canonical names from the
+    original task. Use ``null`` for in-place actions (``press``,
+    ``turn``, ``grasp``, ``release``).
+
+  mistake — a brief one-clause description of any visible failure or
+    recovery during the span (e.g. "dropped the cube and re-grasped",
+    "missed the target on first attempt"). Use ``null`` when the span
+    completes cleanly with no visible recovery.
+
+Output strictly valid JSON of shape:
+
+  {{
+    "verb": "<one of vocabulary>",
+    "object": "<canonical noun>",
+    "arm": "left" | "right" | "both" | null,
+    "grasp_type": "<one of vocabulary>" | null,
+    "destination": "<canonical noun>" | null,
+    "mistake": "<short description>" | null
+  }}
--- a/src/lerobot/annotations/steerable_pipeline/prompts/module_1_memory.txt
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/module_1_memory.txt
@@ -0,0 +1,36 @@
+You are updating the robot's compressed semantic memory at the boundary of
+a completed subtask.
+
+Reference (verbatim from MEM, Torne 2026):
+"Remove or compress information in the language memory whenever
+appropriate. Keep ONLY the minimal set of relevant information for future
+task execution. Specific object attributes (colors, precise quantities of
+each item) get discarded when their details won't affect subsequent
+actions. Functional outcomes (where items went, how many) are preserved."
+
+Episode task: "{episode_task}"
+Previous memory: {prior_memory}
+Just-completed subtask: "{completed_subtask}"
+Remaining subtasks (for relevance judgement only): {remaining_subtasks}
+
+Write the memory as a short FIRST-PERSON, PAST-TENSE narrative of what the
+robot has accomplished so far — the running story it would tell itself.
+
+Authoring rules:
+- First person, past tense. Every sentence starts with "I": "I picked
+  up...", "I opened...", "I moved to...".
+- One or two short sentences. Extend the previous memory with the
+  just-completed subtask; do not rewrite it from scratch.
+- Keep WHAT happened (functional outcomes — where items went, how many),
+  drop HOW (grasp details, motions).
+- Compress completed steps and drop object attributes (colors, exact
+  counts) once they no longer affect the remaining subtasks.
+
+Example (MEM, Torne 2026):
+  Before: "I prepared the pot and got the potatoes, milk, and butter. I
+           moved to the drawer."
+  After:  "I prepared the pot and got the ingredients. I opened the
+           drawer with the masher."
+
+Output strictly valid JSON:
+  {{ "memory": "<one or two short first-person past-tense sentences>" }}
--- a/src/lerobot/annotations/steerable_pipeline/prompts/module_1_subtask_describe.txt
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/module_1_subtask_describe.txt
@@ -0,0 +1,27 @@
+You are watching a teleoperated robot demonstration from a single
+camera. The user asked the robot to: "{episode_task}"
+
+This is an OBSERVATION pass. Watch the entire clip and describe, in
+chronological order, ONLY what the robot physically does — the concrete
+motions, approaches, contacts, grasps, releases, and relocations you can
+actually SEE in the frames.
+
+Hard rules:
+- Describe only motion visible in the video. Do NOT use the task
+  instruction to guess steps that aren't shown. The instruction is the
+  goal; the video is ground truth.
+- Do NOT segment into named subtasks yet and do NOT output JSON beyond
+  the single field below. Just narrate what happens.
+- Give an approximate timestamp (in seconds) for each distinct event,
+  e.g. "0.0-1.4s: the base drives forward toward the stove".
+- Do NOT invent objects, grasps, destinations, or steps. If the robot
+  only does one thing (e.g. it just navigates and the clip ends), say
+  exactly that and nothing more.
+- Be concrete and literal. "the gripper closes on the mug" — not "the
+  robot prepares to make coffee".
+
+Output strictly valid JSON:
+
+  {{
+    "description": "<chronological, timestamped description of ONLY what is visible>"
+  }}
--- a/src/lerobot/annotations/steerable_pipeline/prompts/module_1_subtasks.txt
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/module_1_subtasks.txt
@@ -0,0 +1,112 @@
+You are labeling a teleoperated robot demonstration.
+
+The user originally asked: "{episode_task}"
+
+You are shown the entire demonstration as a single video. Watch the
+whole clip, then segment it into a list of consecutive atomic subtasks
+the robot performs.
+
+{observation_block}GROUNDING — read this first, it overrides everything below:
+- Label ONLY what the robot actually does in the video. Every subtask
+  you emit must correspond to motion you can SEE in specific frames.
+- Do NOT invent, anticipate, or pad. If the robot only does one thing
+  (e.g. it just navigates to a location and the clip ends), emit
+  EXACTLY ONE subtask. Many demonstrations are a single atomic skill.
+- ``max_steps`` below is a hard CEILING, not a target. Emitting fewer
+  subtasks than the ceiling is not just allowed, it is expected for
+  short / atomic demonstrations. One correct subtask is far better
+  than several invented ones.
+- If the video does not clearly show the action implied by the task,
+  describe what you actually see — do NOT fabricate the task's steps
+  from the instruction text. The instruction tells you the goal; the
+  VIDEO is the ground truth for what happened.
+
+Authoring rules — Hi Robot atom granularity, pi0.7-style short prompts:
+
+- Each subtask = one COMPOSITE atomic skill the low-level policy can
+  execute end-to-end. A "skill" bundles its own approach motion with
+  its terminal action — do NOT split the approach off as its own
+  subtask. The whole-arm policy already learns to reach as part of
+  every manipulation primitive.
+- Write each subtask as an IMPERATIVE COMMAND, starting with one of
+  these verbs (extend only when none fits):
+    pick up <obj>           — approach + grasp + lift in one subtask
+    put <obj> on/in <loc>   — transport + release in one subtask
+    place <obj> on/in <loc> — synonym of "put"; pick one and stay consistent
+    push <obj>              — contact + linear shove
+    pull <obj>              — contact + linear retract
+    turn <knob/dial/handle> — rotary actuation
+    press <button>          — single-press contact
+    open <drawer/door/lid>  — full open motion
+    close <drawer/door/lid> — full close motion
+    pour <src> into <dst>   — tilt + flow
+    insert <obj> into <slot>— alignment + push-fit
+    go to <loc>             — ONLY when no grasp / actuation follows
+                             (e.g. a pure relocation between phases).
+                             If the next subtask grasps something at
+                             that location, drop "go to ..." and just
+                             write "pick up ..." instead.
+- Forbidden ultra-fine splits — the VLM is NOT allowed to emit these
+  as standalone subtasks; fold them into the parent composite:
+    "move to X"   → fold into "pick up X" (or whatever follows)
+    "reach for X" → fold into "pick up X"
+    "grasp X"     → fold into "pick up X"
+    "lift X"      → fold into "pick up X" (or "put X on Y" if it's
+                    the transport phase of a place)
+    "release X"   → fold into "put X on Y" (or "place X in Y")
+- Keep it SHORT — a verb phrase, not a sentence. Drop articles
+  ("the", "a") and adverbs ("carefully", "slowly"). Add a "how"
+  detail (which hand, which grasp point) ONLY when it is needed to
+  disambiguate. Every subtask must begin with one of the verbs
+  above (no leading nouns, no "then", no "first").
+- NEVER use third person. Never write "the robot", "the arm", "the
+  gripper moves", "it picks up" — the robot is implied. Command it,
+  do not describe it.
+- Use the exact object nouns from the task above. If the task says
+  "cube", every subtask says "cube" — never switch to "block". If it
+  says "box", never switch to "bin"/"container". Keep vocabulary
+  consistent across the whole episode.
+- Good: "pick up blue cube", "put blue cube in box", "open drawer",
+  "turn red knob", "press start button", "go to sink".
+- Bad: "move to blue cube" (approach as its own subtask — forbidden,
+  must be folded into "pick up blue cube"); "the robot arm moves
+  towards the blue cube" (third person, too long); "carefully pick
+  up the cube" (adverb, article); "release the yellow block"
+  ("block" when the task said "cube", and "release" must be folded
+  into a "put"/"place" subtask).
+- Subtasks are non-overlapping and cover the full episode in order.
+  Choose the cut points yourself based on what you see in the video
+  (gripper open/close events, contact, regrasps, transitions).
+- Each subtask spans at least {min_subtask_seconds} seconds. If a
+  candidate span would be shorter, merge it into its neighbour
+  rather than emitting it.
+- Do not exceed {max_steps} subtasks total. Fewer, larger composites
+  are preferred over many micro-steps.
+- Every subtask's [start_time, end_time] must lie within
+  [0.0, {episode_duration}] seconds.
+
+SPECIAL CASES — verb disambiguation (each rule is narrowly visual and
+fires ONLY on the spatial situation it names; it must not change how you
+label any other situation):
+- STACK vs PUT: if an object is placed ON TOP OF another specific object
+  (not on a flat table / shelf / counter), use "stack ... on ...", not
+  "put". "stack blue book on green book", NOT "put blue book on table".
+- INSERT vs PUT: if an object goes INTO a fitted slot / hole / socket /
+  receptacle (push-fit), use "insert ... into ...", not "put".
+- RETRIEVE/PICK-UP vs PUT (direction): watch the gripper. If it CLOSES
+  on the object and the object moves WITH the hand, it is "pick up" /
+  "retrieve" (object leaves its location). If the gripper OPENS and the
+  object stays where the hand left it, it is "put" / "place" (object
+  arrives at a location). Decide by which way the object moves, not by
+  where the hand ends up.
+- POUR vs PUT: only use "pour" when the source is tilted and contents
+  flow out; moving a full container without tilting is "put"/"place".
+
+Output strictly valid JSON of shape:
+
+  {{
+    "subtasks": [
+      {{"text": "<short imperative verb phrase>", "start": <float>, "end": <float>}},
+      ...
+    ]
+  }}
--- a/src/lerobot/annotations/steerable_pipeline/prompts/module_1_task_aug_axes.txt
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/module_1_task_aug_axes.txt
@@ -0,0 +1,67 @@
+You are generating structured augmentations of a robot task instruction
+for training a language-conditioned policy. Unlike free-form rephrasing,
+your variants follow a NAMED 5-axis taxonomy — each axis omits or varies
+a specific element of the task while preserving its meaning.
+
+Original task: "{base_task}"
+
+Produce variants along five named axes. Each axis has a target count.
+The whole batch should expose the policy to maximum linguistic diversity
+WITHOUT changing what the robot is supposed to do.
+
+Axes and target counts:
+
+  synonym_paraphrase ({n_synonym}):
+    Different wording / verbs / sentence structure. ALL information
+    from the original task is preserved — same object, same arm
+    specification if present, same orientation if present, same grasp
+    if present.
+
+  omit_arm ({n_omit_arm}):
+    Drop the left/right/both arm specification from the task. Skip
+    entirely (emit 0 entries) if the original task does NOT mention an
+    arm. Do not invent an arm specification just to omit it.
+
+  omit_orientation ({n_omit_orientation}):
+    Drop orientation cues (upright, sideways, facing the user,
+    long-edge-first, etc.). Skip entirely if no orientation cue is
+    present in the original task.
+
+  omit_grasp_method ({n_omit_grasp_method}):
+    Drop the grip / grasp method specification (pinch, wrap, hold by
+    the rim, etc.). Skip entirely if no grasp method is mentioned.
+
+  combined_omissions ({n_combined}):
+    Combine TWO of the above omissions simultaneously (e.g. drop both
+    arm and orientation). Skip entirely if fewer than two of (arm,
+    orientation, grasp_method) appear in the original task.
+
+Hard rules:
+- Each variant MUST preserve the core action, the target object, AND
+  the goal / destination. Do not change which object is involved, where
+  it goes, or the high-level action. "Navigate to the stove" may become
+  "go to the stove" or "head over to the stove" — it must NEVER become
+  "wander around the kitchen", "explore the room", or anything that
+  drops or generalises the stove destination. If you cannot vary the
+  wording without changing the goal, emit fewer variants.
+- Only the FIVE listed elements (wording, arm, orientation, grasp
+  method, or a combination) may be varied or omitted. The verb's
+  meaning, the object, and the destination are fixed.
+- Each variant is plain prose, no markdown, no quotes, no list numbers.
+- Each variant must be DISTINCT from every other variant in the entire
+  output, both within and across axes. Near-duplicates are not allowed.
+- If an axis cannot reach its target count because the original task
+  lacks the omittable element, emit fewer entries — do NOT pad the
+  axis with paraphrases that belong to a different axis.
+- Variants should not all start with verbs — vary sentence structure
+  (some imperative, some polite request, some question).
+
+Output strictly valid JSON of shape:
+
+  {{
+    "synonym_paraphrase": ["<v1>", "<v2>", ...],
+    "omit_arm": ["<v1>", "<v2>", ...],
+    "omit_orientation": ["<v1>", ...],
+    "omit_grasp_method": ["<v1>", ...],
+    "combined_omissions": ["<v1>", ...]
+  }}
--- a/src/lerobot/annotations/steerable_pipeline/prompts/module_1_task_rephrasings.txt
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/module_1_task_rephrasings.txt
@@ -0,0 +1,32 @@
+You are generating training data for a Hi Robot-style policy. We need
+{n} alternative phrasings of the same robot task so the policy sees
+diverse user prompts during training instead of the same canonical
+string repeated every frame.
+
+Original task:
+"{base_task}"
+
+Generate exactly {n} alternative phrasings of the same task. Vary:
+
+- formality (casual / polite / curt)
+- verbosity (mostly short imperative; occasional polite request)
+- word choice (synonyms, different verbs)
+- sentence structure (imperative / question / suggestion)
+
+Hard rules:
+- Each phrasing MUST preserve the exact meaning of the original task.
+  Do not change which object is involved, the destination, or the
+  action. Do not add extra steps. Do not invent new objects.
+- Each phrasing must be a short phrase or sentence, plain prose, no
+  markdown, no quotes, no list numbers.
+- Phrasings must be distinct — no near-duplicates.
+- Output exactly {n} entries.
+
+Output strictly valid JSON:
+  {{
+    "rephrasings": [
+      "<phrasing 1>",
+      "<phrasing 2>",
+      ...
+    ]
+  }}
--- a/src/lerobot/annotations/steerable_pipeline/prompts/module_1_video_task.txt
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/module_1_video_task.txt
@@ -0,0 +1,17 @@
+The video above shows a robot manipulation episode in full. Look at
+the entire video and describe in ONE concise sentence what the robot
+is doing.
+
+Rules:
+- One sentence, in natural English, like a user instruction.
+- Capture the goal of the demonstration, not low-level motions.
+  Example: "place the yellow cube into the red bin" — not "move the
+  end-effector down 5cm and close the gripper".
+- 4 to 15 words. Plain prose, no markdown, no bullets, no quotes.
+- Do not invent objects or actions that aren't visible.
+- Do not output anything other than the JSON object below.
+
+Output strictly valid JSON:
+  {{
+    "task": "<single concise sentence describing what the robot does in this video>"
+  }}
--- a/src/lerobot/annotations/steerable_pipeline/prompts/module_2_initial_speech.txt
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/module_2_initial_speech.txt
@@ -0,0 +1,12 @@
+The user just asked the robot: "{episode_task}".
+
+Generate a short verbal acknowledgement the robot would speak back before
+beginning the task. Style: compact, confident, friendly.
+
+Examples (Hi Robot, Shi 2025): "Sure, I won't put cheese on it.",
+"OK, starting with the sponge.", "Got it.".
+
+Prefer very short replies: "Got it.", "On it.", "OK."
+
+Output strictly valid JSON:
+  {{ "text": "<the spoken acknowledgement>" }}
--- a/src/lerobot/annotations/steerable_pipeline/prompts/module_2_interjection.txt
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/module_2_interjection.txt
@@ -0,0 +1,46 @@
+You are generating training data for a Hi Robot-style hierarchical
+robot policy. The robot in this demonstration has ALREADY executed
+every step shown in the video — we cannot retroactively change the
+action stream. To keep training data consistent with the video, the
+"interjection" must align with what the robot is *about to do next* in
+the demonstration, framed as a natural mid-task user request.
+
+The episode's overall task: "{episode_task}".
+
+The images above show roughly {window_seconds:.1f} seconds straddling a
+subtask boundary in the demonstration:
+
+- Subtask the robot just finished: "{prev_subtask}"
+- Subtask the robot is about to start: "{next_subtask}"
+- Time into episode: {timestamp:.2f}s
+
+Write ONE compact interjection the user would naturally say at this
+moment to prompt / confirm / encourage the robot to do "{next_subtask}".
+Keep it like a mid-task coaching cue, not a full instruction paragraph.
+Also write the robot's compact verbal acknowledgement.
+
+Hard rules:
+
+- The interjection MUST be consistent with the next subtask. The user
+  cannot ask for something different from what the robot then does in
+  the video. If you're tempted to say "actually skip X" or "do Y
+  instead", DO NOT — those would contradict the demonstration.
+- The interjection must reference an object, location, or action that
+  is plausible given the visible scene and the next subtask text.
+- One short phrase or sentence each. Conversational, not robotic.
+- Prefer direct cues: "{next_subtask}, please."; "Now {next_subtask}."
+- Keep robot speech very short: "OK.", "On it.", "Doing that."
+
+Style examples (vary the phrasing — don't reuse these verbatim):
+  - "Now go ahead and {next_subtask}."
+  - "Great, can you {next_subtask} next?"
+  - "{next_subtask}, please."
+  - "Before you continue, please {next_subtask}."
+  - "Looking good — {next_subtask} now."
+  - "Okay, {next_subtask}."
+
+Output strictly valid JSON:
+  {{
+    "interjection": "<short cue from the user, asking for the next subtask>",
+    "speech":       "<short robot acknowledgement>"
+  }}
--- a/src/lerobot/annotations/steerable_pipeline/prompts/module_3_vqa.txt
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/module_3_vqa.txt
@@ -0,0 +1,32 @@
+You are generating a frame-grounded visual question/answer pair for
+chain-of-thought training. Reference: ECoT (Zawalski 2024) and Steerable
+Policies — both train policies on grounded features such as bounding box
+pixel coordinates, keypoints, counts, attributes, and spatial relations.
+
+The frame shows a robot working on: "{episode_task}".
+
+Question types and the EXACT answer JSON shape required for each:
+
+  bbox       => {{"detections": [{{"label": "<obj>", "bbox_format": "xyxy",
+                                    "bbox": [x1, y1, x2, y2]}}, ...]}}
+                bbox is in pixel coordinates (x_min, y_min, x_max, y_max).
+                ECoT example: "a white cup [124, 25, 176, 113]".
+
+  keypoint   => {{"label": "<point>", "point_format": "xy",
+                  "point": [x, y]}}
+
+  count      => {{"label": "<obj>", "count": <int>,
+                  "note": "<optional short note>"}}
+
+  attribute  => {{"label": "<obj>", "attribute": "<color|shape|state|...>",
+                  "value": "<observed value>"}}
+
+  spatial    => {{"subject": "<obj>", "relation": "<left_of|right_of|on|in|"
+                  "above|below|near>", "object": "<obj>"}}
+
+Generate a question of type "{question_type}". Output strictly valid JSON:
+
+  {{
+    "question": "<short, frame-grounded question>",
+    "answer":   <object whose shape matches the schema above>
+  }}
--- a/src/lerobot/annotations/steerable_pipeline/reader.py
+++ b/src/lerobot/annotations/steerable_pipeline/reader.py
@@ -0,0 +1,274 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Datatrove-shaped reader.
+
+The reader walks ``data/chunk-*/file-*.parquet`` and yields one record per
+episode containing:
+
+- ``episode_index``: int
+- ``frame_timestamps``: tuple[float, ...]
+- ``frame_indices``: tuple[int, ...]
+- ``episode_task``: str (canonical task from ``meta/tasks.parquet``)
+- ``data_path``: pathlib.Path of the source parquet shard
+- ``frames_df``: pandas.DataFrame slice for the episode (only loaded on demand)
+
+This shape lets each module operate per-episode without loading all parquet
+rows into memory at once.
+"""
+
+from __future__ import annotations
+
+from collections.abc import Iterator, Sequence
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+
+import pyarrow.parquet as pq
+
+from lerobot.datasets.io_utils import load_tasks
+from lerobot.datasets.utils import DEFAULT_TASKS_PATH
+
+
+@dataclass
+class EpisodeRecord:
+    """Per-episode record yielded by the reader."""
+
+    episode_index: int
+    episode_task: str
+    frame_timestamps: tuple[float, ...]
+    frame_indices: tuple[int, ...]
+    data_path: Path
+    row_offset: int  # row offset within the parquet file where this episode starts
+    row_count: int  # number of rows for this episode
+
+    # Memoized parquet slice — populated on first ``frames_df()`` call so
+    # repeat queries from different modules don't re-read the whole shard.
+    _frames_df_cache: Any = field(default=None, init=False, repr=False, compare=False)
+
+    def frames_df(self):  # type: ignore[no-untyped-def]
+        """Lazy-load the pandas slice for this episode (memoized)."""
+        if self._frames_df_cache is None:
+            import pandas as pd  # noqa: PLC0415  - deferred for optional dataset extra
+
+            table = pq.read_table(self.data_path)
+            df: pd.DataFrame = table.to_pandas()
+            self._frames_df_cache = df.iloc[self.row_offset : self.row_offset + self.row_count].reset_index(
+                drop=True
+            )
+        return self._frames_df_cache
+
+
+def reconstruct_subtask_spans(
+    rows: Sequence[dict[str, Any]],
+    *,
+    episode_end_t: float | None = None,
+) -> list[dict[str, Any]]:
+    """Turn ``style="subtask"`` rows into ``{text, start, end}`` spans.
+
+    Each span's ``end`` is the next span's ``start``. The final span's
+    ``end`` defaults to its own ``start`` (zero-duration) — pass
+    ``episode_end_t`` to extend it to the episode's last frame instead,
+    which is what downstream consumers (memory, interjection boundary
+    selection) expect.
+
+    Used by the ``plan`` module (plan-update pass) and the
+    ``interjections`` module (interjection anchoring), which both need the
+    same span shape.
+    """
+    sorted_rows = sorted(
+        (r for r in rows if r.get("style") == "subtask"),
+        key=lambda r: float(r["timestamp"]),
+    )
+    spans: list[dict[str, Any]] = []
+    for r in sorted_rows:
+        t = float(r["timestamp"])
+        if spans:
+            spans[-1]["end"] = t
+        spans.append({"text": r.get("content") or "", "start": t, "end": t})
+    if spans and episode_end_t is not None and float(episode_end_t) > spans[-1]["start"]:
+        spans[-1]["end"] = float(episode_end_t)
+    return spans
+
+
+def snap_to_frame(t: float, frame_timestamps: Sequence[float]) -> float:
+    """Snap an arbitrary float to the nearest exact source frame timestamp.
+
+    Modules use this when emitting event-style rows so the row's
+    timestamp matches a real parquet frame: event rows must land on an
+    exact frame, otherwise the per-frame event lookup the writer does
+    would never match them.
+    """
+    if not frame_timestamps:
+        return float(t)
+    nearest = min(frame_timestamps, key=lambda f: abs(f - t))
+    return float(nearest)
+
+
+def _load_tasks_lookup(root: Path) -> dict[int, str]:
+    """Map ``task_index -> task`` from ``meta/tasks.parquet``.
+
+    Returns an empty dict when the file is absent — the task description is
+    derived later from the video if needed. Reuses the library-level
+    :func:`lerobot.datasets.io_utils.load_tasks`, which returns the tasks
+    frame indexed by task string with a ``task_index`` column.
+    """
+    if not (root / DEFAULT_TASKS_PATH).exists():
+        return {}
+    tasks = load_tasks(root)
+    return {int(idx): str(task) for task, idx in zip(tasks.index, tasks["task_index"], strict=True)}
+
+
+def iter_episodes(root: Path, *, only_episodes: tuple[int, ...] | None = None) -> Iterator[EpisodeRecord]:
+    """Yield :class:`EpisodeRecord` for every episode under ``root/data/``.
+
+    Episodes are yielded in ascending ``episode_index`` order. The reader does
+    not assume a specific chunk/file layout: it scans every ``*.parquet``
+    under ``data/`` and groups by ``episode_index``.
+    """
+    tasks = _load_tasks_lookup(root)
+    data_dir = root / "data"
+    parquet_files = sorted(data_dir.rglob("*.parquet"))
+
+    only_set = set(only_episodes) if only_episodes is not None else None
+
+    for path in parquet_files:
+        yield from _iter_one_path(path, tasks, only_set)
+
+
+def _iter_one_path(path: Path, tasks: dict[int, str], only_set: set[int] | None) -> Iterator[EpisodeRecord]:
+    table = pq.read_table(path)
+    names = table.column_names
+    if "episode_index" not in names:
+        return
+    episode_col = table.column("episode_index").to_pylist()
+    timestamp_col = (
+        table.column("timestamp").to_pylist() if "timestamp" in names else [0.0] * len(episode_col)
+    )
+    frame_col = (
+        table.column("frame_index").to_pylist() if "frame_index" in names else list(range(len(episode_col)))
+    )
+    task_col = table.column("task_index").to_pylist() if "task_index" in names else None
+
+    def _build(
+        ep: int,
+        start: int,
+        end: int,
+        task_idx: int | None,
+        ts_buf: list[float],
+        fi_buf: list[int],
+    ) -> EpisodeRecord | None:
+        if only_set is not None and ep not in only_set:
+            return None
+        task = tasks.get(task_idx, "") if task_idx is not None else ""
+        return EpisodeRecord(
+            episode_index=ep,
+            episode_task=task,
+            frame_timestamps=tuple(ts_buf),
+            frame_indices=tuple(fi_buf),
+            data_path=path,
+            row_offset=start,
+            row_count=end - start,
+        )
+
+    cur_ep: int | None = None
+    start_offset = 0
+    ts_buf: list[float] = []
+    fi_buf: list[int] = []
+    cur_task_idx: int | None = None
+
+    for i, ep in enumerate(episode_col):
+        if cur_ep is None:
+            cur_ep = ep
+            start_offset = i
+            ts_buf = [timestamp_col[i]]
+            fi_buf = [frame_col[i]]
+            cur_task_idx = task_col[i] if task_col is not None else None
+            continue
+        if ep != cur_ep:
+            rec = _build(cur_ep, start_offset, i, cur_task_idx, ts_buf, fi_buf)
+            if rec is not None:
+                yield rec
+            cur_ep = ep
+            start_offset = i
+            ts_buf = [timestamp_col[i]]
+            fi_buf = [frame_col[i]]
+            cur_task_idx = task_col[i] if task_col is not None else None
+        else:
+            ts_buf.append(timestamp_col[i])
+            fi_buf.append(frame_col[i])
+
+    if cur_ep is not None:
+        rec = _build(cur_ep, start_offset, len(episode_col), cur_task_idx, ts_buf, fi_buf)
+        if rec is not None:
+            yield rec
+
+
+def gather_data_paths(root: Path) -> list[Path]:
+    """Return every ``data/chunk-*/file-*.parquet`` path under ``root``."""
+    return sorted((root / "data").rglob("*.parquet"))
+
+
+def episode_offsets_per_path(path: Path) -> dict[int, tuple[int, int]]:
+    """Return ``{episode_index: (row_offset, row_count)}`` for one parquet."""
+    table = pq.read_table(path, columns=["episode_index"])
+    episode_col = table.column("episode_index").to_pylist()
+    out: dict[int, tuple[int, int]] = {}
+    cur_ep: int | None = None
+    start = 0
+    for i, ep in enumerate(episode_col):
+        if cur_ep is None:
+            cur_ep = ep
+            start = i
+            continue
+        if ep != cur_ep:
+            out[cur_ep] = (start, i - start)
+            cur_ep = ep
+            start = i
+    if cur_ep is not None:
+        out[cur_ep] = (start, len(episode_col) - start)
+    return out
+
+
+def keyframe_indices(record: EpisodeRecord, k: int) -> list[int]:
+    """Return ``k`` evenly spaced row indices into the episode (relative)."""
+    n = record.row_count
+    if k <= 0 or n == 0:
+        return []
+    if k >= n:
+        return list(range(n))
+    step = (n - 1) / (k - 1) if k > 1 else 0.0
+    return [int(round(i * step)) for i in range(k)] if k > 1 else [n // 2]
+
+
+def lookup_data_path(root: Path, episode_index: int) -> tuple[Path, int, int] | None:
+    """Find the parquet file containing ``episode_index`` and its slice bounds."""
+    for path in gather_data_paths(root):
+        offsets = episode_offsets_per_path(path)
+        if episode_index in offsets:
+            start, count = offsets[episode_index]
+            return path, start, count
+    return None
+
+
+def episode_frame_timestamps(root: Path, episode_index: int) -> tuple[Any, list[float]]:
+    """Return the parquet path and per-frame timestamps for ``episode_index``."""
+    found = lookup_data_path(root, episode_index)
+    if found is None:
+        raise ValueError(f"Episode {episode_index} not found under {root}/data/")
+    path, start, count = found
+    table = pq.read_table(path, columns=["timestamp"])
+    timestamps = table.column("timestamp").to_pylist()[start : start + count]
+    return path, [float(t) for t in timestamps]
--- a/src/lerobot/annotations/steerable_pipeline/staging.py
+++ b/src/lerobot/annotations/steerable_pipeline/staging.py
@@ -0,0 +1,104 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Per-episode staging.
+
+Each module writes its raw output as a JSONL file under
+``<staging_dir>/episode_{ep:06d}/<module>.jsonl``. The writer reads back this
+staging tree and partitions rows into the two language columns.
+
+JSONL is preferred over parquet here because the staging artifact is meant to
+be human-inspectable, easy to diff between prompt iterations, and trivially
+appended to. The final dataset format is parquet; staging is just an
+intermediate.
+"""
+
+from __future__ import annotations
+
+import json
+from collections.abc import Iterable, Iterator
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+
+ModuleName = str
+
+_MODULES: tuple[ModuleName, ...] = (
+    "plan",
+    "interjections",
+    "vqa",
+)
+
+
+@dataclass
+class EpisodeStaging:
+    """Filesystem layout for a single episode's staged module outputs."""
+
+    root: Path
+    episode_index: int
+
+    @property
+    def episode_dir(self) -> Path:
+        return self.root / f"episode_{self.episode_index:06d}"
+
+    def path_for(self, module: ModuleName) -> Path:
+        if module not in _MODULES:
+            raise ValueError(f"Unknown module {module!r}; expected one of {_MODULES}")
+        return self.episode_dir / f"{module}.jsonl"
+
+    def write(self, module: ModuleName, rows: Iterable[dict[str, Any]]) -> Path:
+        path = self.path_for(module)
+        path.parent.mkdir(parents=True, exist_ok=True)
+        # Atomic replace: a crash mid-write would otherwise leave a
+        # half-written JSONL file that ``read()`` would then fail to
+        # parse. Write to a sibling .tmp and rename so the target path
+        # only ever points at a complete file.
+        tmp_path = path.with_suffix(path.suffix + ".tmp")
+        with tmp_path.open("w", encoding="utf-8") as f:
+            for row in rows:
+                f.write(json.dumps(row, ensure_ascii=False, sort_keys=True))
+                f.write("\n")
+        tmp_path.replace(path)
+        return path
+
+    def read(self, module: ModuleName) -> list[dict[str, Any]]:
+        path = self.path_for(module)
+        if not path.exists():
+            return []
+        out: list[dict[str, Any]] = []
+        with path.open(encoding="utf-8") as f:
+            for line in f:
+                line = line.strip()
+                if line:
+                    out.append(json.loads(line))
+        return out
+
+    def read_all(self) -> dict[ModuleName, list[dict[str, Any]]]:
+        return {m: self.read(m) for m in _MODULES}
+
+    def has(self, module: ModuleName) -> bool:
+        return self.path_for(module).exists()
+
+
+def iter_staged_episodes(root: Path) -> Iterator[int]:
+    """Yield episode indices for which any staging artifact exists."""
+    if not root.exists():
+        return
+    for child in sorted(root.iterdir()):
+        if child.is_dir() and child.name.startswith("episode_"):
+            try:
+                yield int(child.name.removeprefix("episode_"))
+            except ValueError:
+                continue
--- a/src/lerobot/annotations/steerable_pipeline/validator.py
+++ b/src/lerobot/annotations/steerable_pipeline/validator.py
@@ -0,0 +1,326 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Pre-write validation against staged outputs.
+
+Runs after all three modules have written their per-episode artifacts but
+*before* the writer rewrites parquet shards. The validator never touches
+parquet; it only inspects the staging tree and the source frame timestamps
+exposed by :class:`EpisodeRecord`.
+
+Checks (per the plan's "Intermediate staging and validation" section):
+
+- exact timestamp alignment against source frame timestamps
+- no orphan speech / interjection pairs
+- plan / memory emission consistency (events have a paired persistent row)
+- VQA assistant ``content`` is valid JSON (one of bbox / keypoint / count /
+  attribute / spatial)
+- every row maps to its correct column under :func:`column_for_style`
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+from collections.abc import Iterable, Sequence
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+
+from lerobot.datasets.language import (
+    LANGUAGE_EVENTS,
+    LANGUAGE_PERSISTENT,
+    column_for_style,
+    is_view_dependent_style,
+    validate_camera_field,
+)
+
+from .reader import EpisodeRecord
+from .staging import EpisodeStaging
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class ValidationReport:
+    """Outcome of one validation pass across all episodes."""
+
+    errors: list[str] = field(default_factory=list)
+    warnings: list[str] = field(default_factory=list)
+    episodes_checked: int = 0
+
+    @property
+    def ok(self) -> bool:
+        return not self.errors
+
+    def add_error(self, message: str) -> None:
+        self.errors.append(message)
+
+    def add_warning(self, message: str) -> None:
+        self.warnings.append(message)
+
+    def summary(self) -> str:
+        return f"checked={self.episodes_checked} errors={len(self.errors)} warnings={len(self.warnings)}"
+
+
+VQA_ANSWER_SHAPES: dict[str, set[str]] = {
+    "bbox": {"detections"},
+    "keypoint": {"label", "point_format", "point"},
+    "count": {"label", "count"},
+    "attribute": {"label", "attribute", "value"},
+    "spatial": {"subject", "relation", "object"},
+}
+
+
+def classify_vqa_answer(payload: Any) -> str | None:
+    """Best-effort classification of a VQA answer payload to a question type."""
+    if not isinstance(payload, dict):
+        return None
+    keys = set(payload.keys())
+    for kind, required in VQA_ANSWER_SHAPES.items():
+        if required.issubset(keys):
+            return kind
+    return None
+
+
+@dataclass
+class StagingValidator:
+    """Walks the staging tree and produces a :class:`ValidationReport`."""
+
+    timestamp_atol: float = 0.0  # exact-match by default
+    dataset_camera_keys: tuple[str, ...] | None = None
+    """Known ``observation.images.*`` keys on the dataset. When set, the
+    validator additionally enforces that every view-dependent row's
+    ``camera`` field references one of these keys. Pass ``None`` (default)
+    to skip that cross-check (e.g. in unit tests with no real dataset)."""
+
+    def validate(
+        self,
+        records: Sequence[EpisodeRecord],
+        staging_dir: Path,
+    ) -> ValidationReport:
+        report = ValidationReport()
+        for record in records:
+            self._validate_episode(record, staging_dir, report)
+            report.episodes_checked += 1
+        return report
+
+    def _validate_episode(
+        self,
+        record: EpisodeRecord,
+        staging_dir: Path,
+        report: ValidationReport,
+    ) -> None:
+        staging = EpisodeStaging(staging_dir, record.episode_index)
+        staged = staging.read_all()
+        all_rows: list[dict[str, Any]] = []
+        for module_name, rows in staged.items():
+            for row in rows:
+                row = {**row, "_module": module_name}
+                all_rows.append(row)
+
+        frame_ts = set(record.frame_timestamps)
+
+        events: list[dict[str, Any]] = []
+        persistent: list[dict[str, Any]] = []
+        for row in all_rows:
+            self._check_column_routing(row, report, record.episode_index)
+            self._check_camera_field(row, report, record.episode_index, self.dataset_camera_keys)
+            if column_for_style(row.get("style")) == LANGUAGE_PERSISTENT:
+                persistent.append(row)
+            else:
+                events.append(row)
+
+        for row in events:
+            self._check_event_timestamp_alignment(row, frame_ts, report, record.episode_index)
+
+        self._check_speech_interjection_pairs(events, report, record.episode_index)
+        self._check_plan_memory_consistency(persistent, events, report, record.episode_index)
+        self._check_vqa_json(events, report, record.episode_index)
+        self._check_vqa_uniqueness_per_frame_camera(events, report, record.episode_index)
+
+    def _check_camera_field(
+        self,
+        row: dict[str, Any],
+        report: ValidationReport,
+        episode_index: int,
+        dataset_camera_keys: Sequence[str] | None,
+    ) -> None:
+        """Enforce the camera invariant + that the key matches the dataset's cameras."""
+        style = row.get("style")
+        camera = row.get("camera")
+        try:
+            validate_camera_field(style, camera)
+        except ValueError as exc:
+            report.add_error(f"ep={episode_index} module={row.get('_module')}: {exc}")
+            return
+        if is_view_dependent_style(style) and dataset_camera_keys and camera not in dataset_camera_keys:
+            report.add_error(
+                f"ep={episode_index} module={row.get('_module')}: camera {camera!r} on style "
+                f"{style!r} is not one of the dataset's video keys {sorted(dataset_camera_keys)!r}"
+            )
+
+    def _check_vqa_uniqueness_per_frame_camera(
+        self,
+        events: Iterable[dict[str, Any]],
+        report: ValidationReport,
+        episode_index: int,
+    ) -> None:
+        """Ensure at most one (vqa, user) and one (vqa, assistant) per (t, camera)."""
+        counts: dict[tuple[float, str, str], int] = {}
+        for row in events:
+            if row.get("style") != "vqa":
+                continue
+            ts = row.get("timestamp")
+            camera = row.get("camera")
+            role = row.get("role")
+            if ts is None or camera is None or role is None:
+                continue  # other validators flag these
+            key = (float(ts), str(camera), str(role))
+            counts[key] = counts.get(key, 0) + 1
+        for (ts, camera, role), n in counts.items():
+            if n > 1:
+                report.add_error(
+                    f"ep={episode_index}: {n} duplicate vqa rows at t={ts} "
+                    f"camera={camera!r} role={role!r}; expected at most one per (t, camera, role)"
+                )
+
+    def _check_column_routing(
+        self,
+        row: dict[str, Any],
+        report: ValidationReport,
+        episode_index: int,
+    ) -> None:
+        style = row.get("style")
+        module = row.get("_module")
+        try:
+            target_col = column_for_style(style)
+        except ValueError:
+            report.add_error(f"ep={episode_index} module={module}: unknown style {style!r}")
+            return
+        if module == "plan" and target_col != LANGUAGE_PERSISTENT:
+            report.add_error(
+                f"ep={episode_index} module=plan emitted style {style!r} that routes to {target_col} (must be persistent)"
+            )
+        if module in {"interjections", "vqa"} and target_col != LANGUAGE_EVENTS:
+            report.add_error(
+                f"ep={episode_index} module={module} emitted style {style!r} that routes to {target_col} (must be events)"
+            )
+
+    def _check_event_timestamp_alignment(
+        self,
+        row: dict[str, Any],
+        frame_ts: set[float],
+        report: ValidationReport,
+        episode_index: int,
+    ) -> None:
+        ts = row.get("timestamp")
+        if ts is None:
+            report.add_error(f"ep={episode_index}: event row missing timestamp: {row!r}")
+            return
+        if self.timestamp_atol == 0.0:
+            if float(ts) not in frame_ts:
+                report.add_error(
+                    f"ep={episode_index}: event row timestamp {ts!r} does not match any source frame timestamp"
+                )
+        else:
+            if not any(abs(float(ts) - f) <= self.timestamp_atol for f in frame_ts):
+                report.add_error(
+                    f"ep={episode_index}: event row timestamp {ts!r} not within {self.timestamp_atol}s of any frame"
+                )
+
+    def _check_speech_interjection_pairs(
+        self,
+        events: Iterable[dict[str, Any]],
+        report: ValidationReport,
+        episode_index: int,
+    ) -> None:
+        speech_ts: dict[float, int] = {}
+        interjection_ts: dict[float, int] = {}
+        for row in events:
+            ts = row.get("timestamp")
+            if ts is None:
+                continue
+            ts_f = float(ts)
+            if row.get("style") is None and row.get("role") == "assistant":
+                speech_ts[ts_f] = speech_ts.get(ts_f, 0) + 1
+            if row.get("style") == "interjection":
+                interjection_ts[ts_f] = interjection_ts.get(ts_f, 0) + 1
+
+        for ts in interjection_ts:
+            if ts not in speech_ts:
+                report.add_error(f"ep={episode_index}: interjection at t={ts} has no paired speech atom")
+
+    def _check_plan_memory_consistency(
+        self,
+        persistent: Sequence[dict[str, Any]],
+        events: Sequence[dict[str, Any]],
+        report: ValidationReport,
+        episode_index: int,
+    ) -> None:
+        plan_ts = sorted({float(r["timestamp"]) for r in persistent if r.get("style") == "plan"})
+        memory_ts = sorted({float(r["timestamp"]) for r in persistent if r.get("style") == "memory"})
+        subtask_ts = sorted({float(r["timestamp"]) for r in persistent if r.get("style") == "subtask"})
+        interjection_ts = sorted(
+            {
+                float(r["timestamp"])
+                for r in events
+                if r.get("style") == "interjection" and r.get("timestamp") is not None
+            }
+        )
+
+        if persistent and not plan_ts:
+            report.add_warning(f"ep={episode_index}: persistent rows present but no plan emitted")
+        # every interjection should have a same-timestamp plan refresh
+        for ts in interjection_ts:
+            if ts not in set(plan_ts):
+                report.add_error(
+                    f"ep={episode_index}: interjection at t={ts} has no co-timestamped plan update"
+                )
+        # memory should be emitted at subtask boundaries (subset relation)
+        if memory_ts and subtask_ts:
+            mem_set = set(memory_ts)
+            sub_set = set(subtask_ts)
+            stray = sorted(mem_set - sub_set)
+            if stray:
+                report.add_warning(f"ep={episode_index}: memory rows at {stray} not at any subtask boundary")
+
+    def _check_vqa_json(
+        self,
+        events: Iterable[dict[str, Any]],
+        report: ValidationReport,
+        episode_index: int,
+    ) -> None:
+        for row in events:
+            if row.get("style") != "vqa" or row.get("role") != "assistant":
+                continue
+            content = row.get("content")
+            if content is None:
+                report.add_error(
+                    f"ep={episode_index}: VQA assistant row at t={row.get('timestamp')} has null content"
+                )
+                continue
+            try:
+                payload = json.loads(content)
+            except (TypeError, ValueError) as exc:
+                report.add_error(
+                    f"ep={episode_index}: VQA assistant content not valid JSON at t={row.get('timestamp')}: {exc}"
+                )
+                continue
+            shape = classify_vqa_answer(payload)
+            if shape is None:
+                report.add_error(
+                    f"ep={episode_index}: VQA assistant payload at t={row.get('timestamp')} does not match any known shape: keys={list(payload) if isinstance(payload, dict) else type(payload).__name__}"
+                )
--- a/src/lerobot/annotations/steerable_pipeline/vlm_client.py
+++ b/src/lerobot/annotations/steerable_pipeline/vlm_client.py
@@ -0,0 +1,703 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Shared Qwen-VL client.
+
+The pipeline uses a single shared VLM across modules. vLLM is preferred when
+available (high throughput, JSON-guided decoding); transformers is the
+fallback. A ``stub`` backend is used for unit tests so fixtures never call
+into a real model.
+
+The client speaks one method, :meth:`VlmClient.generate_json`, which:
+
+- accepts a list of OpenAI/HF-style multimodal messages,
+- requests JSON output (``json_mode=True`` enables guided decoding when the
+  backend supports it),
+- batches requests transparently,
+- and reprompts once on a JSON parse failure with an inline correction
+  message before raising.
+"""
+
+from __future__ import annotations
+
+import atexit
+import base64
+import io
+import json
+import os
+import shlex
+import signal
+import subprocess
+import sys
+import threading
+import time
+import urllib.request
+from collections.abc import Callable, Sequence
+from concurrent.futures import ThreadPoolExecutor
+from dataclasses import dataclass
+from typing import Any, Protocol
+
+from .config import VlmConfig
+
+
+class VlmClient(Protocol):
+    """Protocol every backend must implement."""
+
+    def generate_json(
+        self,
+        messages_batch: Sequence[Sequence[dict[str, Any]]],
+        *,
+        max_new_tokens: int | None = None,
+        temperature: float | None = None,
+    ) -> list[Any]:
+        """Generate one JSON-decoded response per messages list."""
+
+
+@dataclass
+class StubVlmClient:
+    """Deterministic stub used in unit tests.
+
+    A test passes a callable that maps the *last user message text* (or, if
+    that is empty, the full message list) to a JSON-serializable response.
+    """
+
+    responder: Callable[[Sequence[dict[str, Any]]], Any]
+
+    def generate_json(
+        self,
+        messages_batch: Sequence[Sequence[dict[str, Any]]],
+        *,
+        max_new_tokens: int | None = None,
+        temperature: float | None = None,
+    ) -> list[Any]:
+        return [self.responder(list(messages)) for messages in messages_batch]
+
+
+def _strip_to_json(text: str) -> Any:
+    text = text.strip()
+    # Strip <think>...</think> blocks (Qwen3 Thinking style)
+    while "<think>" in text and "</think>" in text:
+        start = text.find("<think>")
+        end = text.find("</think>", start) + len("</think>")
+        text = (text[:start] + text[end:]).strip()
+    # Strip ```json ... ``` fences from chat-tuned backbones
+    if text.startswith("```"):
+        first = text.find("\n")
+        last = text.rfind("```")
+        if first != -1 and last != -1 and last > first:
+            text = text[first + 1 : last].strip()
+    try:
+        return json.loads(text)
+    except (ValueError, json.JSONDecodeError):
+        pass
+    # Fall back to extracting the first balanced {...} block.
+    obj_text = _extract_first_json_object(text)
+    if obj_text is None:
+        raise json.JSONDecodeError("No JSON object found", text, 0)
+    return json.loads(obj_text)
+
+
+def _extract_first_json_object(text: str) -> str | None:
+    """Return the first balanced ``{...}`` substring, ignoring braces in
+    string literals. Returns ``None`` if no balanced block is found."""
+    start = text.find("{")
+    if start < 0:
+        return None
+    depth = 0
+    in_string = False
+    escape = False
+    for i in range(start, len(text)):
+        ch = text[i]
+        if escape:
+            escape = False
+            continue
+        if ch == "\\":
+            escape = True
+            continue
+        # Note: ``escape`` is always False here — the ``if escape`` branch
+        # above already handled and reset it.
+        if ch == '"':
+            in_string = not in_string
+            continue
+        if in_string:
+            continue
+        if ch == "{":
+            depth += 1
+        elif ch == "}":
+            depth -= 1
+            if depth == 0:
+                return text[start : i + 1]
+    return None
+
+
+@dataclass
+class _GenericTextClient:
+    """Wraps any text-generation callable in JSON-mode + one-retry semantics."""
+
+    generate_text: Callable[[Sequence[Sequence[dict[str, Any]]], int, float], list[str]]
+    config: VlmConfig
+
+    def generate_json(
+        self,
+        messages_batch: Sequence[Sequence[dict[str, Any]]],
+        *,
+        max_new_tokens: int | None = None,
+        temperature: float | None = None,
+    ) -> list[Any]:
+        max_tok = max_new_tokens if max_new_tokens is not None else self.config.max_new_tokens
+        temp = temperature if temperature is not None else self.config.temperature
+        raw = self.generate_text(messages_batch, max_tok, temp)
+        out: list[Any] = []
+        for messages, text in zip(messages_batch, raw, strict=True):
+            try:
+                out.append(_strip_to_json(text))
+                continue
+            except (ValueError, json.JSONDecodeError):
+                pass
+            retry = list(messages) + [
+                {"role": "assistant", "content": text},
+                {
+                    "role": "user",
+                    "content": (
+                        "Your previous reply was not valid JSON. "
+                        "Reply with strictly valid JSON, no prose, no fences."
+                    ),
+                },
+            ]
+            retry_text = self.generate_text([retry], max_tok, temp)[0]
+            try:
+                out.append(_strip_to_json(retry_text))
+            except (ValueError, json.JSONDecodeError):
+                # After retry: log preview and return None instead of crashing
+                # the whole pipeline. Modules treat None as "skip".
+                preview = retry_text.strip().replace("\n", " ")[:200]
+                print(
+                    f"[vlm] WARNING: failed to parse JSON after retry; preview: {preview!r}",
+                    flush=True,
+                )
+                out.append(None)
+        return out
+
+
+def make_vlm_client(config: VlmConfig) -> VlmClient:
+    """Build the shared VLM client per the configured backend.
+
+    For ``stub``, callers should construct :class:`StubVlmClient` directly with
+    a responder callable. ``stub`` here is rejected to make accidental misuse
+    obvious.
+    """
+    if config.backend == "stub":
+        raise ValueError(
+            "Use StubVlmClient(...) directly for the stub backend; make_vlm_client builds real clients."
+        )
+    if config.backend == "vllm":
+        return _make_vllm_client(config)
+    if config.backend == "transformers":
+        return _make_transformers_client(config)
+    if config.backend == "openai":
+        return _make_openai_client(config)
+    raise ValueError(f"Unknown VLM backend: {config.backend!r}")
+
+
+def _make_vllm_client(config: VlmConfig) -> VlmClient:
+    try:
+        from vllm import LLM, SamplingParams  # type: ignore[import-not-found]
+    except ImportError as exc:
+        raise ImportError(
+            "vllm is required for backend='vllm'. Install with `pip install lerobot[annotations]`."
+        ) from exc
+    # Workaround for cuDNN 9.x + torch 2.8 conv3d regression that surfaces
+    # as CUDNN_STATUS_NOT_INITIALIZED in Qwen-VL vision-tower patch
+    # embedders. Setting LEROBOT_DISABLE_CUDNN=1 forces native PyTorch
+    # convolution kernels — slower but functional.
+    if os.environ.get("LEROBOT_DISABLE_CUDNN", "").lower() in {"1", "true", "yes"}:
+        import torch as _torch  # noqa: PLC0415  - optional GPU dep, deferred
+
+        _torch.backends.cudnn.enabled = False
+    llm_kwargs: dict[str, Any] = {
+        "model": config.model_id,
+        "tensor_parallel_size": config.tensor_parallel_size,
+        "gpu_memory_utilization": config.gpu_memory_utilization,
+        "trust_remote_code": config.trust_remote_code,
+    }
+    if config.max_model_len is not None:
+        llm_kwargs["max_model_len"] = config.max_model_len
+    llm = LLM(**llm_kwargs)
+
+    def _gen(batch: Sequence[Sequence[dict[str, Any]]], max_tok: int, temp: float) -> list[str]:
+        # ``guided_decoding`` would speed up parsing but its API differs across
+        # vllm releases (dict vs GuidedDecodingParams). The _GenericTextClient
+        # wrapper already has a one-retry JSON-recovery path, so we skip it.
+        params = SamplingParams(max_tokens=max_tok, temperature=temp)
+        # ``llm.chat`` handles chat-template application + multimodal input
+        # extraction (image/video blocks) internally, which ``llm.generate``
+        # does not.
+        outputs = llm.chat([list(m) for m in batch], params)
+        return [o.outputs[0].text for o in outputs]
+
+    return _GenericTextClient(_gen, config)
+
+
+def _make_transformers_client(config: VlmConfig) -> VlmClient:
+    try:
+        import torch  # type: ignore[import-not-found]
+        import transformers  # type: ignore[import-not-found]
+        from transformers import AutoProcessor  # type: ignore[import-not-found]
+    except ImportError as exc:
+        raise ImportError("transformers + torch are required for backend='transformers'.") from exc
+    auto_cls = getattr(transformers, "AutoModelForImageTextToText", None) or getattr(
+        transformers, "AutoModelForVision2Seq", None
+    )
+    if auto_cls is None:
+        raise ImportError(
+            "Neither AutoModelForImageTextToText nor AutoModelForVision2Seq is available in this "
+            "transformers version. Install transformers>=4.45 (which has AutoModelForImageTextToText) "
+            "for VL models."
+        )
+    processor = AutoProcessor.from_pretrained(config.model_id, trust_remote_code=config.trust_remote_code)
+    use_accelerate = os.environ.get("LEROBOT_TRANSFORMERS_DEVICE_MAP", "manual") != "manual"
+    # ``device_map='auto'`` triggers a known std::bad_alloc on the Qwen3-VL
+    # post-load dispatch path (the alloc fails in accelerate's hook setup
+    # even with TBs of host RAM). Default to manual: load on CPU with
+    # ``low_cpu_mem_usage=True``, then ``.to("cuda")``. Set
+    # ``LEROBOT_TRANSFORMERS_DEVICE_MAP=auto`` to opt back into the old path.
+    if use_accelerate:
+        model = auto_cls.from_pretrained(
+            config.model_id,
+            torch_dtype="auto",
+            device_map="auto",
+            low_cpu_mem_usage=True,
+            trust_remote_code=config.trust_remote_code,
+        )
+    else:
+        import torch as _torch  # noqa: PLC0415  - optional GPU dep, deferred
+
+        model = auto_cls.from_pretrained(
+            config.model_id,
+            torch_dtype=_torch.bfloat16,
+            low_cpu_mem_usage=True,
+            trust_remote_code=config.trust_remote_code,
+        )
+        model = model.to("cuda")
+    model.eval()
+
+    def _gen(batch: Sequence[Sequence[dict[str, Any]]], max_tok: int, temp: float) -> list[str]:
+        outs: list[str] = []
+        for messages in batch:
+            text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
+            inputs = processor(text=[text], return_tensors="pt").to(model.device)
+            with torch.no_grad():
+                gen = model.generate(
+                    **inputs,
+                    max_new_tokens=max_tok,
+                    temperature=temp,
+                    do_sample=temp > 0.0,
+                )
+            decoded = processor.batch_decode(
+                gen[:, inputs["input_ids"].shape[-1] :], skip_special_tokens=True
+            )[0]
+            outs.append(decoded)
+        return outs
+
+    return _GenericTextClient(_gen, config)
+
+
+def _make_openai_client(config: VlmConfig) -> VlmClient:
+    """Backend that talks to any OpenAI-compatible server.
+
+    Compatible with ``vllm serve``, ``transformers serve``,
+    ``ktransformers serve``, and hosted endpoints. By default the server
+    is expected to be already running. Set ``auto_serve=True`` to have
+    this client spawn one (default: ``transformers serve``), wait until
+    it's ready, and tear it down on process exit.
+
+    Image blocks ``{"type":"image", "image":<PIL.Image>}`` are
+    auto-converted to ``image_url`` data-URLs. Video blocks
+    ``{"type":"video", "video":[<PIL>...]}`` are forwarded as
+    multi-frame ``video_url`` items where supported.
+    """
+    try:
+        from openai import OpenAI  # type: ignore[import-not-found]
+    except ImportError as exc:
+        raise ImportError(
+            "openai package is required for backend='openai'. Install with `pip install openai`."
+        ) from exc
+
+    api_base = config.api_base
+    api_key = config.api_key
+    auto_serve = config.auto_serve
+    api_bases: list[str] = [api_base]
+
+    print(
+        f"[lerobot-annotate] backend=openai model={config.model_id} "
+        f"api_base={api_base} auto_serve={auto_serve}",
+        flush=True,
+    )
+    if auto_serve:
+        if config.parallel_servers > 1:
+            print(
+                f"[lerobot-annotate] spawning {config.parallel_servers} parallel servers",
+                flush=True,
+            )
+            api_bases = _spawn_parallel_inference_servers(config)
+        elif _server_is_up(api_base):
+            print(f"[lerobot-annotate] reusing server already up at {api_base}", flush=True)
+        else:
+            print("[lerobot-annotate] no server reachable; spawning one", flush=True)
+            api_base = _spawn_inference_server(config)
+            api_bases = [api_base]
+            print(f"[lerobot-annotate] server ready at {api_base}", flush=True)
+
+    clients = [OpenAI(base_url=base, api_key=api_key) for base in api_bases]
+    # round-robin counter for parallel mode
+    rr_counter = {"i": 0}
+
+    # ``mm_processor_kwargs`` is a vllm-specific extra; transformers serve
+    # rejects it with HTTP 422. Send it only when explicitly opted in via
+    # an env var (e.g. ``LEROBOT_OPENAI_SEND_MM_KWARGS=1`` for vllm).
+    send_mm_kwargs = os.environ.get("LEROBOT_OPENAI_SEND_MM_KWARGS", "").lower() in {"1", "true", "yes"}
+
+    rr_lock = threading.Lock()
+
+    def _one_call(messages: Sequence[dict[str, Any]], max_tok: int, temp: float) -> str:
+        api_messages, mm_kwargs = _to_openai_messages(messages)
+        kwargs: dict[str, Any] = {
+            "model": config.model_id,
+            "messages": api_messages,
+            "max_tokens": max_tok,
+            "temperature": temp,
+        }
+        extra_body: dict[str, Any] = {}
+        if send_mm_kwargs and mm_kwargs:
+            extra_body["mm_processor_kwargs"] = {**mm_kwargs, "do_sample_frames": True}
+        if config.chat_template_kwargs:
+            extra_body["chat_template_kwargs"] = config.chat_template_kwargs
+        if extra_body:
+            kwargs["extra_body"] = extra_body
+        with rr_lock:
+            chosen = clients[rr_counter["i"] % len(clients)]
+            rr_counter["i"] += 1
+        response = chosen.chat.completions.create(**kwargs)
+        return response.choices[0].message.content or ""
+
+    def _gen(batch: Sequence[Sequence[dict[str, Any]]], max_tok: int, temp: float) -> list[str]:
+        if len(batch) <= 1 or config.client_concurrency <= 1:
+            return [_one_call(messages, max_tok, temp) for messages in batch]
+        # Parallel fan-out — vllm batches these on the server side.
+        max_workers = min(config.client_concurrency, len(batch))
+        with ThreadPoolExecutor(max_workers=max_workers) as pool:
+            futures = [pool.submit(_one_call, messages, max_tok, temp) for messages in batch]
+            return [f.result() for f in futures]
+
+    return _GenericTextClient(_gen, config)
+
+
+def _spawn_parallel_inference_servers(config: VlmConfig) -> list[str]:
+    """Spawn ``config.parallel_servers`` independent vllm replicas.
+
+    Each replica:
+    - is pinned to a single GPU via ``CUDA_VISIBLE_DEVICES``
+    - listens on ``serve_port + i``
+    - is shut down via the same atexit hook as the single-server path
+
+    Returns the list of ``api_base`` URLs the client should round-robin
+    across.
+    """
+    n = config.parallel_servers
+    api_bases: list[str] = []
+    procs: list[subprocess.Popen] = []
+    ready_events: list[threading.Event] = []
+    # Multiple readiness signals — uvicorn's own banner is suppressed at
+    # ``--uvicorn-log-level warning``, so we also accept vllm's own
+    # "Starting vLLM API server" line and the route-listing line. The
+    # HTTP probe below is the ultimate fallback.
+    ready_markers = (
+        "Uvicorn running",
+        "Application startup complete",
+        "Starting vLLM API server",
+        "Available routes are",
+    )
+    # Single lock for all server-stream threads so multibyte chars from
+    # different servers don't interleave and tear UTF-8 sequences.
+    print_lock = threading.Lock()
+
+    base_cmd = config.serve_command or (
+        f"vllm serve {shlex.quote(config.model_id)} "
+        f"--tensor-parallel-size 1 "
+        f"--max-model-len {config.max_model_len or 32768} "
+        f"--uvicorn-log-level warning"
+    )
+
+    num_gpus = config.num_gpus if config.num_gpus > 0 else n
+    for i in range(n):
+        port = config.serve_port + i
+        gpu = i % num_gpus
+        env = os.environ.copy()
+        env["CUDA_VISIBLE_DEVICES"] = str(gpu)
+        cmd = base_cmd.replace("{port}", str(port)) if "{port}" in base_cmd else f"{base_cmd} --port {port}"
+        api_base = f"http://localhost:{port}/v1"
+        api_bases.append(api_base)
+        print(f"[server-{i}] launching on GPU {gpu} port {port}: {cmd}", flush=True)
+        proc = subprocess.Popen(
+            shlex.split(cmd),
+            stdout=subprocess.PIPE,
+            stderr=subprocess.STDOUT,
+            text=True,
+            bufsize=1,
+            env=env,
+        )
+        procs.append(proc)
+        ready = threading.Event()
+        ready_events.append(ready)
+
+        def _stream(idx: int, p: subprocess.Popen, ev: threading.Event) -> None:
+            # Read whole lines and emit each line atomically under the
+            # shared print_lock so output from N servers stays readable.
+            assert p.stdout is not None
+            for line in iter(p.stdout.readline, ""):
+                with print_lock:
+                    sys.stdout.write(f"[server-{idx}] {line}")
+                    if not line.endswith(("\n", "\r")):
+                        sys.stdout.write("\n")
+                    sys.stdout.flush()
+                if any(m in line for m in ready_markers):
+                    ev.set()
+
+        threading.Thread(target=_stream, args=(i, proc, ready), daemon=True).start()
+
+        def _probe(idx: int, base: str, ev: threading.Event, p: subprocess.Popen) -> None:
+            while not ev.is_set() and p.poll() is None:
+                if _server_is_up(base):
+                    print(f"[server-{idx}] ready (http probe)", flush=True)
+                    ev.set()
+                    return
+                time.sleep(2)
+
+        threading.Thread(target=_probe, args=(i, api_base, ready, proc), daemon=True).start()
+
+    def _shutdown() -> None:
+        for i, p in enumerate(procs):
+            if p.poll() is None:
+                print(f"[server-{i}] stopping pid={p.pid}", flush=True)
+                p.send_signal(signal.SIGINT)
+        for p in procs:
+            try:
+                p.wait(timeout=15)
+            except subprocess.TimeoutExpired:
+                p.kill()
+                p.wait(timeout=5)
+
+    atexit.register(_shutdown)
+
+    deadline = time.monotonic() + config.serve_ready_timeout_s
+    while any(not ev.is_set() for ev in ready_events) and time.monotonic() < deadline:
+        for i, p in enumerate(procs):
+            if p.poll() is not None:
+                raise RuntimeError(
+                    f"[server-{i}] inference server exited unexpectedly with rc={p.returncode}"
+                )
+        time.sleep(2)
+    if any(not ev.is_set() for ev in ready_events):
+        raise RuntimeError(f"[server] not all replicas became ready within {config.serve_ready_timeout_s}s")
+    print(f"[lerobot-annotate] all {n} servers ready: {api_bases}", flush=True)
+    return api_bases
+
+
+def _server_is_up(api_base: str) -> bool:
+    """Return True if ``api_base/models`` answers 200 within 2 seconds."""
+    url = api_base.rstrip("/") + "/models"
+    # ``api_base`` is the user-configured local-server URL we just spawned
+    # or the user passed in via ``--vlm.api_base``; the bandit B310 warning
+    # is for arbitrary user-controlled URLs with file:/ schemes which
+    # cannot reach this code path.
+    try:
+        with urllib.request.urlopen(url, timeout=2) as resp:  # noqa: S310  # nosec B310
+            return resp.status == 200
+    except Exception:  # noqa: BLE001
+        return False
+
+
+def _spawn_inference_server(config: VlmConfig) -> str:
+    """Spawn ``transformers serve`` (or ``serve_command``), wait until it
+    accepts ``/v1/models``, and register a shutdown hook.
+
+    Streams the server's stdout/stderr to the parent terminal in
+    real-time on a background thread so users can see model-load
+    progress and errors as they happen.
+
+    Returns the full ``api_base`` URL the OpenAI client should use.
+    """
+    cmd = config.serve_command
+    if not cmd:
+        cmd = (
+            f"transformers serve {shlex.quote(config.model_id)} "
+            f"--port {config.serve_port} --continuous-batching"
+        )
+    api_base = f"http://localhost:{config.serve_port}/v1"
+    print(f"[server] launching: {cmd}", flush=True)
+    proc = subprocess.Popen(
+        shlex.split(cmd),
+        stdout=subprocess.PIPE,
+        stderr=subprocess.STDOUT,
+        text=True,
+        bufsize=1,
+    )
+
+    # Watch the server output for the uvicorn readiness banner. This is
+    # more reliable than polling /v1/models because transformers serve
+    # rescans its cache on every model-list request, which can exceed
+    # the urllib timeout and trigger an infinite probe loop.
+    ready_event = threading.Event()
+    # See _spawn_parallel_inference_servers for why we accept these.
+    ready_markers = (
+        "Uvicorn running",
+        "Application startup complete",
+        "Starting vLLM API server",
+        "Available routes are",
+    )
+
+    def _probe() -> None:
+        while not ready_event.is_set() and proc.poll() is None:
+            if _server_is_up(api_base):
+                print("[server] ready (http probe)", flush=True)
+                ready_event.set()
+                return
+            time.sleep(2)
+
+    threading.Thread(target=_probe, daemon=True).start()
+
+    def _stream_output() -> None:
+        # Read raw chunks instead of iterating lines so tqdm progress
+        # bars (which overwrite using \r) flush in real time.
+        assert proc.stdout is not None
+        buf = ""
+        prefix_started = False
+        while True:
+            ch = proc.stdout.read(1)
+            if ch == "":
+                # process exited; flush any tail
+                if buf:
+                    sys.stdout.write(buf)
+                    sys.stdout.flush()
+                return
+            if not prefix_started:
+                sys.stdout.write("[server] ")
+                prefix_started = True
+            sys.stdout.write(ch)
+            sys.stdout.flush()
+            buf += ch
+            if ch in ("\n", "\r"):
+                if any(marker in buf for marker in ready_markers):
+                    ready_event.set()
+                buf = ""
+                prefix_started = False
+
+    threading.Thread(target=_stream_output, daemon=True).start()
+
+    def _shutdown() -> None:
+        if proc.poll() is None:
+            print(f"[server] stopping pid={proc.pid}", flush=True)
+            proc.send_signal(signal.SIGINT)
+            try:
+                proc.wait(timeout=15)
+            except subprocess.TimeoutExpired:
+                proc.kill()
+                proc.wait(timeout=5)
+
+    atexit.register(_shutdown)
+
+    deadline = time.monotonic() + config.serve_ready_timeout_s
+    while time.monotonic() < deadline:
+        if proc.poll() is not None:
+            raise RuntimeError(
+                f"[server] inference server exited unexpectedly with rc={proc.returncode}. "
+                f"See [server] log lines above for the cause."
+            )
+        if ready_event.wait(timeout=2):
+            return api_base
+    proc.terminate()
+    raise RuntimeError(f"[server] did not become ready within {config.serve_ready_timeout_s}s")
+
+
+def _to_openai_messages(
+    messages: Sequence[dict[str, Any]],
+) -> tuple[list[dict[str, Any]], dict[str, Any]]:
+    """Convert internal messages to OpenAI chat format.
+
+    Returns ``(api_messages, mm_kwargs)``. Multimodal-processor kwargs
+    (``fps`` from ``video_url`` blocks) are extracted out so the caller
+    can pass them via ``extra_body.mm_processor_kwargs`` rather than
+    inside the content blocks (which transformers serve rejects).
+
+    File-URL video blocks are inlined as base64 data URLs.
+    """
+    out_messages: list[dict[str, Any]] = []
+    mm_kwargs: dict[str, Any] = {}
+    for message in messages:
+        content = message.get("content")
+        if not isinstance(content, list):
+            out_messages.append({"role": message["role"], "content": content})
+            continue
+        out_blocks: list[dict[str, Any]] = []
+        for block in content:
+            block_type = block.get("type") if isinstance(block, dict) else None
+            if block_type == "text":
+                out_blocks.append({"type": "text", "text": block.get("text", "")})
+            elif block_type == "image":
+                out_blocks.append(
+                    {"type": "image_url", "image_url": {"url": _pil_to_data_url(block["image"])}}
+                )
+            elif block_type == "video":
+                frames = block.get("video", [])
+                for img in frames:
+                    out_blocks.append({"type": "image_url", "image_url": {"url": _pil_to_data_url(img)}})
+            elif block_type == "video_url":
+                video_url = dict(block["video_url"])
+                url = video_url.get("url", "")
+                if url.startswith("file://"):
+                    video_url["url"] = _file_to_data_url(url[len("file://") :])
+                out_blocks.append({"type": "video_url", "video_url": video_url})
+                fps = block.get("fps")
+                if fps is not None:
+                    mm_kwargs["fps"] = fps
+            else:
+                out_blocks.append(block)
+        out_messages.append({"role": message["role"], "content": out_blocks})
+    return out_messages, mm_kwargs
+
+
+def _file_to_data_url(path: str) -> str:
+    """Read a local video file and return a base64 ``data:video/mp4`` URL."""
+    with open(path, "rb") as f:
+        b64 = base64.b64encode(f.read()).decode("ascii")
+    return f"data:video/mp4;base64,{b64}"
+
+
+def _pil_to_data_url(image: Any) -> str:
+    """Encode a PIL.Image as a base64 data URL."""
+    buf = io.BytesIO()
+    image.save(buf, format="PNG")
+    b64 = base64.b64encode(buf.getvalue()).decode("ascii")
+    return f"data:image/png;base64,{b64}"
+
+
+def _messages_to_prompt(messages: Sequence[dict[str, Any]]) -> Any:
+    """Pass-through hook used by the vllm backend.
+
+    vllm exposes its own multimodal entry points that vary by version; for the
+    base flow we simply forward the raw message list and let the caller's
+    custom backend handle templating. Real deployments override this.
+    """
+    return list(messages)
--- a/src/lerobot/annotations/steerable_pipeline/writer.py
+++ b/src/lerobot/annotations/steerable_pipeline/writer.py
@@ -0,0 +1,356 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Final parquet rewrite.
+
+For every episode the writer:
+
+1. reads the staged module outputs,
+2. partitions them into a persistent slice (PERSISTENT_STYLES) and an event
+   slice (EVENT_ONLY_STYLES + style=None tool-call atoms),
+3. sorts each slice deterministically,
+4. broadcasts the persistent slice across every frame in the episode,
+5. for each frame, materializes the sublist of event rows whose timestamp
+   exactly equals that frame's timestamp,
+6. drops the legacy ``subtask_index`` column,
+7. writes the parquet shard back in place.
+
+The writer does NOT add a dataset-level ``tools`` column. Tool *calls* are
+emitted per-row via the existing ``tool_calls`` field on the v3.1 row
+struct for every speech atom. The tool *schema* (the description
+of the ``say`` function and its parameters) is a fixed code constant —
+``SAY_TOOL_SCHEMA`` below — and downstream chat-template consumers import
+it directly rather than reading a redundant per-row column.
+
+Invariants enforced here (and re-checked by the validator):
+
+- per-episode persistent slice is byte-identical across every frame;
+- ``language_events`` rows on a frame all have ``timestamp == frame_ts``
+  (timestamps come straight from the source parquet — never recomputed);
+- every row passes ``column_for_style(style)``.
+"""
+
+from __future__ import annotations
+
+import logging
+from collections import defaultdict
+from collections.abc import Iterable, Sequence
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+
+import pyarrow as pa
+import pyarrow.parquet as pq
+
+from lerobot.datasets.language import (
+    EVENT_ONLY_STYLES,
+    LANGUAGE_EVENTS,
+    LANGUAGE_PERSISTENT,
+    PERSISTENT_STYLES,
+    column_for_style,
+    validate_camera_field,
+)
+
+from .reader import EpisodeRecord
+from .staging import EpisodeStaging
+
+logger = logging.getLogger(__name__)
+
+
+# Tool schema constants live in lerobot.datasets.language — single
+# source of truth. Re-exported here so existing imports
+# (``from lerobot.annotations.steerable_pipeline.writer import SAY_TOOL_SCHEMA``)
+# keep working.
+from lerobot.datasets.language import DEFAULT_TOOLS, SAY_TOOL_SCHEMA  # noqa: F401, E402
+
+
+def _row_persistent_sort_key(row: dict[str, Any]) -> tuple:
+    return (float(row["timestamp"]), row.get("style") or "", row.get("role") or "")
+
+
+def _row_event_sort_key(row: dict[str, Any]) -> tuple:
+    # events are bucketed per-frame, but within a frame we still want determinism
+    return (
+        row.get("style") or "",
+        row.get("role") or "",
+        row.get("camera") or "",
+    )
+
+
+def _normalize_persistent_row(row: dict[str, Any]) -> dict[str, Any]:
+    """Coerce a staged row into the persistent column's struct shape."""
+    style = row.get("style")
+    if style not in PERSISTENT_STYLES:
+        raise ValueError(
+            f"persistent slice contains row with non-persistent style {style!r}; "
+            "row would be misrouted under column_for_style()"
+        )
+    if "timestamp" not in row:
+        raise ValueError(f"persistent row missing timestamp: {row!r}")
+    if "role" not in row:
+        # Surface a friendly error from the writer rather than letting
+        # the raw KeyError bubble out of the dict access below — modules
+        # are expected to always emit ``role``, but the validator
+        # currently doesn't check this so a future bug would otherwise
+        # be hard to triage.
+        raise ValueError(f"persistent row missing role: {row!r}")
+    camera = row.get("camera")
+    validate_camera_field(style, camera)
+    return {
+        "role": str(row["role"]),
+        "content": None if row.get("content") is None else str(row["content"]),
+        "style": style,
+        "timestamp": float(row["timestamp"]),
+        "camera": None if camera is None else str(camera),
+        "tool_calls": _normalize_tool_calls(row.get("tool_calls")),
+    }
+
+
+def _normalize_event_row(row: dict[str, Any]) -> dict[str, Any]:
+    """Coerce a staged row into the event column's struct shape (no timestamp)."""
+    style = row.get("style")
+    if style is not None and style not in EVENT_ONLY_STYLES:
+        raise ValueError(
+            f"event slice contains row with style {style!r}; expected None or one of {EVENT_ONLY_STYLES}"
+        )
+    if column_for_style(style) != LANGUAGE_EVENTS:
+        raise ValueError(f"event row with style {style!r} would not route to language_events")
+    if "role" not in row:
+        raise ValueError(f"event row missing role: {row!r}")
+    camera = row.get("camera")
+    validate_camera_field(style, camera)
+    return {
+        "role": str(row["role"]),
+        "content": None if row.get("content") is None else str(row["content"]),
+        "style": style,
+        "camera": None if camera is None else str(camera),
+        "tool_calls": _normalize_tool_calls(row.get("tool_calls")),
+    }
+
+
+def _normalize_tool_calls(value: Any) -> list[Any] | None:
+    if value is None:
+        return None
+    if not isinstance(value, list):
+        raise ValueError(f"tool_calls must be a list or None, got {type(value).__name__}")
+    return list(value)
+
+
+def _validate_atom_invariants(row: dict[str, Any]) -> None:
+    """At-least-one of content/tool_calls; style=None implies tool_calls."""
+    has_content = row.get("content") is not None
+    has_tools = row.get("tool_calls") is not None
+    if not (has_content or has_tools):
+        raise ValueError(f"row has neither content nor tool_calls: {row!r}")
+    if row.get("style") is None and not has_tools:
+        raise ValueError(f"style=None requires tool_calls: {row!r}")
+
+
+def _validate_speech_atom(row: dict[str, Any]) -> None:
+    """Speech atoms: role=assistant, style=None, content=None, say tool call."""
+    if row.get("style") is not None:
+        return  # not a speech atom
+    if row.get("role") != "assistant":
+        raise ValueError(f"speech atom must have role=assistant: {row!r}")
+    if row.get("content") is not None:
+        raise ValueError(f"speech atom must have content=null: {row!r}")
+    tool_calls = row.get("tool_calls")
+    if not tool_calls or not isinstance(tool_calls, list):
+        raise ValueError(f"speech atom must have non-empty tool_calls list: {row!r}")
+    first = tool_calls[0]
+    if not isinstance(first, dict):
+        raise ValueError(f"speech atom tool_calls[0] must be a dict: {row!r}")
+    if first.get("type") != "function":
+        raise ValueError(f"speech atom tool_calls[0].type must be 'function': {row!r}")
+    fn = first.get("function") or {}
+    if fn.get("name") != "say":
+        raise ValueError(f"speech atom tool_calls[0].function.name must be 'say': {row!r}")
+    args = fn.get("arguments") or {}
+    if not isinstance(args, dict) or "text" not in args or not isinstance(args["text"], str):
+        raise ValueError(f"speech atom must carry 'text' string in arguments: {row!r}")
+
+
+@dataclass
+class LanguageColumnsWriter:
+    """Rewrite ``data/chunk-*/file-*.parquet`` with the two language columns."""
+
+    drop_existing_subtask_index: bool = True
+
+    def write_all(
+        self,
+        records: Sequence[EpisodeRecord],
+        staging_dir: Path,
+        root: Path,
+    ) -> list[Path]:
+        episodes_by_path: dict[Path, list[EpisodeRecord]] = defaultdict(list)
+        for record in records:
+            episodes_by_path[record.data_path].append(record)
+
+        written: list[Path] = []
+        for path, eps in episodes_by_path.items():
+            self._rewrite_one(path, eps, staging_dir, root)
+            written.append(path)
+        return written
+
+    def _rewrite_one(
+        self,
+        path: Path,
+        episodes: Sequence[EpisodeRecord],
+        staging_dir: Path,
+        root: Path,
+    ) -> None:
+        table = pq.read_table(path)
+        n_rows = table.num_rows
+
+        # Ensure we cover every episode in the file. Episodes that don't have
+        # staging artifacts are passed through with empty annotation lists —
+        # this keeps the writer idempotent and safe for partial reruns.
+        staged_per_ep: dict[int, dict[str, list[dict[str, Any]]]] = {}
+        for record in episodes:
+            staging = EpisodeStaging(staging_dir, record.episode_index)
+            staged_per_ep[record.episode_index] = staging.read_all()
+
+        persistent_by_ep: dict[int, list[dict[str, Any]]] = {}
+        events_by_ep_ts: dict[int, dict[float, list[dict[str, Any]]]] = {}
+
+        for ep_index, ep_staged in staged_per_ep.items():
+            persistent_rows: list[dict[str, Any]] = []
+            event_rows: list[dict[str, Any]] = []  # carry timestamp until bucketed
+            for _module_name, rows in ep_staged.items():
+                for row in rows:
+                    style = row.get("style")
+                    if column_for_style(style) == LANGUAGE_PERSISTENT:
+                        persistent_rows.append(row)
+                    else:
+                        event_rows.append(row)
+
+            persistent_rows.sort(key=_row_persistent_sort_key)
+            normalized_persistent = []
+            for r in persistent_rows:
+                _validate_atom_invariants(r)
+                _validate_speech_atom(r)
+                normalized_persistent.append(_normalize_persistent_row(r))
+            persistent_by_ep[ep_index] = normalized_persistent
+
+            buckets: dict[float, list[dict[str, Any]]] = defaultdict(list)
+            for r in event_rows:
+                _validate_atom_invariants(r)
+                _validate_speech_atom(r)
+                ts = float(r["timestamp"])
+                buckets[ts].append(_normalize_event_row(r))
+            for ts in list(buckets.keys()):
+                buckets[ts].sort(key=_row_event_sort_key)
+            events_by_ep_ts[ep_index] = buckets
+
+        episode_col = (
+            table.column("episode_index").to_pylist() if "episode_index" in table.column_names else None
+        )
+        ts_col = table.column("timestamp").to_pylist() if "timestamp" in table.column_names else None
+        if episode_col is None or ts_col is None:
+            raise ValueError(f"{path} is missing 'episode_index' or 'timestamp' — required by the writer.")
+
+        per_row_persistent: list[list[dict[str, Any]]] = []
+        per_row_events: list[list[dict[str, Any]]] = []
+        for i in range(n_rows):
+            ep = episode_col[i]
+            ts = float(ts_col[i])
+            per_row_persistent.append(persistent_by_ep.get(ep, []))
+            buckets = events_by_ep_ts.get(ep, {})
+            per_row_events.append(buckets.get(ts, []))
+
+        new_table = self._materialize_table(
+            table, per_row_persistent, per_row_events, drop_old=self.drop_existing_subtask_index
+        )
+        # Atomic replace: write to a sibling tmp path and rename so a crash
+        # mid-write can't leave a half-written shard that ``pq.read_table``
+        # would then fail to open. ``Path.replace`` is atomic on POSIX +
+        # Windows when source and target sit on the same filesystem.
+        tmp_path = path.with_suffix(path.suffix + ".tmp")
+        pq.write_table(new_table, tmp_path)
+        tmp_path.replace(path)
+
+    def _materialize_table(
+        self,
+        table: pa.Table,
+        persistent: list[list[dict[str, Any]]],
+        events: list[list[dict[str, Any]]],
+        *,
+        drop_old: bool,
+    ) -> pa.Table:
+        cols = []
+        names = []
+        for name in table.column_names:
+            if drop_old and name == "subtask_index":
+                continue
+            if name in (LANGUAGE_PERSISTENT, LANGUAGE_EVENTS):
+                continue  # we'll re-add canonical versions
+            # Strip any legacy ``tools`` column previously emitted by older
+            # writers — the schema no longer uses it (constant lives in
+            # SAY_TOOL_SCHEMA / DEFAULT_TOOLS).
+            if name == "tools":
+                continue
+            cols.append(table.column(name))
+            names.append(name)
+
+        # We let pyarrow infer struct/list schema rather than passing the
+        # canonical type from `lerobot.datasets.language` directly: that type
+        # uses `pa.json_()` for the `tool_calls` element type, which
+        # `pa.array(..., type=...)` cannot materialize from Python lists on
+        # current pyarrow versions. The inferred schema round-trips through
+        # parquet and `LeRobotDataset` correctly — `tests/datasets/test_language.py`
+        # exercises the same flow.
+        persistent_arr = pa.array(persistent)
+        events_arr = pa.array(events)
+
+        cols.extend([persistent_arr, events_arr])
+        names.extend([LANGUAGE_PERSISTENT, LANGUAGE_EVENTS])
+
+        return pa.Table.from_arrays(cols, names=names)
+
+
+def speech_atom(timestamp: float, text: str) -> dict[str, Any]:
+    """Build a canonical speech tool-call atom for the events column."""
+    return {
+        "role": "assistant",
+        "content": None,
+        "style": None,
+        "timestamp": float(timestamp),
+        "camera": None,
+        "tool_calls": [
+            {
+                "type": "function",
+                "function": {
+                    "name": "say",
+                    "arguments": {"text": text},
+                },
+            }
+        ],
+    }
+
+
+def normalize_rows_for_writer(
+    rows: Iterable[dict[str, Any]],
+) -> tuple[list[dict[str, Any]], list[dict[str, Any]]]:
+    """Helper used by tests/validators to partition a flat row list into
+    (persistent_rows, event_rows) using ``column_for_style``.
+    """
+    persistent: list[dict[str, Any]] = []
+    events: list[dict[str, Any]] = []
+    for row in rows:
+        if column_for_style(row.get("style")) == LANGUAGE_PERSISTENT:
+            persistent.append(row)
+        else:
+            events.append(row)
+    return persistent, events
--- a/src/lerobot/datasets/language.py
+++ b/src/lerobot/datasets/language.py
@@ -46,7 +46,7 @@ CORE_STYLES = {
 EXTENDED_STYLES: set[str] = set()
 STYLE_REGISTRY = CORE_STYLES | EXTENDED_STYLES

-PERSISTENT_STYLES = {"subtask", "plan", "memory", "motion", "task_aug"}
+PERSISTENT_STYLES = {"subtask", "plan", "memory", "motion", "task_aug", "action_record"}
 EVENT_ONLY_STYLES = {"interjection", "vqa", "trace"}

 # Styles whose ``content`` is grounded in a specific camera view. Rows of these
--- a/src/lerobot/policies/init.py
+++ b/src/lerobot/policies/init.py
@@ -20,6 +20,7 @@ from .eo1.configuration_eo1 import EO1Config as EO1Config
 from .factory import get_policy_class, make_policy, make_policy_config, make_pre_post_processors
 from .gaussian_actor.configuration_gaussian_actor import GaussianActorConfig as GaussianActorConfig
 from .groot.configuration_groot import GrootConfig as GrootConfig
+from .molmoact2.configuration_molmoact2 import MolmoAct2Config as MolmoAct2Config
 from .multi_task_dit.configuration_multi_task_dit import MultiTaskDiTConfig as MultiTaskDiTConfig
 from .pi0.configuration_pi0 import PI0Config as PI0Config
 from .pi0_fast.configuration_pi0_fast import PI0FastConfig as PI0FastConfig
@@ -43,6 +44,7 @@ __all__ = [
    "EO1Config",
    "GaussianActorConfig",
    "GrootConfig",
+    "MolmoAct2Config",
    "MultiTaskDiTConfig",
    "PI0Config",
    "PI0FastConfig",
--- a/src/lerobot/policies/factory.py
+++ b/src/lerobot/policies/factory.py
@@ -49,6 +49,7 @@ from .diffusion.configuration_diffusion import DiffusionConfig
 from .eo1.configuration_eo1 import EO1Config
 from .gaussian_actor.configuration_gaussian_actor import GaussianActorConfig
 from .groot.configuration_groot import GrootConfig
+from .molmoact2.configuration_molmoact2 import MolmoAct2Config
 from .multi_task_dit.configuration_multi_task_dit import MultiTaskDiTConfig
 from .pi0.configuration_pi0 import PI0Config
 from .pi05.configuration_pi05 import PI05Config
@@ -88,7 +89,8 @@ def get_policy_class(name: str) -> type[PreTrainedPolicy]:

    Args:
        name: The name of the policy. Supported names are "tdmpc", "diffusion", "act",
-            "multi_task_dit", "vqbet", "pi0", "pi05", "gaussian_actor", "smolvla", "wall_x".
+            "multi_task_dit", "vqbet", "pi0", "pi05", "gaussian_actor", "smolvla", "wall_x",
+            "molmoact2".
    Returns:
        The policy class corresponding to the given name.

@@ -151,6 +153,10 @@ def get_policy_class(name: str) -> type[PreTrainedPolicy]:
        from .eo1.modeling_eo1 import EO1Policy

        return EO1Policy
+    elif name == "molmoact2":
+        from .molmoact2.modeling_molmoact2 import MolmoAct2Policy
+
+        return MolmoAct2Policy
    else:
        try:
            return _get_policy_cls_from_policy_name(name=name)
@@ -168,7 +174,7 @@ def make_policy_config(policy_type: str, **kwargs) -> PreTrainedConfig:
    Args:
        policy_type: The type of the policy. Supported types include "tdmpc",
                     "multi_task_dit", "diffusion", "act", "vqbet", "pi0", "pi05", "gaussian_actor",
-                     "smolvla", "wall_x".
+                     "smolvla", "wall_x", "molmoact2".
        **kwargs: Keyword arguments to be passed to the configuration class constructor.

    Returns:
@@ -203,6 +209,8 @@ def make_policy_config(policy_type: str, **kwargs) -> PreTrainedConfig:
        return WallXConfig(**kwargs)
    elif policy_type == "eo1":
        return EO1Config(**kwargs)
+    elif policy_type == "molmoact2":
+        return MolmoAct2Config(**kwargs)
    else:
        try:
            config_cls = PreTrainedConfig.get_choice_class(policy_type)
@@ -231,6 +239,7 @@ class ProcessorConfigKwargs(TypedDict, total=False):
    preprocessor_overrides: dict[str, Any] | None
    postprocessor_overrides: dict[str, Any] | None
    dataset_stats: dict[str, dict[str, torch.Tensor]] | None
+    dataset_meta: Any | None


 def make_pre_post_processors(
@@ -414,6 +423,15 @@ def make_pre_post_processors(
            dataset_stats=kwargs.get("dataset_stats"),
        )

+    elif isinstance(policy_cfg, MolmoAct2Config):
+        from .molmoact2.processor_molmoact2 import make_molmoact2_pre_post_processors
+
+        processors = make_molmoact2_pre_post_processors(
+            config=policy_cfg,
+            dataset_stats=kwargs.get("dataset_stats"),
+            dataset_meta=kwargs.get("dataset_meta"),
+        )
+
    else:
        try:
            processors = _make_processors_from_policy_config(
@@ -499,6 +517,10 @@ def make_policy(
        action_names = ds_meta.features.get(ACTION, {}).get("names")
        if action_names is not None:
            cfg.action_feature_names = list(action_names)
+    if ds_meta is not None:
+        set_dataset_feature_metadata = getattr(cfg, "set_dataset_feature_metadata", None)
+        if callable(set_dataset_feature_metadata):
+            set_dataset_feature_metadata(ds_meta.features)

    kwargs["config"] = cfg

--- a/src/lerobot/policies/molmoact2/README.md
+++ b/src/lerobot/policies/molmoact2/README.md
@@ -0,0 +1 @@
+../../../../docs/source/policy_molmoact2_README.md
--- a/src/lerobot/policies/molmoact2/init.py
+++ b/src/lerobot/policies/molmoact2/init.py
@@ -0,0 +1,21 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .configuration_molmoact2 import MolmoAct2Config
+from .modeling_molmoact2 import MolmoAct2Policy
+from .processor_molmoact2 import make_molmoact2_pre_post_processors
+
+__all__ = ["MolmoAct2Config", "MolmoAct2Policy", "make_molmoact2_pre_post_processors"]
--- a/src/lerobot/policies/molmoact2/configuration_molmoact2.py
+++ b/src/lerobot/policies/molmoact2/configuration_molmoact2.py
@@ -0,0 +1,519 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import json
+import math
+import os
+from contextlib import suppress
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+
+from huggingface_hub import snapshot_download
+
+from lerobot.configs import FeatureType, NormalizationMode, PolicyFeature, PreTrainedConfig
+from lerobot.optim import (
+    AdamWConfig,
+    CosineDecayWithWarmupSchedulerConfig,
+    LRSchedulerConfig,
+    OptimizerConfig,
+)
+from lerobot.utils.constants import ACTION, OBS_STATE
+
+from ..rtc.configuration_rtc import RTCConfig
+
+MOLMOACT2_DEFAULT_NUM_IMAGES = 2
+MOLMOACT2_IMAGE_TOKENS_PER_IMAGE = 196
+MOLMOACT2_FIXED_PROMPT_TOKEN_BUDGET = 80
+MOLMOACT2_TASK_TOKEN_BUDGET = 32
+MOLMOACT2_SEQUENCE_LENGTH_MARGIN = 32
+MOLMOACT2_SEQUENCE_LENGTH_MULTIPLE = 64
+MOLMOACT2_DISCRETE_ACTION_WRAPPER_TOKENS = 4
+MOLMOACT2_MIN_DISCRETE_ACTION_TOKENS_PER_STEP = 6
+MOLMOACT2_DISCRETE_ACTION_TOKENS_PER_DIM = 0.95
+
+
+def _hf_token() -> str | None:
+    return os.environ.get("HF_TOKEN") or os.environ.get("HF_ACCESS_TOKEN")
+
+
+def _resolve_checkpoint_location(
+    checkpoint_path: str,
+    *,
+    revision: str | None = None,
+    force_download: bool = False,
+) -> str:
+    checkpoint_path = str(checkpoint_path or "").strip()
+    if not checkpoint_path:
+        raise ValueError("MolmoAct2 policy requires `checkpoint_path`.")
+    local_path = Path(checkpoint_path).expanduser()
+    if local_path.exists():
+        return str(local_path)
+    return snapshot_download(
+        repo_id=checkpoint_path,
+        repo_type="model",
+        revision=revision,
+        force_download=force_download,
+        ignore_patterns=["*.py", "*.pyc", "__pycache__/*"],
+        token=_hf_token(),
+    )
+
+
+def _load_hf_norm_metadata_for_tag(
+    checkpoint_path: str,
+    *,
+    revision: str | None,
+    force_download: bool,
+    norm_tag: str | None,
+) -> dict[str, Any]:
+    norm_tag = str(norm_tag or "").strip()
+    if not norm_tag:
+        return {}
+    checkpoint_location = Path(
+        _resolve_checkpoint_location(
+            checkpoint_path,
+            revision=revision,
+            force_download=force_download,
+        )
+    )
+    norm_stats_filename = "norm_stats.json"
+    config_path = checkpoint_location / "config.json"
+    if config_path.exists():
+        with suppress(OSError, json.JSONDecodeError):
+            norm_stats_filename = str(
+                json.loads(config_path.read_text()).get("norm_stats_filename") or norm_stats_filename
+            )
+    stats_path = checkpoint_location / norm_stats_filename
+    if not stats_path.exists():
+        raise FileNotFoundError(
+            f"MolmoAct2 HF checkpoint is missing {norm_stats_filename!r}; cannot resolve norm_tag={norm_tag!r}."
+        )
+    payload = json.loads(stats_path.read_text())
+    metadata_by_tag = payload.get("metadata_by_tag")
+    if not isinstance(metadata_by_tag, dict):
+        raise ValueError(f"MolmoAct2 norm stats file {stats_path} has no metadata_by_tag mapping.")
+    metadata = metadata_by_tag.get(norm_tag)
+    if not isinstance(metadata, dict):
+        available = sorted(str(tag) for tag in metadata_by_tag)
+        raise ValueError(f"Unknown MolmoAct2 norm_tag={norm_tag!r}. Available tags: {available}.")
+    return metadata
+
+
+@LRSchedulerConfig.register_subclass("molmoact2_cosine_decay_with_warmup")
+@dataclass
+class MolmoAct2CosineDecayWithWarmupSchedulerConfig(CosineDecayWithWarmupSchedulerConfig):
+    """MolmoAct2-local cosine scheduler with optional decay-step auto-match.
+
+    LeRobot's generic cosine scheduler keeps an explicit integer decay length.
+    For MolmoAct2, leaving num_decay_steps unset means "decay across this run's
+    training steps"; build() is the first point where num_training_steps is known.
+    """
+
+    num_decay_steps: int | None
+
+    def build(self, optimizer, num_training_steps: int):
+        return CosineDecayWithWarmupSchedulerConfig(
+            peak_lr=self.peak_lr,
+            decay_lr=self.decay_lr,
+            num_warmup_steps=self.num_warmup_steps,
+            num_decay_steps=num_training_steps if self.num_decay_steps is None else self.num_decay_steps,
+        ).build(optimizer, num_training_steps=num_training_steps)
+
+
+def _round_up(value: int, multiple: int) -> int:
+    return int(math.ceil(value / multiple) * multiple)
+
+
+def infer_molmoact2_max_sequence_length(
+    *,
+    num_images: int,
+    state_dim: int,
+    action_dim: int,
+    action_horizon: int,
+    include_discrete_action: bool,
+) -> int:
+    """Infer the padded text/image sequence cap from MolmoAct2's fixed token layout."""
+    if num_images < 1:
+        num_images = MOLMOACT2_DEFAULT_NUM_IMAGES
+    if state_dim < 0:
+        state_dim = 0
+    if action_dim < 1:
+        action_dim = 1
+    if action_horizon < 1:
+        action_horizon = 1
+
+    image_tokens = num_images * MOLMOACT2_IMAGE_TOKENS_PER_IMAGE
+    prompt_tokens = (
+        MOLMOACT2_FIXED_PROMPT_TOKEN_BUDGET
+        + MOLMOACT2_TASK_TOKEN_BUDGET
+        + state_dim
+        + MOLMOACT2_SEQUENCE_LENGTH_MARGIN
+    )
+    action_tokens = 0
+    if include_discrete_action:
+        action_tokens_per_step = max(
+            MOLMOACT2_MIN_DISCRETE_ACTION_TOKENS_PER_STEP,
+            math.ceil(action_dim * MOLMOACT2_DISCRETE_ACTION_TOKENS_PER_DIM),
+        )
+        action_tokens = MOLMOACT2_DISCRETE_ACTION_WRAPPER_TOKENS + action_horizon * action_tokens_per_step
+
+    return _round_up(
+        image_tokens + prompt_tokens + action_tokens,
+        MOLMOACT2_SEQUENCE_LENGTH_MULTIPLE,
+    )
+
+
+@PreTrainedConfig.register_subclass("molmoact2")
+@dataclass
+class MolmoAct2Config(PreTrainedConfig):
+    """MolmoAct2 policy backed by the converted HF checkpoint implementation."""
+
+    checkpoint_path: str = "allenai/MolmoAct2"
+    checkpoint_revision: str | None = None
+    checkpoint_force_download: bool = False
+
+    n_obs_steps: int = 1
+    chunk_size: int = 30
+    n_action_steps: int = 30
+
+    action_mode: str = "both"
+    inference_action_mode: str | None = None
+    discrete_action_tokenizer: str = "allenai/MolmoAct2-FAST-Tokenizer"
+    discrete_generation_max_steps: int | None = None
+    norm_tag: str | None = None
+
+    setup_type: str = ""
+    control_mode: str = ""
+    image_keys: list[str] = field(default_factory=list)
+    normalize_language: bool = True
+    add_setup_tokens: bool = True
+    add_control_tokens: bool = True
+    normalize_gripper: bool = False
+    num_state_tokens: int = 256
+    # Leave unset for the default MolmoAct2 sequence budget inferred from the fixed
+    # image/prompt/state/action token layout. Override only for unusual long prompts.
+    max_sequence_length: int | None = None
+
+    # Fixed by released MolmoAct2 checkpoints. We validate this at model load.
+    expected_max_action_dim: int = 32
+
+    # Flow-matching training knobs copied from the original MolmoAct2 training path.
+    num_flow_timesteps: int = 8
+    flow_matching_cutoff: float = 1.0
+    flow_matching_time_offset: float = 0.001
+    flow_matching_time_scale: float = 0.999
+    flow_matching_beta_alpha: float = 1.0
+    flow_matching_beta_beta: float = 1.5
+    num_inference_steps: int | None = None
+    mask_action_dim_padding: bool = True
+    enable_inference_cuda_graph: bool = True
+    # MolmoAct2-local eval option. When enabled, stochastic continuous action
+    # generation uses a rollout-local generator derived from eval_seed.
+    per_episode_seed: bool = False
+    eval_seed: int | None = None
+    rtc_config: RTCConfig | None = None
+
+    # Default is full finetuning with gradients from the action expert flowing into the VLM.
+    enable_lora_vlm: bool = False
+    lora_rank: int = 64
+    lora_alpha: int = 16
+    lora_dropout: float = 0.05
+    lora_bias: str = "none"
+    enable_lora_action_expert: bool = False
+    enable_knowledge_insulation: bool = False
+    freeze_embedding: bool = True
+    train_action_expert_only: bool = False
+    gradient_checkpointing: bool = False
+
+    model_dtype: str = "bfloat16"
+    softmax_auxiliary_loss: bool = True
+    softmax_auxiliary_loss_scale: float = 1e-4
+    discrete_loss_token_weighting: str = "root_subsegments_root_tokens"
+
+    optimizer_lr: float = 1e-5
+    optimizer_vit_lr: float = 5e-6
+    optimizer_connector_lr: float = 5e-6
+    optimizer_action_expert_lr: float = 5e-5
+    optimizer_betas: tuple[float, float] = (0.9, 0.95)
+    optimizer_eps: float = 1e-6
+    optimizer_weight_decay: float = 0.0
+    optimizer_grad_clip_norm: float = 1.0
+
+    scheduler_warmup_steps: int = 200
+    scheduler_decay_steps: int | None = None
+    scheduler_decay_lr: float = 1e-6
+
+    normalization_mapping: dict[str, NormalizationMode] = field(
+        default_factory=lambda: {
+            "VISUAL": NormalizationMode.IDENTITY,
+            "STATE": NormalizationMode.QUANTILES,
+            "ACTION": NormalizationMode.QUANTILES,
+        }
+    )
+
+    input_features: dict[str, PolicyFeature] = field(default_factory=dict)
+    output_features: dict[str, PolicyFeature] = field(default_factory=dict)
+    dataset_feature_names: dict[str, Any] = field(default_factory=dict)
+
+    def __post_init__(self) -> None:
+        super().__post_init__()
+        if self.action_mode not in {"continuous", "discrete", "both"}:
+            raise ValueError(
+                f"Unsupported action_mode={self.action_mode!r}. "
+                "Expected one of {'continuous', 'discrete', 'both'}."
+            )
+        if self.inference_action_mode not in {None, "continuous", "discrete"}:
+            raise ValueError(
+                f"Unsupported inference_action_mode={self.inference_action_mode!r}. "
+                "Expected one of {None, 'continuous', 'discrete'}."
+            )
+        if self.inference_action_mode == "continuous" and self.action_mode == "discrete":
+            raise ValueError("MolmoAct2 action_mode='discrete' cannot run continuous inference.")
+        if self.inference_action_mode == "discrete" and self.action_mode == "continuous":
+            raise ValueError("MolmoAct2 action_mode='continuous' cannot run discrete inference.")
+        if self.train_action_expert_only and self.action_mode != "continuous":
+            raise ValueError("MolmoAct2 train_action_expert_only requires action_mode='continuous'.")
+        if self.train_action_expert_only and self.enable_lora_vlm:
+            raise ValueError("MolmoAct2 train_action_expert_only is incompatible with enable_lora_vlm.")
+        if self.enable_lora_action_expert and not self.enable_lora_vlm:
+            raise ValueError("MolmoAct2 enable_lora_action_expert requires enable_lora_vlm.")
+        if self.chunk_size < 1:
+            raise ValueError(f"chunk_size must be >= 1, got {self.chunk_size}.")
+        if self.n_action_steps < 1:
+            raise ValueError(f"n_action_steps must be >= 1, got {self.n_action_steps}.")
+        if self.n_action_steps > self.chunk_size:
+            raise ValueError(
+                f"n_action_steps ({self.n_action_steps}) cannot exceed chunk_size ({self.chunk_size})."
+            )
+        if self.expected_max_action_dim != 32:
+            raise ValueError("MolmoAct2 released checkpoints use expected_max_action_dim=32.")
+        if self.model_dtype not in {"float32", "bfloat16", "float16"}:
+            raise ValueError(
+                f"Unsupported model_dtype={self.model_dtype!r}. Expected 'float32', 'bfloat16', or 'float16'."
+            )
+        if self.lora_rank < 1:
+            raise ValueError(f"lora_rank must be >= 1, got {self.lora_rank}.")
+        if self.lora_alpha < 1:
+            raise ValueError(f"lora_alpha must be >= 1, got {self.lora_alpha}.")
+        if not 0 <= self.lora_dropout <= 1:
+            raise ValueError(f"lora_dropout must be in [0, 1], got {self.lora_dropout}.")
+        if self.lora_bias not in {"none", "all", "lora_only"}:
+            raise ValueError(
+                f"Unsupported lora_bias={self.lora_bias!r}. Expected one of 'none', 'all', or 'lora_only'."
+            )
+        if self.discrete_loss_token_weighting not in {
+            "none",
+            "token",
+            "root_tokens",
+            "root_subsegments",
+            "root_subsegments_root_tokens",
+        }:
+            raise ValueError(
+                f"Unsupported discrete_loss_token_weighting={self.discrete_loss_token_weighting!r}."
+            )
+        if self.discrete_generation_max_steps is not None and self.discrete_generation_max_steps < 1:
+            raise ValueError(
+                f"discrete_generation_max_steps must be >= 1 or None, got {self.discrete_generation_max_steps}."
+            )
+        if self.max_sequence_length is not None and self.max_sequence_length < 1:
+            raise ValueError(f"max_sequence_length must be >= 1 or None, got {self.max_sequence_length}.")
+
+    def inferred_max_sequence_length(
+        self,
+        *,
+        num_images: int | None = None,
+        state_dim: int | None = None,
+        action_dim: int | None = None,
+        action_horizon: int | None = None,
+        include_discrete_action: bool | None = None,
+    ) -> int:
+        if self.max_sequence_length is not None:
+            return int(self.max_sequence_length)
+
+        if num_images is None:
+            num_images = len(self.image_keys) or len(self.image_features) or MOLMOACT2_DEFAULT_NUM_IMAGES
+        if state_dim is None:
+            state_feature = self.robot_state_feature
+            state_dim = int(state_feature.shape[0]) if state_feature is not None else 0
+        if action_dim is None:
+            action_feature = self.action_feature
+            action_dim = (
+                int(action_feature.shape[0]) if action_feature is not None else self.expected_max_action_dim
+            )
+        if action_horizon is None:
+            action_horizon = self.chunk_size
+        if include_discrete_action is None:
+            include_discrete_action = self.action_mode in {"discrete", "both"}
+
+        return infer_molmoact2_max_sequence_length(
+            num_images=int(num_images),
+            state_dim=int(state_dim),
+            action_dim=int(action_dim),
+            action_horizon=int(action_horizon),
+            include_discrete_action=bool(include_discrete_action),
+        )
+
+    @property
+    def observation_delta_indices(self) -> None:
+        return None
+
+    @property
+    def action_delta_indices(self) -> list[int]:
+        return list(range(self.chunk_size))
+
+    @property
+    def reward_delta_indices(self) -> None:
+        return None
+
+    def get_optimizer_preset(self) -> OptimizerConfig:
+        return AdamWConfig(
+            lr=self.optimizer_lr,
+            betas=self.optimizer_betas,
+            eps=self.optimizer_eps,
+            weight_decay=self.optimizer_weight_decay,
+            grad_clip_norm=self.optimizer_grad_clip_norm,
+        )
+
+    def get_scheduler_preset(self) -> LRSchedulerConfig | None:
+        return MolmoAct2CosineDecayWithWarmupSchedulerConfig(
+            peak_lr=self.optimizer_lr,
+            decay_lr=self.scheduler_decay_lr,
+            num_warmup_steps=self.scheduler_warmup_steps,
+            num_decay_steps=self.scheduler_decay_steps,
+        )
+
+    def set_dataset_feature_metadata(self, features: dict[str, Any]) -> None:
+        self.dataset_feature_names = {}
+        for key in (ACTION, OBS_STATE):
+            feature = features.get(key) if isinstance(features, dict) else None
+            if isinstance(feature, dict) and feature.get("names") is not None:
+                self.dataset_feature_names[key] = feature["names"]
+
+    def validate_features(self) -> None:
+        """Validate and set up MolmoAct2 input and output features."""
+        image_features = [key for key, feat in self.input_features.items() if feat.type == FeatureType.VISUAL]
+        if not image_features:
+            raise ValueError(
+                "MolmoAct2 policy requires at least one visual input feature. "
+                "No features of type FeatureType.VISUAL found in input_features."
+            )
+
+        if OBS_STATE not in self.input_features:
+            state_feature = PolicyFeature(
+                type=FeatureType.STATE,
+                shape=(0,),
+            )
+            self.input_features[OBS_STATE] = state_feature
+
+        if ACTION not in self.output_features:
+            action_feature = PolicyFeature(
+                type=FeatureType.ACTION,
+                shape=(self.expected_max_action_dim,),
+            )
+            self.output_features[ACTION] = action_feature
+
+    def apply_norm_tag_metadata(self) -> None:
+        if not str(self.norm_tag or "").strip():
+            return
+        metadata = _load_hf_norm_metadata_for_tag(
+            self.checkpoint_path,
+            revision=self.checkpoint_revision,
+            force_download=bool(self.checkpoint_force_download),
+            norm_tag=self.norm_tag,
+        )
+        if metadata.get("action_horizon") is not None:
+            self.chunk_size = int(metadata["action_horizon"])
+        if metadata.get("n_action_steps") is not None:
+            self.n_action_steps = int(metadata["n_action_steps"])
+        if not self.setup_type and metadata.get("setup_type") is not None:
+            self.setup_type = str(metadata["setup_type"])
+        if not self.control_mode and metadata.get("control_mode") is not None:
+            self.control_mode = str(metadata["control_mode"])
+
+    def saved_policy_action_mode(self) -> str | None:
+        pretrained_path = getattr(self, "pretrained_path", None)
+        if pretrained_path is None:
+            return None
+        config_path = Path(pretrained_path) / "config.json"
+        if not config_path.exists():
+            return None
+        try:
+            mode = json.loads(config_path.read_text()).get("action_mode")
+        except (OSError, json.JSONDecodeError):
+            return None
+        if mode in {"continuous", "discrete", "both"}:
+            return str(mode)
+        return None
+
+    def training_action_mode(self, saved_policy_action_mode: str | None = None) -> str:
+        return saved_policy_action_mode or self.action_mode
+
+    def validate_inference_action_mode(self, saved_policy_action_mode: str | None = None) -> None:
+        requested_mode = self.inference_action_mode
+        if requested_mode is None:
+            return
+        training_mode = self.training_action_mode(saved_policy_action_mode)
+        if requested_mode == "continuous" and training_mode == "discrete":
+            raise ValueError(
+                "MolmoAct2 checkpoint was trained with action_mode='discrete' and cannot run "
+                "continuous inference."
+            )
+        if requested_mode == "discrete" and training_mode == "continuous":
+            raise ValueError(
+                "MolmoAct2 checkpoint was trained with action_mode='continuous' and cannot run "
+                "discrete inference. Train with action_mode='both' or action_mode='discrete' first."
+            )
+
+    def validate_checkpoint_action_mode(
+        self,
+        checkpoint_action_mode: str,
+        *,
+        has_action_expert: bool,
+    ) -> None:
+        if self.action_mode == "both" and checkpoint_action_mode != "both":
+            raise ValueError(
+                f"action_mode='both' requires checkpoint action_mode='both', got {checkpoint_action_mode!r}."
+            )
+        if self.action_mode == "discrete" and checkpoint_action_mode not in {"discrete", "both"}:
+            raise ValueError(
+                f"action_mode='discrete' requires checkpoint action_mode in {{'discrete', 'both'}}, "
+                f"got {checkpoint_action_mode!r}."
+            )
+        if self.action_mode in {"continuous", "both"} and not has_action_expert:
+            raise ValueError("Continuous MolmoAct2 training requires an action expert checkpoint.")
+
+    def resolve_inference_action_mode(
+        self,
+        requested_mode: str | None,
+        saved_policy_action_mode: str | None = None,
+    ) -> str:
+        training_mode = self.training_action_mode(saved_policy_action_mode)
+        if requested_mode is None:
+            requested_mode = self.inference_action_mode
+        if requested_mode is None:
+            raise ValueError(
+                "MolmoAct2 inference requires `inference_action_mode` to be set explicitly "
+                "to either 'continuous' or 'discrete'."
+            )
+        if requested_mode not in {"continuous", "discrete"}:
+            raise ValueError("MolmoAct2 inference_action_mode must be either 'continuous' or 'discrete'.")
+        if requested_mode == "continuous" and training_mode == "discrete":
+            raise ValueError("MolmoAct2 action_mode='discrete' checkpoint cannot run continuous inference.")
+        if requested_mode == "discrete" and training_mode == "continuous":
+            raise ValueError("MolmoAct2 action_mode='continuous' checkpoint cannot run discrete inference.")
+        return requested_mode
--- a/src/lerobot/policies/molmoact2/hf_model/init.py
+++ b/src/lerobot/policies/molmoact2/hf_model/init.py
@@ -0,0 +1,17 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# ruff: noqa
--- a/src/lerobot/policies/molmoact2/hf_model/action_tokenizer.py
+++ b/src/lerobot/policies/molmoact2/hf_model/action_tokenizer.py
@@ -0,0 +1,237 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# ruff: noqa
+
+import logging
+import os
+from pathlib import Path
+from typing import ClassVar
+
+import numpy as np
+from tokenizers import ByteLevelBPETokenizer
+from tokenizers.trainers import BpeTrainer
+from huggingface_hub import snapshot_download
+from transformers import PreTrainedTokenizerFast
+from transformers.processing_utils import ProcessorMixin
+
+
+def _hf_token() -> str | None:
+    return os.environ.get("HF_TOKEN") or os.environ.get("HF_ACCESS_TOKEN")
+
+
+def _resolve_tokenizer_location(
+    tokenizer_path: str,
+    *,
+    revision: str | None = None,
+    force_download: bool = False,
+) -> str:
+    local_path = Path(str(tokenizer_path)).expanduser()
+    if local_path.exists():
+        return str(local_path)
+    return snapshot_download(
+        repo_id=str(tokenizer_path),
+        repo_type="model",
+        revision=revision,
+        force_download=force_download,
+        ignore_patterns=["*.py", "*.pyc", "__pycache__/*"],
+        token=_hf_token(),
+    )
+
+
+class UniversalActionProcessor(ProcessorMixin):
+    attributes: ClassVar[list[str]] = ["tokenizer"]
+    tokenizer_class: str = "AutoTokenizer"
+
+    def __init__(
+        self,
+        tokenizer: PreTrainedTokenizerFast,
+        scale: float = 10,
+        vocab_size: int = 1024,
+        min_token: int = 0,
+        *,
+        action_dim: int | None = None,
+        time_horizon: int | None = None,
+    ):
+        self.scale = scale
+        self.vocab_size = vocab_size
+        self.min_token = min_token
+
+        # Action horizon and dimension needed during decoding. These can be specified
+        # in three ways (in order of priority):
+        # 1. passed in as kwargs to decode()
+        # 2. in the constructor
+        # 3. cached from the last time decode() was called
+        self.time_horizon = time_horizon
+        self.action_dim = action_dim
+        self.called_time_horizon = time_horizon
+        self.called_action_dim = action_dim
+
+        super().__init__(tokenizer)
+        self.bpe_tokenizer = self.tokenizer
+
+    def __call__(self, action_chunk: np.array) -> np.array:
+        from scipy.fft import dct
+
+        assert action_chunk.ndim <= 3, "Only 3 dimensions supported: [batch, timesteps, action_dim]"
+        if action_chunk.ndim == 2:
+            action_chunk = action_chunk[None, ...]
+
+        # Cache the time horizon and action dimension for decoding
+        self.called_time_horizon = action_chunk.shape[-2]
+        self.called_action_dim = action_chunk.shape[-1]
+
+        dct_coeff = dct(action_chunk, axis=1, norm="ortho")
+        dct_coeff = np.around(dct_coeff * self.scale)
+        tokens = []
+        for elem in dct_coeff:
+            token_str = "".join(map(chr, np.maximum(elem.flatten() - self.min_token, 0).astype(int)))
+            tokens.append(self.bpe_tokenizer(token_str)["input_ids"])
+        return tokens
+
+    def decode(
+        self,
+        tokens: list[list[int]],
+        *,
+        time_horizon: int | None = None,
+        action_dim: int | None = None,
+    ) -> np.array:
+        from scipy.fft import idct
+
+        self.time_horizon = time_horizon or self.time_horizon or self.called_time_horizon
+        self.action_dim = action_dim or self.action_dim or self.called_action_dim
+
+        # Cache the time horizon and action dimension for the next call
+        self.called_time_horizon = self.time_horizon
+        self.called_action_dim = self.action_dim
+
+        assert self.time_horizon is not None and self.action_dim is not None, (
+            "Tokenizer not initialized, call encode() once or pass in time_horizon and action_dim."
+        )
+
+        decoded_actions = []
+        for token in tokens:
+            try:
+                decoded_tokens = self.bpe_tokenizer.decode(token)
+                decoded_dct_coeff = np.array(list(map(ord, decoded_tokens))) + self.min_token
+                decoded_dct_coeff = decoded_dct_coeff.reshape(-1, self.action_dim)
+                assert decoded_dct_coeff.shape == (
+                    self.time_horizon,
+                    self.action_dim,
+                ), (
+                    f"Decoded DCT coefficients have shape {decoded_dct_coeff.shape}, expected ({self.time_horizon}, {self.action_dim})"
+                )
+            except Exception as e:
+                print(f"Error decoding tokens: {e}")
+                print(f"Tokens: {token}")
+                decoded_dct_coeff = np.zeros((self.time_horizon, self.action_dim))
+            decoded_actions.append(idct(decoded_dct_coeff / self.scale, axis=0, norm="ortho"))
+        return np.stack(decoded_actions)
+
+    @classmethod
+    def fit(
+        cls,
+        action_data: list[np.array],
+        scale: float = 10,
+        vocab_size: int = 1024,
+        *,
+        time_horizon: int | None = None,
+        action_dim: int | None = None,
+    ) -> "UniversalActionProcessor":
+        from scipy.fft import dct
+
+        # Run DCT over all inputs
+        dct_tokens = [dct(a, axis=0, norm="ortho").flatten() for a in action_data]
+
+        # Quantize and find min token
+        max_token = int(np.around(np.concatenate(dct_tokens) * scale).max())
+        min_token = int(np.around(np.concatenate(dct_tokens) * scale).min())
+        min_vocab_size = max_token - min_token
+
+        assert min_vocab_size <= vocab_size, (
+            f"Vocab size {vocab_size} is too small for the range of tokens {min_vocab_size}"
+        )
+        if min_vocab_size + 100 > vocab_size:
+            logging.warning(
+                f"Initial alphabet size {min_vocab_size} is almost as large as the vocab"
+                f"size {vocab_size}, consider increasing vocab size"
+            )
+
+        # Make token iterator for BPE training
+        def _token_iter():
+            for tokens in dct_tokens:
+                rounded_tokens = np.around(tokens * scale) - min_token
+                rounded_tokens = rounded_tokens.astype(int)
+                string = "".join(map(chr, rounded_tokens))
+                yield string
+
+        # Train BPE tokenizer
+        bpe = ByteLevelBPETokenizer()
+
+        # Set up the entire range of possible tokens as the initial alphabet
+        alphabet = [chr(i) for i in range(max_token - min_token + 1)]
+        trainer = BpeTrainer(
+            vocab_size=vocab_size,
+            min_frequency=2,
+            show_progress=True,
+            special_tokens=[],
+            initial_alphabet=alphabet,
+            max_token_length=10000,
+        )
+
+        # Train the inner tokenizer (don't use ByteLevelBPETokenizer.train_from_iterator()
+        # because it doesn't support custom alphabets)
+        bpe._tokenizer.train_from_iterator(_token_iter(), trainer=trainer)
+
+        return cls(
+            PreTrainedTokenizerFast(tokenizer_object=bpe, clean_up_tokenization_spaces=False),
+            scale=scale,
+            vocab_size=vocab_size,
+            min_token=min_token,
+            time_horizon=time_horizon,
+            action_dim=action_dim,
+        )
+
+    @classmethod
+    def from_pretrained_local(
+        cls,
+        pretrained_model_name_or_path: str,
+        *,
+        revision: str | None = None,
+        force_download: bool = False,
+    ) -> "UniversalActionProcessor":
+        location = Path(
+            _resolve_tokenizer_location(
+                pretrained_model_name_or_path,
+                revision=revision,
+                force_download=force_download,
+            )
+        )
+        processor_config = {}
+        processor_config_path = location / "processor_config.json"
+        if processor_config_path.exists():
+            import json
+
+            processor_config = json.loads(processor_config_path.read_text())
+        tokenizer = PreTrainedTokenizerFast.from_pretrained(str(location))
+        return cls(
+            tokenizer,
+            scale=processor_config.get("scale", 10),
+            vocab_size=processor_config.get("vocab_size", 1024),
+            min_token=processor_config.get("min_token", 0),
+            action_dim=processor_config.get("action_dim"),
+            time_horizon=processor_config.get("time_horizon"),
+        )
--- a/src/lerobot/policies/molmoact2/hf_model/configuration_molmoact2.py
+++ b/src/lerobot/policies/molmoact2/hf_model/configuration_molmoact2.py
@@ -0,0 +1,553 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# ruff: noqa
+
+"""
+MolmoAct2 configuration
+"""
+
+from typing import Optional, Any
+
+from transformers import PretrainedConfig
+from transformers.modeling_rope_utils import rope_config_validation
+from transformers.utils import logging
+
+logger = logging.get_logger(__name__)
+
+
+class MolmoAct2VitConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`MolmoAct2VisionTransformer`].
+    It is used to instantiate a `MolmoAct2VisionTransformer` according to the specified arguments,
+    defining the model architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Example:
+    ```python
+    >>> from transformers import MolmoAct2VitConfig, MolmoAct2VisionTransformer
+
+    >>> # Initializing a MolmoAct2VitConfig
+    >>> configuration = MolmoAct2VitConfig()
+
+    >>> # Initializing a MolmoAct2VisionTransformer (with random weights)
+    >>> model = MolmoAct2VisionTransformer(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "molmoact2"
+    base_config_key = "vit_config"
+
+    def __init__(
+        self,
+        hidden_size: int = 1152,
+        intermediate_size: int = 4304,
+        num_hidden_layers: int = 27,
+        num_attention_heads: int = 16,
+        num_key_value_heads: int = 16,
+        head_dim: int = 72,
+        hidden_act: str = "gelu_pytorch_tanh",
+        layer_norm_eps: float = 1e-6,
+        image_default_input_size: tuple[int, int] = (378, 378),
+        image_patch_size: int = 14,
+        image_num_pos: int = 577,
+        attention_dropout: float = 0.0,
+        residual_dropout: float = 0.0,
+        initializer_range: float = 0.02,
+        float32_attention: bool = True,
+        attn_implementation: str = "eager",
+        **kwargs,
+    ):
+        self.attn_implementation = attn_implementation
+        super().__init__(attn_implementation=attn_implementation, **kwargs)
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.head_dim = head_dim
+        self.hidden_act = hidden_act
+        self.layer_norm_eps = layer_norm_eps
+        self.image_default_input_size = image_default_input_size
+        self.image_patch_size = image_patch_size
+        self.image_num_pos = image_num_pos
+        self.attention_dropout = attention_dropout
+        self.residual_dropout = residual_dropout
+        self.initializer_range = initializer_range
+        self.float32_attention = float32_attention
+
+    @property
+    def image_num_patch(self):
+        h, w = self.image_default_input_size
+        return h // self.image_patch_size, w // self.image_patch_size
+
+
+class MolmoAct2AdapterConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of MolmoAct2Adapter. With MolmoAct2VitConfig,
+    It is used to instantiate an MolmoAct2VisionBackbone according to the specified arguments,
+    defining the model architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Example:
+
+    ```python
+    >>> from transformers import MolmoAct2VitConfig, MolmoAct2AdapterConfig, MolmoAct2VisionBackbone
+
+    >>> # Initializing a MolmoAct2VitConfig and a MolmoAct2AdapterConfig
+    >>> vit_config = MolmoAct2VitConfig()
+    >>> adapter_config = MolmoPoolingConfig()
+
+    >>> # Initializing a MolmoAct2VisionBackbone (with random weights)
+    >>> model = MolmoAct2VisionBackbone(vit_config, adapter_config)
+
+    >>> # Accessing the model configuration
+    >>> vit_configuration = model.vit_config
+    >>> adapter_configuration = model.adapter_config
+    ```"""
+
+    model_type = "molmoact2"
+    base_config_key = "adapter_config"
+
+    def __init__(
+        self,
+        vit_layers: tuple = (-3, -9),
+        pooling_attention_mask: bool = False,
+        hidden_size: int = 1152,
+        num_attention_heads: int = 16,
+        num_key_value_heads: int = 16,
+        head_dim: int = 72,
+        float32_attention: bool = True,
+        attention_dropout: float = 0.0,
+        residual_dropout: float = 0.0,
+        hidden_act: str = "silu",
+        intermediate_size: int = 18944,
+        text_hidden_size: int = 3584,
+        image_feature_dropout: float = 0.0,
+        initializer_range: float = 0.02,
+        attn_implementation: str = "eager",
+        **kwargs,
+    ):
+        self.attn_implementation = attn_implementation
+        super().__init__(attn_implementation=attn_implementation, **kwargs)
+        self.vit_layers = vit_layers
+        self.pooling_attention_mask = pooling_attention_mask
+        self.hidden_size = hidden_size
+        self.num_attention_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.head_dim = head_dim
+        self.float32_attention = float32_attention
+        self.attention_dropout = attention_dropout
+        self.residual_dropout = residual_dropout
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.text_hidden_size = text_hidden_size
+        self.image_feature_dropout = image_feature_dropout
+        self.initializer_range = initializer_range
+
+
+class MolmoAct2TextConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`MolmoAct2TextModel`]. It is used to instantiate a
+    `MolmoAct2TextModel` according to the specified arguments, defining the model architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Example:
+    ```python
+    >>> from transformers import MolmoAct2TextConfig, MolmoAct2TextModel
+
+    >>> # Initializing a MolmoAct2TextConfig
+    >>> configuration = MolmoAct2TextConfig()
+
+    >>> # Initializing a MolmoAct2TextModel (with random weights)
+    >>> model = MolmoAct2TextModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "molmoact2_text"
+    base_config_key = "text_config"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    base_model_tp_plan = {
+        "blocks.*.self_attn.att_proj": "colwise",
+        "blocks.*.self_attn.attn_out": "rowwise",
+        "blocks.*.mlp.ff_proj": "colwise",
+        "blocks.*.mlp.ff_out": "rowwise",
+    }
+    base_model_pp_plan = {
+        "wte": (["input_ids"], ["inputs_embeds"]),
+        "blocks": (["hidden_states", "attention_mask"], ["hidden_states"]),
+        "ln_f": (["hidden_states"], ["hidden_states"]),
+    }
+
+    def __init__(
+        self,
+        hidden_size: int = 3584,
+        num_attention_heads: int = 28,
+        num_key_value_heads: int | None = 4,
+        head_dim: int = 128,
+        vocab_size: int = 152064,
+        additional_vocab_size: int = 128,
+        qkv_bias: bool = True,
+        num_hidden_layers: int = 48,
+        intermediate_size: int = 18944,
+        hidden_act: str = "silu",
+        embedding_dropout: float = 0.0,
+        attention_dropout: float = 0.0,
+        residual_dropout: float = 0.0,
+        max_position_embeddings: int = 4096,
+        rope_theta: float = 1000000.0,
+        rope_scaling: dict[str, Any] = None,
+        rope_scaling_layers: list[int] | None = None,
+        use_qk_norm: bool = False,
+        qk_norm_type: str = "olmo",
+        layer_norm_eps: int = 1e-6,
+        norm_after: bool = False,
+        initializer_range: float = 0.02,
+        use_cache=True,
+        tie_word_embeddings=False,
+        attn_implementation: str = "eager",
+        **kwargs,
+    ):
+        self.attn_implementation = attn_implementation
+        super().__init__(
+            tie_word_embeddings=tie_word_embeddings, attn_implementation=attn_implementation, **kwargs
+        )
+        self.hidden_size = hidden_size
+        self.num_attention_heads = num_attention_heads
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.head_dim = head_dim
+        self.vocab_size = vocab_size
+        self.additional_vocab_size = additional_vocab_size
+        self.qkv_bias = qkv_bias
+        self.num_hidden_layers = num_hidden_layers
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.embedding_dropout = embedding_dropout
+        self.attention_dropout = attention_dropout
+        self.residual_dropout = residual_dropout
+        self.max_position_embeddings = max_position_embeddings
+        self.rope_theta = rope_theta
+        self.rope_scaling = rope_scaling
+        self.rope_scaling_layers = rope_scaling_layers
+        self.use_qk_norm = use_qk_norm
+        self.qk_norm_type = qk_norm_type
+        self.layer_norm_eps = layer_norm_eps
+        self.norm_after = norm_after
+        self.initializer_range = initializer_range
+        self.use_cache = use_cache
+
+        # Validate the correctness of rotary position embeddings parameters
+        rope_config_validation(self)
+
+
+class MolmoAct2ActionExpertConfig(PretrainedConfig):
+    r"""Configuration for the MolmoAct2 modern action expert."""
+
+    model_type = "molmoact2_action_expert"
+    base_config_key = "action_expert_config"
+
+    def __init__(
+        self,
+        max_action_horizon: int = 32,
+        max_action_dim: int = 32,
+        hidden_size: int = 1024,
+        num_layers: int = 32,
+        num_heads: int = 16,
+        mlp_ratio: float = 8.0 / 3.0,
+        ffn_multiple_of: int = 256,
+        timestep_embed_dim: int = 256,
+        dropout: float = 0.0,
+        attn_dropout: float = 0.0,
+        context_layer_norm: bool = True,
+        qk_norm: bool = True,
+        qk_norm_eps: float = 1e-6,
+        rope: bool = True,
+        causal_attn: bool = False,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.max_action_horizon = max_action_horizon
+        self.max_action_dim = max_action_dim
+        self.hidden_size = hidden_size
+        self.num_layers = num_layers
+        self.num_heads = num_heads
+        self.mlp_ratio = mlp_ratio
+        self.ffn_multiple_of = ffn_multiple_of
+        self.timestep_embed_dim = timestep_embed_dim
+        self.dropout = dropout
+        self.attn_dropout = attn_dropout
+        self.context_layer_norm = context_layer_norm
+        self.qk_norm = qk_norm
+        self.qk_norm_eps = qk_norm_eps
+        self.rope = rope
+        self.causal_attn = causal_attn
+
+    def to_dict(self):
+        output = super().to_dict()
+        # These are derived from the parent MolmoAct2Config for HF exports. Keeping
+        # them out of the public nested config avoids duplicated sources of truth.
+        output.pop("max_action_horizon", None)
+        output.pop("max_action_dim", None)
+        return output
+
+
+class MolmoAct2Config(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`MolmoAct2ForConditionalGeneration`].
+    It is used to instantiate an MolmoAct2 model according to the specified arguments, defining the model architecture.
+
+    Example:
+
+    ```python
+    >>> from transformers import MolmoAct2Config, MolmoAct2VitConfig, MolmoAct2AdapterConfig, MolmoAct2TextConfig
+
+    >>> # Initializing a MolmoAct2VitConfig
+    >>> vit_config = MolmoAct2VitConfig()
+
+    >>> # Initializing a MolmoAct2AdapterConfig
+    >>> adapter_config = MolmoAct2AdapterConfig()
+
+    >>> # Initializing a MolmoAct2TextConfig
+    >>> text_config = MolmoAct2TextConfig()
+
+    >>> # Initializing a MolmoAct2Config
+    >>> configuration = MolmoAct2Config(
+    >>>     vit_config=vit_config,
+    >>>     adapter_config=adapter_config,
+    >>>     text_config=text_config,
+    >>>     image_start_token_id=151936,
+    >>>     image_end_token_id=151937,
+    >>>     image_patch_id=151938,
+    >>>     image_col_id=151939,
+    >>>     low_res_image_start_token_id=151940,
+    >>>     image_low_res_id=151942,
+    >>>     frame_start_token_id=151943,
+    >>>     frame_end_token_id=151944,
+    >>> )
+
+    >>> # Initializing a model
+    >>> model = MolmoAct2ForConditionalGeneration(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "molmoact2"
+    sub_configs = {
+        "text_config": MolmoAct2TextConfig,
+        "vit_config": MolmoAct2VitConfig,
+        "adapter_config": MolmoAct2AdapterConfig,
+        "action_expert_config": MolmoAct2ActionExpertConfig,
+    }
+
+    def __init__(
+        self,
+        vit_config: MolmoAct2VitConfig = None,
+        adapter_config: MolmoAct2AdapterConfig = None,
+        text_config: MolmoAct2TextConfig = None,
+        action_expert_config: MolmoAct2ActionExpertConfig = None,
+        image_start_token_id: int = None,
+        low_res_image_start_token_id: int = None,
+        image_end_token_id: int = None,
+        image_low_res_id: int = None,
+        image_patch_id: int = None,
+        image_col_id: int = None,
+        frame_start_token_id: int = None,
+        frame_end_token_id: int = None,
+        use_frame_special_tokens: bool = True,
+        initializer_range: float = 0.02,
+        add_action_expert: bool = True,
+        max_action_dim: int = 32,
+        max_action_horizon: int = 30,
+        n_obs_steps: int = 30,
+        action_mode: str = "both",
+        state_format: str = "discrete",
+        flow_matching_num_steps: int = 10,
+        flow_matching_cutoff: float = 1.0,
+        flow_matching_time_offset: float = 0.001,
+        flow_matching_time_scale: float = 0.999,
+        flow_matching_beta_alpha: float = 1.0,
+        flow_matching_beta_beta: float = 1.5,
+        mask_action_dim_padding: bool = True,
+        enable_depth_reasoning: bool = False,
+        depth_mode: int = 2,
+        num_depth_codes: int = 100,
+        action_expert_depth_gate: bool = False,
+        action_expert_depth_gate_per_layer: bool = False,
+        action_expert_depth_gate_init_bias: float = -4.0,
+        action_output_token_id: int = None,
+        action_start_token_id: int = None,
+        action_end_token_id: int = None,
+        action_token_start_id: int = None,
+        num_action_tokens: int = 0,
+        depth_output_token_id: int = None,
+        depth_start_token_id: int = None,
+        depth_end_token_id: int = None,
+        depth_token_start_id: int = None,
+        num_depth_tokens: int = 0,
+        state_start_token_id: int = None,
+        state_end_token_id: int = None,
+        state_token_start_id: int = None,
+        num_state_tokens: int = 0,
+        add_setup_tokens: bool = True,
+        add_control_tokens: bool = True,
+        norm_stats_filename: str = "norm_stats.json",
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        if vit_config is None:
+            self.vit_config = MolmoAct2VitConfig()
+        elif isinstance(vit_config, dict):
+            self.vit_config = MolmoAct2VitConfig(**vit_config)
+        else:
+            self.vit_config = vit_config
+        if adapter_config is None:
+            self.adapter_config = MolmoAct2AdapterConfig()
+        elif isinstance(adapter_config, dict):
+            self.adapter_config = MolmoAct2AdapterConfig(**adapter_config)
+        else:
+            self.adapter_config = adapter_config
+        if text_config is None:
+            self.text_config = MolmoAct2TextConfig()
+        elif isinstance(text_config, dict):
+            self.text_config = MolmoAct2TextConfig(**text_config)
+        else:
+            self.text_config = text_config
+        self.add_action_expert = bool(add_action_expert)
+        if not self.add_action_expert:
+            self.action_expert_config = None
+        elif action_expert_config is None:
+            self.action_expert_config = MolmoAct2ActionExpertConfig(
+                max_action_horizon=max_action_horizon,
+                max_action_dim=max_action_dim,
+                num_layers=self.text_config.num_hidden_layers,
+            )
+        elif isinstance(action_expert_config, dict):
+            self.action_expert_config = MolmoAct2ActionExpertConfig(**action_expert_config)
+        else:
+            self.action_expert_config = action_expert_config
+        if self.add_action_expert:
+            self.action_expert_config.max_action_dim = int(max_action_dim)
+            self.action_expert_config.max_action_horizon = int(max_action_horizon)
+            self._validate_release_action_config(
+                state_format=state_format,
+            )
+        self.image_start_token_id = image_start_token_id
+        self.low_res_image_start_token_id = low_res_image_start_token_id
+        self.image_end_token_id = image_end_token_id
+        self.image_low_res_id = image_low_res_id
+        self.image_high_res_id = image_patch_id
+        self.image_patch_id = image_patch_id
+        self.image_col_id = image_col_id
+        self.frame_start_token_id = frame_start_token_id
+        self.frame_end_token_id = frame_end_token_id
+        self.use_frame_special_tokens = use_frame_special_tokens
+        self.initializer_range = initializer_range
+        self.max_action_dim = max_action_dim
+        self.max_action_horizon = max_action_horizon
+        self.n_obs_steps = n_obs_steps
+        self.action_mode = action_mode
+        self.state_format = state_format
+        self.flow_matching_num_steps = flow_matching_num_steps
+        self.flow_matching_cutoff = flow_matching_cutoff
+        self.flow_matching_time_offset = flow_matching_time_offset
+        self.flow_matching_time_scale = flow_matching_time_scale
+        self.flow_matching_beta_alpha = flow_matching_beta_alpha
+        self.flow_matching_beta_beta = flow_matching_beta_beta
+        self.mask_action_dim_padding = mask_action_dim_padding
+        self.enable_depth_reasoning = enable_depth_reasoning
+        self.depth_mode = depth_mode
+        self.num_depth_codes = num_depth_codes
+        self.action_expert_depth_gate = action_expert_depth_gate
+        self.action_expert_depth_gate_per_layer = action_expert_depth_gate_per_layer
+        self.action_expert_depth_gate_init_bias = action_expert_depth_gate_init_bias
+        self.action_output_token_id = action_output_token_id
+        self.action_start_token_id = action_start_token_id
+        self.action_end_token_id = action_end_token_id
+        self.action_token_start_id = action_token_start_id
+        self.num_action_tokens = num_action_tokens
+        self.depth_output_token_id = depth_output_token_id
+        self.depth_start_token_id = depth_start_token_id
+        self.depth_end_token_id = depth_end_token_id
+        self.depth_token_start_id = depth_token_start_id
+        self.num_depth_tokens = num_depth_tokens
+        self.state_start_token_id = state_start_token_id
+        self.state_end_token_id = state_end_token_id
+        self.state_token_start_id = state_token_start_id
+        self.num_state_tokens = num_state_tokens
+        self.add_setup_tokens = add_setup_tokens
+        self.add_control_tokens = add_control_tokens
+        self.norm_stats_filename = norm_stats_filename
+
+    @staticmethod
+    def _validate_release_action_config(
+        *,
+        state_format: str,
+    ) -> None:
+        if state_format != "discrete":
+            raise ValueError("MolmoAct2 HF export supports only state_format='discrete'.")
+
+    @property
+    def image_num_patch(self):
+        assert self.vit_config is not None
+        return self.vit_config.image_num_patch
+
+    @property
+    def num_attention_heads(self):
+        return self.text_config.num_attention_heads
+
+    @property
+    def num_key_value_heads(self):
+        return self.text_config.num_key_value_heads
+
+    @property
+    def head_dim(self):
+        return self.text_config.head_dim
+
+    @property
+    def num_hidden_layers(self):
+        return self.text_config.num_hidden_layers
+
+    @property
+    def hidden_size(self):
+        return self.text_config.hidden_size
+
+    @property
+    def vocab_size(self):
+        return self.text_config.vocab_size
+
+    @property
+    def max_position_embeddings(self):
+        return self.text_config.max_position_embeddings
+
+
+MolmoAct2VitConfig.register_for_auto_class()
+MolmoAct2AdapterConfig.register_for_auto_class()
+MolmoAct2TextConfig.register_for_auto_class()
+MolmoAct2ActionExpertConfig.register_for_auto_class()
+MolmoAct2Config.register_for_auto_class()
--- a/src/lerobot/policies/molmoact2/hf_model/image_processing_molmoact2.py
+++ b/src/lerobot/policies/molmoact2/hf_model/image_processing_molmoact2.py
@@ -0,0 +1,564 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# ruff: noqa
+
+"""Image processor class for MolmoAct2"""
+
+from typing import Optional, Union
+import numpy as np
+import einops
+import torch
+import torchvision.transforms
+
+from transformers.image_utils import (
+    IMAGENET_STANDARD_MEAN,
+    IMAGENET_STANDARD_STD,
+    ImageInput,
+    PILImageResampling,
+    make_flat_list_of_images,
+    valid_images,
+    to_numpy_array,
+)
+from transformers.image_transforms import convert_to_rgb
+from transformers.processing_utils import ImagesKwargs
+from transformers.image_processing_utils import BaseImageProcessor, get_size_dict
+from transformers.utils import logging
+from transformers.feature_extraction_utils import BatchFeature
+from transformers.utils import TensorType, logging
+
+
+logger = logging.get_logger(__name__)
+
+
+def normalize_image(
+    image: np.ndarray,
+    image_mean: list[float],
+    image_std: list[float],
+) -> np.ndarray:
+    if np.allclose(image_mean, [0.5, 0.5, 0.5]) and np.allclose(image_std, [0.5, 0.5, 0.5]):
+        return image * np.asarray(2.0, dtype=np.float32) - np.asarray(1.0, dtype=np.float32)
+    image -= np.array(image_mean, dtype=np.float32)[None, None, :]
+    image /= np.array(image_std, dtype=np.float32)[None, None, :]
+    return image
+
+
+def resize_image(
+    image: np.ndarray,
+    desired_output_size: list[int],
+    resample: PILImageResampling,
+) -> np.ndarray:
+    image = torch.permute(torch.from_numpy(image), [2, 0, 1])
+    dtype = image.dtype
+    if torch.is_floating_point(image):
+        in_min = 0.0
+        in_max = 1.0
+        resized = torchvision.transforms.Resize(
+            desired_output_size,
+            resample,
+            antialias=False,
+        )(image)
+        resized = torch.clip(resized, 0.0, 1.0).to(dtype)
+    else:
+        assert image.dtype == torch.uint8, "SigLIP expects float images or uint8 images, but got {}".format(
+            image.dtype
+        )
+        in_min = 0.0
+        in_max = 255.0
+        resized = torchvision.transforms.Resize(
+            desired_output_size,
+            resample,
+            antialias=False,
+        )(image)
+        resized = torch.clip(resized, 0, 255).to(dtype)
+
+    resized = resized.to(torch.float32)
+    resized = (resized - in_min) / (in_max - in_min)
+
+    resized = torch.permute(resized, [1, 2, 0]).numpy()
+
+    return resized
+
+
+def select_tiling(h, w, patch_size, max_num_crops):
+    """Divide in image of size [w, h] in up to max_num_patches of size patch_size"""
+    original_size = np.stack([h, w])  # [1, 2]
+    original_res = h * w
+    tilings = []
+    for i in range(1, max_num_crops + 1):
+        for j in range(1, max_num_crops + 1):
+            if i * j <= max_num_crops:
+                tilings.append((i, j))
+    # sort so argmin and argmax favour smaller tilings in the event of a tie
+    tilings.sort(key=lambda x: (x[0] * x[1], x[0]))
+    candidate_tilings = np.array(tilings, dtype=np.int32)  # [n_resolutions, 2]
+    candidate_resolutions = candidate_tilings * patch_size  # [n_resolutions, 2]
+
+    # How much we would need to scale the image to fit exactly in each tiling
+    original_size = np.stack([h, w], dtype=np.float32)  # [1, 2]
+
+    # The original size can be zero in rare cases if the image is smaller than the margin
+    # In those cases letting the scale become infinite means the tiling is based on the
+    # other side, or falls back to the smallest tiling
+    with np.errstate(divide="ignore"):
+        required_scale_d = (candidate_resolutions.astype(np.float32) / original_size,)
+    required_scale = np.min(required_scale_d, axis=-1, keepdims=True)  # [n_resolutions, 1]
+    if np.all(required_scale < 1):
+        # We are forced to downscale, so try to minimize the amount of downscaling
+        ix = np.argmax(required_scale)
+    else:
+        # Pick the resolution that required the least upscaling so that it most closely fits the image
+        required_scale = np.where(required_scale < 1.0, 10e9, required_scale)
+        ix = np.argmin(required_scale)
+    return candidate_tilings[ix]
+
+
+def build_resized_image(
+    image: np.ndarray,
+    base_image_input_size: list[int],
+    resample: PILImageResampling,
+    image_mean: list[float],
+    image_std: list[float],
+    image_patch_size: int,
+) -> tuple[np.ndarray, np.ndarray]:
+    resized = resize_image(
+        image,
+        base_image_input_size,
+        resample,
+    )
+    resized = normalize_image(resized, image_mean, image_std)
+    if len(resized.shape) == 3:
+        resized = np.expand_dims(resized, 0)
+    crop_patch_w = base_image_input_size[1] // image_patch_size
+    crop_patch_h = base_image_input_size[0] // image_patch_size
+    resize_idx = np.arange(crop_patch_w * crop_patch_h).reshape([crop_patch_h, crop_patch_w])
+    return resized, resize_idx
+
+
+def build_overlapping_crops(
+    image: np.ndarray,
+    max_crops: int,
+    overlap_margins: list[int],
+    base_image_input_size: list[int],
+    resample: PILImageResampling,
+    image_mean: list[float],
+    image_std: list[float],
+    image_patch_size: int,
+) -> tuple[np.ndarray, np.ndarray]:
+    """Decompose an image into a set of overlapping crops
+
+    :return crop_arr: [n_crops, h, w, 3] The crops
+    :return patch_idx: [overlap_patch_h, overlap_patch_w] For each patch in the resized image
+                        the crops were extracted from, what patch in `crop_arr` it corresponds to
+    """
+    original_image_h, original_image_w = image.shape[:2]
+    crop_size = base_image_input_size[0]
+    assert base_image_input_size[0] == base_image_input_size[1]
+
+    left_margin, right_margin = overlap_margins
+    total_margin_pixels = image_patch_size * (right_margin + left_margin)  # pixels removed per dim
+    crop_patches = base_image_input_size[0] // image_patch_size  # patches per crop dim
+    crop_window_patches = crop_patches - (right_margin + left_margin)  # usable patches
+    crop_window_size = crop_window_patches * image_patch_size
+    crop_patch_w = base_image_input_size[1] // image_patch_size
+    crop_patch_h = base_image_input_size[0] // image_patch_size
+    original_image_h, original_image_w = image.shape[:2]
+    crop_size = base_image_input_size[0]
+
+    # Decide how to tile the image, to account for the overlap margins we compute the tiling
+    # as if we had an image without the margins and were using a crop size without the margins
+    tiling = select_tiling(
+        original_image_h - total_margin_pixels,
+        original_image_w - total_margin_pixels,
+        crop_window_size,
+        max_crops,
+    )
+
+    src = resize_image(
+        image,
+        [
+            tiling[0] * crop_window_size + total_margin_pixels,
+            tiling[1] * crop_window_size + total_margin_pixels,
+        ],
+        resample,
+    )
+    src = normalize_image(src, image_mean, image_std)
+
+    # Now we have to split the image into crops, and track what patches came from
+    # where in `patch_idx_arr`
+    n_crops = tiling[0] * tiling[1]
+    crop_arr = np.zeros([n_crops, crop_size, crop_size, 3], dtype=src.dtype)
+    patch_idx_arr = np.zeros([n_crops, crop_patch_h, crop_patch_w], dtype=np.int32)
+    on_crop = 0
+    for i in range(tiling[0]):
+        # Slide over `src` by `crop_window_size` steps, but extract crops of size `crops_size`
+        # which results in overlapping crop windows
+        y0 = i * crop_window_size
+        for j in range(tiling[1]):
+            x0 = j * crop_window_size
+            crop_arr[on_crop] = src[y0 : y0 + crop_size, x0 : x0 + crop_size]
+            patch_idx = np.arange(crop_patch_w * crop_patch_h).reshape(crop_patch_h, crop_patch_w)
+            patch_idx += on_crop * crop_patch_h * crop_patch_w
+
+            # Mask out idx that are in the overlap region
+            if i != 0:
+                patch_idx[:left_margin, :] = -1
+            if j != 0:
+                patch_idx[:, :left_margin] = -1
+            if i != tiling[0] - 1:
+                patch_idx[-right_margin:, :] = -1
+            if j != tiling[1] - 1:
+                patch_idx[:, -right_margin:] = -1
+            patch_idx_arr[on_crop] = patch_idx
+            on_crop += 1
+
+    # `patch_idx_arr` is ordered crop-by-crop, here we transpose `patch_idx_arr`
+    # so it is ordered left-to-right order
+    patch_idx_arr = np.reshape(patch_idx_arr, [tiling[0], tiling[1], crop_patch_h, crop_patch_w])
+    patch_idx_arr = np.transpose(patch_idx_arr, [0, 2, 1, 3])
+    patch_idx_arr = np.reshape(patch_idx_arr, [-1])
+
+    # Now get the parts not in the overlap region, so it should map each patch in `src`
+    # to the correct patch it should come from in `crop_arr`
+    patch_idx_arr = patch_idx_arr[patch_idx_arr >= 0].reshape(
+        src.shape[0] // image_patch_size,
+        src.shape[1] // image_patch_size,
+    )
+    return crop_arr, patch_idx_arr
+
+
+def batch_pixels_to_patches(array: np.ndarray, patch_size: int) -> np.ndarray:
+    """Reshape images of [n_images, h, w, 3] -> [n_images, n_patches, pixels_per_patch]"""
+    if len(array.shape) == 3:
+        n_crops, h, w = array.shape
+        h_patches = h // patch_size
+        w_patches = w // patch_size
+        array = np.reshape(array, [n_crops, h_patches, patch_size, w_patches, patch_size])
+        array = np.transpose(array, [0, 1, 3, 2, 4])
+        array = np.reshape(array, [n_crops, h_patches * w_patches, patch_size * patch_size])
+        return array
+    else:
+        n_crops, h, w, c = array.shape
+        h_patches = h // patch_size
+        w_patches = w // patch_size
+        array = np.reshape(array, [n_crops, h_patches, patch_size, w_patches, patch_size, c])
+        array = np.transpose(array, [0, 1, 3, 2, 4, 5])
+        array = np.reshape(array, [n_crops, h_patches * w_patches, patch_size * patch_size * c])
+        return array
+
+
+def arange_for_pooling(
+    idx_arr: np.ndarray,
+    pool_h: int,
+    pool_w: int,
+) -> np.ndarray:
+    h_pad = pool_h * ((idx_arr.shape[0] + pool_h - 1) // pool_h) - idx_arr.shape[0]
+    w_pad = pool_w * ((idx_arr.shape[1] + pool_w - 1) // pool_w) - idx_arr.shape[1]
+    idx_arr = np.pad(
+        idx_arr,
+        [[h_pad // 2, (h_pad + 1) // 2], [w_pad // 2, (w_pad + 1) // 2]],
+        mode="constant",
+        constant_values=-1,
+    )
+    return einops.rearrange(idx_arr, "(h dh) (w dw) -> h w (dh dw)", dh=pool_h, dw=pool_w)
+
+
+def image_to_patches_and_grids(
+    image: np.ndarray,
+    max_crops: int,
+    overlap_margins: list[int],
+    base_image_input_size: list[int],
+    resample: PILImageResampling,
+    image_mean: list[float],
+    image_std: list[float],
+    image_patch_size: int,
+    image_pooling_w: int,
+    image_pooling_h: int,
+    crop_mode: str = "overlap-and-resize-c2",
+) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
+    """
+    :return image_grids, the shape of each (low-res, high-res) image after pooling
+    :return crops, the image crops to processes with the ViT
+    :return pooled_patch_idx, for each patch_id tokens in `image_tokens`, the indices of the
+                                patches in `crops` to pool for that token, masked with -1
+    """
+    if isinstance(base_image_input_size, int):
+        base_image_input_size = (base_image_input_size, base_image_input_size)
+
+    base_image_input_d = image_patch_size
+    pooling_w = image_pooling_w
+    pooling_h = image_pooling_h
+    crop_patch_w = base_image_input_size[1] // base_image_input_d
+    crop_patch_h = base_image_input_size[0] // base_image_input_d
+
+    if crop_mode == "resize":
+        resized, resize_idx = build_resized_image(
+            image,
+            base_image_input_size,
+            resample,
+            image_mean,
+            image_std,
+            image_patch_size,
+        )
+        resize_idx = arange_for_pooling(resize_idx, pooling_h, pooling_w)
+        resized_h, resized_w = resize_idx.shape[:2]
+        resize_idx = resize_idx.reshape([-1, pooling_h * pooling_w])
+        image_grid = [np.array([resized_h, resized_w, 0, 0])]
+        return (
+            np.stack(image_grid, 0),
+            batch_pixels_to_patches(resized, image_patch_size),
+            resize_idx,
+        )
+
+    if crop_mode not in {"overlap-and-resize-c2", "overlap-and-resize"}:
+        raise ValueError(f"Unsupported MolmoAct2 image crop_mode {crop_mode!r}.")
+
+    crop_arr, patch_idx_arr = build_overlapping_crops(
+        image,
+        max_crops,
+        overlap_margins,
+        base_image_input_size,
+        resample,
+        image_mean,
+        image_std,
+        image_patch_size,
+    )
+    pooling_idx = arange_for_pooling(patch_idx_arr, pooling_h, pooling_w)
+    h, w = pooling_idx.shape[:2]
+    pooling_idx = pooling_idx.reshape([-1, pooling_h * pooling_w])
+
+    # Finally do the same for the global image
+    resized, resize_idx = build_resized_image(
+        image,
+        base_image_input_size,
+        resample,
+        image_mean,
+        image_std,
+        image_patch_size,
+    )
+    crop_arr = np.concatenate([resized, crop_arr], 0)
+
+    resize_idx = arange_for_pooling(resize_idx, pooling_h, pooling_w)
+    resized_h, resized_w = resize_idx.shape[:2]
+    resize_idx = resize_idx.reshape([-1, pooling_h * pooling_w])
+
+    # Global image goes first, so the order of patches in previous crops gets increased
+    pooling_idx = np.where(pooling_idx >= 0, pooling_idx + crop_patch_h * crop_patch_w, -1)
+    pooling_idx = np.concatenate([resize_idx, pooling_idx])
+    image_grid = [np.array([resized_h, resized_w, h, w])]
+
+    return (np.stack(image_grid, 0), batch_pixels_to_patches(crop_arr, image_patch_size), pooling_idx)
+
+
+class MolmoAct2ImagesKwargs(ImagesKwargs, total=False):
+    max_crops: int | None
+    overlap_margins: list[int] | None
+    crop_mode: str | None
+    patch_size: int | None
+    pooling_size: list[int] | None
+
+
+class MolmoAct2ImageProcessor(BaseImageProcessor):
+    r"""
+    Constructs a MolmoAct2 image processor that preprocesses images for the model.
+
+    Args:
+        size (`dict[str, int]` *optional*, defaults to `{"height": 378, "width": 378}`):
+            Size of the image after resizing.
+        resample (`PILImageResampling`, *optional*, defaults to `Resampling.BILINEAR`):
+            Resampling filter to use when resizing the image.
+        image_mean (`float` or `list[float]`, *optional*, defaults to `[0.5, 0.5, 0.5]`):
+            Mean to use if normalizing the image. This is a float or list of floats for each channel in the image.
+        image_std (`float` or `list[float]`, *optional*, defaults to `[0.5, 0.5, 0.5]`):
+            Standard deviation to use if normalizing the image. This is a float or list of floats for each channel in the image.
+        do_convert_rgb (`bool`, *optional*, defaults to `True`):
+            Whether to convert the image to RGB.
+        max_crops (`int`, *optional*, defaults to `8`):
+            Maximum number of crops to use per image.
+        overlap_margins (`list[int]`, *optional*, defaults to `[4, 4]`):
+            Overlap margins to use.
+        patch_size (`int`, *optional*, defaults to 14):
+            The spatial patch size of the vision encoder.
+        pooling_size (`list[int]`, *optional*, defaults to `[2, 2]`):
+            The pooling size of the vision adapter.
+    """
+
+    model_input_names = ["pixel_values", "image_token_pooling", "image_grids", "image_num_crops"]
+
+    def __init__(
+        self,
+        size: dict[str, int] | None = None,
+        resample: PILImageResampling = PILImageResampling.BILINEAR,
+        image_mean: float | list[float] | None = None,
+        image_std: float | list[float] | None = None,
+        do_convert_rgb: bool = True,
+        max_crops: int = 8,
+        overlap_margins: list[int] = [4, 4],
+        crop_mode: str = "overlap-and-resize-c2",
+        patch_size: int = 14,
+        pooling_size: list[int] = [2, 2],
+        **kwargs,
+    ) -> None:
+        super().__init__(**kwargs)
+        size = size if size is not None else {"height": 378, "width": 378}
+        size = get_size_dict(size, default_to_square=True)
+        self.size = size
+
+        self.resample = resample
+        self.image_mean = image_mean if image_mean is not None else IMAGENET_STANDARD_MEAN
+        self.image_std = image_std if image_std is not None else IMAGENET_STANDARD_STD
+        self.do_convert_rgb = do_convert_rgb
+
+        self.max_crops = max_crops
+        self.overlap_margins = overlap_margins
+        self.crop_mode = crop_mode
+        self.patch_size = patch_size
+        self.pooling_size = pooling_size
+
+    def preprocess(
+        self,
+        images: ImageInput,
+        size: dict[str, int] | None = None,
+        resample: PILImageResampling | None = None,
+        image_mean: float | list[float] | None = None,
+        image_std: float | list[float] | None = None,
+        do_convert_rgb: bool | None = None,
+        max_crops: int | None = None,
+        overlap_margins: list[int] | None = None,
+        crop_mode: str | None = None,
+        patch_size: int | None = None,
+        pooling_size: list[int] | None = None,
+        return_tensors: str | TensorType | None = None,
+        **kwargs,
+    ) -> BatchFeature:
+        """
+        Args:
+            images (`ImageInput`):
+                Image to preprocess.
+            size (`dict[str, int]`, *optional*, defaults to `self.size`):
+                Size of the image after resizing.
+            resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
+                Resampling filter to use when resizing the image. This can be one of the enum `PILImageResampling`. Only
+                has an effect if `do_resize` is set to `True`.
+            image_mean (`float` or `list[float]`, *optional*, defaults to `self.image_mean`):
+                Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
+            image_std (`float` or `list[float]`, *optional*, defaults to `self.image_std`):
+                Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
+                `True`.
+            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
+                Whether to convert the image to RGB.
+            max_crops (`int`, *optional*, defaults to `self.max_crops`):
+                Maximum number of crops to use per image.
+            overlap_margins (`list[int]`, *optional*, defaults to `self.overlap_margins`):
+                Overlap margins to use.
+            patch_size (`int`, *optional*, defaults to `self.patch_size`):
+                The spatial patch size of the vision encoder.
+            pooling_size (`list[int]`, *optional*, defaults to `self.pooling_size`):
+                The pooling size of the vision adapter.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                - Unset: Return a list of `np.ndarray`.
+                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
+                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
+                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
+
+        Returns:
+            A `BatchFeature` containing the following keys:
+                - `pixel_values`: The preprocessed images.
+                - `image_token_pooling`: The indices of the patches in `crops` to pool for each token in `image_tokens`.
+                - `image_grids`: The image grids.
+                - `image_num_crops`: The number of crops for each image.
+        """
+        if size is not None:
+            if "height" not in size or "width" not in size:
+                raise ValueError("size must contain 'height' and 'width' keys.")
+        else:
+            size = {**self.size}
+
+        base_image_input_size = [size["height"], size["width"]]
+
+        resample = resample or self.resample
+        image_mean = image_mean or self.image_mean
+        image_std = image_std or self.image_std
+        do_convert_rgb = do_convert_rgb or self.do_convert_rgb
+
+        max_crops = max_crops or self.max_crops
+        overlap_margins = overlap_margins or self.overlap_margins
+        crop_mode = crop_mode or self.crop_mode
+        patch_size = patch_size or self.patch_size
+        pooling_size = pooling_size or self.pooling_size
+
+        image_pooling_h, image_pooling_w = pooling_size
+
+        if images is not None:
+            images = self.fetch_images(images)
+            images = make_flat_list_of_images(images)
+
+        if images is not None and not valid_images(images):
+            raise ValueError(
+                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
+                "torch.Tensor, tf.Tensor or jax.ndarray."
+            )
+
+        if do_convert_rgb:
+            images = [convert_to_rgb(image) for image in images]
+
+        # All transformations expect numpy arrays.
+        images = [to_numpy_array(image) for image in images]
+
+        data = {}
+        if images is not None:
+            batch_grids = []
+            batch_crops = []
+            batch_pooled_patches_idx = []
+            batch_num_crops = []
+
+            for image in images:
+                image_grid, crops, pooled_idx = image_to_patches_and_grids(
+                    image,
+                    max_crops,
+                    overlap_margins,
+                    base_image_input_size,
+                    resample,
+                    image_mean,
+                    image_std,
+                    patch_size,
+                    image_pooling_w,
+                    image_pooling_h,
+                    crop_mode,
+                )
+                batch_grids.append(image_grid)
+                batch_crops.append(crops)
+                batch_pooled_patches_idx.append(pooled_idx)
+                batch_num_crops.append(crops.shape[0])
+
+            pixel_values = np.concatenate(batch_crops, 0)
+            image_token_pooling = np.concatenate(batch_pooled_patches_idx, 0)
+            image_grids = np.concatenate(batch_grids, 0)
+            image_num_crops = np.array(batch_num_crops)
+
+            data.update(
+                pixel_values=pixel_values,
+                image_token_pooling=image_token_pooling,
+                image_grids=image_grids,
+                image_num_crops=image_num_crops,
+            )
+
+        return BatchFeature(data, tensor_type=return_tensors)
+
+
+MolmoAct2ImageProcessor.register_for_auto_class()
--- a/src/lerobot/policies/molmoact2/hf_model/inference.py
+++ b/src/lerobot/policies/molmoact2/hf_model/inference.py
@@ -0,0 +1,748 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# ruff: noqa
+
+"""Inference utilities for MolmoAct2"""
+
+from dataclasses import dataclass
+from typing import Any, Optional, Tuple
+from collections.abc import Iterable, Sequence
+
+import torch
+from torch.nn import functional as F
+from transformers.cache_utils import Cache
+from transformers.configuration_utils import PretrainedConfig
+
+
+@dataclass
+class _ActionFlowInputs:
+    trajectory: torch.Tensor
+    context: Any
+    modulations: Sequence[Any]
+    action_dim_is_pad: torch.Tensor | None
+
+
+@dataclass
+class _ActionFlowCudaGraph:
+    key: tuple[Any, ...]
+    graph: torch.cuda.CUDAGraph
+    static_inputs: _ActionFlowInputs
+    output: torch.Tensor
+
+
+@dataclass
+class _DepthDecodeCudaGraphLayerStage:
+    residual: torch.Tensor
+    query: torch.Tensor
+    key: torch.Tensor
+    value: torch.Tensor
+
+
+@dataclass
+class _DepthDecodeCudaGraphPostStage:
+    graph: torch.cuda.CUDAGraph
+    attn_context: torch.Tensor
+
+
+@dataclass
+class _DepthDecodeCudaGraph:
+    cache_key: tuple[Any, ...]
+    pre_graph: torch.cuda.CUDAGraph
+    token_ids: torch.Tensor
+    cos: torch.Tensor
+    sin: torch.Tensor
+    positions: torch.Tensor
+    stages: Sequence[_DepthDecodeCudaGraphLayerStage]
+    post_graphs: Sequence[_DepthDecodeCudaGraphPostStage]
+    output: torch.Tensor
+
+
+@dataclass
+class _DepthDecodeCudaGraphSpec:
+    eligible: bool
+    cache_key_prefix: tuple[Any, ...]
+    num_hidden_layers: int
+    head_dim: int
+    num_attention_heads: int
+
+
+def _cache_seq_len_int(past_key_values: Cache | None) -> int:
+    if past_key_values is None:
+        return 0
+    seq_len = past_key_values.get_seq_length()
+    if torch.is_tensor(seq_len):
+        return int(seq_len.item())
+    return int(seq_len)
+
+
+def _cache_max_len_int(past_key_values: Cache | None) -> int:
+    if past_key_values is None:
+        return -1
+    max_len = past_key_values.get_max_cache_shape()
+    if torch.is_tensor(max_len):
+        return int(max_len.item())
+    return int(max_len)
+
+
+def _iter_cache_key_values(
+    past_key_values: Cache,
+) -> Iterable[tuple[torch.Tensor | None, torch.Tensor | None]]:
+    layers = getattr(past_key_values, "layers", None)
+    if layers is not None:
+        for layer in layers:
+            yield getattr(layer, "keys", None), getattr(layer, "values", None)
+        return
+    for layer in past_key_values:
+        yield layer[0], layer[1]
+
+
+class _DepthDecodeStaticLayerCache:
+    is_compileable = False
+    is_sliding = False
+
+    def __init__(self, max_cache_len: int) -> None:
+        self.max_cache_len = int(max_cache_len)
+        self.cumulative_length = 0
+        self.keys: torch.Tensor | None = None
+        self.values: torch.Tensor | None = None
+
+    def _allocate(self, key_states: torch.Tensor, value_states: torch.Tensor) -> None:
+        bsz, n_heads = key_states.shape[:2]
+        self.keys = torch.empty(
+            (bsz, n_heads, self.max_cache_len, key_states.shape[-1]),
+            dtype=key_states.dtype,
+            device=key_states.device,
+        )
+        self.values = torch.empty(
+            (bsz, n_heads, self.max_cache_len, value_states.shape[-1]),
+            dtype=value_states.dtype,
+            device=value_states.device,
+        )
+
+    def update(
+        self,
+        key_states: torch.Tensor,
+        value_states: torch.Tensor,
+        *args,
+        **kwargs,
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        if self.keys is None:
+            self._allocate(key_states, value_states)
+        start = self.cumulative_length
+        end = start + key_states.shape[-2]
+        if end > self.max_cache_len:
+            raise RuntimeError(f"KV cache length {end} exceeds max_cache_len={self.max_cache_len}.")
+        self.keys[:, :, start:end, :].copy_(key_states)
+        self.values[:, :, start:end, :].copy_(value_states)
+        self.cumulative_length = end
+        return self.keys[:, :, :end, :], self.values[:, :, :end, :]
+
+    def get_seq_length(self) -> int:
+        return self.cumulative_length
+
+    def get_max_cache_shape(self) -> int:
+        return -1
+
+    def reset(self) -> None:
+        self.cumulative_length = 0
+
+
+class _DepthDecodeStaticCache(Cache):
+    def __init__(self, config: PretrainedConfig, max_cache_len: int) -> None:
+        text_config = config.get_text_config(decoder=True)
+        super().__init__(
+            layers=[
+                _DepthDecodeStaticLayerCache(max_cache_len=max_cache_len)
+                for _ in range(text_config.num_hidden_layers)
+            ]
+        )
+
+    def get_seq_length(self, layer_idx: int = 0) -> int:
+        return self.layers[layer_idx].get_seq_length()
+
+    def get_max_cache_shape(self, layer_idx: int = 0) -> int:
+        return self.layers[layer_idx].get_max_cache_shape()
+
+    def reset(self) -> None:
+        for layer in self.layers:
+            layer.reset()
+
+
+class ActionCudaGraphManager:
+    def __init__(self, model: Any) -> None:
+        self.model = model
+        self.enabled = True
+        self.action_flow_graph: _ActionFlowCudaGraph | None = None
+
+    def set_enabled(self, enabled: bool) -> None:
+        self.enabled = bool(enabled)
+
+    def can_use_action_flow(self, inputs: _ActionFlowInputs) -> bool:
+        action_model = self.model
+        if not self.enabled:
+            return False
+        if action_model.training or action_model._require_action_expert().training:
+            return False
+        if inputs.trajectory.device.type != "cuda":
+            return False
+
+        def all_on_cuda():
+            yield inputs.trajectory
+            for k, v in inputs.context.kv_contexts:
+                yield k
+                yield v
+            for t in (
+                inputs.context.cross_mask,
+                inputs.context.self_mask,
+                inputs.context.valid_action,
+                inputs.action_dim_is_pad,
+            ):
+                if t is not None:
+                    yield t
+            if inputs.context.rope_cache is not None:
+                yield from inputs.context.rope_cache
+            for step in inputs.modulations:
+                yield step.conditioning
+                for block_modulation in step.block_modulations:
+                    yield from block_modulation
+                yield from step.final_modulation
+
+        return all(t.device.type == "cuda" for t in all_on_cuda())
+
+    def run_action_flow(
+        self,
+        inputs: _ActionFlowInputs,
+        steps: int,
+        run_loop,
+    ) -> torch.Tensor:
+        key = _cuda_graph_key(inputs, steps)
+        cache = self.action_flow_graph
+        if cache is None or cache.key != key:
+            static_inputs = _clone_static_inputs(inputs)
+            graph, output = _capture_cuda_graph(
+                lambda: run_loop(static_inputs, steps),
+                inputs.trajectory.device,
+                after_warmup=lambda: static_inputs.trajectory.copy_(inputs.trajectory),
+            )
+            cache = _ActionFlowCudaGraph(
+                key=key,
+                graph=graph,
+                static_inputs=static_inputs,
+                output=output,
+            )
+            self.action_flow_graph = cache
+        else:
+            _copy_inputs_(cache.static_inputs, inputs)
+
+        cache.graph.replay()
+        return cache.output.clone()
+
+
+class DepthDecodeCudaGraphManager:
+    def __init__(self, model: Any) -> None:
+        self.model = model
+        self.backbone = model.model
+        self.enabled = True
+        self.graph: _DepthDecodeCudaGraph | None = None
+        self.graph_spec: _DepthDecodeCudaGraphSpec | None = None
+
+    def set_enabled(self, enabled: bool) -> None:
+        self.enabled = bool(enabled)
+
+    def make_static_cache(self, max_cache_len: int) -> _DepthDecodeStaticCache:
+        return _DepthDecodeStaticCache(
+            config=self.model.config.text_config,
+            max_cache_len=max_cache_len,
+        )
+
+    def _depth_decode_spec(self) -> _DepthDecodeCudaGraphSpec:
+        static = self.graph_spec
+        if static is None:
+            cfg = self.backbone.transformer.config
+            rotary_emb = getattr(self.backbone.transformer, "rotary_emb", None)
+            static = _DepthDecodeCudaGraphSpec(
+                eligible=(
+                    not cfg.norm_after
+                    and cfg.rope_scaling_layers is None
+                    and getattr(rotary_emb, "rope_type", None) == "default"
+                    and cfg._attn_implementation == "sdpa"
+                ),
+                cache_key_prefix=(
+                    cfg.hidden_size,
+                    cfg.num_attention_heads,
+                    cfg.num_key_value_heads,
+                    cfg.head_dim,
+                    cfg.num_hidden_layers,
+                    cfg.use_qk_norm,
+                    cfg.qk_norm_type,
+                    cfg._attn_implementation,
+                ),
+                num_hidden_layers=cfg.num_hidden_layers,
+                head_dim=cfg.head_dim,
+                num_attention_heads=cfg.num_attention_heads,
+            )
+            self.graph_spec = static
+        return static
+
+    def can_use(
+        self,
+        next_input_ids: torch.Tensor,
+        *,
+        past_key_values: Cache,
+        attention_bias: torch.Tensor,
+    ) -> bool:
+        if not self.enabled or self.model.training or self.backbone.transformer.training:
+            return False
+        if next_input_ids.device.type != "cuda":
+            return False
+        if next_input_ids.ndim != 2 or next_input_ids.shape[0] != 1 or next_input_ids.shape[1] != 1:
+            return False
+        if not isinstance(past_key_values, _DepthDecodeStaticCache):
+            return False
+        if not torch.is_tensor(attention_bias) or attention_bias.device != next_input_ids.device:
+            return False
+        return self._depth_decode_spec().eligible
+
+    def _depth_decode_key(
+        self,
+        next_input_ids: torch.Tensor,
+        attention_bias: torch.Tensor,
+    ) -> tuple[Any, ...]:
+        device = next_input_ids.device
+        return (
+            self._depth_decode_spec().cache_key_prefix,
+            device.type,
+            device.index,
+            self.model.lm_head.weight.dtype,
+            attention_bias.shape[-1],
+        )
+
+    def _select_depth_decode_rope(self, cos: torch.Tensor, sin: torch.Tensor, *, past_length: int) -> None:
+        emb = self.backbone.transformer.rotary_emb
+        cos.copy_(emb._pos_cos_cache[0, :, past_length : past_length + 1, :])
+        sin.copy_(emb._pos_sin_cache[0, :, past_length : past_length + 1, :])
+
+    def _depth_decode_pre_layer(
+        self,
+        layer_idx: int,
+        hidden_states: torch.Tensor,
+        cos: torch.Tensor,
+        sin: torch.Tensor,
+    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+        block = self.backbone.transformer.blocks[layer_idx]
+        attention = block.self_attn
+        residual = hidden_states
+        hidden_states = block.attn_norm(hidden_states)
+
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, attention.head_dim)
+        qkv = attention.att_proj(hidden_states)
+        query_states, key_states, value_states = qkv.split(attention.fused_dims, dim=-1)
+        value_states = value_states.view(hidden_shape)
+
+        apply_qk_norm = attention.q_norm is not None and attention.k_norm is not None
+        norm_after_view = apply_qk_norm and attention.qk_norm_type == "qwen3"
+
+        if apply_qk_norm and not norm_after_view:
+            query_states = attention.q_norm(query_states)
+            key_states = attention.k_norm(key_states)
+
+        query_states = query_states.view(hidden_shape)
+        key_states = key_states.view(hidden_shape)
+
+        if norm_after_view:
+            query_states = attention.q_norm(query_states)
+            key_states = attention.k_norm(key_states)
+
+        query_states = query_states.transpose(1, 2)
+        key_states = key_states.transpose(1, 2)
+        value_states = value_states.transpose(1, 2)
+        query_states, key_states = _apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        return residual, query_states, key_states, value_states
+
+    def _depth_decode_pre0(
+        self,
+        token_ids: torch.Tensor,
+        cos: torch.Tensor,
+        sin: torch.Tensor,
+    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+        inputs_embeds = self.model._embed_base_tokens(token_ids)
+        return self._depth_decode_pre_layer(0, inputs_embeds, cos, sin)
+
+    def _depth_decode_post_layer(
+        self,
+        layer_idx: int,
+        residual: torch.Tensor,
+        attn_context: torch.Tensor,
+    ) -> torch.Tensor:
+        block = self.backbone.transformer.blocks[layer_idx]
+        attention = block.self_attn
+        input_shape = residual.shape[:-1]
+        attn_output = attn_context.reshape(*input_shape, -1).contiguous()
+        attn_output = attention.attn_out(attn_output)
+        hidden_states = residual + block.dropout(attn_output)
+
+        residual = hidden_states
+        hidden_states = block.ff_norm(hidden_states)
+        hidden_states = block.mlp(hidden_states)
+        hidden_states = residual + block.dropout(hidden_states)
+        return hidden_states
+
+    def _depth_decode_post_and_pre_next(
+        self,
+        layer_idx: int,
+        residual: torch.Tensor,
+        attn_context: torch.Tensor,
+        cos: torch.Tensor,
+        sin: torch.Tensor,
+    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+        hidden_states = self._depth_decode_post_layer(layer_idx, residual, attn_context)
+        return self._depth_decode_pre_layer(layer_idx + 1, hidden_states, cos, sin)
+
+    def _depth_decode_last_post(
+        self,
+        layer_idx: int,
+        residual: torch.Tensor,
+        attn_context: torch.Tensor,
+    ) -> torch.Tensor:
+        hidden_states = self._depth_decode_post_layer(layer_idx, residual, attn_context)
+        return self.backbone.transformer.ln_f(hidden_states)
+
+    def _build_depth_decode_graph(
+        self,
+        next_input_ids: torch.Tensor,
+        *,
+        past_length: int,
+        attention_bias: torch.Tensor,
+    ) -> _DepthDecodeCudaGraph:
+        text_config = self.backbone.transformer.config
+        device = next_input_ids.device
+        dtype = self.model.lm_head.weight.dtype
+        static = self._depth_decode_spec()
+        num_layers = static.num_hidden_layers
+        head_dim = static.head_dim
+        max_cache_len = int(attention_bias.shape[-1])
+        max_rope_len = max(int(text_config.max_position_embeddings or 0), max_cache_len)
+        self.backbone.transformer.prepare_rope_cache(device=device, max_seq_len=max_rope_len)
+
+        token_ids = torch.empty((1, 1), device=device, dtype=torch.long)
+        cos = torch.empty((1, 1, head_dim), device=device, dtype=dtype)
+        sin = torch.empty_like(cos)
+        positions = torch.arange(max_cache_len, device=device, dtype=torch.long)
+        context_shape = (1, 1, static.num_attention_heads, head_dim)
+
+        token_ids.copy_(next_input_ids)
+        self._select_depth_decode_rope(cos, sin, past_length=past_length)
+
+        pre_graph, pre_output = _capture_cuda_graph(
+            lambda: self._depth_decode_pre0(token_ids, cos, sin),
+            device,
+        )
+        stages = [_DepthDecodeCudaGraphLayerStage(*pre_output)]
+        post_graphs = []
+        for layer_idx in range(num_layers - 1):
+            stage = stages[-1]
+            attn_context = torch.empty(context_shape, device=device, dtype=dtype)
+            graph, output = _capture_cuda_graph(
+                lambda layer_idx=layer_idx, stage=stage, attn_context=attn_context: (
+                    self._depth_decode_post_and_pre_next(
+                        layer_idx,
+                        stage.residual,
+                        attn_context,
+                        cos,
+                        sin,
+                    )
+                ),
+                device,
+            )
+            post_graphs.append(_DepthDecodeCudaGraphPostStage(graph=graph, attn_context=attn_context))
+            stages.append(_DepthDecodeCudaGraphLayerStage(*output))
+
+        last_stage = stages[-1]
+        last_attn_context = torch.empty(context_shape, device=device, dtype=dtype)
+        last_graph, last_output = _capture_cuda_graph(
+            lambda: self._depth_decode_last_post(
+                num_layers - 1,
+                last_stage.residual,
+                last_attn_context,
+            ),
+            device,
+        )
+        post_graphs.append(_DepthDecodeCudaGraphPostStage(graph=last_graph, attn_context=last_attn_context))
+        return _DepthDecodeCudaGraph(
+            cache_key=self._depth_decode_key(next_input_ids, attention_bias),
+            pre_graph=pre_graph,
+            token_ids=token_ids,
+            cos=cos,
+            sin=sin,
+            positions=positions,
+            stages=tuple(stages),
+            post_graphs=tuple(post_graphs),
+            output=last_output,
+        )
+
+    def _get_depth_decode_graph(
+        self,
+        next_input_ids: torch.Tensor,
+        *,
+        past_length: int,
+        attention_bias: torch.Tensor,
+    ) -> _DepthDecodeCudaGraph:
+        key = self._depth_decode_key(next_input_ids, attention_bias)
+        decode_graph = self.graph
+        if decode_graph is None or decode_graph.cache_key != key:
+            decode_graph = self._build_depth_decode_graph(
+                next_input_ids,
+                past_length=past_length,
+                attention_bias=attention_bias,
+            )
+            self.graph = decode_graph
+        else:
+            decode_graph.token_ids.copy_(next_input_ids)
+            self._select_depth_decode_rope(decode_graph.cos, decode_graph.sin, past_length=past_length)
+        return decode_graph
+
+    def _run_depth_decode_attention_core(
+        self,
+        layer_idx: int,
+        stage: _DepthDecodeCudaGraphLayerStage,
+        *,
+        past_key_values: Cache,
+        attention_bias: torch.Tensor,
+        cache_position: torch.Tensor,
+        cos: torch.Tensor,
+        sin: torch.Tensor,
+    ) -> torch.Tensor:
+        attention = self.backbone.transformer.blocks[layer_idx].self_attn
+        cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+        key_states, value_states = past_key_values.update(
+            stage.key,
+            stage.value,
+            layer_idx,
+            cache_kwargs,
+        )
+        key_states = _repeat_kv(key_states, attention.num_key_value_groups)
+        value_states = _repeat_kv(value_states, attention.num_key_value_groups)
+        attn_output = F.scaled_dot_product_attention(
+            stage.query,
+            key_states,
+            value_states,
+            attn_mask=attention_bias,
+            dropout_p=0.0,
+            is_causal=False,
+        )
+        return attn_output.transpose(1, 2)
+
+    def run(
+        self,
+        next_input_ids: torch.Tensor,
+        *,
+        past_key_values: Cache,
+        attention_bias: torch.Tensor,
+        past_length: int,
+    ) -> tuple[torch.Tensor, Cache]:
+        end = past_length + 1
+        decode_graph = self._get_depth_decode_graph(
+            next_input_ids,
+            past_length=past_length,
+            attention_bias=attention_bias,
+        )
+        cache_position = decode_graph.positions[past_length:end]
+        attention_bias_q = attention_bias[:, :, past_length:end, :end]
+
+        decode_graph.pre_graph.replay()
+
+        for layer_idx, post_graph in enumerate(decode_graph.post_graphs):
+            attn_context = self._run_depth_decode_attention_core(
+                layer_idx,
+                decode_graph.stages[layer_idx],
+                past_key_values=past_key_values,
+                attention_bias=attention_bias_q,
+                cache_position=cache_position,
+                cos=decode_graph.cos,
+                sin=decode_graph.sin,
+            )
+            post_graph.attn_context.copy_(attn_context)
+            post_graph.graph.replay()
+
+        return decode_graph.output, past_key_values
+
+
+def _cuda_graph_tensor_signature(
+    tensor: torch.Tensor | None,
+) -> tuple[Any, ...] | None:
+    if tensor is None:
+        return None
+    return (
+        tuple(tensor.shape),
+        tuple(tensor.stride()),
+        str(tensor.dtype),
+        str(tensor.device),
+    )
+
+
+def _cuda_graph_context_signature(context: Any) -> tuple[Any, ...]:
+    sig = _cuda_graph_tensor_signature
+    return (
+        tuple((sig(k), sig(v)) for k, v in context.kv_contexts),
+        sig(context.cross_mask),
+        sig(context.self_mask),
+        sig(context.valid_action),
+        None if context.rope_cache is None else tuple(sig(t) for t in context.rope_cache),
+    )
+
+
+def _cuda_graph_modulation_signature(modulations: Sequence[Any]) -> tuple[Any, ...]:
+    sig = _cuda_graph_tensor_signature
+    return tuple(
+        (
+            sig(step.conditioning),
+            tuple(tuple(sig(t) for t in block_modulation) for block_modulation in step.block_modulations),
+            tuple(sig(t) for t in step.final_modulation),
+        )
+        for step in modulations
+    )
+
+
+def _cuda_graph_key(inputs: _ActionFlowInputs, steps: int) -> tuple[Any, ...]:
+    sig = _cuda_graph_tensor_signature
+    return (
+        sig(inputs.trajectory),
+        _cuda_graph_context_signature(inputs.context),
+        _cuda_graph_modulation_signature(inputs.modulations),
+        sig(inputs.action_dim_is_pad),
+        int(steps),
+    )
+
+
+def _clone_static_tensor(tensor: torch.Tensor | None) -> torch.Tensor | None:
+    if tensor is None:
+        return None
+    static = torch.empty_strided(
+        tuple(tensor.shape),
+        tuple(tensor.stride()),
+        device=tensor.device,
+        dtype=tensor.dtype,
+    )
+    static.copy_(tensor)
+    return static
+
+
+def _clone_static_context(context: Any) -> Any:
+    rope_cache = None
+    if context.rope_cache is not None:
+        rope_cache = tuple(_clone_static_tensor(t) for t in context.rope_cache)
+    return context.__class__(
+        kv_contexts=tuple((_clone_static_tensor(k), _clone_static_tensor(v)) for k, v in context.kv_contexts),
+        cross_mask=_clone_static_tensor(context.cross_mask),
+        self_mask=_clone_static_tensor(context.self_mask),
+        valid_action=_clone_static_tensor(context.valid_action),
+        rope_cache=rope_cache,
+    )
+
+
+def _clone_static_modulations(modulations: Sequence[Any]) -> Sequence[Any]:
+    return tuple(
+        step.__class__(
+            conditioning=_clone_static_tensor(step.conditioning),
+            block_modulations=tuple(
+                tuple(_clone_static_tensor(t) for t in block_modulation)
+                for block_modulation in step.block_modulations
+            ),
+            final_modulation=tuple(_clone_static_tensor(t) for t in step.final_modulation),
+        )
+        for step in modulations
+    )
+
+
+def _clone_static_inputs(inputs: _ActionFlowInputs) -> _ActionFlowInputs:
+    return _ActionFlowInputs(
+        trajectory=_clone_static_tensor(inputs.trajectory),
+        context=_clone_static_context(inputs.context),
+        modulations=_clone_static_modulations(inputs.modulations),
+        action_dim_is_pad=_clone_static_tensor(inputs.action_dim_is_pad),
+    )
+
+
+def _copy_context_(dst: Any, src: Any) -> None:
+    for (dst_k, dst_v), (src_k, src_v) in zip(dst.kv_contexts, src.kv_contexts):
+        dst_k.copy_(src_k)
+        dst_v.copy_(src_v)
+    if src.cross_mask is not None:
+        dst.cross_mask.copy_(src.cross_mask)
+    if src.self_mask is not None:
+        dst.self_mask.copy_(src.self_mask)
+    if src.valid_action is not None:
+        dst.valid_action.copy_(src.valid_action)
+    if src.rope_cache is not None:
+        for dst_tensor, src_tensor in zip(dst.rope_cache, src.rope_cache):
+            dst_tensor.copy_(src_tensor)
+
+
+def _copy_inputs_(dst: _ActionFlowInputs, src: _ActionFlowInputs) -> None:
+    dst.trajectory.copy_(src.trajectory)
+    _copy_context_(dst.context, src.context)
+    if src.action_dim_is_pad is not None:
+        dst.action_dim_is_pad.copy_(src.action_dim_is_pad)
+
+
+def _rotate_half(x: torch.Tensor) -> torch.Tensor:
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+
+
+def _apply_rotary_pos_emb(
+    q: torch.Tensor,
+    k: torch.Tensor,
+    cos: torch.Tensor,
+    sin: torch.Tensor,
+    unsqueeze_dim: int = 1,
+) -> tuple[torch.Tensor, torch.Tensor]:
+    cos = cos.unsqueeze(unsqueeze_dim)
+    sin = sin.unsqueeze(unsqueeze_dim)
+    q_embed = (q * cos) + (_rotate_half(q) * sin)
+    k_embed = (k * cos) + (_rotate_half(k) * sin)
+    return q_embed, k_embed
+
+
+def _repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
+    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+
+
+def _capture_cuda_graph(
+    fn,
+    device: torch.device,
+    *,
+    after_warmup=None,
+) -> tuple[torch.cuda.CUDAGraph, Any]:
+    warmup_stream = torch.cuda.Stream(device=device)
+    warmup_stream.wait_stream(torch.cuda.current_stream(device))
+    with torch.cuda.stream(warmup_stream):
+        fn()
+    torch.cuda.current_stream(device).wait_stream(warmup_stream)
+    if after_warmup is not None:
+        after_warmup()
+
+    graph = torch.cuda.CUDAGraph()
+    with torch.cuda.graph(graph):
+        output = fn()
+    return graph, output
--- a/src/lerobot/policies/molmoact2/hf_model/modeling_molmoact2.py
+++ b/src/lerobot/policies/molmoact2/hf_model/modeling_molmoact2.py
--- a/src/lerobot/policies/molmoact2/hf_model/processing_molmoact2.py
+++ b/src/lerobot/policies/molmoact2/hf_model/processing_molmoact2.py
@@ -0,0 +1,431 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# ruff: noqa
+
+"""
+Processor class for MolmoAct2.
+"""
+
+from typing import Optional, Union
+import dataclasses
+
+import numpy as np
+
+from transformers.image_utils import ImageInput
+from transformers.video_utils import VideoInput
+from transformers.processing_utils import (
+    Unpack,
+    ProcessingKwargs,
+    ProcessorMixin,
+)
+from transformers.feature_extraction_utils import BatchFeature
+from transformers.tokenization_utils_base import TextInput, PreTokenizedInput
+from transformers.utils import logging
+
+from transformers import AutoTokenizer
+from .image_processing_molmoact2 import MolmoAct2ImagesKwargs, MolmoAct2ImageProcessor
+from .video_processing_molmoact2 import MolmoAct2VideoProcessorKwargs, MolmoAct2VideoProcessor
+
+
+logger = logging.get_logger(__name__)
+
+
+# Special tokens, these should be present in any tokenizer we use since the preprocessor uses them
+IMAGE_PATCH_TOKEN = f"<im_patch>"  # Where to insert high-res tokens
+IMAGE_LOW_RES_TOKEN = f"<im_low>"  # Where to insert low-res tokens
+IM_START_TOKEN = f"<im_start>"
+LOW_RES_IMAGE_START_TOKEN = f"<low_res_im_start>"
+FRAME_START_TOKEN = f"<frame_start>"
+IM_END_TOKEN = f"<im_end>"
+FRAME_END_TOKEN = f"<frame_end>"
+IM_COL_TOKEN = f"<im_col>"
+IMAGE_PROMPT = "<|image|>"
+VIDEO_PROMPT = "<|video|>"
+
+IMAGE_TOKENS = [
+    IMAGE_PATCH_TOKEN,
+    IM_COL_TOKEN,
+    IM_START_TOKEN,
+    LOW_RES_IMAGE_START_TOKEN,
+    FRAME_START_TOKEN,
+    IM_END_TOKEN,
+    FRAME_END_TOKEN,
+    IMAGE_LOW_RES_TOKEN,
+]
+
+
+class MolmoAct2ProcessorKwargs(ProcessingKwargs, total=False):
+    """MolmoAct2 processor kwargs"""
+
+    images_kwargs: MolmoAct2ImagesKwargs
+    videos_kwargs: MolmoAct2VideoProcessorKwargs
+    _defaults = {
+        "text_kwargs": {
+            "padding": False,
+            "return_mm_token_type_ids": True,
+        },
+        "videos_kwargs": {"return_metadata": True},
+    }
+
+
+class MolmoAct2Processor(ProcessorMixin):
+    attributes = ["image_processor", "video_processor", "tokenizer"]
+    optional_attributes = [
+        "chat_template",
+        "time_mode",
+        "image_use_col_tokens",
+        "use_single_crop_col_tokens",
+        "use_single_crop_start_token",
+        "video_use_col_tokens",
+        "use_frame_special_tokens",
+    ]
+    image_processor_class = "AutoImageProcessor"
+    video_processor_class = "AutoVideoProcessor"
+    tokenizer_class = "AutoTokenizer"
+
+    def __init__(
+        self,
+        image_processor: MolmoAct2ImageProcessor = None,
+        video_processor: MolmoAct2VideoProcessor = None,
+        tokenizer: AutoTokenizer = None,
+        chat_template: str | None = None,
+        image_use_col_tokens: bool | None = True,
+        use_single_crop_col_tokens: bool | None = None,
+        use_single_crop_start_token: bool | None = True,
+        video_use_col_tokens: bool | None = False,
+        use_frame_special_tokens: bool | None = True,
+        **kwargs,
+    ) -> None:
+        super().__init__(
+            image_processor,
+            video_processor,
+            tokenizer,
+            chat_template=chat_template,
+        )
+        self.image_use_col_tokens = image_use_col_tokens
+        self.use_single_crop_col_tokens = use_single_crop_col_tokens
+        self.use_single_crop_start_token = use_single_crop_start_token
+        self.video_use_col_tokens = video_use_col_tokens
+        self.use_frame_special_tokens = use_frame_special_tokens
+
+        self.image_placeholder_token = IMAGE_PROMPT
+        self.video_placeholder_token = VIDEO_PROMPT
+        self.image_token_ids = [tokenizer.convert_tokens_to_ids(token) for token in IMAGE_TOKENS]
+
+    def get_image_tokens(self, image_grid: np.ndarray):
+        resized_h, resized_w, height, width = image_grid
+        if int(height) == 0 or int(width) == 0:
+            per_row = np.full(resized_w, IMAGE_PATCH_TOKEN)
+            use_single_crop_col_tokens = (
+                self.image_use_col_tokens
+                if self.use_single_crop_col_tokens is None
+                else self.use_single_crop_col_tokens
+            )
+            if use_single_crop_col_tokens:
+                per_row = np.concatenate([per_row, [IM_COL_TOKEN]], 0)
+            joint = [
+                [IM_START_TOKEN],
+                np.tile(per_row, [resized_h]),
+                [IM_END_TOKEN],
+            ]
+            return np.concatenate(joint)
+        per_row = np.full(width, IMAGE_PATCH_TOKEN)
+        if self.image_use_col_tokens:
+            per_row = np.concatenate([per_row, [IM_COL_TOKEN]], 0)
+        joint = [
+            [IM_START_TOKEN],
+            np.tile(per_row, [height]),
+            [IM_END_TOKEN],
+        ]
+        per_row = np.full(resized_w, IMAGE_PATCH_TOKEN)
+        use_single_crop_col_tokens = (
+            self.image_use_col_tokens
+            if self.use_single_crop_col_tokens is None
+            else self.use_single_crop_col_tokens
+        )
+        image_start_token = LOW_RES_IMAGE_START_TOKEN if self.use_single_crop_start_token else IM_START_TOKEN
+        if use_single_crop_col_tokens:
+            per_row = np.concatenate([per_row, [IM_COL_TOKEN]], 0)
+        joint = [
+            [image_start_token],
+            np.tile(per_row, [resized_h]),
+            [IM_END_TOKEN],
+        ] + joint
+
+        return np.concatenate(joint)
+
+    def get_video_string(
+        self,
+        video_grid: np.ndarray,
+        timestamps: np.ndarray,
+    ):
+        if self.use_frame_special_tokens:
+            start_token_id = FRAME_START_TOKEN
+            end_token_id = FRAME_END_TOKEN
+        else:
+            start_token_id = IM_START_TOKEN
+            end_token_id = IM_END_TOKEN
+
+        num_frames, h, w = video_grid
+        video_string: str = ""
+        for frame_idx, frame_time in enumerate(timestamps):
+            # `per-frame-compact` time mode
+            prev_space = " " if frame_idx > 0 else ""
+            frame_prefix = prev_space + f"{frame_time:.1f} "  # explicit whitespace before/after image tokens
+
+            video_string += frame_prefix
+            per_row = np.full(w, IMAGE_PATCH_TOKEN)
+            if self.video_use_col_tokens:
+                per_row = np.concatenate([per_row, [IM_COL_TOKEN]], 0)
+            extra_tokens = np.tile(per_row, [h])
+            video_tokens = [
+                [start_token_id],
+                extra_tokens,
+                [end_token_id],
+            ]
+            video_string += "".join(np.concatenate(video_tokens, 0))
+
+        return video_string
+
+    def insert_bos(
+        self,
+        input_ids: np.ndarray,
+        attention_mask: np.ndarray,
+        bos_token_id: int,
+        pad_token_id: int,
+    ):
+        """
+        Args:
+            input_ids: [B, S] array with left padding
+            attention_mask: [B, S] array (0 for pad, 1 for valid)
+            bos_token_id: int
+            pad_token_id: int
+        Returns:
+            input_ids_out: [B, S] or [B, S+1] array with bos inserted if needed
+            attention_mask_out: same shape as input_ids_out
+        """
+
+        need_to_expand = len(input_ids.shape) == 1
+        if need_to_expand:
+            input_ids = input_ids[None, :]
+            attention_mask = attention_mask[None, :]
+
+        B, S = input_ids.shape
+
+        # Handle zero-length sequence
+        if S == 0:
+            new_input_ids = np.full((B, 1), bos_token_id, dtype=input_ids.dtype)
+            new_attention_mask = np.ones((B, 1), dtype=attention_mask.dtype)
+            if need_to_expand:
+                new_input_ids = new_input_ids[0]
+                new_attention_mask = new_attention_mask[0]
+            return new_input_ids, new_attention_mask
+
+        first_valid_index = (attention_mask == 1).argmax(axis=-1)  # [B]
+        bos_already_present = np.all(input_ids[np.arange(B), first_valid_index] == bos_token_id)
+
+        if bos_already_present:
+            if need_to_expand:
+                input_ids = input_ids[0]
+                attention_mask = attention_mask[0]
+            return input_ids, attention_mask
+        else:
+            new_input_ids = np.full((B, S + 1), pad_token_id, dtype=input_ids.dtype)
+            new_attention_mask = np.zeros((B, S + 1), dtype=attention_mask.dtype)
+
+            src_idx = np.tile(np.arange(S), (B, 1))  # [B, S]
+            valid_mask = src_idx >= first_valid_index[:, None]  # [B, S]
+            tgt_idx = src_idx + 1  # shit right
+            batch_idx = np.tile(np.arange(B)[:, None], (1, S))  # [B, S]
+
+            # flatten valid_positions
+            flat_vals = input_ids[valid_mask]
+            flat_batch = batch_idx[valid_mask]
+            flat_tgt = tgt_idx[valid_mask]
+
+            new_input_ids[flat_batch, flat_tgt] = flat_vals
+            new_attention_mask[flat_batch, flat_tgt] = 1
+
+            insert_pos = first_valid_index
+            new_input_ids[np.arange(B), insert_pos] = bos_token_id
+            new_attention_mask[np.arange(B), insert_pos] = 1
+
+            if need_to_expand:
+                new_input_ids = new_input_ids[0]
+                new_attention_mask = new_attention_mask[0]
+
+            return new_input_ids, new_attention_mask
+
+    def __call__(
+        self,
+        text: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] = None,
+        images: ImageInput = None,
+        videos: VideoInput = None,
+        **kwargs: Unpack[MolmoAct2ProcessorKwargs],
+    ) -> BatchFeature:
+        """
+
+        Args:
+            text (`str`, `list[str]`, `list[list[str]]`):
+                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
+                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
+                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
+            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `list[PIL.Image.Image]`, `list[np.ndarray]`, `list[torch.Tensor]`):
+                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
+                tensor. Both channels-first and channels-last formats are supported.
+            videos (`dict[str, Any]` or `list[dict[str, Any]]`):
+                The video or batch of videos to be prepared. Each video can be a dictionary with the following keys:
+                - `"frames"`: `np.ndarray` of shape (T, H, W, 3)
+                - `"timestamps"`: `np.ndarray` of shape (T,)
+                - `"sampled_fps"`: `float` (optional)
+                - `"sampling_augmentation"`: `str` (optional)
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors of a particular framework. Acceptable values are:
+                - `'tf'`: Return TensorFlow `tf.constant` objects.
+                - `'pt'`: Return PyTorch `torch.Tensor` objects.
+                - `'np'`: Return NumPy `np.ndarray` objects.
+                - `'jax'`: Return JAX `jnp.ndarray` objects.
+
+        Returns:
+            `BatchFeature`: A [`BatchFeature`] with the following fields:
+            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
+            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
+              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not `None`).
+            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
+            - **image_token_pooling** -- Indices of the patches in `image_grids` to pool for each token in `image_tokens`.
+              Returned when `images` is not `None`.
+            - **image_grids** -- Grids of images. Returned when `images` is not `None`.
+            - **image_num_crops** -- Number of crops for each image. Returned when `images` is not `None`.
+            - **pixel_values_videos** -- Pixel values of videos to be fed to a model. Returned when `videos` is not `None`.
+            - **video_token_pooling** -- Indices of the patches in `video_grids` to pool for each token in `video_tokens`.
+              Returned when `videos` is not `None`.
+            - **video_grids** -- Grids of videos. Returned when `videos` is not `None`.
+        """
+
+        output_kwargs = self._merge_kwargs(
+            MolmoAct2ProcessorKwargs,
+            tokenizer_init_kwargs=self.tokenizer.init_kwargs,
+            **kwargs,
+        )
+
+        if images is not None:
+            image_inputs = self.image_processor(images, **output_kwargs["images_kwargs"])
+            image_grids = image_inputs["image_grids"]
+        else:
+            image_inputs = {}
+            image_grids = None
+
+        if videos is not None:
+            videos_inputs = self.video_processor(videos=videos, **output_kwargs["videos_kwargs"])
+            video_grids = videos_inputs["video_grids"]
+            # If user has not requested video metadata, pop it
+            if "return_metadata" not in kwargs:
+                video_metadata = videos_inputs.pop("video_metadata")
+            else:
+                video_metadata = videos_inputs["video_metadata"]
+        else:
+            videos_inputs = {}
+            video_grids = None
+
+        if not isinstance(text, list):
+            text = [text]
+
+        text = text.copy()  # below lines change text in-place
+
+        if image_grids is not None:
+            index = 0
+            for i in range(len(text)):
+                num_images = text[i].count(self.image_placeholder_token)
+                image_grids_i = image_grids[index : index + num_images]
+                for image_grid in image_grids_i:
+                    image_tokens = self.get_image_tokens(image_grid)
+                    image_string = "".join(image_tokens)
+                    text[i] = text[i].replace(self.image_placeholder_token, image_string, 1)
+                index += num_images
+
+        if video_grids is not None:
+            index = 0
+            for i in range(len(text)):
+                num_videos = text[i].count(self.video_placeholder_token)
+                assert num_videos in {0, 1}, "At most one video is supported for now"
+                video_grids_i = video_grids[index : index + num_videos]
+                metadata_i = video_metadata[index : index + num_videos]
+                for video_grid, metadata in zip(video_grids_i, metadata_i):
+                    video_string = self.get_video_string(
+                        video_grid,
+                        metadata.timestamps,
+                    )
+                    text[i] = text[i].replace(self.video_placeholder_token, video_string, 1)
+                index += num_videos
+
+        return_tensors = output_kwargs["text_kwargs"].pop("return_tensors", None)
+        return_mm_token_type_ids = output_kwargs["text_kwargs"].pop("return_mm_token_type_ids", False)
+        text_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"])
+
+        input_ids = text_inputs["input_ids"]
+        attention_mask = text_inputs["attention_mask"]
+
+        input_ids = np.array(input_ids)
+        attention_mask = np.array(attention_mask)
+
+        bos = self.tokenizer.bos_token_id or self.tokenizer.eos_token_id
+        input_ids, attention_mask = self.insert_bos(
+            input_ids, attention_mask, bos, self.tokenizer.pad_token_id
+        )
+
+        if return_mm_token_type_ids:
+            image_tokens = np.array(self.image_token_ids).astype(input_ids.dtype)
+            token_type_ids = np.any(input_ids[:, :, None] == image_tokens[None, None, :], axis=-1)
+            text_inputs["token_type_ids"] = token_type_ids.tolist()
+
+        text_inputs["input_ids"] = input_ids.tolist()
+        text_inputs["attention_mask"] = attention_mask.tolist()
+
+        return BatchFeature(
+            data={**text_inputs, **image_inputs, **videos_inputs},
+            tensor_type=return_tensors,
+        )
+
+    def post_process_image_text_to_text(
+        self, generated_outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False, **kwargs
+    ):
+        """
+        Post-process the output of the model to decode the text.
+
+        Args:
+            generated_outputs (`torch.Tensor` or `np.ndarray`):
+                The output of the model `generate` function. The output is expected to be a tensor of shape `(batch_size, sequence_length)`
+                or `(sequence_length,)`.
+            skip_special_tokens (`bool`, *optional*, defaults to `True`):
+                Whether or not to remove special tokens in the output. Argument passed to the tokenizer's `batch_decode` method.
+            clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
+                Whether or not to clean up the tokenization spaces. Argument passed to the tokenizer's `batch_decode` method.
+            **kwargs:
+                Additional arguments to be passed to the tokenizer's `batch_decode method`.
+
+        Returns:
+            `list[str]`: The decoded text.
+        """
+        return self.tokenizer.batch_decode(
+            generated_outputs,
+            skip_special_tokens=skip_special_tokens,
+            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
+            **kwargs,
+        )
+
+
+MolmoAct2Processor.register_for_auto_class()
--- a/src/lerobot/policies/molmoact2/hf_model/video_processing_molmoact2.py
+++ b/src/lerobot/policies/molmoact2/hf_model/video_processing_molmoact2.py
@@ -0,0 +1,997 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# ruff: noqa
+
+"""Video processor class for MolmoAct2"""
+
+from functools import partial
+import os
+import warnings
+from contextlib import redirect_stdout
+from io import BytesIO
+from urllib.parse import urlparse
+from typing import Optional, Union
+from collections.abc import Callable
+
+import numpy as np
+import requests
+import einops
+import torch
+import torchvision.transforms
+
+from transformers.image_utils import (
+    IMAGENET_STANDARD_MEAN,
+    IMAGENET_STANDARD_STD,
+    ImageInput,
+    PILImageResampling,
+    SizeDict,
+    validate_kwargs,
+)
+from transformers.video_utils import (
+    VideoInput,
+    is_valid_video,
+    make_batched_videos,
+    make_batched_metadata,
+    VideoMetadata,
+)
+from transformers.processing_utils import Unpack, VideosKwargs
+from transformers.video_processing_utils import BaseVideoProcessor
+from transformers.utils import logging
+from transformers.feature_extraction_utils import BatchFeature
+from transformers.utils import (
+    is_av_available,
+    is_decord_available,
+    is_torchcodec_available,
+    is_yt_dlp_available,
+    TensorType,
+    logging,
+    to_numpy,
+)
+
+
+logger = logging.get_logger(__name__)
+
+MAX_VIDEO_FPS = 8
+
+
+def normalize_image(
+    image: np.ndarray,
+    image_mean: list[float],
+    image_std: list[float],
+) -> np.ndarray:
+    if np.allclose(image_mean, [0.5, 0.5, 0.5]) and np.allclose(image_std, [0.5, 0.5, 0.5]):
+        return image * np.asarray(2.0, dtype=np.float32) - np.asarray(1.0, dtype=np.float32)
+    image -= np.array(image_mean, dtype=np.float32)[None, None, :]
+    image /= np.array(image_std, dtype=np.float32)[None, None, :]
+    return image
+
+
+def resize_image(
+    image: np.ndarray,
+    desired_output_size: list[int],
+    resample: PILImageResampling,
+) -> np.ndarray:
+    if len(image.shape) == 3:
+        is_video = False
+        image = torch.permute(torch.from_numpy(image), [2, 0, 1])
+    else:
+        is_video = True
+        image = torch.permute(torch.from_numpy(image), [0, 3, 1, 2])
+    dtype = image.dtype
+    if torch.is_floating_point(image):
+        in_min = 0.0
+        in_max = 1.0
+        resized = torchvision.transforms.Resize(
+            desired_output_size,
+            resample,
+            antialias=False,
+        )(image)
+        resized = torch.clip(resized, 0.0, 1.0).to(dtype)
+    else:
+        assert image.dtype == torch.uint8, "SigLIP expects float images or uint8 images, but got {}".format(
+            image.dtype
+        )
+        in_min = 0.0
+        in_max = 255.0
+        resized = torchvision.transforms.Resize(
+            desired_output_size,
+            resample,
+            antialias=False,
+        )(image)
+        resized = torch.clip(resized, 0, 255).to(dtype)
+
+    resized = resized.to(torch.float32)
+    resized = (resized - in_min) / (in_max - in_min)
+
+    if is_video:
+        resized = torch.permute(resized, [0, 2, 3, 1]).numpy()
+    else:
+        resized = torch.permute(resized, [1, 2, 0]).numpy()
+
+    return resized
+
+
+def build_resized_image(
+    image: np.ndarray,
+    base_image_input_size: list[int],
+    resample: PILImageResampling,
+    image_mean: list[float],
+    image_std: list[float],
+    image_patch_size: int,
+) -> tuple[np.ndarray, np.ndarray]:
+    resized = resize_image(
+        image,
+        base_image_input_size,
+        resample,
+    )
+    resized = normalize_image(resized, image_mean, image_std)
+    if len(resized.shape) == 3:
+        resized = np.expand_dims(resized, 0)
+    crop_patch_w = base_image_input_size[1] // image_patch_size
+    crop_patch_h = base_image_input_size[0] // image_patch_size
+    resize_idx = np.arange(crop_patch_w * crop_patch_h).reshape([crop_patch_h, crop_patch_w])
+    return resized, resize_idx
+
+
+def batch_pixels_to_patches(array: np.ndarray, patch_size: int) -> np.ndarray:
+    """Reshape images of [n_images, h, w, 3] -> [n_images, n_patches, pixels_per_patch]"""
+    if len(array.shape) == 3:
+        n_crops, h, w = array.shape
+        h_patches = h // patch_size
+        w_patches = w // patch_size
+        array = np.reshape(array, [n_crops, h_patches, patch_size, w_patches, patch_size])
+        array = np.transpose(array, [0, 1, 3, 2, 4])
+        array = np.reshape(array, [n_crops, h_patches * w_patches, patch_size * patch_size])
+        return array
+    else:
+        n_crops, h, w, c = array.shape
+        h_patches = h // patch_size
+        w_patches = w // patch_size
+        array = np.reshape(array, [n_crops, h_patches, patch_size, w_patches, patch_size, c])
+        array = np.transpose(array, [0, 1, 3, 2, 4, 5])
+        array = np.reshape(array, [n_crops, h_patches * w_patches, patch_size * patch_size * c])
+        return array
+
+
+def arange_for_pooling(
+    idx_arr: np.ndarray,
+    pool_h: int,
+    pool_w: int,
+) -> np.ndarray:
+    h_pad = pool_h * ((idx_arr.shape[0] + pool_h - 1) // pool_h) - idx_arr.shape[0]
+    w_pad = pool_w * ((idx_arr.shape[1] + pool_w - 1) // pool_w) - idx_arr.shape[1]
+    idx_arr = np.pad(
+        idx_arr,
+        [[h_pad // 2, (h_pad + 1) // 2], [w_pad // 2, (w_pad + 1) // 2]],
+        mode="constant",
+        constant_values=-1,
+    )
+    return einops.rearrange(idx_arr, "(h dh) (w dw) -> h w (dh dw)", dh=pool_h, dw=pool_w)
+
+
+def image_to_patches_and_grids(
+    image: ImageInput,
+    base_image_input_size: list[int],
+    resample: PILImageResampling,
+    image_mean: list[float],
+    image_std: list[float],
+    image_patch_size: int,
+    image_pooling_w: int,
+    image_pooling_h: int,
+) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
+    """
+    :return image_grids, the shape of each image after pooling
+    :return crops, the image crops to processes with the ViT
+    :return pooled_patch_idx, for each patch_id tokens in `image_tokens`, the indices of the
+                                patches in `crops` to pool for that token, masked with -1
+    """
+    if isinstance(base_image_input_size, int):
+        base_image_input_size = (base_image_input_size, base_image_input_size)
+
+    pooling_w = image_pooling_w
+    pooling_h = image_pooling_h
+
+    resized, resize_idx = build_resized_image(
+        image,
+        base_image_input_size,
+        resample,
+        image_mean,
+        image_std,
+        image_patch_size,
+    )
+    pooling_idx = arange_for_pooling(resize_idx, pooling_h, pooling_w)
+    h, w = pooling_idx.shape[:2]
+    pooling_idx = pooling_idx.reshape([-1, pooling_h * pooling_w])
+    image_grid = [h, w]
+    return (
+        image_grid,
+        batch_pixels_to_patches(resized, image_patch_size),
+        pooling_idx,
+    )
+
+
+def get_candidate_target_fps(
+    video_fps: int | float,
+    sampling_fps: int | float,
+    max_fps: int | float = MAX_VIDEO_FPS,
+) -> list[float]:
+    """
+    Return the subset of `video_fps` factors that remain multiples of `sampling_fps`.
+
+    Examples:
+        >>> get_candidate_target_fps(video_fps=6, sampling_fps=2)
+        [2, 6]
+        >>> get_candidate_target_fps(video_fps=5, sampling_fps=1)
+        [1, 5]
+        >>> get_candidate_target_fps(video_fps=2, sampling_fps=2)
+        [2]
+        >>> get_candidate_target_fps(video_fps=5, sampling_fps=2)
+        Traceback (most recent call last):
+            ...
+        ValueError: sampling_fps=2 must divide video_fps=5 to produce consistent frame steps.
+    """
+    video_fps = int(video_fps)
+    sampling_fps = int(sampling_fps)
+    max_fps = int(max_fps)
+
+    if sampling_fps is None:
+        raise ValueError("sampling_fps must be provided")
+    if video_fps <= 0 or sampling_fps <= 0:
+        raise ValueError(f"video_fps and sampling_fps must be positive (got {video_fps}, {sampling_fps})")
+    if video_fps % sampling_fps != 0:
+        raise ValueError(f"sampling_fps={sampling_fps} must divide video_fps={video_fps}.")
+
+    candidates = []
+    for candidate in range(sampling_fps, video_fps + 1, sampling_fps):
+        if candidate > max_fps:
+            break
+        if video_fps % candidate == 0:
+            candidates.append(float(candidate))
+
+    return candidates
+
+
+def read_video_decord(
+    video_path,
+    sample_timestamps_fn: Callable,
+    **kwargs,
+) -> np.ndarray:
+    """
+    Decode a video using the Decord backend.
+
+    Args:
+        video_path (`str`):
+            Path to the video file.
+        sample_timestamps_fn (`Callable`):
+            A callable function that will return timestamps at which the video should be sampled.
+
+    Returns:
+        tuple[`np.array`, `VideoMetadata`]: A tuple containing:
+            - Numpy array of frames in RGB (shape: [num_frames, height, width, 3]).
+            - `VideoMetadata` object.
+    """
+    # Lazy import from decord
+    import importlib
+
+    decord = importlib.import_module("decord")
+
+    vr = decord.VideoReader(uri=video_path, ctx=decord.cpu(0))  # decord has problems with gpu
+    video_fps = vr.get_avg_fps()
+    total_num_frames = len(vr)
+    time_stamps = vr.get_frame_timestamp(list(range(len(vr))))
+    duration = time_stamps[-1][1] - time_stamps[0][0]
+
+    metadata = VideoMetadata(
+        total_num_frames=int(total_num_frames),
+        fps=float(video_fps),
+        duration=float(duration),
+        video_backend="decord",
+    )
+
+    target_timestamps = sample_timestamps_fn(metadata=metadata, **kwargs)
+    target_timestamps = np.array(target_timestamps)
+    offset = time_stamps[0, 0]
+
+    ix = np.searchsorted(time_stamps[:, 1], target_timestamps + offset, side="right")
+    ix = np.minimum(ix, len(time_stamps) - 1)
+
+    video = vr.get_batch(ix).asnumpy()
+    metadata.update(
+        {
+            "frames_indices": target_timestamps * video_fps,
+            "height": video.shape[1],
+            "width": video.shape[2],
+        }
+    )
+    return video, metadata
+
+
+def read_video_torchcodec(
+    video_path,
+    sample_timestamps_fn: Callable,
+    **kwargs,
+) -> np.ndarray:
+    """
+    Decode a video using torchcodec decoder.
+
+    Args:
+        video_path (`str`):
+            Path to the video file.
+        sample_timestamps_fn (`Callable`):
+            A callable function that will return timestamps at which the video should be sampled.
+
+    Returns:
+        tuple[`np.array`, `VideoMetadata`]: A tuple containing:
+            - Numpy array of frames in RGB (shape: [num_frames, height, width, 3]).
+            - `VideoMetadata` object.
+    """
+    # Lazy import torchcodec
+    import importlib
+
+    torchcodec = importlib.import_module("torchcodec")
+
+    decoder = torchcodec.decoders.VideoDecoder(
+        video_path,
+        # Interestingly `exact` mode takes less than approximate when we load the whole video
+        seek_mode="exact",
+        # Allow FFmpeg decide on the number of threads for efficiency
+        num_ffmpeg_threads=0,
+    )
+    # If the first frame starts at > 0, we effectively clip the video starting at that time
+    # since (most) video players would also skip to that time
+    time_offset = decoder.metadata.begin_stream_seconds_from_content
+    # Note this duration does assume we started playing at `time_offset`
+    duration = decoder.metadata.duration_seconds
+
+    metadata = VideoMetadata(
+        total_num_frames=decoder.metadata.num_frames,
+        fps=decoder.metadata.average_fps,
+        duration=duration,
+        video_backend="torchcodec",
+        height=decoder.metadata.height,
+        width=decoder.metadata.width,
+    )
+
+    target_timestamps = sample_timestamps_fn(metadata=metadata, **kwargs)
+
+    # Floating point/rounding issues might cause `target_timestamps` to be very slightly
+    # out-of-bounds, to handle this we sanity check then clip them
+    assert all(x >= 0 for x in target_timestamps)
+    assert all(x < duration + 1e-6 for x in target_timestamps)
+    # 1e-6 padding since torchcodec can throw out-of-bounds errors even if you ask for the
+    # exact boundary value, we should still get the first/last frame anyway
+    max_timestamp = decoder.metadata.end_stream_seconds_from_content - 1e-6
+    min_timestamp = decoder.metadata.begin_stream_seconds_from_content + 1e-6
+    # Note we avoid using numpy ops here to reduce floating precision issues
+    timestamps = [x + time_offset for x in target_timestamps]
+    timestamps = [max(min_timestamp, min(max_timestamp, x)) for x in timestamps]
+
+    video = (
+        decoder.get_frames_played_at(timestamps).data.numpy().transpose(0, 2, 3, 1)
+    )  # Convert to THWC format
+    target_timestamps = np.array(target_timestamps)
+    metadata.frames_indices = target_timestamps * metadata.fps
+
+    return video, metadata
+
+
+def read_video_pyav(
+    video_path,
+    sample_timestamps_fn: Callable,
+    **kwargs,
+) -> np.ndarray:
+    """
+    Decode a video using the PyAV backend.
+
+    Args:
+        video_path (`str`):
+            Path to the video file.
+        sample_timestamps_fn (`Callable`):
+            A callable function that will return timestamps at which the video should be sampled.
+
+    Returns:
+        tuple[`np.array`, `VideoMetadata`]: A tuple containing:
+            - Numpy array of frames in RGB (shape: [num_frames, height, width, 3]).
+            - `VideoMetadata` object.
+    """
+    # Lazy import torchcodec
+    import importlib
+
+    av = importlib.import_module("av")
+
+    with av.open(video_path) as container:
+        video_stream = container.streams.video[0]
+        fps = video_stream.average_rate or video_stream.guessed_rate
+        it = container.decode(video=0)
+        frames = list(it)
+
+        stream = container.streams.video[0]
+        start = frames[0].pts * stream.time_base
+        container_end = stream.duration
+        if container_end is not None:
+            container_end *= stream.time_base
+        if container_end is None or container_end < frames[-1].pts:
+            # Some problem with stream duration, so use the frame PTS directly
+            # and guess the duration of the last frame
+            end = frames[-1].pts * stream.time_base + 1 / fps
+        else:
+            end = container_end
+        duration = float(end - start)
+
+        metadata = VideoMetadata(
+            total_num_frames=len(frames),
+            fps=float(fps),
+            duration=float(duration),
+            video_backend="pyav",
+            height=video_stream.height,
+            width=video_stream.width,
+        )
+
+        target_timestamps = sample_timestamps_fn(metadata=metadata, **kwargs)
+        offset = float(start)
+
+        target_timestamps = np.array(target_timestamps)
+        end_time_stamps = np.array([float(frame.pts * stream.time_base) for frame in frames[1:]] + [duration])
+        indices = np.searchsorted(end_time_stamps, target_timestamps + offset, side="right")
+        indices = np.minimum(indices, len(end_time_stamps) - 1)
+
+        video = np.stack(
+            [frames[i].to_ndarray(format="rgb24", channel_last=True) for i in indices],
+            axis=0,
+        )
+
+        metadata.frames_indices = target_timestamps * fps
+
+        return video, metadata
+
+
+VIDEO_DECODERS = {
+    "decord": read_video_decord,
+    "torchcodec": read_video_torchcodec,
+    "pyav": read_video_pyav,
+}
+
+
+def load_video(
+    video: VideoInput,
+    backend: str = "decord",
+    sample_timestamps_fn: Callable | None = None,
+    **kwargs,
+):
+    """
+    Loads `video` to a numpy array.
+
+    Args:
+        video (`VideoInput`):
+            The video to convert to the numpy array format. Can be a link to video or local path.
+        backend (`str`, *optional*, defaults to `"decord"`):
+            The backend to use when loading the video. Can be any of ["decord", "pyav", ""torchcodec"]. Defaults to "decord".
+        sample_timestamps_fn (`Callable`):
+            A callable function that will return timestamps at which the video should be sampled.
+    """
+
+    # Early exit if provided an array or `PIL` frames
+    if not isinstance(video, str):
+        metadata = [None] * len(video)
+        return video, metadata
+
+    if urlparse(video).netloc in ["www.youtube.com", "youtube.com"]:
+        if not is_yt_dlp_available():
+            raise ImportError("To load a video from YouTube url you have  to install `yt_dlp` first.")
+        # Lazy import from yt_dlp
+        import importlib
+
+        yt_dlp = importlib.import_module("yt_dlp")
+
+        buffer = BytesIO()
+        with redirect_stdout(buffer), yt_dlp.YoutubeDL() as f:
+            f.download([video])
+        bytes_obj = buffer.getvalue()
+        file_obj = BytesIO(bytes_obj)
+    elif video.startswith("http://") or video.startswith("https://"):
+        file_obj = BytesIO(requests.get(video, timeout=10).content)
+    elif os.path.isfile(video):
+        file_obj = video
+    else:
+        raise TypeError(
+            "Incorrect format used for video. Should be an url linking to an video or a local path."
+        )
+
+    # can also load with decord, but not cv2/torchvision
+    # both will fail in case of url links
+    video_is_url = video.startswith("http://") or video.startswith("https://")
+    if video_is_url and backend == "opencv":
+        raise ValueError("If you are trying to load a video from URL, you cannot use 'opencv' as backend")
+
+    if (
+        (not is_decord_available() and backend == "decord")
+        or (not is_torchcodec_available() and backend == "torchcodec")
+        or (not is_av_available() and backend == "pyav")
+    ):
+        raise ImportError(
+            f"You chose backend={backend} for loading the video but the required library is not found in your environment "
+            f"Make sure to install {backend} before loading the video."
+        )
+
+    video_decoder = VIDEO_DECODERS[backend]
+    video, metadata = video_decoder(file_obj, sample_timestamps_fn, **kwargs)
+    return video, metadata
+
+
+def get_target_fps(
+    video_fps: float,
+    max_frames: int,
+    total_frames: int,
+    frame_sample_mode: str,
+    candidate_target_fps: tuple[float],
+) -> float:
+    """
+    Get the target fps that best spans the video and has the most frames sampled
+    """
+    num_frames_sampled = 0
+    selected_target_fps = None
+    for target_fps in candidate_target_fps:
+        step_size = max(int(video_fps / target_fps), 1)
+        num_frames_sampled_at_fps = int(total_frames / step_size)
+        if num_frames_sampled == 0:
+            if "uniform" in frame_sample_mode:
+                if num_frames_sampled_at_fps > max_frames:
+                    break
+            selected_target_fps = target_fps
+            num_frames_sampled = num_frames_sampled_at_fps
+
+        else:
+            # the candidate sampling fps increases so frame count can't decrease
+            assert num_frames_sampled <= num_frames_sampled_at_fps
+            if num_frames_sampled_at_fps > max_frames:
+                # choose the sampling fps that spans the video
+                continue
+
+            elif num_frames_sampled_at_fps > num_frames_sampled:
+                # both are less than max_frames, choose the one with higher density of frames sampled
+                selected_target_fps = target_fps
+                num_frames_sampled = num_frames_sampled_at_fps
+    return selected_target_fps
+
+
+def get_frame_times_and_chosen_fps(selected_target_fps, total_frames, max_frames, video_fps):
+    if selected_target_fps is None:
+        frame_indices = np.linspace(0, total_frames, max_frames, endpoint=False, dtype=int)
+    else:
+        step_size = max(int(video_fps / selected_target_fps), 1)
+        frame_indices = np.arange(0, total_frames, step_size)
+    if len(frame_indices) > max_frames:
+        frame_indices = frame_indices[:max_frames]
+    return selected_target_fps, frame_indices
+
+
+class MolmoAct2VideoProcessorKwargs(VideosKwargs, total=False):
+    patch_size: int | None
+    pooling_size: list[int] | None
+    frame_sample_mode: str | None
+    max_fps: int | None
+    sampling_fps: int | None
+
+
+class MolmoAct2VideoProcessor(BaseVideoProcessor):
+    resample = PILImageResampling.BILINEAR
+    size = {"height": 378, "width": 378}
+    image_mean = IMAGENET_STANDARD_MEAN
+    image_std = IMAGENET_STANDARD_STD
+    do_resize = True
+    do_rescale = True
+    do_normalize = True
+    do_convert_rgb = True
+    patch_size = 14
+    pooling_size = [3, 3]
+    do_sample_frames = True
+    frame_sample_mode = "uniform_last_frame"
+    max_fps = 2
+    sampling_fps = 2
+    valid_kwargs = MolmoAct2VideoProcessorKwargs
+    model_input_names = ["pixel_values_videos", "video_token_pooling", "video_grids"]
+
+    def __init__(self, **kwargs: Unpack[MolmoAct2VideoProcessorKwargs]):
+        super().__init__(**kwargs)
+        if self.size is not None and (
+            self.size.get("height", None) is None or self.size.get("width", None) is None
+        ):
+            raise ValueError("size must contain 'height' and 'width' keys.")
+
+    def _further_process_kwargs(
+        self,
+        size: SizeDict | None = None,
+        **kwargs,
+    ) -> dict:
+        """
+        Update kwargs that need further processing before being validated
+        Can be overridden by subclasses to customize the processing of kwargs.
+        """
+        if size is not None and ("height" not in size or "width" not in size):
+            raise ValueError("size must contain 'height' and 'width' keys.")
+
+        return super()._further_process_kwargs(size=size, **kwargs)
+
+    def sample_times(
+        self,
+        metadata: VideoMetadata,
+        frame_sample_mode: str,
+        num_frames: int,
+        max_fps: int | None = None,
+        sampling_fps: int | None = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Time-based sampling if an array video is passed
+        Args:
+            metadata (`VideoMetadata`):
+                Metadata of the video containing information about total duration, fps and total number of frames.
+            frame_sample_mode (`str`, *optional*):
+                Mode to sample frames. Defaults to `self.frame_sample_mode`.
+            num_frames (`int`, *optional*):
+                Maximum number of frames to sample. Defaults to `self.num_frames`.
+            man_fps (`int`, *optional*):
+                Maximum frames per second to sample.
+            sampling_fps (`int`, *optional*):
+                Sampling frames per second. Defaults to `self.sampling_fps`.
+                Used when `frame_sample_mode` is `"fps"`.
+        """
+        frame_sample_mode = frame_sample_mode or self.frame_sample_mode
+        num_frames = num_frames or self.num_frames
+        sampling_fps = sampling_fps or self.sampling_fps
+
+        duration = metadata.duration or metadata.total_num_frames / metadata.fps
+        if frame_sample_mode == "fps":
+            candidate_target_fps = get_candidate_target_fps(metadata.fps, sampling_fps)
+            # Try larger and larger FPSs until we hit one that can't span the video
+            target_fps = candidate_target_fps[0]
+            for candidate_fps in candidate_target_fps[1:]:
+                if num_frames / candidate_fps < duration:
+                    break
+                target_fps = candidate_fps
+            times = np.arange(0, num_frames) / target_fps
+            times = times[times < duration]
+            return times
+        elif frame_sample_mode == "uniform_last_frame":
+            if max_fps is not None:
+                max_duration = (num_frames - 1) / max_fps  # -1 to include the last frame
+                if max_duration < duration:
+                    times = np.linspace(0, duration, num=num_frames, endpoint=True, dtype=np.float64)
+                else:
+                    times = np.arange(0.0, stop=duration, step=1 / max_fps)
+                    times = np.concatenate([times, [duration]], axis=0)
+                    assert len(times) <= num_frames
+            else:
+                times = np.linspace(0, duration, num=num_frames, endpoint=True, dtype=np.float64)
+            return times
+        else:
+            raise NotImplementedError(frame_sample_mode)
+
+    def sample_frames(
+        self,
+        metadata: VideoMetadata,
+        frame_sample_mode: str | None = None,
+        num_frames: int | None = None,
+        max_fps: int | None = None,
+        sampling_fps: int | None = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Frame-based sampling if an array video is passed
+        Args:
+            metadata (`VideoMetadata`):
+                Metadata of the video containing information about total duration, fps and total number of frames.
+            frame_sample_mode (`str`, *optional*):
+                Mode to sample frames. Defaults to `self.frame_sample_mode`.
+            num_frames (`int`, *optional*):
+                Maximum number of frames to sample. Defaults to `self.num_frames`.
+            max_fps (`int`, *optional*):
+                Maximum frames per second to sample.
+            sampling_fps (`int`, *optional*):
+                Sampling frames per second. Defaults to `self.sampling_fps`.
+                Used when `frame_sample_mode` is `"fps"`.
+        """
+        frame_sample_mode = frame_sample_mode or self.frame_sample_mode
+        num_frames = num_frames or self.num_frames
+        sampling_fps = sampling_fps or self.sampling_fps
+
+        total_num_frames = metadata.total_num_frames
+        if frame_sample_mode == "uniform_last_frame" and max_fps is not None:
+            duration = total_num_frames / metadata.fps
+            if total_num_frames <= 2:
+                return np.arange(total_num_frames).astype(int)
+            if duration > (num_frames - 1) / max_fps:  # -1 to include the last frame
+                # uniform fallback
+                indices = np.linspace(
+                    0,
+                    total_num_frames - 1,
+                    num=min(num_frames, total_num_frames),
+                    endpoint=True,
+                ).astype(int)
+                return indices
+            else:
+                float_indices = np.arange(
+                    0.0,
+                    stop=total_num_frames - 1,
+                    step=float(metadata.fps / max_fps),
+                )
+                if np.round(float_indices[-1]) != total_num_frames - 1:
+                    float_indices = np.concatenate([float_indices, [total_num_frames - 1]], axis=0)
+                indices = np.round(float_indices).astype(int)
+                assert indices[-1] < total_num_frames
+                assert len(float_indices) <= num_frames
+                return indices
+        elif frame_sample_mode == "uniform_last_frame":
+            indices = np.linspace(
+                0,
+                total_num_frames - 1,
+                num=min(num_frames, total_num_frames),
+                endpoint=True,
+            ).astype(int)
+            return indices
+        elif frame_sample_mode == "fps":
+            candidate_target_fps = get_candidate_target_fps(metadata.fps, sampling_fps)
+            selected_target_fps = get_target_fps(
+                metadata.fps,
+                num_frames,
+                total_num_frames,
+                frame_sample_mode,
+                candidate_target_fps,
+            )
+            _, indices = get_frame_times_and_chosen_fps(
+                selected_target_fps,
+                total_num_frames,
+                num_frames,
+                metadata.fps,
+            )
+            return indices
+        else:
+            raise NotImplementedError(frame_sample_mode)
+
+    def fetch_videos(self, video_url_or_urls: str | list[str] | list[list[str]], sample_timestamps_fn=None):
+        """
+        Convert a single or a list of urls into the corresponding `np.array` objects.
+
+        If a single url is passed, the return value will be a single object. If a list is passed a list of objects is
+        returned.
+        """
+        if (not is_decord_available()) and (not is_torchcodec_available()) and (not is_av_available()):
+            raise ImportError(
+                "MolmoAct2VideoProcessor requires `decord`, `torchcodec`, or `av` to be installed."
+            )
+
+        if is_decord_available():
+            backend = "decord"
+        elif is_torchcodec_available():
+            warnings.warn(
+                "`decord` is not installed and cannot be used to decode the video by default. "
+                "Falling back to `torchcodec`."
+            )
+            backend = "torchcodec"
+        else:
+            warnings.warn(
+                "`decord` is not installed and cannot be used to decode the video by default. "
+                "Falling back to `PyAV`."
+            )
+            backend = "pyav"
+
+        if isinstance(video_url_or_urls, list):
+            return list(
+                zip(
+                    *[
+                        self.fetch_videos(x, sample_timestamps_fn=sample_timestamps_fn)
+                        for x in video_url_or_urls
+                    ]
+                )
+            )
+        else:
+            return load_video(video_url_or_urls, backend=backend, sample_timestamps_fn=sample_timestamps_fn)
+
+    def _decode_and_sample_videos(
+        self,
+        videos: VideoInput,
+        video_metadata: VideoMetadata | dict,
+        do_sample_frames: bool | None = None,
+        sample_indices_fn: Callable | None = None,
+        sample_timestamps_fn: Callable | None = None,
+    ):
+        """
+        Decode input videos and sample frames if needed.
+        """
+        videos = make_batched_videos(videos)
+        video_metadata = make_batched_metadata(videos, video_metadata=video_metadata)
+
+        # Framed-based sampling if an array video is passed
+        # Otherwise, time-based sampling with decoding
+        if is_valid_video(videos[0]) and do_sample_frames:
+            assert video_metadata[0].fps is not None, "FPS must be provided for video input"
+            sampled_videos = []
+            sampled_metadata = []
+            for video, metadata in zip(videos, video_metadata):
+                indices = sample_indices_fn(metadata=metadata)
+                metadata.frames_indices = indices
+                sampled_videos.append(video[indices])
+                sampled_metadata.append(metadata)
+            videos = sampled_videos
+            video_metadata = sampled_metadata
+        elif not is_valid_video(videos[0]):
+            if sample_indices_fn is None:
+                logger.warning(
+                    "do_sample_frames is False, but video array is not provided: "
+                    "Will decode the video and sample frames using MolmoAct2's default sampling mode"
+                )
+            if isinstance(videos[0], list):
+                raise ValueError("A list of images is not supported for video input!")
+            else:
+                videos, video_metadata = self.fetch_videos(videos, sample_timestamps_fn=sample_timestamps_fn)
+
+        return videos, video_metadata
+
+    def _prepare_input_videos(
+        self,
+        videos: VideoInput,
+        **kwargs,
+    ) -> list[np.ndarray]:
+        processed_videos = [to_numpy(video) for video in videos]
+        return processed_videos
+
+    def preprocess(
+        self,
+        videos: VideoInput,
+        **kwargs: Unpack[MolmoAct2VideoProcessorKwargs],
+    ) -> BatchFeature:
+        validate_kwargs(
+            captured_kwargs=kwargs.keys(),
+            valid_processor_keys=list(self.valid_kwargs.__annotations__.keys()) + ["return_tensors"],
+        )
+
+        # Set default kwargs from self. This ensures that if a kwarg is not provided
+        # by the user, it gets its default value from the instance, or is set to None.
+        for kwarg_name in self.valid_kwargs.__annotations__:
+            kwargs.setdefault(kwarg_name, getattr(self, kwarg_name, None))
+
+        do_sample_frames = kwargs.pop("do_sample_frames")
+        video_metadata = kwargs.pop("video_metadata")
+
+        sample_indices_fn = partial(self.sample_frames, **kwargs) if do_sample_frames else None
+        sample_timestamps_fn = partial(self.sample_times, **kwargs)
+        videos, video_metadata = self._decode_and_sample_videos(
+            videos,
+            video_metadata=video_metadata,
+            do_sample_frames=do_sample_frames,
+            sample_indices_fn=sample_indices_fn,
+            sample_timestamps_fn=sample_timestamps_fn,
+        )
+        videos = self._prepare_input_videos(videos=videos)
+
+        kwargs = self._further_process_kwargs(**kwargs)
+
+        return_metadata = kwargs.pop("return_metadata")
+        preprocessed_videos = self._preprocess(videos=videos, **kwargs)
+        if return_metadata:
+            preprocessed_videos["video_metadata"] = video_metadata
+        return preprocessed_videos
+
+    def _preprocess(
+        self,
+        videos: list[np.ndarray],
+        size: SizeDict | None = None,
+        resample: PILImageResampling | None = None,
+        image_mean: float | list[float] | None = None,
+        image_std: float | list[float] | None = None,
+        do_convert_rgb: bool | None = None,
+        patch_size: int | None = None,
+        pooling_size: list[int] | None = None,
+        return_tensors: str | TensorType | None = None,
+        **kwargs,
+    ) -> BatchFeature:
+        """
+        Preprocess a video for the model.
+        Args:
+            videos (`VideoInput`):
+                Video to preprocess.
+            size (`SizeDict`, *optional*, defaults to `self.size`):
+                Size of the image after resizing.
+            resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
+                Resampling filter to use when resizing the image. This can be one of the enum `PILImageResampling`. Only
+                has an effect if `do_resize` is set to `True`.
+            image_mean (`float` or `list[float]`, *optional*, defaults to `self.image_mean`):
+                Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
+            image_std (`float` or `list[float]`, *optional*, defaults to `self.image_std`):
+                Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
+                `True`.
+            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
+                Whether to convert the image to RGB.
+            patch_size (`int`, *optional*, defaults to `self.patch_size`):
+                The spatial patch size of the vision encoder.
+            pooling_size (`list[int]`, *optional*, defaults to `self.pooling_size`):
+                The pooling size of the vision adapter.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                - Unset: Return a list of `np.ndarray`.
+                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
+                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
+                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
+
+        Returns:
+            A `BatchFeature` containing the following keys:
+                - `pixel_values_videos`: The preprocessed videos.
+                - `video_token_pooling`: The indices of the patches in `crops` to pool for each token in `video_tokens`.
+                - `video_grids`: The video grids.
+        """
+        if size.height is None or size.width is None:
+            raise ValueError("size must contain 'height' and 'width' keys.")
+
+        base_image_input_size = [size.height, size.width]
+
+        resample = resample or self.resample
+        image_mean = image_mean or self.image_mean
+        image_std = image_std or self.image_std
+        do_convert_rgb = do_convert_rgb or self.do_convert_rgb
+
+        patch_size = patch_size or self.patch_size
+        pooling_size = pooling_size or self.pooling_size
+
+        image_pooling_h, image_pooling_w = pooling_size
+
+        batch_grids = []
+        batch_crops = []
+        batch_pooled_patches_idx = []
+
+        for video in videos:
+            all_crops = []
+            pooled_patches_idx = []
+
+            for frame in video:
+                image_grid, crops, pooled_idx = image_to_patches_and_grids(
+                    frame,
+                    base_image_input_size,
+                    resample,
+                    image_mean,
+                    image_std,
+                    patch_size,
+                    image_pooling_w,
+                    image_pooling_h,
+                )
+                offset = sum(np.prod(x.shape[:2]) for x in all_crops)
+                pooled_idx_with_offset = np.where(pooled_idx >= 0, pooled_idx + offset, pooled_idx)
+                pooled_patches_idx.append(pooled_idx_with_offset)
+                all_crops.append(crops)
+
+            video_grid = np.array([len(video), image_grid[0], image_grid[1]])
+            all_crops = np.concatenate(all_crops, 0)
+            pooled_patches_idx = np.concatenate(pooled_patches_idx, 0)
+
+            batch_grids.append(video_grid)
+            batch_crops.append(all_crops)
+            batch_pooled_patches_idx.append(pooled_patches_idx)
+
+        video_grids = np.stack(batch_grids, 0)
+        pixel_values_videos = np.concatenate(batch_crops, 0)
+        video_token_pooling = np.concatenate(batch_pooled_patches_idx, 0)
+
+        data = dict(
+            pixel_values_videos=pixel_values_videos,
+            video_token_pooling=video_token_pooling,
+            video_grids=video_grids,
+        )
+
+        return BatchFeature(data, tensor_type=return_tensors)
+
+
+MolmoAct2VideoProcessor.register_for_auto_class()
--- a/src/lerobot/policies/molmoact2/modeling_molmoact2.py
+++ b/src/lerobot/policies/molmoact2/modeling_molmoact2.py
--- a/src/lerobot/policies/molmoact2/processor_molmoact2.py
+++ b/src/lerobot/policies/molmoact2/processor_molmoact2.py
--- a/src/lerobot/rewards/init.py
+++ b/src/lerobot/rewards/init.py
@@ -20,12 +20,14 @@ from .factory import (
    make_reward_pre_post_processors as make_reward_pre_post_processors,
 )
 from .pretrained import PreTrainedRewardModel as PreTrainedRewardModel
+from .robometer.configuration_robometer import RobometerConfig as RobometerConfig
 from .sarm.configuration_sarm import SARMConfig as SARMConfig
 from .topreward.configuration_topreward import TOPRewardConfig as TOPRewardConfig

 __all__ = [
    # Configuration classes
    "RewardClassifierConfig",
+    "RobometerConfig",
    "SARMConfig",
    "TOPRewardConfig",
    # Base class
--- a/src/lerobot/rewards/factory.py
+++ b/src/lerobot/rewards/factory.py
@@ -25,6 +25,7 @@ from lerobot.processor import PolicyAction, PolicyProcessorPipeline

 from .classifier.configuration_classifier import RewardClassifierConfig
 from .pretrained import PreTrainedRewardModel
+from .robometer.configuration_robometer import RobometerConfig
 from .sarm.configuration_sarm import SARMConfig
 from .topreward.configuration_topreward import TOPRewardConfig

@@ -38,7 +39,7 @@ def get_reward_model_class(name: str) -> type[PreTrainedRewardModel]:

    Args:
        name: The name of the reward model. Supported names are "reward_classifier",
-              "sarm", "topreward".
+              "sarm", "robometer", "topreward".

    Returns:
        The reward model class corresponding to the given name.
@@ -54,6 +55,10 @@ def get_reward_model_class(name: str) -> type[PreTrainedRewardModel]:
        from lerobot.rewards.sarm.modeling_sarm import SARMRewardModel

        return SARMRewardModel
+    elif name == "robometer":
+        from lerobot.rewards.robometer.modeling_robometer import RobometerRewardModel
+
+        return RobometerRewardModel
    elif name == "topreward":
        from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel

@@ -74,7 +79,7 @@ def make_reward_model_config(reward_type: str, **kwargs) -> RewardModelConfig:

    Args:
        reward_type: The type of the reward model. Supported types include
-                     "reward_classifier", "sarm", "topreward".
+                     "reward_classifier", "sarm", "robometer", "topreward".
        **kwargs: Keyword arguments to be passed to the configuration class constructor.

    Returns:
@@ -87,6 +92,8 @@ def make_reward_model_config(reward_type: str, **kwargs) -> RewardModelConfig:
        return RewardClassifierConfig(**kwargs)
    elif reward_type == "sarm":
        return SARMConfig(**kwargs)
+    elif reward_type == "robometer":
+        return RobometerConfig(**kwargs)
    elif reward_type == "topreward":
        return TOPRewardConfig(**kwargs)
    else:
@@ -168,6 +175,13 @@ def make_reward_pre_post_processors(
            dataset_stats=kwargs.get("dataset_stats"),
            dataset_meta=kwargs.get("dataset_meta"),
        )
+    elif isinstance(reward_cfg, RobometerConfig):
+        from lerobot.rewards.robometer.processor_robometer import make_robometer_pre_post_processors
+
+        return make_robometer_pre_post_processors(
+            config=reward_cfg,
+            dataset_stats=kwargs.get("dataset_stats"),
+        )

    elif isinstance(reward_cfg, TOPRewardConfig):
        from lerobot.rewards.topreward.processor_topreward import make_topreward_pre_post_processors
--- a/src/lerobot/rewards/robometer/init.py
+++ b/src/lerobot/rewards/robometer/init.py
@@ -0,0 +1,19 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .configuration_robometer import RobometerConfig
+from .modeling_robometer import RobometerRewardModel
+from .processor_robometer import make_robometer_pre_post_processors
+
+__all__ = ["RobometerConfig", "RobometerRewardModel", "make_robometer_pre_post_processors"]
--- a/src/lerobot/rewards/robometer/compute_rabc_weights.py
+++ b/src/lerobot/rewards/robometer/compute_rabc_weights.py
@@ -0,0 +1,320 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Compute per-frame Robometer progress and success curves for a LeRobot dataset.
+
+For each episode, builds per-frame sub-samples using the frame-steps
+strategy from the Robometer eval server: for each original frame ``t``,
+linspace-subsample ``[0, t]`` into ``K`` frames (default 4, matching
+``NUM_SUBSAMPLED_FRAMES`` in the eval server), run one forward through
+the Robometer processor + model, and keep the last-frame progress value.
+All sub-samples are the same size ``K`` so they batch cleanly.
+
+The parquet uses the same schema as SARM's
+:mod:`lerobot.rewards.sarm.compute_rabc_weights` so existing consumers —
+:class:`lerobot.rewards.sarm.rabc.RABCWeights` (which reads
+``progress_sparse``) and the progress-overlay script in
+``examples/dataset/create_progress_videos.py`` — work without modification.
+
+Usage:
+    # Dense per-frame progress for one episode
+    python -m lerobot.rewards.robometer.compute_rabc_weights \\
+        --dataset-repo-id lerobot/libero_10_image \\
+        --reward-model-path lerobot/Robometer-4B \\
+        --episodes 0
+
+    # All episodes with batching
+    python -m lerobot.rewards.robometer.compute_rabc_weights \\
+        --dataset-repo-id lerobot/libero_10_image \\
+        --reward-model-path lerobot/Robometer-4B \\
+        --batch-size 16
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+from pathlib import Path
+from typing import Any
+
+import numpy as np
+import pyarrow as pa
+import pyarrow.parquet as pq
+import torch
+from tqdm import tqdm
+
+from lerobot.datasets import LeRobotDataset
+from lerobot.rewards.robometer.configuration_robometer import RobometerConfig
+from lerobot.rewards.robometer.modeling_robometer import RobometerRewardModel
+from lerobot.rewards.robometer.processor_robometer import RobometerEncoderProcessorStep
+from lerobot.types import TransitionKey
+
+DEFAULT_OUTPUT_FILENAME = "robometer_progress.parquet"
+
+# Upstream Robometer eval server uses K=4 for frame-steps sub-samples.
+DEFAULT_NUM_SUBSAMPLED_FRAMES = 4
+
+
+def get_reward_model_path_from_parquet(parquet_path: Path) -> str | None:
+    """Read ``reward_model_path`` from parquet metadata if available."""
+    if not parquet_path.exists():
+        return None
+    try:
+        metadata = pq.read_metadata(parquet_path).schema.to_arrow_schema().metadata
+        if metadata and b"reward_model_path" in metadata:
+            return metadata[b"reward_model_path"].decode()
+    except Exception:  # nosec B110
+        return None
+    return None
+
+
+def _resolve_task(sample: dict[str, Any], default: str) -> str:
+    """Best-effort task extraction from a dataset sample."""
+    task = sample.get("task")
+    if isinstance(task, str) and task:
+        return task
+    return default
+
+
+def _build_subsample_indices(num_frames: int, num_subsampled_frames: int) -> list[np.ndarray]:
+    """Frame-steps linspace expansion.
+
+    For each ``t in [0, num_frames - 1]`` returns ``num_subsampled_frames``
+    indices from ``np.linspace(0, t, num_subsampled_frames)`` — the first
+    and last frames are always included. Each entry is a fixed-size array
+    so the model can batch them.
+    """
+    return [np.linspace(0, t, num_subsampled_frames).round().astype(np.int64) for t in range(num_frames)]
+
+
+def compute_robometer_progress(
+    dataset_repo_id: str,
+    reward_model_path: str,
+    output_path: str | None = None,
+    device: str = "cuda",
+    batch_size: int = 32,
+    num_subsampled_frames: int = DEFAULT_NUM_SUBSAMPLED_FRAMES,
+    episodes: list[int] | None = None,
+    image_key: str | None = None,
+) -> Path:
+    """Run Robometer over a dataset and write per-frame progress + success."""
+    logging.info(f"Loading Robometer: {reward_model_path}")
+    config = RobometerConfig(pretrained_path=reward_model_path, device=device)
+    if image_key is not None:
+        config.image_key = image_key
+    model = RobometerRewardModel.from_pretrained(reward_model_path, config=config)
+    model.to(device).eval()
+
+    encoder = RobometerEncoderProcessorStep(
+        base_model_id=config.base_model_id,
+        image_key=config.image_key,
+        task_key=config.task_key,
+        default_task=config.default_task,
+        max_frames=num_subsampled_frames,
+        use_multi_image=config.use_multi_image,
+        use_per_frame_progress_token=config.use_per_frame_progress_token,
+    )
+
+    image_key = config.image_key
+
+    logging.info(f"Loading dataset: {dataset_repo_id}")
+    dataset = LeRobotDataset(dataset_repo_id, download_videos=True)
+    logging.info(f"Dataset: {dataset.num_episodes} episodes, {dataset.num_frames} frames")
+
+    episode_indices = list(range(dataset.num_episodes)) if episodes is None else episodes
+    logging.info(f"Processing {len(episode_indices)} episode(s)")
+
+    all_index: list[int] = []
+    all_episode: list[int] = []
+    all_frame: list[int] = []
+    all_progress: list[float] = []
+
+    for episode_idx in tqdm(episode_indices, desc="Episodes"):
+        ep = dataset.meta.episodes[episode_idx]
+        ep_start = int(ep["dataset_from_index"])
+        ep_end = int(ep["dataset_to_index"])
+        num_frames = ep_end - ep_start
+        if num_frames <= 0:
+            continue
+
+        first_sample = dataset[ep_start]
+        task = _resolve_task(first_sample, default=config.default_task or "perform the task")
+
+        ep_frames = torch.stack([dataset[ep_start + i][image_key] for i in range(num_frames)])
+
+        sub_indices = _build_subsample_indices(num_frames, num_subsampled_frames)
+
+        progress_per_frame = np.zeros(num_frames, dtype=np.float32)
+
+        for start in tqdm(range(0, num_frames, batch_size), desc=f"  Ep {episode_idx}", leave=False):
+            end = min(start + batch_size, num_frames)
+            frames_batch = torch.stack([ep_frames[sub_indices[i]] for i in range(start, end)])
+
+            transition = {
+                TransitionKey.OBSERVATION: {image_key: frames_batch},
+                TransitionKey.COMPLEMENTARY_DATA: {"task": task},
+            }
+            encoded = encoder(transition)
+            obs = encoded[TransitionKey.OBSERVATION]
+            batch = {
+                key: value.to(device) if isinstance(value, torch.Tensor) else value
+                for key, value in obs.items()
+            }
+
+            with torch.no_grad():
+                rewards = model.compute_reward(batch)
+            progress_per_frame[start:end] = rewards.cpu().numpy()
+
+        for local in range(num_frames):
+            all_index.append(ep_start + local)
+            all_episode.append(episode_idx)
+            all_frame.append(local)
+            all_progress.append(float(progress_per_frame[local]))
+
+        if device.startswith("cuda"):
+            torch.cuda.empty_cache()
+
+    table = pa.table(
+        {
+            "index": np.asarray(all_index, dtype=np.int64),
+            "episode_index": np.asarray(all_episode, dtype=np.int64),
+            "frame_index": np.asarray(all_frame, dtype=np.int64),
+            "progress_sparse": np.asarray(all_progress, dtype=np.float32),
+        }
+    ).replace_schema_metadata({b"reward_model_path": reward_model_path.encode()})
+
+    out = Path(dataset.root) / DEFAULT_OUTPUT_FILENAME if output_path is None else Path(output_path)
+    out.parent.mkdir(parents=True, exist_ok=True)
+    pq.write_table(table, out)
+    logging.info(f"Saved {len(table)} frame values to {out}")
+
+    progress_arr = np.asarray(all_progress, dtype=np.float32)
+    if progress_arr.size:
+        logging.info(
+            f"Progress: mean={float(progress_arr.mean()):.4f}, "
+            f"std={float(progress_arr.std()):.4f}, "
+            f"min={float(progress_arr.min()):.4f}, "
+            f"max={float(progress_arr.max()):.4f}"
+        )
+    return out
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Compute per-frame Robometer progress curves for RA-BC weighting.",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+    # Dense per-frame progress for one episode
+    python -m lerobot.rewards.robometer.compute_rabc_weights \\
+        --dataset-repo-id lerobot/libero_10_image \\
+        --reward-model-path lerobot/Robometer-4B \\
+        --episodes 0
+
+    # All episodes, smaller batches for memory-constrained GPUs
+    python -m lerobot.rewards.robometer.compute_rabc_weights \\
+        --dataset-repo-id lerobot/libero_10_image \\
+        --reward-model-path lerobot/Robometer-4B \\
+        --batch-size 16
+        """,
+    )
+    parser.add_argument(
+        "--dataset-repo-id", type=str, required=True, help="HuggingFace dataset repo id or local path."
+    )
+    parser.add_argument(
+        "--reward-model-path", type=str, default=None, help="Robometer checkpoint repo id or local path."
+    )
+    parser.add_argument("--output-path", type=str, default=None, help="Output parquet path.")
+    parser.add_argument("--device", type=str, default="cuda", help="Device to use (default: cuda).")
+    parser.add_argument(
+        "--batch-size", type=int, default=32, help="Sub-samples per Qwen forward (default: 32)."
+    )
+    parser.add_argument(
+        "--num-subsampled-frames",
+        type=int,
+        default=DEFAULT_NUM_SUBSAMPLED_FRAMES,
+        help=f"Frames per sub-sample (default: {DEFAULT_NUM_SUBSAMPLED_FRAMES}, matches eval server).",
+    )
+    parser.add_argument(
+        "--episodes", type=int, nargs="+", default=None, help="Process only these episode indices."
+    )
+    parser.add_argument(
+        "--image-key", type=str, default=None, help="Image observation key (default: from config)."
+    )
+    parser.add_argument(
+        "--push-to-hub", action="store_true", help="Upload to the dataset repo on HuggingFace Hub."
+    )
+
+    args = parser.parse_args()
+
+    logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
+
+    reward_model_path = args.reward_model_path
+    if reward_model_path is None:
+        temp_dataset = LeRobotDataset(args.dataset_repo_id, download_videos=False)
+        parquet_path = Path(temp_dataset.root) / DEFAULT_OUTPUT_FILENAME
+        reward_model_path = get_reward_model_path_from_parquet(parquet_path)
+        if reward_model_path:
+            logging.info(f"Using reward model from parquet metadata: {reward_model_path}")
+        else:
+            raise ValueError(
+                "--reward-model-path is required (no existing parquet with model metadata found)."
+            )
+
+    output_path = compute_robometer_progress(
+        dataset_repo_id=args.dataset_repo_id,
+        reward_model_path=reward_model_path,
+        output_path=args.output_path,
+        device=args.device,
+        batch_size=args.batch_size,
+        num_subsampled_frames=args.num_subsampled_frames,
+        episodes=args.episodes,
+        image_key=args.image_key,
+    )
+
+    print(f"\nRobometer progress saved to: {output_path}")
+
+    if args.push_to_hub:
+        from huggingface_hub import HfApi
+
+        api = HfApi()
+        hub_path = DEFAULT_OUTPUT_FILENAME
+
+        print(f"\nUploading to Hub: {args.dataset_repo_id}/{hub_path}")
+        api.upload_file(
+            path_or_fileobj=str(output_path),
+            path_in_repo=hub_path,
+            repo_id=args.dataset_repo_id,
+            repo_type="dataset",
+        )
+        print(
+            "Successfully uploaded to: "
+            f"https://huggingface.co/datasets/{args.dataset_repo_id}/blob/main/{hub_path}"
+        )
+
+        print("\nTo use in training, add to your config:")
+        print("  use_rabc: true")
+        print(f"  rabc_progress_path: hf://datasets/{args.dataset_repo_id}/{hub_path}")
+        print("  rabc_head_mode: sparse")
+    else:
+        print("\nTo use in training, add to your config:")
+        print("  use_rabc: true")
+        print(f"  rabc_progress_path: {output_path}")
+        print("  rabc_head_mode: sparse")
+
+
+if __name__ == "__main__":
+    main()
--- a/src/lerobot/rewards/robometer/configuration_robometer.py
+++ b/src/lerobot/rewards/robometer/configuration_robometer.py
@@ -0,0 +1,158 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+from copy import deepcopy
+from dataclasses import dataclass, field
+from typing import TYPE_CHECKING, Any
+
+from lerobot.configs import FeatureType, NormalizationMode, PolicyFeature
+from lerobot.configs.rewards import RewardModelConfig
+from lerobot.utils.constants import OBS_IMAGES
+from lerobot.utils.import_utils import _transformers_available, require_package
+
+if TYPE_CHECKING or _transformers_available:
+    from transformers import AutoConfig, AutoTokenizer
+else:
+    AutoConfig = None  # type: ignore[assignment]
+    AutoTokenizer = None  # type: ignore[assignment]
+
+
+# Special tokens Robometer adds to the Qwen-VL tokenizer at construction time.
+# The order is part of the data contract: upstream resized ``embed_tokens``
+# after adding these tokens in this exact order, so changing the set or order
+# would silently misalign the saved embedding rows with their token ids.
+# ``<|reward_token|>`` and ``<|sim_token|>`` are leftover from earlier upstream
+# heads (never read at inference) but still occupy rows the checkpoint expects.
+ROBOMETER_SPECIAL_TOKENS = (
+    "<|split_token|>",
+    "<|reward_token|>",
+    "<|pref_token|>",
+    "<|sim_token|>",
+    "<|prog_token|>",
+)
+
+
+@RewardModelConfig.register_subclass("robometer")
+@dataclass
+class RobometerConfig(RewardModelConfig):
+    """Configuration for the Robometer reward model."""
+
+    pretrained_path: str | None = "lerobot/Robometer-4B"
+    image_key: str = OBS_IMAGES + ".top"
+    task_key: str = "task"
+    default_task: str | None = None
+
+    max_frames: int | None = 8
+    reward_output: str = "progress"  # "progress" or "success"
+    success_threshold: float = 0.5
+
+    license: str | None = "apache-2.0"
+    tags: list[str] | None = field(
+        default_factory=lambda: ["reward-model", "vision-language", "qwen3-vl", "zero-shot"]
+    )
+
+    base_model_id: str = "Qwen/Qwen3-VL-4B-Instruct"
+    torch_dtype: str = "bfloat16"
+    use_multi_image: bool = True
+    use_per_frame_progress_token: bool = True
+    average_temporal_patches: bool = True
+    frame_pooling: str = "mean"  # "mean" | "boundary" | "attention"
+    frame_pooling_attn_temperature: float = 1.0
+    progress_loss_type: str = "discrete"  # "l1" | "l2" | "discrete"
+    progress_discrete_bins: int = 10
+
+    # Serialised Qwen backbone config (post-resize). Always populated by
+    # ``__post_init__`` from ``base_model_id`` + ``len(tokenizer) + 5``, so it
+    # is non-empty after construction. Saved into ``config.json`` automatically
+    # by the base ``_save_pretrained``.
+    vlm_config: dict[str, Any] = field(default_factory=dict)
+
+    input_features: dict[str, PolicyFeature] = field(default_factory=dict)
+    output_features: dict[str, PolicyFeature] = field(default_factory=dict)
+    normalization_mapping: dict[str, NormalizationMode] = field(
+        default_factory=lambda: {
+            "VISUAL": NormalizationMode.IDENTITY,
+            "REWARD": NormalizationMode.IDENTITY,
+        }
+    )
+
+    def __post_init__(self) -> None:
+        super().__post_init__()
+        if self.reward_output not in {"progress", "success"}:
+            raise ValueError(f"reward_output must be 'progress' or 'success', got {self.reward_output!r}")
+        if self.max_frames is not None and self.max_frames < 1:
+            raise ValueError(f"max_frames must be >= 1, got {self.max_frames}")
+        if self.frame_pooling not in {"mean", "boundary", "attention"}:
+            raise ValueError(f"frame_pooling must be mean/boundary/attention; got {self.frame_pooling!r}")
+        if self.frame_pooling_attn_temperature <= 0:
+            raise ValueError("frame_pooling_attn_temperature must be > 0")
+        if self.progress_loss_type not in {"l1", "l2", "discrete"}:
+            raise ValueError(f"progress_loss_type must be l1/l2/discrete; got {self.progress_loss_type!r}")
+        if self.use_per_frame_progress_token and not self.use_multi_image:
+            raise ValueError("use_per_frame_progress_token=True requires use_multi_image=True")
+
+        if self.image_key not in self.input_features:
+            self.input_features[self.image_key] = PolicyFeature(shape=(3, 224, 224), type=FeatureType.VISUAL)
+        self.output_features.setdefault("progress", PolicyFeature(shape=(1,), type=FeatureType.REWARD))
+        self.output_features.setdefault("success", PolicyFeature(shape=(1,), type=FeatureType.REWARD))
+
+        # Deterministically populate ``vlm_config`` so it is non-empty after
+        # construction. For ``Qwen/Qwen3-VL-4B-Instruct`` this gives
+        # ``len(tokenizer) + 5 = 151,669 + 5 = 151,674`` — the exact post-resize
+        # vocab the published ``Robometer-4B`` checkpoint was saved with.
+        if not self.vlm_config:
+            require_package("transformers", extra="robometer")
+            vlm = AutoConfig.from_pretrained(self.base_model_id).to_dict()
+            tokenizer = AutoTokenizer.from_pretrained(self.base_model_id)
+            text_config = vlm.get("text_config")
+            if not isinstance(text_config, dict):
+                raise ValueError(
+                    f"Backbone config for {self.base_model_id!r} has no nested `text_config`; "
+                    "Robometer expects a Qwen-VL-style config."
+                )
+            text_config["vocab_size"] = len(tokenizer) + len(ROBOMETER_SPECIAL_TOKENS)
+            self.vlm_config = vlm
+
+    @property
+    def use_discrete_progress(self) -> bool:
+        """Whether the progress head outputs distribution logits over bins."""
+        return self.progress_loss_type.lower() == "discrete"
+
+    @property
+    def vlm_backbone_config(self):
+        """Reconstruct the Qwen backbone config from :attr:`vlm_config`."""
+        require_package("transformers", extra="robometer")
+        config_dict = deepcopy(self.vlm_config)
+        model_type = config_dict.pop("model_type", None)
+        if model_type is None:
+            raise ValueError("vlm_config must include `model_type` to reconstruct the backbone config")
+        return AutoConfig.for_model(model_type, **config_dict)
+
+    @property
+    def observation_delta_indices(self) -> list[int] | None:
+        return None
+
+    @property
+    def action_delta_indices(self) -> None:
+        return None
+
+    @property
+    def reward_delta_indices(self) -> None:
+        return None
+
+    def validate_features(self) -> None:
+        if self.image_key not in self.input_features:
+            raise ValueError(f"Robometer requires image input feature {self.image_key!r}")
--- a/src/lerobot/rewards/robometer/modeling_robometer.py
+++ b/src/lerobot/rewards/robometer/modeling_robometer.py
@@ -0,0 +1,481 @@
+# Copyright 2026 Anthony Liang, Yigit Korkmaz, Stephen Tu, Erdem Bıyık, Jesse Zhang
+# and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""ROBOMETER: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons.
+
+Paper:         https://arxiv.org/abs/2603.02115
+Project:       https://robometer.github.io
+Original code: https://github.com/aliang8/robometer
+Model:         https://huggingface.co/robometer/Robometer-4B
+
+Robometer is a general-purpose, video-language-input reward model built on
+``Qwen/Qwen3-VL-4B-Instruct``. It is trained with a dual reward-prediction
+objective:
+
+- A frame-level progress loss anchoring reward magnitude on expert data.
+- A trajectory-comparison preference loss imposing global ordering constraints
+  across trajectories sharing the same instruction.
+
+To support downstream RL it also predicts a frame-level binary success. The
+training prompt inserts three learnable tokens:
+
+- ``<|prog_token|>`` after each frame to read per-frame progress and success.
+- ``<|pref_token|>`` at the end to read pairwise preference (training-only).
+- ``<|split_token|>`` between two trajectories in preference samples
+  (training-only).
+
+Progress is modeled as a categorical distribution over ``progress_discrete_bins``
+uniformly-spaced centers in ``[0, 1]`` (C51-style), and the continuous estimate
+is recovered as the softmax-weighted mean of those centers — see
+:func:`convert_bins_to_continuous`.
+
+This LeRobot port is **inference-only**: the preference head is preserved in
+the state dict for byte-equivalence with the published ``Robometer-4B``
+checkpoint but is not queried by :meth:`RobometerRewardModel.compute_reward`,
+which returns the last-frame progress (clamped to ``[0, 1]``) or sigmoid'd
+success probability depending on :attr:`RobometerConfig.reward_output`.
+"""
+
+from __future__ import annotations
+
+import logging
+from typing import TYPE_CHECKING, Any
+
+import torch
+from torch import Tensor, nn
+
+from lerobot.rewards.pretrained import PreTrainedRewardModel
+from lerobot.rewards.robometer.configuration_robometer import RobometerConfig
+from lerobot.utils.constants import OBS_PREFIX
+from lerobot.utils.import_utils import _transformers_available, require_package
+
+if TYPE_CHECKING or _transformers_available:
+    from transformers import AutoModelForImageTextToText
+else:
+    AutoModelForImageTextToText = None  # type: ignore[assignment]
+
+logger = logging.getLogger(__name__)
+
+# Namespace for Robometer's pre-encoded Qwen-VL observation tensors.
+ROBOMETER_FEATURE_PREFIX = f"{OBS_PREFIX}robometer."
+ROBOMETER_QWEN_INPUT_KEYS = (
+    "input_ids",
+    "attention_mask",
+    "pixel_values",
+    "pixel_values_videos",
+    "image_grid_thw",
+    "video_grid_thw",
+    "second_per_grid_ts",
+    "mm_token_type_ids",
+)
+ROBOMETER_METADATA_KEYS = (
+    "prog_token_id",
+    "vision_start_token_id",
+    "vision_end_token_id",
+    "video_merge_size",
+)
+ROBOMETER_INPUT_KEYS = ROBOMETER_QWEN_INPUT_KEYS + ROBOMETER_METADATA_KEYS
+
+
+def convert_bins_to_continuous(bin_logits: Tensor) -> Tensor:
+    """Collapse per-bin logits into a single value in ``[0, 1]``.
+
+    The discrete progress head outputs ``num_bins`` logits per frame. Bins are
+    evenly spaced centers in ``[0, 1]``; the continuous prediction is the
+    softmax-weighted mean of those centers.
+    """
+    bin_probs = torch.softmax(bin_logits, dim=-1)
+    num_bins = bin_logits.shape[-1]
+    bin_centers = torch.linspace(0.0, 1.0, num_bins, device=bin_logits.device, dtype=bin_logits.dtype)
+    return (bin_probs * bin_centers).sum(dim=-1)
+
+
+def _squeeze_last_safe(x: Tensor) -> Tensor:
+    """Drop a trailing singleton dim only when present."""
+    return x.squeeze(-1) if x.ndim > 1 and x.shape[-1] == 1 else x
+
+
+def _torch_dtype(name: str) -> torch.dtype:
+    dtype = getattr(torch, name, None)
+    if isinstance(dtype, torch.dtype):
+        return dtype
+    raise ValueError(f"Unknown torch dtype: {name!r}")
+
+
+class RobometerPredictionHead(nn.Sequential):
+    """Small MLP head used for Robometer's progress / success / preference outputs."""
+
+    def __init__(self, hidden_dim: int, output_size: int, *, dropout: float, with_sigmoid: bool) -> None:
+        layers: list[nn.Module] = [
+            nn.Linear(hidden_dim, hidden_dim // 2),
+            nn.LayerNorm(hidden_dim // 2),
+            nn.GELU(),
+            nn.Dropout(dropout),
+            nn.Linear(hidden_dim // 2, output_size),
+        ]
+        if with_sigmoid:
+            layers.append(nn.Sigmoid())
+        super().__init__(*layers)
+
+
+def decode_progress_outputs(
+    progress_logits: Tensor | None,
+    success_logits: Tensor | None,
+    *,
+    is_discrete_mode: bool,
+) -> dict[str, list[list[float]]]:
+    """Decode RBM head outputs into per-frame floats.
+
+    Args:
+        progress_logits: ``(B, T)`` (continuous) or ``(B, T, num_bins)`` (discrete).
+        success_logits: ``(B, T)`` raw logits, ``sigmoid``-ed to probabilities.
+        is_discrete_mode: if True the progress logits get a softmax over bins
+            and are projected onto bin centers via :func:`convert_bins_to_continuous`.
+
+    Returns:
+        Dict with ``progress_pred`` and ``success_probs``, each a list of
+        length ``B`` of per-frame float lists.
+    """
+    progress_pred: list[list[float]] = []
+    success_probs: list[list[float]] = []
+
+    if progress_logits is not None:
+        for sample_logits in progress_logits:
+            if is_discrete_mode:
+                continuous = convert_bins_to_continuous(sample_logits.detach().float().cpu())
+                progress_pred.append(continuous.flatten().tolist())
+            else:
+                progress_pred.append(sample_logits.detach().float().cpu().flatten().tolist())
+
+    if success_logits is not None:
+        for sample_logits in success_logits:
+            success_probs.append(torch.sigmoid(sample_logits.detach().float().cpu()).flatten().tolist())
+
+    return {"progress_pred": progress_pred, "success_probs": success_probs}
+
+
+class RobometerRewardModel(PreTrainedRewardModel):
+    """Robometer (RBM) reward model — inference-only LeRobot port.
+
+    Wraps a Qwen-VL backbone (default: ``Qwen/Qwen3-VL-4B-Instruct``) with three
+    prediction heads from the paper (progress, success, preference). At
+    inference time only the progress and success heads are queried; the
+    preference head is kept on the module so the published ``Robometer-4B``
+    safetensors load unchanged.
+    """
+
+    name = "robometer"
+    config_class = RobometerConfig
+
+    def __init__(self, config: RobometerConfig, *, dropout: float = 0.1) -> None:
+        require_package("transformers", extra="robometer")
+        super().__init__(config)
+        self.config = config
+
+        # Two backbone-build paths (EO-1 style, branched on ``pretrained_path``):
+        #
+        #   - Fresh training (``pretrained_path is None``): download the base
+        #     Qwen weights and resize the embed table to match
+        #     ``vlm_config.text_config.vocab_size`` — populated deterministically
+        #     in ``RobometerConfig.__post_init__`` as
+        #     ``len(tokenizer) + len(ROBOMETER_SPECIAL_TOKENS)``
+        #
+        #   - Loading a saved checkpoint (``pretrained_path`` is set): rebuild
+        #     the empty architecture from ``vlm_config`` via
+        #     ``AutoModelForImageTextToText.from_config`` so the subsequent
+        #     ``model.safetensors`` load is a direct fill of the right shape —
+        #     no redundant Qwen weight download.
+        torch_dtype = _torch_dtype(config.torch_dtype)
+        if config.pretrained_path is None:
+            self.model = AutoModelForImageTextToText.from_pretrained(
+                config.base_model_id,
+                dtype=torch_dtype,
+                trust_remote_code=True,
+            )
+            target_vocab = config.vlm_config["text_config"]["vocab_size"]
+            self.model.resize_token_embeddings(target_vocab)
+        else:
+            self.model = AutoModelForImageTextToText.from_config(
+                config.vlm_backbone_config,
+                dtype=torch_dtype,
+                trust_remote_code=True,
+            )
+
+        # All Qwen-VL backbones Robometer supports expose `text_config.hidden_size`.
+        # Falls back to the top-level `hidden_size` so future non-multimodal
+        # variants would still resolve.
+        backbone_config = self.model.config
+        text_config = getattr(backbone_config, "text_config", None)
+        hidden_size = getattr(text_config, "hidden_size", None) if text_config is not None else None
+        if hidden_size is None:
+            hidden_size = getattr(backbone_config, "hidden_size", None)
+        if hidden_size is None:
+            raise AttributeError(
+                f"Could not infer hidden_size from backbone config of {config.base_model_id}"
+            )
+        hidden_dim = int(hidden_size)
+
+        # Robometer's three prediction heads + frame-pool attention.
+        progress_output = config.progress_discrete_bins if config.use_discrete_progress else 1
+        self.progress_head = RobometerPredictionHead(
+            hidden_dim,
+            progress_output,
+            dropout=dropout,
+            with_sigmoid=not config.use_discrete_progress,
+        )
+        self.preference_head = RobometerPredictionHead(hidden_dim, 1, dropout=dropout, with_sigmoid=False)
+        self.success_head = RobometerPredictionHead(hidden_dim, 1, dropout=dropout, with_sigmoid=False)
+        self.frame_pool_attn = nn.Linear(hidden_dim, 1, bias=False)
+
+        # Match the dtype of the loaded base model so weight loading is a no-op cast.
+        model_dtype = next(self.model.parameters()).dtype
+        self.progress_head.to(dtype=model_dtype)
+        self.preference_head.to(dtype=model_dtype)
+        self.success_head.to(dtype=model_dtype)
+        self.frame_pool_attn.to(dtype=model_dtype)
+
+    def compute_reward(self, batch: dict[str, Tensor]) -> Tensor:
+        inputs = {
+            key: batch[f"{ROBOMETER_FEATURE_PREFIX}{key}"]
+            for key in ROBOMETER_INPUT_KEYS
+            if f"{ROBOMETER_FEATURE_PREFIX}{key}" in batch
+        }
+        if "input_ids" not in inputs:
+            raise KeyError(
+                f"Robometer batch missing pre-encoded inputs (expected "
+                f"`{ROBOMETER_FEATURE_PREFIX}input_ids`). Make sure the "
+                "RobometerEncoderProcessorStep ran before `compute_reward`."
+            )
+
+        device = next(self.model.parameters()).device
+        inputs = {key: value.to(device) if hasattr(value, "to") else value for key, value in inputs.items()}
+
+        self.eval()
+        with torch.no_grad():
+            progress_logits, success_logits = self._compute_rbm_logits(inputs)
+
+        decoded = decode_progress_outputs(
+            progress_logits,
+            success_logits,
+            is_discrete_mode=self.config.use_discrete_progress,
+        )
+        values = (
+            decoded["success_probs"] if self.config.reward_output == "success" else decoded["progress_pred"]
+        )
+
+        rewards = torch.stack([torch.as_tensor(seq, dtype=torch.float32)[-1] for seq in values])
+        if self.config.reward_output == "success":
+            rewards = (rewards > self.config.success_threshold).float()
+        else:
+            # Match upstream Robometer's ``extract_rewards_from_output``: per-frame
+            # progress predictions are clamped to ``[0, 1]`` before being returned.
+            rewards = rewards.clamp(0.0, 1.0)
+        return rewards.to(self.config.device or "cpu")
+
+    def _compute_rbm_logits(
+        self,
+        inputs: dict[str, Any],
+    ) -> tuple[Tensor, Tensor]:
+        """Run the Qwen3-VL backbone and apply Robometer's heads.
+
+        ``inputs`` is the encoded batch produced by
+        :class:`RobometerEncoderProcessorStep`. It carries Qwen tensors as well
+        as Robometer-specific metadata (``prog_token_id``,
+        ``vision_start_token_id``, ``vision_end_token_id``, ``video_merge_size``)
+        — the metadata is popped here so the rest can be forwarded straight to
+        the Qwen model.
+
+        Returns ``(progress_logits, success_logits)``. Shapes:
+
+        - ``progress_logits``: ``(B, T)`` (continuous) or ``(B, T, num_bins)`` (discrete).
+        - ``success_logits``: ``(B, T)`` raw logits (sigmoid happens at decode time).
+        """
+        prog_token_id = inputs.pop("prog_token_id", None)
+        vision_start_token_id = inputs.pop("vision_start_token_id", None)
+        vision_end_token_id = inputs.pop("vision_end_token_id", None)
+        video_merge_size = inputs.pop("video_merge_size", 14)
+
+        # Qwen3-VL doesn't reliably populate `last_hidden_state`; ask for the
+        # full hidden-state tuple and take the last layer. This matches the
+        # `is_qwen3` path in upstream Robometer's `RBM.forward_qwen` (main).
+        outputs = self.model(**inputs, output_hidden_states=True, return_dict=True)
+        hidden_state = (
+            outputs.hidden_states[-1]
+            if getattr(outputs, "hidden_states", None)
+            else outputs.last_hidden_state
+        )
+
+        input_ids = inputs["input_ids"]
+        if self.config.use_per_frame_progress_token:
+            if prog_token_id is None:
+                raise KeyError("`prog_token_id` missing in batch (run RobometerEncoderProcessorStep first)")
+            return self._process_token_extraction(hidden_state, input_ids, prog_token_id=prog_token_id)
+        if self.config.use_multi_image:
+            if vision_start_token_id is None or vision_end_token_id is None:
+                raise KeyError(
+                    "`vision_start_token_id` / `vision_end_token_id` missing in batch "
+                    "(run RobometerEncoderProcessorStep first)"
+                )
+            return self._process_multi_image_frames(
+                hidden_state,
+                input_ids,
+                start_id=vision_start_token_id,
+                end_id=vision_end_token_id,
+            )
+        video_grid_thw = inputs.get("video_grid_thw")
+        if video_grid_thw is None:
+            raise ValueError("video_grid_thw is required for video-mode Robometer inference")
+        if vision_start_token_id is None:
+            raise KeyError("`vision_start_token_id` missing in batch")
+        return self._process_video_frames(
+            hidden_state,
+            input_ids,
+            video_grid_thw,
+            start_id=vision_start_token_id,
+            merge_size=video_merge_size,
+        )
+
+    def _apply_heads_to_hidden_states(self, frame_embeddings: Tensor) -> tuple[Tensor, Tensor]:
+        """Apply progress + success heads to a tensor of frame embeddings."""
+        progress_out = self.progress_head(frame_embeddings)
+        progress = progress_out if self.config.use_discrete_progress else _squeeze_last_safe(progress_out)
+        success = _squeeze_last_safe(self.success_head(frame_embeddings))
+        return progress, success
+
+    def _process_token_extraction(
+        self,
+        hidden_state: Tensor,
+        input_ids: Tensor,
+        *,
+        prog_token_id: int,
+    ) -> tuple[Tensor, Tensor]:
+        """Per-frame progress/success from ``<|prog_token|>`` positions."""
+        token_mask = input_ids == prog_token_id
+        batch_indices, positions = token_mask.nonzero(as_tuple=True)
+        if positions.numel() == 0:
+            raise ValueError("`<|prog_token|>` not found in any sequence")
+
+        per_sample_hidden = [
+            hidden_state[i, positions[batch_indices == i]] for i in range(input_ids.shape[0])
+        ]
+        progress_list, success_list = [], []
+        for embeddings in per_sample_hidden:
+            if embeddings.shape[0] == 0:
+                raise ValueError("`<|prog_token|>` missing in a sequence")
+            progress, success = self._apply_heads_to_hidden_states(embeddings)
+            progress_list.append(progress)
+            success_list.append(success)
+
+        return torch.stack(progress_list), torch.stack(success_list)
+
+    def _process_multi_image_frames(
+        self,
+        hidden_state: Tensor,
+        input_ids: Tensor,
+        *,
+        start_id: int,
+        end_id: int,
+    ) -> tuple[Tensor, Tensor]:
+        """Per-frame progress/success in multi-image mode (Qwen-VL)."""
+        progress_list, success_list = [], []
+        for batch_idx in range(input_ids.shape[0]):
+            seq_ids = input_ids[batch_idx]
+            seq_hidden = hidden_state[batch_idx]
+            frame_embeddings = self._extract_hidden_states_from_token_pairs(
+                seq_hidden, seq_ids, start_id, end_id
+            )
+            progress, success = self._apply_heads_to_hidden_states(frame_embeddings)
+            progress_list.append(progress)
+            success_list.append(success)
+
+        return torch.stack(progress_list), torch.stack(success_list)
+
+    def _extract_hidden_states_from_token_pairs(
+        self,
+        hidden_state: Tensor,
+        input_ids: Tensor,
+        start_id: int,
+        end_id: int,
+    ) -> Tensor:
+        start_positions = (input_ids == start_id).nonzero(as_tuple=True)[0]
+        end_positions = (input_ids == end_id).nonzero(as_tuple=True)[0]
+        if start_positions.numel() == 0:
+            raise ValueError("`<|vision_start|>` not found in sequence")
+        if start_positions.numel() != end_positions.numel():
+            raise ValueError(
+                f"Mismatched vision token counts: {start_positions.numel()} start vs "
+                f"{end_positions.numel()} end"
+            )
+
+        frames: list[Tensor] = []
+        for start, end in zip(start_positions.tolist(), end_positions.tolist(), strict=True):
+            if start >= end:
+                raise ValueError(f"Invalid vision token pair: start={start} end={end}")
+            patch_tokens = hidden_state[start + 1 : end]
+            if patch_tokens.shape[0] == 0:
+                frames.append((hidden_state[start] + hidden_state[end]) / 2.0)
+                continue
+
+            pooling = self.config.frame_pooling
+            if pooling == "mean":
+                frames.append(patch_tokens.mean(dim=0))
+            elif pooling == "boundary":
+                frames.append(patch_tokens[-1])
+            else:  # attention
+                scores = (
+                    self.frame_pool_attn(patch_tokens).squeeze(-1)
+                    / self.config.frame_pooling_attn_temperature
+                )
+                weights = torch.softmax(scores, dim=0).unsqueeze(-1)
+                frames.append((weights * patch_tokens).sum(dim=0))
+
+        return torch.stack(frames)
+
+    def _process_video_frames(
+        self,
+        hidden_state: Tensor,
+        input_ids: Tensor,
+        video_grid_thw: Tensor,
+        *,
+        start_id: int,
+        merge_size: int,
+    ) -> tuple[Tensor, Tensor]:
+        """Per-frame progress/success in video mode (Qwen-VL)."""
+        progress_list, success_list = [], []
+        for batch_idx in range(input_ids.shape[0]):
+            seq_ids = input_ids[batch_idx]
+            seq_hidden = hidden_state[batch_idx]
+            start_positions = (seq_ids == start_id).nonzero(as_tuple=True)[0]
+            if start_positions.numel() == 0:
+                raise ValueError("`<|vision_start|>` not found in sequence")
+            t_dim, h_dim, w_dim = (int(x) for x in video_grid_thw[batch_idx].tolist())
+            tokens_per_frame = (h_dim * w_dim) // (merge_size**2)
+
+            cursor = start_positions[0].item()
+            frame_embeddings: list[Tensor] = []
+            for _ in range(t_dim):
+                if self.config.average_temporal_patches:
+                    patch = seq_hidden[cursor : cursor + tokens_per_frame]
+                    frame_embeddings.append(patch.mean(dim=0))
+                else:
+                    frame_embeddings.append(seq_hidden[cursor + tokens_per_frame])
+                cursor += tokens_per_frame
+
+            stacked = torch.stack(frame_embeddings)
+            progress, success = self._apply_heads_to_hidden_states(stacked)
+            progress_list.append(progress)
+            success_list.append(success)
+
+        return torch.stack(progress_list), torch.stack(success_list)
--- a/src/lerobot/rewards/robometer/processor_robometer.py
+++ b/src/lerobot/rewards/robometer/processor_robometer.py
@@ -0,0 +1,338 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Robometer pre/post processing pipelines."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from typing import TYPE_CHECKING, Any
+
+import numpy as np
+import torch
+from PIL import Image
+from torch import Tensor
+
+from lerobot.configs import PipelineFeatureType, PolicyFeature
+from lerobot.processor import (
+    AddBatchDimensionProcessorStep,
+    DeviceProcessorStep,
+    PolicyAction,
+    PolicyProcessorPipeline,
+    ProcessorStep,
+    ProcessorStepRegistry,
+    policy_action_to_transition,
+)
+from lerobot.rewards.robometer.configuration_robometer import (
+    ROBOMETER_SPECIAL_TOKENS,
+    RobometerConfig,
+)
+from lerobot.rewards.robometer.modeling_robometer import ROBOMETER_FEATURE_PREFIX
+from lerobot.types import EnvTransition, TransitionKey
+from lerobot.utils.constants import (
+    OBS_IMAGES,
+    POLICY_POSTPROCESSOR_DEFAULT_NAME,
+    POLICY_PREPROCESSOR_DEFAULT_NAME,
+)
+from lerobot.utils.import_utils import _transformers_available, require_package
+
+if TYPE_CHECKING or _transformers_available:
+    from transformers import AutoProcessor
+else:
+    AutoProcessor = None
+
+PROGRESS_PROMPT = (
+    "The task for the robot is '{task}'. Given the trajectory video, predict "
+    "the task progress at each frame, how far along the robot is towards "
+    "completing the task, a float between 0 and 1, where 0 is the starting "
+    "state and 1 is when the task is completed. If the robot is not "
+    "performing the same task, predict 0 progress."
+)
+
+
+def _frames_to_pil(frames: np.ndarray) -> list[Image.Image]:
+    """Convert ``(T, H, W, C)`` uint8 frames to a list of PIL images."""
+    if frames.ndim != 4:
+        raise ValueError(f"Expected (T,H,W,C) frames; got shape {frames.shape}")
+    if frames.dtype != np.uint8:
+        frames = np.clip(frames, 0, 255).astype(np.uint8)
+    return [Image.fromarray(frames[i]) for i in range(frames.shape[0])]
+
+
+def _video_to_numpy(video: Tensor, *, max_frames: int | None) -> np.ndarray:
+    """Convert one trajectory tensor to a ``(T, H, W, C) uint8`` numpy array."""
+    if max_frames is not None:
+        video = video[-max_frames:]
+    if video.shape[1] in (1, 3):
+        video = video.permute(0, 2, 3, 1)
+    elif video.shape[-1] not in (1, 3):
+        raise ValueError(f"Expected channel dim of size 1 or 3, got shape {tuple(video.shape)}")
+
+    array = video.detach().cpu().numpy()
+    if np.issubdtype(array.dtype, np.floating) and array.size > 0 and array.max() <= 1.0:
+        array = array * 255.0
+    return np.clip(array, 0, 255).astype(np.uint8)
+
+
+def _expand_tasks(task: Any, *, batch_size: int, default: str | None) -> list[str]:
+    if task is None:
+        task = default
+    if task is None:
+        raise KeyError("Robometer expected a task description in complementary data")
+    if isinstance(task, str):
+        return [task] * batch_size
+    if isinstance(task, tuple):
+        task = list(task)
+    if not (isinstance(task, list) and all(isinstance(item, str) for item in task)):
+        raise TypeError(f"Robometer task must be a string or list of strings, got {type(task)}")
+    if len(task) == 1 and batch_size > 1:
+        return task * batch_size
+    if len(task) != batch_size:
+        raise ValueError(f"Expected {batch_size} tasks, got {len(task)}")
+    return task
+
+
+@dataclass
+@ProcessorStepRegistry.register(name="robometer_encoder")
+class RobometerEncoderProcessorStep(ProcessorStep):
+    """Encode raw frames + task into Qwen-VL tensors for the Robometer model.
+
+    Loads a :class:`~transformers.AutoProcessor` matching ``base_model_id`` and
+    registers Robometer's special tokens on the tokenizer. The matching
+    embedding resize happens model-side in
+    :meth:`RobometerRewardModel.__init__`.
+
+    At call time the step reads:
+
+    - ``observation[image_key]``: ``(B, T, C, H, W)`` or ``(B, C, H, W)`` frames.
+    - ``complementary_data[task_key]``: a string or list of strings.
+
+    and writes ``observation[f"{ROBOMETER_FEATURE_PREFIX}<name>"]`` for:
+
+    - the Qwen-VL processor outputs: ``input_ids``, ``attention_mask``,
+      ``pixel_values``, ``image_grid_thw``, ``video_grid_thw``, ...
+    - Robometer-specific token ids consumed by the model heads:
+      ``prog_token_id``, ``vision_start_token_id``, ``vision_end_token_id``,
+      ``video_merge_size``.
+    """
+
+    base_model_id: str = "Qwen/Qwen3-VL-4B-Instruct"
+    image_key: str = OBS_IMAGES + ".top"
+    task_key: str = "task"
+    default_task: str | None = None
+    max_frames: int | None = 8
+    use_multi_image: bool = True
+    use_per_frame_progress_token: bool = True
+    max_length: int = 1024
+
+    _processor: Any = field(default=None, init=False, repr=False)
+
+    def __post_init__(self) -> None:
+        require_package("transformers", extra="robometer")
+        require_package("qwen-vl-utils", extra="robometer", import_name="qwen_vl_utils")
+
+        self._processor = AutoProcessor.from_pretrained(
+            self.base_model_id,
+            trust_remote_code=True,
+            do_sample_frames=False,
+            padding_side="right",
+        )
+
+        # Register Robometer's special tokens on the tokenizer. The matching
+        # embedding resize happens model-side in `RobometerRewardModel.__init__`.
+        tokenizer = self._processor.tokenizer
+        # Qwen tokenizers may not define a pad token, but batched prompts/videos
+        # require padding, so reuse EOS as the padding token.
+        if tokenizer.pad_token is None:
+            tokenizer.pad_token = tokenizer.eos_token
+        for token in ROBOMETER_SPECIAL_TOKENS:
+            if token not in tokenizer.get_vocab():
+                tokenizer.add_special_tokens({"additional_special_tokens": [token]})
+
+    def __call__(self, transition: EnvTransition) -> EnvTransition:
+        observation = transition.get(TransitionKey.OBSERVATION)
+        complementary = transition.get(TransitionKey.COMPLEMENTARY_DATA) or {}
+        if not isinstance(observation, dict):
+            raise ValueError("RobometerEncoderProcessorStep requires an observation dict")
+
+        if self.image_key not in observation:
+            raise KeyError(f"Robometer expected image key {self.image_key!r} in observation")
+
+        frames = observation[self.image_key]
+        tensor = frames.detach().cpu() if isinstance(frames, Tensor) else torch.as_tensor(frames)
+        if tensor.ndim == 4:
+            tensor = tensor.unsqueeze(1)
+        elif tensor.ndim != 5:
+            raise ValueError(
+                f"Expected Robometer frames with shape (B,C,H,W) or (B,T,C,H,W); got {tuple(tensor.shape)}"
+            )
+
+        batch_size = tensor.shape[0]
+        tasks = _expand_tasks(
+            complementary.get(self.task_key, self.default_task),
+            batch_size=batch_size,
+            default=self.default_task,
+        )
+
+        samples = [
+            (_video_to_numpy(tensor[i], max_frames=self.max_frames), tasks[i]) for i in range(batch_size)
+        ]
+        encoded = self.encode_samples(samples)
+
+        new_observation = dict(observation)
+        for key, value in encoded.items():
+            new_observation[f"{ROBOMETER_FEATURE_PREFIX}{key}"] = value
+
+        new_transition = transition.copy()
+        new_transition[TransitionKey.OBSERVATION] = new_observation
+        return new_transition
+
+    def encode_samples(self, samples: list[tuple[np.ndarray, str]]) -> dict[str, Tensor]:
+        """Run the Qwen-VL processor on a list of ``(frames, task)`` samples."""
+        from qwen_vl_utils import process_vision_info
+
+        conversations = [self._build_conversation(frames, task) for frames, task in samples]
+
+        texts = [
+            self._processor.apply_chat_template(
+                msg,
+                tokenize=False,
+                add_generation_prompt=False,
+                add_vision_id=True,
+                enable_thinking=False,
+                fps=1,
+            )
+            for msg in conversations
+        ]
+
+        process_kwargs: dict[str, Any] = {
+            "return_video_kwargs": True,
+            "return_video_metadata": True,
+        }
+        image_processor = getattr(self._processor, "image_processor", None)
+        if image_processor is not None and hasattr(image_processor, "patch_size"):
+            process_kwargs["image_patch_size"] = image_processor.patch_size
+
+        image_inputs, video_inputs, video_kwargs = process_vision_info(conversations, **process_kwargs)
+
+        videos: list[Any] | None = None
+        video_metadatas: list[Any] | None = None
+        if video_inputs:
+            if isinstance(video_inputs[0], tuple) and len(video_inputs[0]) == 2:
+                videos_seq, metadatas_seq = zip(*video_inputs, strict=False)
+                videos = list(videos_seq)
+                video_metadatas = list(metadatas_seq)
+            else:
+                videos = list(video_inputs)
+
+        processor_kwargs: dict[str, Any] = {
+            "text": texts,
+            "images": image_inputs,
+            "padding": True,
+            "truncation": False,
+            "max_length": self.max_length,
+            "return_tensors": "pt",
+            "do_resize": False,
+        }
+        if videos is not None:
+            processor_kwargs["videos"] = videos
+        if video_metadatas is not None:
+            processor_kwargs["video_metadata"] = video_metadatas
+        if video_kwargs:
+            processor_kwargs.update(video_kwargs)
+
+        encoded = self._processor(**processor_kwargs)
+
+        # Write Robometer-specific token ids and the video patch merge size into
+        # the encoded batch so `RobometerRewardModel` doesn't need its own
+        # tokenizer at inference (EO1-style separation: the processor owns the
+        # tokenizer, the model owns the backbone and heads).
+        tokenizer = self._processor.tokenizer
+        encoded["prog_token_id"] = tokenizer.convert_tokens_to_ids("<|prog_token|>")
+        encoded["vision_start_token_id"] = tokenizer.convert_tokens_to_ids("<|vision_start|>")
+        encoded["vision_end_token_id"] = tokenizer.convert_tokens_to_ids("<|vision_end|>")
+        video_processor = getattr(self._processor, "video_processor", None)
+        encoded["video_merge_size"] = int(getattr(video_processor, "merge_size", 14))
+        return encoded
+
+    def _build_conversation(self, frames: np.ndarray, task: str) -> list[dict[str, Any]]:
+        pil_frames = _frames_to_pil(frames)
+        prompt = PROGRESS_PROMPT.format(task=task)
+        content: list[dict[str, Any]] = [{"type": "text", "text": prompt}]
+
+        if self.use_multi_image:
+            for image in pil_frames:
+                content.append({"type": "image", "image": image})
+                if self.use_per_frame_progress_token:
+                    content.append({"type": "text", "text": "<|prog_token|>"})
+        else:
+            content.append({"type": "video", "video": pil_frames, "sample_fps": 1.0})
+
+        return [{"role": "user", "content": content}]
+
+    def transform_features(
+        self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
+    ) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
+        return features
+
+    def get_config(self) -> dict[str, Any]:
+        return {
+            "base_model_id": self.base_model_id,
+            "image_key": self.image_key,
+            "task_key": self.task_key,
+            "default_task": self.default_task,
+            "max_frames": self.max_frames,
+            "use_multi_image": self.use_multi_image,
+            "use_per_frame_progress_token": self.use_per_frame_progress_token,
+            "max_length": self.max_length,
+        }
+
+
+def make_robometer_pre_post_processors(
+    config: RobometerConfig,
+    dataset_stats: dict[str, dict[str, Any]] | None = None,
+) -> tuple[
+    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
+    PolicyProcessorPipeline[PolicyAction, PolicyAction],
+]:
+    """Pipeline that pre-encodes frames + task into Qwen-VL tensors.
+
+    The preprocessor adds a batch dimension if needed, runs Robometer's
+    encoder, and moves everything to the configured device. The
+    postprocessor is the identity since Robometer outputs a single reward
+    tensor.
+    """
+    del dataset_stats  # Robometer has its own normalisation inside the Qwen-VL processor.
+
+    preprocessor = PolicyProcessorPipeline[dict[str, Any], dict[str, Any]](
+        steps=[
+            AddBatchDimensionProcessorStep(),
+            RobometerEncoderProcessorStep(
+                base_model_id=config.base_model_id,
+                image_key=config.image_key,
+                task_key=config.task_key,
+                default_task=config.default_task,
+                max_frames=config.max_frames,
+                use_multi_image=config.use_multi_image,
+                use_per_frame_progress_token=config.use_per_frame_progress_token,
+            ),
+            DeviceProcessorStep(device=config.device or "cpu"),
+        ],
+        name=POLICY_PREPROCESSOR_DEFAULT_NAME,
+    )
+    postprocessor = PolicyProcessorPipeline(
+        name=POLICY_POSTPROCESSOR_DEFAULT_NAME,
+        to_transition=policy_action_to_transition,
+    )
+    return preprocessor, postprocessor
--- a/src/lerobot/scripts/lerobot_annotate.py
+++ b/src/lerobot/scripts/lerobot_annotate.py
@@ -0,0 +1,201 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""``lerobot-annotate`` — populate ``language_persistent`` and
+``language_events`` columns on a LeRobot dataset.
+
+Annotations live directly in ``data/chunk-*/file-*.parquet``.
+
+Example:
+
+  uv run lerobot-annotate \\
+      --root=/path/to/dataset \\
+      --vlm.model_id=Qwen/Qwen2.5-VL-7B-Instruct
+
+For distributed runs, see ``examples/annotations/run_hf_job.py``.
+"""
+
+import logging
+from pathlib import Path
+
+from lerobot.annotations.steerable_pipeline.config import AnnotationPipelineConfig
+from lerobot.annotations.steerable_pipeline.executor import Executor
+from lerobot.annotations.steerable_pipeline.frames import make_frame_provider
+from lerobot.annotations.steerable_pipeline.modules import (
+    GeneralVqaModule,
+    InterjectionsAndSpeechModule,
+    PlanSubtasksMemoryModule,
+)
+from lerobot.annotations.steerable_pipeline.validator import StagingValidator
+from lerobot.annotations.steerable_pipeline.vlm_client import make_vlm_client
+from lerobot.annotations.steerable_pipeline.writer import LanguageColumnsWriter
+from lerobot.configs import parser
+
+logger = logging.getLogger(__name__)
+
+
+def _resolve_root(cfg: AnnotationPipelineConfig) -> Path:
+    if cfg.root is not None:
+        return Path(cfg.root)
+    if cfg.repo_id is not None:
+        from huggingface_hub import snapshot_download
+
+        return Path(snapshot_download(repo_id=cfg.repo_id, repo_type="dataset"))
+    raise ValueError("Either --root or --repo_id must be provided.")
+
+
+@parser.wrap()
+def annotate(cfg: AnnotationPipelineConfig) -> None:
+    """Run the steerable annotation pipeline against a dataset."""
+    logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
+    root = _resolve_root(cfg)
+    logger.info("annotate: root=%s", root)
+
+    vlm = make_vlm_client(cfg.vlm)
+    frame_provider = make_frame_provider(root, camera_key=cfg.vlm.camera_key, video_backend=cfg.video_backend)
+    # Surface the resolved cameras up front so a silent vqa-module no-op
+    # is obvious in job output rather than discovered post-hoc by counting
+    # parquet rows.
+    cam_keys = list(getattr(frame_provider, "camera_keys", []) or [])
+    logger.info(
+        "annotate: frame_provider default camera=%r, all cameras=%s",
+        getattr(frame_provider, "camera_key", None),
+        cam_keys,
+    )
+    if cfg.vqa.enabled and not cam_keys:
+        logger.warning(
+            "annotate: the vqa module is enabled but no cameras were "
+            "resolved — it will produce zero VQA rows. Check "
+            "meta/info.json for observation.images.* features, or pass "
+            "--vlm.camera_key=<key> to seed the cameras list."
+        )
+    plan = PlanSubtasksMemoryModule(vlm=vlm, config=cfg.plan, frame_provider=frame_provider)
+    interjections = InterjectionsAndSpeechModule(
+        vlm=vlm, config=cfg.interjections, seed=cfg.seed, frame_provider=frame_provider
+    )
+    vqa = GeneralVqaModule(vlm=vlm, config=cfg.vqa, seed=cfg.seed, frame_provider=frame_provider)
+    writer = LanguageColumnsWriter()
+    validator = StagingValidator(
+        dataset_camera_keys=tuple(getattr(frame_provider, "camera_keys", []) or []) or None,
+    )
+
+    executor = Executor(
+        config=cfg,
+        plan=plan,
+        interjections=interjections,
+        vqa=vqa,
+        writer=writer,
+        validator=validator,
+    )
+    summary = executor.run(root)
+    logger.info("annotate: wrote %d shard(s)", len(summary.written_paths))
+    for phase in summary.phases:
+        logger.info(
+            "annotate: phase=%s processed=%d skipped=%d",
+            phase.name,
+            phase.episodes_processed,
+            phase.episodes_skipped,
+        )
+    if summary.validation_report.warnings:
+        for w in summary.validation_report.warnings:
+            logger.warning(w)
+
+    if cfg.push_to_hub:
+        if cfg.repo_id is None and cfg.dest_repo_id is None:
+            raise ValueError(
+                "--push_to_hub requires --repo_id or --dest_repo_id (the dataset repo to push to)."
+            )
+        _push_to_hub(root, cfg)
+
+
+def _push_to_hub(root: Path, cfg: AnnotationPipelineConfig) -> None:
+    """Upload the annotated dataset directory to the Hub.
+
+    Pushes to ``cfg.dest_repo_id`` when set, otherwise back to ``cfg.repo_id``.
+    """
+    from huggingface_hub import HfApi  # noqa: PLC0415
+
+    repo_id = cfg.dest_repo_id or cfg.repo_id
+    commit_message = cfg.push_commit_message or "Add steerable annotations (lerobot-annotate)"
+    api = HfApi()
+    print(f"[lerobot-annotate] creating/locating dataset repo {repo_id}...", flush=True)
+    api.create_repo(
+        repo_id=repo_id,
+        repo_type="dataset",
+        private=cfg.push_private,
+        exist_ok=True,
+    )
+    print(f"[lerobot-annotate] uploading {root} -> {repo_id}...", flush=True)
+    commit_info = api.upload_folder(
+        folder_path=str(root),
+        repo_id=repo_id,
+        repo_type="dataset",
+        commit_message=commit_message,
+        ignore_patterns=[".annotate_staging/**", "**/.DS_Store"],
+    )
+    print(f"[lerobot-annotate] uploaded to https://huggingface.co/datasets/{repo_id}", flush=True)
+
+    # Tag the upload with the codebase version. ``LeRobotDatasetMetadata``
+    # resolves the dataset revision via ``get_safe_version`` which scans
+    # for tags like ``v3.0``; without a tag it raises
+    # ``RevisionNotFoundError``. Read the version straight from the
+    # dataset's own ``meta/info.json`` so we tag whatever the writer
+    # actually wrote (no accidental drift if the codebase floor moves).
+    from lerobot.datasets.dataset_metadata import CODEBASE_VERSION  # noqa: PLC0415
+
+    info_path = root / "meta" / "info.json"
+    version_tag = CODEBASE_VERSION
+    if info_path.exists():
+        try:
+            from lerobot.utils.io_utils import load_json  # noqa: PLC0415
+
+            info = load_json(info_path)
+            ds_version = info.get("codebase_version")
+            if isinstance(ds_version, str) and ds_version.startswith("v"):
+                version_tag = ds_version
+        except Exception as exc:  # noqa: BLE001
+            print(
+                f"[lerobot-annotate] could not read codebase_version from info.json ({exc}); falling back to {version_tag}",
+                flush=True,
+            )
+    revision = getattr(commit_info, "oid", None)
+    tag_kwargs = {
+        "repo_id": repo_id,
+        "tag": version_tag,
+        "repo_type": "dataset",
+        "exist_ok": True,
+    }
+    if revision is not None:
+        tag_kwargs["revision"] = revision
+
+    try:
+        api.create_tag(**tag_kwargs)
+        print(f"[lerobot-annotate] tagged {repo_id} as {version_tag}", flush=True)
+    except Exception as exc:  # noqa: BLE001
+        print(
+            f"[lerobot-annotate] WARNING: could not create tag {version_tag!r} on {repo_id}: {exc}. "
+            "Dataset is uploaded but ``LeRobotDataset`` won't be able to load it until it's tagged. "
+            "Run: from huggingface_hub import HfApi; "
+            f"HfApi().create_tag({repo_id!r}, tag={version_tag!r}, repo_type='dataset', exist_ok=True)",
+            flush=True,
+        )
+
+
+def main() -> None:
+    annotate()
+
+
+if __name__ == "__main__":
+    main()
--- a/src/lerobot/templates/lerobot_rewardmodel_modelcard_template.md
+++ b/src/lerobot/templates/lerobot_rewardmodel_modelcard_template.md
@@ -13,6 +13,8 @@
 A reward classifier is a lightweight neural network that scores observations or trajectories for task success, providing a learned reward signal or offline evaluation when explicit rewards are unavailable.
 {% elif model_name == "sarm" %}
 A Success-Aware Reward Model (SARM) predicts a dense reward signal from observations, typically used downstream for reinforcement learning or human-in-the-loop fine-tuning when task success is not directly observable.
+{% elif model_name == "robometer" %}
+ROBOMETER is a general-purpose video-language robotic reward model built on a fine-tuned Qwen3-VL-4B backbone with progress, preference, and success heads. Given a trajectory video and a task description, it predicts dense, frame-level task progress in [0, 1] and frame-level success probabilities for downstream robot learning, including offline RL, online RL, data filtering and retrieval, and automated failure detection.
 {% elif model_name == "topreward" %}
 TOPReward is a **zero-shot** reward model that extracts token log-probabilities from an off-the-shelf vision-language model (default Qwen3-VL) as a reward signal. Given a video trajectory and a task instruction, it returns the VLM's log-likelihood of the instruction being true, with no fine-tuning required.
 {% else %}
--- a/tests/annotations/init.py
+++ b/tests/annotations/init.py
--- a/tests/annotations/_helpers.py
+++ b/tests/annotations/_helpers.py
@@ -0,0 +1,58 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Helpers shared across annotation-pipeline tests."""
+
+from __future__ import annotations
+
+import json
+from typing import Any
+
+from lerobot.annotations.steerable_pipeline.vlm_client import StubVlmClient
+
+
+def make_canned_responder(
+    responses_by_marker: dict[str, Any],
+    default: Any = None,
+) -> StubVlmClient:
+    """Return a stub that picks a response by inspecting the user prompt.
+
+    For each call the responder examines the last user-message text and
+    returns the response keyed by the first marker substring it contains.
+    Falls back to ``default`` if no marker matches.
+    """
+
+    def responder(messages: list[dict[str, Any]]) -> Any:
+        last_user_text = ""
+        for message in messages:
+            if message.get("role") != "user":
+                continue
+            content = message.get("content")
+            if isinstance(content, str):
+                last_user_text = content
+            elif isinstance(content, list):
+                for block in content:
+                    if isinstance(block, dict) and block.get("type") == "text":
+                        last_user_text = block.get("text", "")
+        for marker, response in responses_by_marker.items():
+            if marker in last_user_text:
+                return response
+        return default
+
+    return StubVlmClient(responder=responder)
+
+
+def encode_vqa_answer(payload: dict[str, Any]) -> str:
+    return json.dumps(payload, sort_keys=True)
--- a/tests/annotations/conftest.py
+++ b/tests/annotations/conftest.py
@@ -0,0 +1,51 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Shared fixtures for annotation-pipeline tests.
+
+The on-disk dataset builder lives with the other dataset factories in
+``tests/fixtures/dataset_factories.py`` (:func:`build_annotation_dataset`);
+these fixtures only wire it into pytest.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import pytest
+
+from tests.fixtures.dataset_factories import build_annotation_dataset
+
+
+@pytest.fixture
+def fixture_dataset_root(tmp_path: Path) -> Path:
+    """A tiny dataset with two episodes, 12 frames each at 10 fps."""
+    return build_annotation_dataset(
+        tmp_path / "ds",
+        episode_specs=[
+            (0, 12, "Could you tidy the kitchen please?"),
+            (1, 12, "Please clean up the kitchen"),
+        ],
+        fps=10,
+    )
+
+
+@pytest.fixture
+def single_episode_root(tmp_path: Path) -> Path:
+    return build_annotation_dataset(
+        tmp_path / "ds_one",
+        episode_specs=[(0, 30, "Pour water from the bottle into the cup.")],
+        fps=10,
+    )
--- a/tests/annotations/run_e2e_smoke.py
+++ b/tests/annotations/run_e2e_smoke.py
@@ -0,0 +1,101 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Opt-in E2E smoke run for ``make annotation-e2e``.
+
+Builds the shared annotation fixture (:func:`build_annotation_dataset`),
+runs the full annotation pipeline against it with a stub VLM, and prints a
+short report. This is intentionally not a pytest test — it exercises the
+CLI plumbing — but it reuses the same on-disk dataset builder as the pytest
+fixtures so there is no duplicated fixture code.
+"""
+
+from __future__ import annotations
+
+import sys
+import tempfile
+from pathlib import Path
+
+from lerobot.annotations.steerable_pipeline.config import AnnotationPipelineConfig
+from lerobot.annotations.steerable_pipeline.executor import Executor
+from lerobot.annotations.steerable_pipeline.modules import (
+    GeneralVqaModule,
+    InterjectionsAndSpeechModule,
+    PlanSubtasksMemoryModule,
+)
+from lerobot.annotations.steerable_pipeline.validator import StagingValidator
+from lerobot.annotations.steerable_pipeline.vlm_client import StubVlmClient
+from lerobot.annotations.steerable_pipeline.writer import LanguageColumnsWriter
+from tests.fixtures.dataset_factories import build_annotation_dataset
+
+
+def _stub_responder(messages):
+    text = ""
+    for m in messages:
+        if m.get("role") == "user":
+            content = m.get("content")
+            if isinstance(content, list):
+                for block in content:
+                    if isinstance(block, dict) and block.get("type") == "text":
+                        text = block.get("text", "")
+            elif isinstance(content, str):
+                text = content
+    if "atomic subtasks" in text:
+        return {
+            "subtasks": [
+                {"text": "grasp the bottle", "start": 0.0, "end": 1.0},
+                {"text": "pour into the cup", "start": 1.0, "end": 2.0},
+                {"text": "place the bottle down", "start": 2.0, "end": 3.0},
+            ]
+        }
+    if "concise hierarchical PLAN" in text:
+        return {"plan": "1. grasp\n2. pour\n3. place"}
+    if "Update the memory" in text:
+        return {"memory": "poured once"}
+    if "acknowledgement the robot" in text:
+        return {"text": "Sure."}
+    if "ONE realistic interruption" in text:
+        return {"interjection": "use less water", "speech": "Using less water."}
+    if "frame-grounded visual question" in text:
+        return {"question": "How many cups?", "answer": {"label": "cup", "count": 1}}
+    return None
+
+
+def main() -> int:
+    with tempfile.TemporaryDirectory() as tmp:
+        root = build_annotation_dataset(
+            Path(tmp) / "ds",
+            episode_specs=[(0, 30, "Pour water into the cup.")],
+            fps=10,
+        )
+        vlm = StubVlmClient(responder=_stub_responder)
+        cfg = AnnotationPipelineConfig()
+        executor = Executor(
+            config=cfg,
+            plan=PlanSubtasksMemoryModule(vlm=vlm, config=cfg.plan),
+            interjections=InterjectionsAndSpeechModule(vlm=vlm, config=cfg.interjections, seed=cfg.seed),
+            vqa=GeneralVqaModule(vlm=vlm, config=cfg.vqa, seed=cfg.seed),
+            writer=LanguageColumnsWriter(),
+            validator=StagingValidator(),
+        )
+        summary = executor.run(root)
+        print(f"phases={[(p.name, p.episodes_processed) for p in summary.phases]}")
+        print(f"validation: {summary.validation_report.summary()}")
+        print(f"shards rewritten: {len(summary.written_paths)}")
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/tests/annotations/test_frames.py
+++ b/tests/annotations/test_frames.py
@@ -0,0 +1,179 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Unit tests for :class:`VideoFrameProvider` method bindings.
+
+These were prompted by a real regression: ``video_for_episode`` was once
+indented one level too deep so it ended up nested *inside* a module-level
+helper (after that function's ``return`` statement) — silently dead code
+that meant production runs with ``use_video_url=False`` would
+``AttributeError`` on ``self.frame_provider.video_for_episode(...)``. The
+existing module tests didn't catch it because they exercise stub providers.
+
+The tests below assert on the class itself (not on an instance), so a
+future reindent regression flips them to red without needing a real
+LeRobot dataset on disk.
+"""
+
+from __future__ import annotations
+
+import shutil
+import subprocess
+from pathlib import Path
+
+import pytest
+import torch
+
+pytest.importorskip("datasets", reason="datasets is required (install lerobot[dataset])")
+
+from lerobot.annotations.steerable_pipeline.frames import (  # noqa: E402
+    VideoFrameProvider,
+    _decode_frames_av,
+    _decode_frames_ffmpeg,
+)
+
+
+class _FakeMeta:
+    """Minimal metadata stub exposing ``video_keys`` / ``camera_keys``."""
+
+    def __init__(self, video_keys: list[str], image_keys: list[str]) -> None:
+        self.video_keys = video_keys
+        self.camera_keys = [*video_keys, *image_keys]
+
+
+def test_default_camera_key_skips_image_only_cameras(tmp_path: Path, monkeypatch) -> None:
+    """The default camera must be a *video* key — image-stored cameras have no
+    ``videos/<key>/from_timestamp`` and would KeyError in the clip/decode path.
+
+    Regression: a dataset whose first ``camera_keys`` entry was an image-stored
+    camera (e.g. ``observation.images.wrist``) crashed at clip extraction.
+    """
+    fake = _FakeMeta(
+        video_keys=["observation.images.robot0_agentview_right"],
+        image_keys=["observation.images.wrist"],
+    )
+    import lerobot.datasets.dataset_metadata as meta_mod
+
+    monkeypatch.setattr(meta_mod, "LeRobotDatasetMetadata", lambda *a, **k: fake, raising=True)
+    provider = VideoFrameProvider(root=tmp_path)
+    assert provider.camera_key == "observation.images.robot0_agentview_right"
+    assert "observation.images.wrist" not in provider.camera_keys
+
+
+def test_video_for_episode_is_a_method_of_videoframeprovider():
+    """``video_for_episode`` must be a bound method, not nested dead code."""
+    assert callable(getattr(VideoFrameProvider, "video_for_episode", None))
+
+
+def test_episode_clip_path_is_a_method_of_videoframeprovider():
+    """``episode_clip_path`` is now a method (was a free function reaching
+    into ``provider._meta`` from outside the class)."""
+    assert callable(getattr(VideoFrameProvider, "episode_clip_path", None))
+
+
+def test_videoframeprovider_has_a_lock_for_concurrent_use():
+    """A ``ThreadPoolExecutor`` runs the plan / interjections / vqa phases
+    concurrently; the cache + warn-flag accesses must be guarded.
+    """
+    import threading
+
+    # Fresh-instance check via a minimal fake to avoid touching the hub.
+    # The lock is declared with ``init=False`` and has a default factory,
+    # so a constructed instance must own a real ``threading.Lock``.
+    lock_field = next(
+        (f for f in VideoFrameProvider.__dataclass_fields__.values() if f.name == "_lock"),
+        None,
+    )
+    assert lock_field is not None
+    assert lock_field.default_factory is threading.Lock
+
+
+@pytest.fixture
+def sample_video(tmp_path: Path) -> Path:
+    """A 3 s 10 fps test-pattern mp4, written with ffmpeg."""
+    if shutil.which("ffmpeg") is None:
+        pytest.skip("ffmpeg not available")
+    out = tmp_path / "sample.mp4"
+    subprocess.run(
+        [
+            "ffmpeg",
+            "-y",
+            "-f",
+            "lavfi",
+            "-i",
+            "testsrc=duration=3:size=160x120:rate=10",
+            "-pix_fmt",
+            "yuv420p",
+            str(out),
+        ],
+        check=True,
+        capture_output=True,
+    )
+    return out
+
+
+def test_decode_frames_av_returns_one_uint8_frame_per_timestamp(sample_video: Path) -> None:
+    """``_decode_frames_av`` decodes via PyAV directly — no torchcodec/torchvision.
+
+    This is the always-available fallback: torchcodec is unusable in some
+    containers and lerobot's ``pyav`` backend routes through the removed
+    ``torchvision.io.VideoReader``.
+    """
+    timestamps = [0.0, 1.0, 2.5]
+    frames = _decode_frames_av(sample_video, timestamps)
+
+    assert len(frames) == len(timestamps)
+    for frame in frames:
+        assert isinstance(frame, torch.Tensor)
+        assert frame.dtype == torch.uint8
+        assert frame.shape == (3, 120, 160)
+
+
+def test_decode_frames_av_picks_nearest_frame(sample_video: Path) -> None:
+    """Repeated and out-of-order timestamps each resolve to the nearest frame."""
+    frames = _decode_frames_av(sample_video, [2.0, 0.0, 2.0])
+
+    assert len(frames) == 3
+    assert torch.equal(frames[0], frames[2])
+    assert not torch.equal(frames[0], frames[1])
+
+
+def test_decode_frames_av_raises_on_missing_file(tmp_path: Path) -> None:
+    """A missing video surfaces as an exception the caller can fall back on."""
+    with pytest.raises(Exception):  # noqa: B017, PT011
+        _decode_frames_av(tmp_path / "does_not_exist.mp4", [0.0])
+
+
+def test_decode_frames_ffmpeg_returns_one_uint8_frame_per_timestamp(sample_video: Path) -> None:
+    """``_decode_frames_ffmpeg`` shells out to the ffmpeg CLI — the always-
+    available fallback that decodes AV1 and isolates crashes to a child
+    process.
+    """
+    timestamps = [0.0, 1.0, 2.5]
+    frames = _decode_frames_ffmpeg(sample_video, timestamps)
+
+    assert len(frames) == len(timestamps)
+    for frame in frames:
+        assert isinstance(frame, torch.Tensor)
+        assert frame.dtype == torch.uint8
+        assert frame.shape == (3, 120, 160)
+
+
+def test_decode_frames_ffmpeg_raises_on_missing_file(tmp_path: Path) -> None:
+    """A missing video raises (non-zero ffmpeg exit), never crashes the job."""
+    if shutil.which("ffmpeg") is None:
+        pytest.skip("ffmpeg not available")
+    with pytest.raises(Exception):  # noqa: B017, PT011
+        _decode_frames_ffmpeg(tmp_path / "does_not_exist.mp4", [0.0])
--- a/tests/annotations/test_modules.py
+++ b/tests/annotations/test_modules.py
@@ -0,0 +1,355 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Module 1/2/3 unit tests with stubbed VLMs."""
+
+from __future__ import annotations
+
+import json
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+
+from lerobot.annotations.steerable_pipeline.config import (
+    InterjectionsConfig,
+    PlanConfig,
+    VqaConfig,
+)
+from lerobot.annotations.steerable_pipeline.modules import (
+    GeneralVqaModule,
+    InterjectionsAndSpeechModule,
+    PlanSubtasksMemoryModule,
+)
+from lerobot.annotations.steerable_pipeline.reader import iter_episodes
+from lerobot.annotations.steerable_pipeline.staging import EpisodeStaging
+from lerobot.annotations.steerable_pipeline.vlm_client import StubVlmClient
+
+from ._helpers import make_canned_responder
+
+
+@dataclass
+class _StubFrameProvider:
+    """Returns one sentinel object per requested timestamp."""
+
+    sentinel: Any = field(default_factory=lambda: object())
+    cameras: tuple[str, ...] = ("observation.images.top",)
+    calls: list[tuple[int, tuple[float, ...], str | None]] = field(default_factory=list)
+    video_calls: list[tuple[int, int, str | None]] = field(default_factory=list)
+
+    @property
+    def camera_keys(self) -> list[str]:
+        return list(self.cameras)
+
+    def frames_at(self, record, timestamps, camera_key=None):
+        self.calls.append((record.episode_index, tuple(timestamps), camera_key))
+        return [self.sentinel] * len(timestamps)
+
+    def video_for_episode(self, record, max_frames, camera_key=None):
+        self.video_calls.append((record.episode_index, max_frames, camera_key))
+        n = min(max_frames, len(record.frame_timestamps))
+        return [self.sentinel] * n
+
+
+def _spy_responder(captured: list[list[dict[str, Any]]], reply: Any):
+    def responder(messages):
+        captured.append(list(messages))
+        return reply
+
+    return StubVlmClient(responder=responder)
+
+
+def test_module1_plan_memory_subtask_smoke(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    vlm = make_canned_responder(
+        {
+            "atomic subtasks": {
+                "subtasks": [
+                    {"text": "grasp the handle of the sponge", "start": 0.0, "end": 0.4},
+                    {"text": "wipe the counter from left to right", "start": 0.4, "end": 0.8},
+                    {"text": "place the sponge into the sink", "start": 0.8, "end": 1.1},
+                ]
+            },
+            "Update the memory": {"memory": "wiped the counter once"},
+        },
+    )
+    module = PlanSubtasksMemoryModule(vlm=vlm, config=PlanConfig())
+    record = next(iter_episodes(fixture_dataset_root))
+    staging = EpisodeStaging(tmp_path / "stage", record.episode_index)
+    module.run_episode(record, staging)
+    rows = staging.read("plan")
+
+    styles = {r["style"] for r in rows}
+    assert {"subtask", "plan", "memory"}.issubset(styles)
+    # subtask timestamps must be exact frame timestamps
+    frame_set = set(record.frame_timestamps)
+    for row in rows:
+        assert row["timestamp"] in frame_set
+    # one plan row per subtask boundary; the first lands at t0 and each
+    # plan is the deterministic numbered list of still-todo subtasks
+    plan_rows = sorted((r for r in rows if r["style"] == "plan"), key=lambda r: r["timestamp"])
+    subtask_rows = [r for r in rows if r["style"] == "subtask"]
+    assert len(plan_rows) == len(subtask_rows)
+    assert plan_rows[0]["timestamp"] == record.frame_timestamps[0]
+    # the t0 plan enumerates all subtasks; later plans shrink
+    assert plan_rows[0]["content"].startswith("1. ")
+    assert len(plan_rows[0]["content"].splitlines()) == len(subtask_rows)
+    assert len(plan_rows[-1]["content"].splitlines()) == 1
+
+
+def test_module2_at_t0_emits_speech_only_no_interjection(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    vlm = make_canned_responder(
+        {"acknowledgement the robot": {"text": "Sure, on it."}},
+    )
+    module = InterjectionsAndSpeechModule(
+        vlm=vlm,
+        config=InterjectionsConfig(max_interjections_per_episode=0),
+    )
+    record = next(iter_episodes(fixture_dataset_root))
+    staging = EpisodeStaging(tmp_path / "stage", record.episode_index)
+    module.run_episode(record, staging)
+    rows = staging.read("interjections")
+    assert len(rows) == 1
+    only = rows[0]
+    assert only["role"] == "assistant"
+    assert only["style"] is None
+    assert only["content"] is None
+    assert only["timestamp"] == record.frame_timestamps[0]
+    assert only["tool_calls"][0]["function"]["name"] == "say"
+
+
+def test_module2_mid_episode_emits_paired_interjection_and_speech(
+    fixture_dataset_root: Path, tmp_path: Path
+) -> None:
+    """Module 2 anchors interjections on Module 1's subtask boundaries.
+
+    The executor runs Module 1 first, then Module 2 reads the subtask
+    rows back from the same staging tree (see
+    ``_mid_episode_interjections``). Reproduce that contract here by
+    seeding the staging with two subtask rows so a single ``0 → 1``
+    boundary exists for Module 2 to anchor on.
+    """
+    vlm = make_canned_responder(
+        {
+            "acknowledgement the robot": {"text": "OK."},
+            # Marker matches the distinctive line of
+            # ``module_2_interjection.txt``. The old marker
+            # ("ONE realistic interruption") came from a previous prompt
+            # version that asked for counterfactual interjections; the
+            # current design anchors on subtask boundaries instead, so
+            # the prompt and its marker changed.
+            "Write ONE interjection": {
+                "interjection": "now wipe the counter please",
+                "speech": "On it.",
+            },
+        },
+    )
+    module = InterjectionsAndSpeechModule(
+        vlm=vlm,
+        config=InterjectionsConfig(max_interjections_per_episode=1, interjection_min_t=0.2),
+        seed=7,
+    )
+    record = next(iter_episodes(fixture_dataset_root))
+    staging = EpisodeStaging(tmp_path / "stage", record.episode_index)
+    # Seed Module 1's subtask staging so Module 2 has a boundary to
+    # anchor on (it bails with zero rows when no spans exist — the
+    # production executor guarantees Module 1 ran first).
+    boundary_ts = float(record.frame_timestamps[len(record.frame_timestamps) // 2])
+    staging.write(
+        "plan",
+        [
+            {
+                "role": "assistant",
+                "content": "grasp the sponge",
+                "style": "subtask",
+                "timestamp": float(record.frame_timestamps[0]),
+                "tool_calls": None,
+            },
+            {
+                "role": "assistant",
+                "content": "wipe the counter",
+                "style": "subtask",
+                "timestamp": boundary_ts,
+                "tool_calls": None,
+            },
+        ],
+    )
+    module.run_episode(record, staging)
+    rows = staging.read("interjections")
+
+    interjections = [r for r in rows if r["style"] == "interjection"]
+    speeches = [r for r in rows if r["style"] is None and r["role"] == "assistant"]
+    assert len(interjections) == 1
+    assert len(speeches) >= 2  # initial t=0 + one paired with the interjection
+    inter_t = interjections[0]["timestamp"]
+    assert any(abs(s["timestamp"] - inter_t) < 1e-9 for s in speeches)
+
+
+def test_module3_vqa_unique_per_frame_and_camera(single_episode_root: Path, tmp_path: Path) -> None:
+    payload = {
+        "question": "How many cups?",
+        "answer": {"label": "cup", "count": 2, "note": "white & blue"},
+    }
+    vlm = make_canned_responder({"frame-grounded visual question": payload})
+    module = GeneralVqaModule(
+        vlm=vlm,
+        config=VqaConfig(vqa_emission_hz=1.0, K=3),
+        seed=1,
+        frame_provider=_StubFrameProvider(cameras=("observation.images.top", "observation.images.wrist")),
+    )
+    record = next(iter_episodes(single_episode_root))
+    staging = EpisodeStaging(tmp_path / "stage", record.episode_index)
+    module.run_episode(record, staging)
+    rows = staging.read("vqa")
+    # every vqa row must carry a camera tag and one of the configured cameras
+    for r in rows:
+        assert r["style"] == "vqa"
+        assert r.get("camera") in {"observation.images.top", "observation.images.wrist"}
+    # at most one (vqa, user) and one (vqa, assistant) per (timestamp, camera)
+    user_keys = [(r["timestamp"], r["camera"]) for r in rows if r["role"] == "user" and r["style"] == "vqa"]
+    assistant_keys = [
+        (r["timestamp"], r["camera"]) for r in rows if r["role"] == "assistant" and r["style"] == "vqa"
+    ]
+    assert len(user_keys) == len(set(user_keys))
+    assert len(assistant_keys) == len(set(assistant_keys))
+    # both cameras must be represented
+    assert {c for _, c in user_keys} == {"observation.images.top", "observation.images.wrist"}
+    # every emitted timestamp must be an exact source frame timestamp
+    frame_set = set(record.frame_timestamps)
+    for ts, _ in user_keys + assistant_keys:
+        assert ts in frame_set
+
+
+def test_module1_attaches_video_block_to_subtask_prompt(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    """Module 1 sends one ``type=video`` block covering the whole episode."""
+    captured: list[list[dict[str, Any]]] = []
+    payload = {
+        "subtasks": [
+            {"text": "grasp the handle of the sponge", "start": 0.0, "end": 0.5},
+            {"text": "wipe the counter", "start": 0.5, "end": 1.1},
+        ]
+    }
+    plan_payload = {"plan": "1. grasp\n2. wipe"}
+    memory_payload = {"memory": "wiped once"}
+
+    def responder(messages):
+        captured.append(list(messages))
+        text = ""
+        for m in messages:
+            for block in m.get("content", []):
+                if isinstance(block, dict) and block.get("type") == "text":
+                    text = block.get("text", "")
+        if "concise hierarchical PLAN" in text:
+            return plan_payload
+        if "Update the memory" in text:
+            return memory_payload
+        return payload
+
+    provider = _StubFrameProvider()
+    module = PlanSubtasksMemoryModule(
+        vlm=StubVlmClient(responder=responder),
+        # Disable the rephrasings sub-prompt so the test's only video-bearing
+        # call is the subtask one — keeps the assertions below focused on
+        # ``_generate_subtasks`` rather than fighting the order of unrelated
+        # text-only Module-1 sub-prompts.
+        config=PlanConfig(max_video_frames=5, frames_per_second=10.0, n_task_rephrasings=0),
+        frame_provider=provider,
+    )
+    record = next(iter_episodes(fixture_dataset_root))
+    staging = EpisodeStaging(tmp_path / "stage", record.episode_index)
+    module.run_episode(record, staging)
+
+    # Find the call carrying the subtask prompt rather than blindly taking
+    # captured[0] — Module 1 issues several sub-prompts and their order is
+    # not part of the contract.
+    assert captured, "no VLM calls made"
+
+    def _prompt_text(messages):
+        for m in messages:
+            for block in m.get("content", []):
+                if isinstance(block, dict) and block.get("type") == "text":
+                    return block.get("text", "")
+        return ""
+
+    subtask_calls = [m for m in captured if "atomic subtasks" in _prompt_text(m)]
+    assert len(subtask_calls) == 1, "expected exactly one subtask-prompt VLM call"
+    content = subtask_calls[0][0]["content"]
+    video_blocks = [b for b in content if isinstance(b, dict) and b.get("type") == "video"]
+    image_blocks = [b for b in content if isinstance(b, dict) and b.get("type") == "image"]
+    text_blocks = [b for b in content if isinstance(b, dict) and b.get("type") == "text"]
+    assert len(video_blocks) == 1, f"expected exactly 1 video block, got {content}"
+    assert image_blocks == [], "subtask prompt must not mix image blocks with the video block"
+    assert len(text_blocks) == 1
+    # video block must wrap a list of frames covering the episode
+    assert isinstance(video_blocks[0]["video"], list)
+    assert len(video_blocks[0]["video"]) <= 5
+    # provider is called with target_count = min(duration * fps, max). With
+    # fps=10 on a ~1s episode that requests >max, so max=5 wins.
+    assert provider.video_calls and provider.video_calls[0][0] == record.episode_index
+    assert provider.video_calls[0][1] <= 5
+
+
+def test_module3_attaches_frame_image_block_to_prompt(single_episode_root: Path, tmp_path: Path) -> None:
+    """Each VQA prompt must carry a single image block at the emission frame."""
+    captured: list[list[dict[str, Any]]] = []
+    payload = {
+        "question": "How many cups?",
+        "answer": {"label": "cup", "count": 1},
+    }
+    provider = _StubFrameProvider()
+    module = GeneralVqaModule(
+        vlm=_spy_responder(captured, payload),
+        config=VqaConfig(vqa_emission_hz=1.0, K=1),
+        seed=0,
+        frame_provider=provider,
+    )
+    record = next(iter_episodes(single_episode_root))
+    staging = EpisodeStaging(tmp_path / "stage", record.episode_index)
+    module.run_episode(record, staging)
+
+    assert captured, "no VLM calls made"
+    for messages in captured:
+        content = messages[0]["content"]
+        image_blocks = [b for b in content if isinstance(b, dict) and b.get("type") == "image"]
+        text_blocks = [b for b in content if isinstance(b, dict) and b.get("type") == "text"]
+        assert len(image_blocks) == 1, f"expected 1 image block per VQA prompt, got {content}"
+        assert image_blocks[0]["image"] is provider.sentinel
+        assert len(text_blocks) == 1
+    # provider was called once per emission per camera with the exact emission timestamp
+    for ep_idx, ts_tuple, camera in provider.calls:
+        assert ep_idx == record.episode_index
+        assert len(ts_tuple) == 1
+        assert ts_tuple[0] in record.frame_timestamps
+        assert camera in provider.cameras
+
+
+def test_module3_assistant_content_is_valid_json(single_episode_root: Path, tmp_path: Path) -> None:
+    payload = {
+        "question": "Where is the cup?",
+        "answer": {"detections": [{"label": "cup", "bbox_format": "xyxy", "bbox": [10, 20, 50, 80]}]},
+    }
+    vlm = make_canned_responder({"frame-grounded visual question": payload})
+    module = GeneralVqaModule(
+        vlm=vlm,
+        config=VqaConfig(vqa_emission_hz=1.0, K=2),
+        seed=2,
+        frame_provider=_StubFrameProvider(),
+    )
+    record = next(iter_episodes(single_episode_root))
+    staging = EpisodeStaging(tmp_path / "stage", record.episode_index)
+    module.run_episode(record, staging)
+    rows = staging.read("vqa")
+    for row in rows:
+        if row["role"] == "assistant" and row["style"] == "vqa":
+            decoded = json.loads(row["content"])
+            assert "detections" in decoded
--- a/tests/annotations/test_pipeline_recipe_render.py
+++ b/tests/annotations/test_pipeline_recipe_render.py
@@ -0,0 +1,175 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""End-to-end smoke: pipeline output → PR 1 canonical recipe rendering."""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import pyarrow.parquet as pq
+
+from lerobot.annotations.steerable_pipeline.config import (
+    AnnotationPipelineConfig,
+    InterjectionsConfig,
+    PlanConfig,
+    VqaConfig,
+)
+from lerobot.annotations.steerable_pipeline.executor import Executor
+from lerobot.annotations.steerable_pipeline.modules import (
+    GeneralVqaModule,
+    InterjectionsAndSpeechModule,
+    PlanSubtasksMemoryModule,
+)
+from lerobot.annotations.steerable_pipeline.validator import StagingValidator
+from lerobot.annotations.steerable_pipeline.writer import LanguageColumnsWriter
+from lerobot.configs.recipe import MessageTurn, TrainingRecipe
+from lerobot.datasets.language_render import render_sample
+
+from ._helpers import make_canned_responder
+
+
+def _build_pr1_style_blend_recipe() -> TrainingRecipe:
+    """Inline blend recipe that consumes every style this pipeline produces.
+
+    PR 1 used to ship ``src/lerobot/configs/recipes/pi05_hirobot.yaml`` as
+    a canonical example, but that file was dropped during PR 1 review. The
+    cross-PR contract this test guards is "the recipe DSL can render
+    non-empty messages from pipeline output", which doesn't require a
+    specific YAML — so we build the equivalent blend in code.
+    """
+    return TrainingRecipe(
+        blend={
+            "low_level_execution": TrainingRecipe(
+                weight=0.35,
+                messages=[
+                    MessageTurn(
+                        role="user",
+                        content="${task}\nPlan: ${plan}\nMemory: ${memory}",
+                        stream="high_level",
+                    ),
+                    MessageTurn(role="assistant", content="${subtask}", stream="low_level", target=True),
+                ],
+            ),
+            "user_interjection_response": TrainingRecipe(
+                weight=0.16,
+                bindings={
+                    "speech": "emitted_at(t, role=assistant, tool_name=say)",
+                    "interjection": "emitted_at(t, style=interjection)",
+                },
+                messages=[
+                    MessageTurn(role="user", content="${task}", stream="high_level"),
+                    MessageTurn(
+                        role="user",
+                        content="${interjection}",
+                        stream="high_level",
+                        if_present="interjection",
+                    ),
+                    MessageTurn(
+                        role="assistant",
+                        content="${plan}",
+                        stream="high_level",
+                        target=True,
+                        if_present="plan",
+                        tool_calls_from="speech",
+                    ),
+                ],
+            ),
+        }
+    )
+
+
+def _build_executor() -> Executor:
+    vlm = make_canned_responder(
+        {
+            "atomic subtasks": {
+                "subtasks": [
+                    {"text": "grasp the bottle", "start": 0.0, "end": 0.5},
+                    {"text": "pour into the cup", "start": 0.5, "end": 1.0},
+                    {"text": "place the bottle down", "start": 1.0, "end": 1.5},
+                ]
+            },
+            "concise hierarchical PLAN": {"plan": "1. grasp\n2. pour\n3. place"},
+            "Update the memory": {"memory": "poured once"},
+            "acknowledgement the robot": {"text": "Sure."},
+            "ONE realistic interruption": {
+                "interjection": "use less water",
+                "speech": "Using less water.",
+            },
+            "frame-grounded visual question": {
+                "question": "How many cups?",
+                "answer": {"label": "cup", "count": 1},
+            },
+        },
+    )
+    config = AnnotationPipelineConfig(
+        plan=PlanConfig(),
+        interjections=InterjectionsConfig(max_interjections_per_episode=1, interjection_min_t=0.5),
+        vqa=VqaConfig(vqa_emission_hz=1.0, K=2),
+    )
+    return Executor(
+        config=config,
+        plan=PlanSubtasksMemoryModule(vlm=vlm, config=config.plan),
+        interjections=InterjectionsAndSpeechModule(vlm=vlm, config=config.interjections, seed=config.seed),
+        vqa=GeneralVqaModule(vlm=vlm, config=config.vqa, seed=config.seed),
+        writer=LanguageColumnsWriter(),
+        validator=StagingValidator(),
+    )
+
+
+def test_pr1_canonical_recipe_renders_nonempty_from_pipeline_output(
+    single_episode_root: Path,
+) -> None:
+    executor = _build_executor()
+    summary = executor.run(single_episode_root)
+    # validator may emit warnings but no errors for the synthetic fixture
+    assert summary.validation_report.ok, summary.validation_report.summary()
+
+    table = pq.read_table(single_episode_root / "data" / "chunk-000" / "file-000.parquet")
+    persistent_lists = table.column("language_persistent").to_pylist()
+    events_lists = table.column("language_events").to_pylist()
+    timestamps = table.column("timestamp").to_pylist()
+
+    recipe = _build_pr1_style_blend_recipe()
+
+    rendered_any = False
+    for ts, persistent, events in zip(timestamps, persistent_lists, events_lists, strict=True):
+        result = render_sample(
+            recipe=recipe,
+            persistent=persistent,
+            events=events,
+            t=float(ts),
+            sample_idx=0,
+            dataset_ctx={"task": "Pour water from the bottle into the cup."},
+        )
+        if result is None:
+            continue
+        if result["messages"]:
+            rendered_any = True
+            assert result["target_message_indices"]
+            break
+    assert rendered_any, "PR 1 recipe rendered no messages from pipeline output"
+
+    # Sanity: speech atom appears in events column intact
+    flat_events = [r for ev in events_lists for r in ev]
+    speech_rows = [r for r in flat_events if r.get("style") is None and r.get("role") == "assistant"]
+    assert speech_rows
+    say = speech_rows[0]["tool_calls"][0]
+    assert say["function"]["name"] == "say"
+    assert isinstance(say["function"]["arguments"]["text"], str)
+    # PR 2 no longer writes a ``tools`` column — the say schema lives as a
+    # constant (``SAY_TOOL_SCHEMA``) so PR 1's row struct is the single
+    # source of truth for the v3.1 schema.
+    assert "tools" not in table.column_names
--- a/tests/annotations/test_validator.py
+++ b/tests/annotations/test_validator.py
@@ -0,0 +1,125 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Validator behavior tests."""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+from lerobot.annotations.steerable_pipeline.reader import iter_episodes
+from lerobot.annotations.steerable_pipeline.staging import EpisodeStaging
+from lerobot.annotations.steerable_pipeline.validator import StagingValidator
+from lerobot.annotations.steerable_pipeline.writer import speech_atom
+
+
+def _validate(root: Path, staging_dir: Path):
+    records = list(iter_episodes(root))
+    return StagingValidator().validate(records, staging_dir)
+
+
+def test_validator_catches_misaligned_timestamps(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    staging_dir = tmp_path / "stage"
+    EpisodeStaging(staging_dir, 0).write(
+        "vqa",
+        [
+            {
+                "role": "assistant",
+                "content": json.dumps({"label": "cup", "count": 2}, sort_keys=True),
+                "style": "vqa",
+                "timestamp": 9.999,  # not on any 10 fps frame
+                "tool_calls": None,
+            }
+        ],
+    )
+    report = _validate(fixture_dataset_root, staging_dir)
+    assert not report.ok
+    assert any("does not match any source frame timestamp" in e for e in report.errors)
+
+
+def test_validator_catches_orphan_speech(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    staging_dir = tmp_path / "stage"
+    EpisodeStaging(staging_dir, 0).write(
+        "interjections",
+        [
+            speech_atom(0.0, "Got it."),
+            # interjection at 0.3s with NO paired speech
+            {
+                "role": "user",
+                "content": "skip it",
+                "style": "interjection",
+                "timestamp": 0.3,
+                "tool_calls": None,
+            },
+        ],
+    )
+    report = _validate(fixture_dataset_root, staging_dir)
+    assert not report.ok
+    assert any("paired speech" in e for e in report.errors)
+
+
+def test_validator_catches_inconsistent_plan_memory(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    staging_dir = tmp_path / "stage"
+    EpisodeStaging(staging_dir, 0).write(
+        "plan",
+        [
+            {
+                "role": "assistant",
+                "content": "1. do x",
+                "style": "plan",
+                "timestamp": 0.0,
+                "tool_calls": None,
+            },
+            {
+                "role": "assistant",
+                "content": "do x",
+                "style": "subtask",
+                "timestamp": 0.0,
+                "tool_calls": None,
+            },
+        ],
+    )
+    EpisodeStaging(staging_dir, 0).write(
+        "interjections",
+        [
+            speech_atom(0.0, "Got it."),
+            speech_atom(0.4, "Replanning."),
+            {
+                "role": "user",
+                "content": "replan",
+                "style": "interjection",
+                "timestamp": 0.4,
+                "tool_calls": None,
+            },
+        ],
+    )
+    report = _validate(fixture_dataset_root, staging_dir)
+    # missing co-timestamped plan refresh at 0.4s → error
+    assert not report.ok
+    assert any("co-timestamped plan update" in e for e in report.errors)
+
+
+def test_validator_catches_wrong_column(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    staging_dir = tmp_path / "stage"
+    EpisodeStaging(staging_dir, 0).write(
+        "plan",
+        [
+            {"role": "user", "content": "where?", "style": "vqa", "timestamp": 0.0, "tool_calls": None},
+        ],
+    )
+    report = _validate(fixture_dataset_root, staging_dir)
+    assert not report.ok
+    assert any("plan emitted style 'vqa'" in e or "must be persistent" in e for e in report.errors)
--- a/tests/annotations/test_writer.py
+++ b/tests/annotations/test_writer.py
@@ -0,0 +1,350 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Writer correctness tests."""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+import pyarrow.parquet as pq
+import pytest
+
+from lerobot.annotations.steerable_pipeline.reader import iter_episodes
+from lerobot.annotations.steerable_pipeline.staging import EpisodeStaging
+from lerobot.annotations.steerable_pipeline.writer import (
+    LanguageColumnsWriter,
+    speech_atom,
+)
+
+
+def _stage_episode(
+    staging_dir: Path,
+    episode_index: int,
+    *,
+    plan: list[dict] | None = None,
+    interjections: list[dict] | None = None,
+    vqa: list[dict] | None = None,
+) -> None:
+    staging = EpisodeStaging(staging_dir, episode_index)
+    if plan is not None:
+        staging.write("plan", plan)
+    if interjections is not None:
+        staging.write("interjections", interjections)
+    if vqa is not None:
+        staging.write("vqa", vqa)
+
+
+def test_writer_persistence_identity(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    """Every frame in an episode has a byte-identical persistent list."""
+    staging_dir = tmp_path / "stage"
+    _stage_episode(
+        staging_dir,
+        0,
+        plan=[
+            {
+                "role": "assistant",
+                "content": "grasp the sponge",
+                "style": "subtask",
+                "timestamp": 0.0,
+                "tool_calls": None,
+            },
+            {
+                "role": "assistant",
+                "content": "1. wipe\n2. dry",
+                "style": "plan",
+                "timestamp": 0.0,
+                "tool_calls": None,
+            },
+            {
+                "role": "assistant",
+                "content": "wiped the counter",
+                "style": "memory",
+                "timestamp": 0.5,
+                "tool_calls": None,
+            },
+        ],
+    )
+    records = list(iter_episodes(fixture_dataset_root))
+    LanguageColumnsWriter().write_all(records, staging_dir, fixture_dataset_root)
+
+    table = pq.read_table(fixture_dataset_root / "data" / "chunk-000" / "file-000.parquet")
+    persistent = table.column("language_persistent").to_pylist()
+    first = persistent[0]
+    assert first  # non-empty
+    for row in persistent:
+        assert row == first, "persistent slice must be byte-identical across all frames"
+
+
+def test_writer_events_exact_timestamp(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    staging_dir = tmp_path / "stage"
+    _stage_episode(
+        staging_dir,
+        0,
+        interjections=[
+            speech_atom(0.0, "Got it."),
+            {
+                "role": "user",
+                "content": "skip the dishes",
+                "style": "interjection",
+                "timestamp": 0.5,
+                "tool_calls": None,
+            },
+            speech_atom(0.5, "Skipping the dishes."),
+        ],
+    )
+    records = list(iter_episodes(fixture_dataset_root))
+    LanguageColumnsWriter().write_all(records, staging_dir, fixture_dataset_root)
+
+    table = pq.read_table(fixture_dataset_root / "data" / "chunk-000" / "file-000.parquet")
+    timestamps = table.column("timestamp").to_pylist()
+    events = table.column("language_events").to_pylist()
+    for ts, ev in zip(timestamps, events, strict=True):
+        if abs(ts - 0.0) < 1e-9:
+            assert any(r["role"] == "assistant" and r.get("style") is None for r in ev), ev
+        elif abs(ts - 0.5) < 1e-9:
+            assert any(r.get("style") == "interjection" for r in ev), ev
+            assert any(r.get("style") is None for r in ev), ev
+        else:
+            assert ev == []
+
+
+def test_writer_column_routing(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    staging_dir = tmp_path / "stage"
+    _stage_episode(
+        staging_dir,
+        0,
+        plan=[
+            {
+                "role": "assistant",
+                "content": "do X",
+                "style": "subtask",
+                "timestamp": 0.0,
+                "tool_calls": None,
+            },
+            {
+                "role": "assistant",
+                "content": "1. do X",
+                "style": "plan",
+                "timestamp": 0.0,
+                "tool_calls": None,
+            },
+            {
+                "role": "assistant",
+                "content": "did X",
+                "style": "memory",
+                "timestamp": 0.3,
+                "tool_calls": None,
+            },
+        ],
+        interjections=[
+            speech_atom(0.0, "OK"),
+            {
+                "role": "user",
+                "content": "wait",
+                "style": "interjection",
+                "timestamp": 0.2,
+                "tool_calls": None,
+            },
+            speech_atom(0.2, "Waiting"),
+        ],
+        vqa=[
+            {
+                "role": "user",
+                "content": "where is the cup?",
+                "style": "vqa",
+                "timestamp": 0.4,
+                "camera": "observation.images.front",
+                "tool_calls": None,
+            },
+            {
+                "role": "assistant",
+                "content": json.dumps(
+                    {"detections": [{"label": "cup", "bbox_format": "xyxy", "bbox": [1, 2, 3, 4]}]},
+                    sort_keys=True,
+                ),
+                "style": "vqa",
+                "timestamp": 0.4,
+                "camera": "observation.images.front",
+                "tool_calls": None,
+            },
+        ],
+    )
+    records = list(iter_episodes(fixture_dataset_root))
+    LanguageColumnsWriter().write_all(records, staging_dir, fixture_dataset_root)
+    table = pq.read_table(fixture_dataset_root / "data" / "chunk-000" / "file-000.parquet")
+
+    persistent = table.column("language_persistent").to_pylist()[0]
+    persistent_styles = {r["style"] for r in persistent}
+    assert persistent_styles == {"subtask", "plan", "memory"}
+
+    all_events = [r for ev in table.column("language_events").to_pylist() for r in ev]
+    event_styles = {r.get("style") for r in all_events}
+    assert event_styles == {None, "interjection", "vqa"}
+
+
+def test_writer_drops_subtask_index_idempotent(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    staging_dir = tmp_path / "stage"
+    _stage_episode(
+        staging_dir,
+        0,
+        plan=[
+            {
+                "role": "assistant",
+                "content": "do X",
+                "style": "subtask",
+                "timestamp": 0.0,
+                "tool_calls": None,
+            },
+        ],
+    )
+    records = list(iter_episodes(fixture_dataset_root))
+    writer = LanguageColumnsWriter()
+    writer.write_all(records, staging_dir, fixture_dataset_root)
+
+    path = fixture_dataset_root / "data" / "chunk-000" / "file-000.parquet"
+    table_a = pq.read_table(path)
+    assert "subtask_index" not in table_a.column_names
+    assert "language_persistent" in table_a.column_names
+    assert "language_events" in table_a.column_names
+    # The writer no longer emits a dataset-level ``tools`` column; the
+    # ``say`` tool schema lives as a code constant (``SAY_TOOL_SCHEMA``)
+    # so the parquet stays small and PR 2 doesn't extend PR 1's schema.
+    assert "tools" not in table_a.column_names
+
+    # second pass — must produce identical bytes for the language columns
+    records_again = list(iter_episodes(fixture_dataset_root))
+    writer.write_all(records_again, staging_dir, fixture_dataset_root)
+    table_b = pq.read_table(path)
+    assert (
+        table_a.column("language_persistent").to_pylist() == table_b.column("language_persistent").to_pylist()
+    )
+    assert table_a.column("language_events").to_pylist() == table_b.column("language_events").to_pylist()
+
+
+def test_writer_normalize_rejects_misrouted_persistent_style() -> None:
+    """``_normalize_persistent_row`` must reject any non-persistent style."""
+    from lerobot.annotations.steerable_pipeline.writer import _normalize_persistent_row
+
+    with pytest.raises(ValueError, match="non-persistent style"):
+        _normalize_persistent_row(
+            {"role": "assistant", "content": "oops", "style": "vqa", "timestamp": 0.0, "tool_calls": None}
+        )
+
+
+def test_writer_normalize_rejects_misrouted_event_style() -> None:
+    """``_normalize_event_row`` must reject any persistent style."""
+    from lerobot.annotations.steerable_pipeline.writer import _normalize_event_row
+
+    with pytest.raises(ValueError):
+        _normalize_event_row({"role": "assistant", "content": "oops", "style": "subtask", "tool_calls": None})
+
+
+def test_say_tool_schema_constant_is_well_formed() -> None:
+    """``SAY_TOOL_SCHEMA`` (and ``DEFAULT_TOOLS``) replace the parquet
+    ``tools`` column — chat-template consumers import them directly.
+    """
+    from lerobot.annotations.steerable_pipeline.writer import (
+        DEFAULT_TOOLS,
+        SAY_TOOL_SCHEMA,
+    )
+
+    assert DEFAULT_TOOLS == [SAY_TOOL_SCHEMA]
+    assert SAY_TOOL_SCHEMA["function"]["name"] == "say"
+    params = SAY_TOOL_SCHEMA["function"]["parameters"]
+    assert params["properties"]["text"]["type"] == "string"
+    assert params["required"] == ["text"]
+
+
+def test_writer_does_not_add_tools_column(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    """Re-running on a parquet that already has a legacy ``tools`` column
+    must drop it cleanly so reruns converge to the v3.1 schema.
+    """
+    staging_dir = tmp_path / "stage"
+    _stage_episode(
+        staging_dir,
+        0,
+        plan=[
+            {"role": "assistant", "content": "x", "style": "subtask", "timestamp": 0.0, "tool_calls": None}
+        ],
+    )
+    records = list(iter_episodes(fixture_dataset_root))
+    LanguageColumnsWriter().write_all(records, staging_dir, fixture_dataset_root)
+    table = pq.read_table(fixture_dataset_root / "data" / "chunk-000" / "file-000.parquet")
+    assert "tools" not in table.column_names
+
+
+def test_annotation_metadata_sync_allows_non_streaming_load(
+    fixture_dataset_root: Path, tmp_path: Path
+) -> None:
+    """Annotated parquet columns must be declared in ``meta/info.json``.
+
+    ``LeRobotDataset`` loads non-streaming datasets by casting parquet
+    against metadata-derived HF features. If the annotation writer adds
+    language columns but metadata stays stale, that cast fails with a column
+    mismatch.
+    """
+    from lerobot.annotations.steerable_pipeline.executor import Executor
+    from lerobot.datasets.feature_utils import get_hf_features_from_features
+    from lerobot.datasets.io_utils import load_info, load_nested_dataset
+    from lerobot.datasets.language import LANGUAGE_EVENTS, LANGUAGE_PERSISTENT, language_feature_info
+
+    info_path = fixture_dataset_root / "meta" / "info.json"
+    info = json.loads(info_path.read_text())
+    info["features"] = {
+        "episode_index": {"dtype": "int64", "shape": (1,), "names": None},
+        "frame_index": {"dtype": "int64", "shape": (1,), "names": None},
+        "timestamp": {"dtype": "float32", "shape": (1,), "names": None},
+        "task_index": {"dtype": "int64", "shape": (1,), "names": None},
+    }
+    info_path.write_text(json.dumps(info, indent=2))
+
+    staging_dir = tmp_path / "stage"
+    _stage_episode(
+        staging_dir,
+        0,
+        plan=[
+            {"role": "assistant", "content": "do X", "style": "subtask", "timestamp": 0.0, "tool_calls": None}
+        ],
+    )
+    records = list(iter_episodes(fixture_dataset_root))
+    LanguageColumnsWriter().write_all(records, staging_dir, fixture_dataset_root)
+
+    Executor._ensure_annotation_metadata_in_info(fixture_dataset_root)
+
+    synced = load_info(fixture_dataset_root)
+    for key, feature in language_feature_info().items():
+        assert synced["features"][key] == feature
+
+    hf_features = get_hf_features_from_features(synced["features"])
+    dataset = load_nested_dataset(fixture_dataset_root / "data", features=hf_features)
+
+    assert LANGUAGE_PERSISTENT in dataset.column_names
+    assert LANGUAGE_EVENTS in dataset.column_names
+    assert len(dataset) == 24
+
+
+def test_speech_atom_shape_matches_plan_spec() -> None:
+    atom = speech_atom(2.5, "I'm cleaning up!")
+    assert atom["role"] == "assistant"
+    assert atom["style"] is None
+    assert atom["content"] is None
+    assert atom["timestamp"] == 2.5
+    assert isinstance(atom["tool_calls"], list)
+    call = atom["tool_calls"][0]
+    assert call["type"] == "function"
+    assert call["function"]["name"] == "say"
+    assert call["function"]["arguments"]["text"] == "I'm cleaning up!"
--- a/tests/fixtures/dataset_factories.py
+++ b/tests/fixtures/dataset_factories.py
@@ -552,3 +552,64 @@ def lerobot_dataset_factory(
@pytest.fixture(scope="session")
 def empty_lerobot_dataset_factory() -> LeRobotDatasetFactory:
    return partial(LeRobotDataset.create, repo_id=DUMMY_REPO_ID, fps=DEFAULT_FPS)
+
+
+def build_annotation_dataset(
+    root: Path,
+    episode_specs: list[tuple[int, int, str]],
+    *,
+    fps: int = 10,
+) -> Path:
+    """Build a minimal LeRobot-shaped dataset on disk for annotation tests.
+
+    ``episode_specs`` is a list of ``(episode_index, num_frames, task_text)``.
+    Each episode is written to its own
+    ``data/chunk-000/file-{ep:03d}.parquet`` so the writer's per-shard
+    rewrite path is exercised. The dataset carries the minimum
+    ``meta/tasks.parquet`` + ``meta/info.json`` the reader / executor need;
+    it has no videos, so the modules fall back to text-only prompts.
+
+    Shared by the annotation-pipeline pytest fixtures (``tests/annotations/
+    conftest.py``) and the opt-in E2E smoke run so the fixture shape lives
+    in exactly one place.
+    """
+    from lerobot.datasets.io_utils import write_tasks
+    from lerobot.utils.io_utils import write_json
+
+    data_dir = root / "data" / "chunk-000"
+    data_dir.mkdir(parents=True, exist_ok=True)
+
+    tasks: dict[int, str] = {}
+    for episode_index, num_frames, task_text in episode_specs:
+        if task_text not in tasks.values():
+            tasks[len(tasks)] = task_text
+        task_index = next(k for k, v in tasks.items() if v == task_text)
+        frame = pd.DataFrame(
+            {
+                "episode_index": [episode_index] * num_frames,
+                "frame_index": list(range(num_frames)),
+                "timestamp": [round(i / fps, 6) for i in range(num_frames)],
+                "task_index": [task_index] * num_frames,
+                "subtask_index": [0] * num_frames,  # legacy column the writer must drop
+            }
+        )
+        frame.to_parquet(data_dir / f"file-{episode_index:03d}.parquet", index=False)
+
+    # Canonical tasks frame: indexed by task string with a ``task_index``
+    # column, matching what ``lerobot.datasets.io_utils.load_tasks`` expects.
+    tasks_df = pd.DataFrame(
+        {"task_index": list(tasks.keys())},
+        index=pd.Index(list(tasks.values()), name="task"),
+    )
+    write_tasks(tasks_df, root)
+
+    write_json(
+        {
+            "codebase_version": "v3.1",
+            "fps": fps,
+            "features": {},
+            "total_episodes": len(episode_specs),
+        },
+        root / "meta" / "info.json",
+    )
+    return root
--- a/tests/policies/molmoact2/test_molmoact2.py
+++ b/tests/policies/molmoact2/test_molmoact2.py
--- a/tests/rewards/test_modeling_robometer.py
+++ b/tests/rewards/test_modeling_robometer.py
@@ -0,0 +1,340 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for Robometer reward model."""
+
+from __future__ import annotations
+
+from types import SimpleNamespace
+
+import pytest
+import torch
+
+from lerobot.configs.rewards import RewardModelConfig
+from lerobot.rewards.factory import get_reward_model_class, make_reward_model_config
+from lerobot.rewards.robometer import RobometerConfig
+from lerobot.rewards.robometer.configuration_robometer import ROBOMETER_SPECIAL_TOKENS
+from lerobot.rewards.robometer.modeling_robometer import (
+    ROBOMETER_FEATURE_PREFIX,
+    convert_bins_to_continuous,
+    decode_progress_outputs,
+)
+from tests.utils import skip_if_package_missing
+
+# Length of the fake tokenizer used in `_patch_build`. The deterministic
+# resize target derived in ``RobometerConfig.__post_init__`` is therefore
+# ``_FAKE_TOKENIZER_LEN + len(ROBOMETER_SPECIAL_TOKENS)``.
+_FAKE_TOKENIZER_LEN = 100
+_EXPECTED_RESIZED_VOCAB = _FAKE_TOKENIZER_LEN + len(ROBOMETER_SPECIAL_TOKENS)
+
+
+class _FakeQwenConfig:
+    """Stand-in for a Qwen3-VL config (the `model.config` attribute).
+
+    ``to_dict`` matches HF's ``PretrainedConfig.to_dict`` closely enough for
+    ``RobometerConfig.__post_init__`` to snapshot a meaningful ``vlm_config``
+    into the saved ``config.json`` and for the reload path to round-trip
+    through ``AutoConfig.for_model``.
+    """
+
+    def __init__(self, hidden_dim: int = 8, vocab_size: int = _FAKE_TOKENIZER_LEN) -> None:
+        # `vocab_size` here is the *pre-resize* value the fake backbone advertises.
+        # `__post_init__` is expected to overwrite it with `len(tokenizer) + 5`.
+        self.text_config = SimpleNamespace(hidden_size=hidden_dim, vocab_size=vocab_size)
+        self._hidden_dim = hidden_dim
+        self._vocab_size = vocab_size
+
+    def to_dict(self) -> dict:
+        return {
+            "model_type": "fake_qwen",
+            "text_config": {
+                "hidden_size": self._hidden_dim,
+                "vocab_size": self._vocab_size,
+            },
+        }
+
+
+class _FakeEmbeddings(torch.nn.Module):
+    def __init__(self, num_embeddings: int = _FAKE_TOKENIZER_LEN) -> None:
+        super().__init__()
+        self.num_embeddings = num_embeddings
+
+
+class _FakeBaseModel(torch.nn.Module):
+    """Stand-in for the Qwen3-VL backbone during tests.
+
+    Provides the minimum surface `RobometerRewardModel.__init__` and
+    `_compute_rbm_logits` rely on: a `parameters()` iterator (for dtype +
+    device), a `config.text_config.hidden_size`, a `config.to_dict()` so
+    `_save_pretrained` can snapshot `vlm_config`,
+    `get_input_embeddings()` / `resize_token_embeddings()` so the fresh-init
+    embed resize is a no-op, and a forward that returns a `SimpleNamespace`
+    with a `hidden_states` tuple.
+    """
+
+    def __init__(self, hidden_dim: int = 8) -> None:
+        super().__init__()
+        self._param = torch.nn.Parameter(torch.zeros(1))
+        self.hidden_dim = hidden_dim
+        self.config = _FakeQwenConfig(hidden_dim)
+        self._embeddings = _FakeEmbeddings()
+
+    def get_input_embeddings(self) -> _FakeEmbeddings:
+        return self._embeddings
+
+    def resize_token_embeddings(self, new_size: int) -> None:
+        self._embeddings.num_embeddings = new_size
+
+    def forward(self, **kwargs):  # noqa: ARG002 - intentional kwargs sink
+        input_ids = kwargs["input_ids"]
+        return SimpleNamespace(
+            hidden_states=(torch.zeros(input_ids.shape[0], input_ids.shape[1], self.hidden_dim),),
+            last_hidden_state=torch.zeros(input_ids.shape[0], input_ids.shape[1], self.hidden_dim),
+        )
+
+
+class _FakeTokenizer:
+    """Minimal stand-in for an HF tokenizer.
+
+    ``RobometerConfig.__post_init__`` uses ``len(tokenizer)`` to compute the
+    deterministic resize target ``len(tokenizer) + len(ROBOMETER_SPECIAL_TOKENS)``,
+    so a working ``__len__`` is all we need.
+    """
+
+    def __init__(self, length: int = _FAKE_TOKENIZER_LEN) -> None:
+        self._length = length
+
+    def __len__(self) -> int:
+        return self._length
+
+
+def _patch_build(monkeypatch) -> None:
+    """Stub out the HF AutoX calls so Robometer construction stays cheap in tests.
+
+    Covers (EO-1 style — no model-side override hooks):
+    * ``AutoConfig.from_pretrained`` (config side) — used by
+      ``RobometerConfig.__post_init__`` to snapshot the backbone config.
+    * ``AutoTokenizer.from_pretrained`` (config side) — used by
+      ``__post_init__`` to compute ``len(tokenizer) + 5``.
+    * ``AutoConfig.for_model``                       — used by
+      ``RobometerConfig.vlm_backbone_config`` when rebuilding for ``from_config``.
+    * ``AutoModelForImageTextToText.from_pretrained`` — fresh-training path
+      (``pretrained_path is None``).
+    * ``AutoModelForImageTextToText.from_config``    — checkpoint-reload path
+      (``pretrained_path`` is set).
+    """
+    from lerobot.rewards.robometer import configuration_robometer, modeling_robometer
+
+    monkeypatch.setattr(
+        modeling_robometer.AutoModelForImageTextToText,
+        "from_pretrained",
+        lambda *args, **kwargs: _FakeBaseModel(hidden_dim=8),
+    )
+    monkeypatch.setattr(
+        modeling_robometer.AutoModelForImageTextToText,
+        "from_config",
+        lambda *args, **kwargs: _FakeBaseModel(hidden_dim=8),
+    )
+    monkeypatch.setattr(
+        configuration_robometer.AutoConfig,
+        "for_model",
+        lambda *args, **kwargs: _FakeQwenConfig(hidden_dim=8),
+    )
+    monkeypatch.setattr(
+        configuration_robometer.AutoConfig,
+        "from_pretrained",
+        lambda *args, **kwargs: _FakeQwenConfig(hidden_dim=8),
+    )
+    monkeypatch.setattr(
+        configuration_robometer.AutoTokenizer,
+        "from_pretrained",
+        lambda *args, **kwargs: _FakeTokenizer(length=_FAKE_TOKENIZER_LEN),
+    )
+
+
+def _make_batch(features: dict[str, torch.Tensor]) -> dict[str, torch.Tensor]:
+    """Build a `compute_reward`-ready batch using Robometer's namespaced keys."""
+    return {f"{ROBOMETER_FEATURE_PREFIX}{key}": value for key, value in features.items()}
+
+
+@skip_if_package_missing("transformers")
+def test_robometer_config_registered(monkeypatch):
+    _patch_build(monkeypatch)
+    assert "robometer" in RewardModelConfig.get_known_choices()
+    assert RewardModelConfig.get_choice_class("robometer") is RobometerConfig
+    assert isinstance(make_reward_model_config("robometer", device="cpu"), RobometerConfig)
+
+
+def test_robometer_factory_returns_in_tree_class():
+    from lerobot.rewards.robometer.modeling_robometer import RobometerRewardModel
+
+    assert get_reward_model_class("robometer") is RobometerRewardModel
+
+
+def test_convert_bins_to_continuous_returns_expected_values():
+    # Two frames: first peaks at bin 0 (center 0.0), second peaks at bin 9 (center 1.0).
+    bin_logits = torch.full((2, 10), -10.0)
+    bin_logits[0, 0] = 10.0
+    bin_logits[1, -1] = 10.0
+    values = convert_bins_to_continuous(bin_logits)
+    assert values.shape == (2,)
+    assert torch.allclose(values, torch.tensor([0.0, 1.0]), atol=1e-3)
+
+
+def test_decode_progress_outputs_returns_last_frame_values():
+    progress = torch.tensor([[0.1, 0.9], [0.4, 0.6]])
+    success_logits = torch.tensor([[0.0, 5.0], [0.0, -5.0]])
+
+    outputs = decode_progress_outputs(progress, success_logits, is_discrete_mode=False)
+
+    assert outputs["progress_pred"] == [pytest.approx([0.1, 0.9]), pytest.approx([0.4, 0.6])]
+    assert outputs["success_probs"][0][-1] == pytest.approx(torch.sigmoid(torch.tensor(5.0)).item(), abs=1e-3)
+    assert outputs["success_probs"][1][-1] == pytest.approx(
+        torch.sigmoid(torch.tensor(-5.0)).item(), abs=1e-3
+    )
+
+
+def test_decode_progress_outputs_discrete_mode_softmaxes_over_bins():
+    # 2 frames, peaks at bin 0 and bin 9 → continuous predictions 0.0 and 1.0
+    bin_logits = torch.full((1, 2, 10), -10.0)
+    bin_logits[0, 0, 0] = 10.0
+    bin_logits[0, 1, -1] = 10.0
+
+    outputs = decode_progress_outputs(bin_logits, success_logits=None, is_discrete_mode=True)
+
+    assert outputs["success_probs"] == []
+    assert outputs["progress_pred"][0] == pytest.approx([0.0, 1.0], abs=1e-3)
+
+
+@skip_if_package_missing("transformers")
+def test_robometer_post_init_overwrites_vocab_size_with_tokenizer_length(monkeypatch):
+    """``RobometerConfig.__post_init__`` must overwrite the backbone's stale
+    ``text_config.vocab_size`` (which on the real Qwen3-VL config is the
+    padded embedding size, ``151,936``) with ``len(tokenizer) + 5``. This is
+    the contract that makes the published ``Robometer-4B`` checkpoint load
+    byte-equivalently."""
+    _patch_build(monkeypatch)
+
+    cfg = RobometerConfig(device="cpu", progress_loss_type="l2")
+
+    assert cfg.vlm_config["text_config"]["vocab_size"] == _EXPECTED_RESIZED_VOCAB
+
+
+@skip_if_package_missing("transformers")
+def test_robometer_compute_reward_reads_pre_encoded_inputs(monkeypatch):
+    from lerobot.rewards.robometer.modeling_robometer import RobometerRewardModel
+
+    progress = torch.tensor([[0.1, 0.9], [0.4, 0.6]])
+    success_logits = torch.tensor([[0.0, 5.0], [0.0, -5.0]])
+    _patch_build(monkeypatch)
+
+    cfg = RobometerConfig(device="cpu", reward_output="progress", progress_loss_type="l2")
+    model = RobometerRewardModel(cfg)
+    # Bypass the Qwen3-VL forward + head extraction with deterministic logits.
+    monkeypatch.setattr(model, "_compute_rbm_logits", lambda _inputs: (progress, success_logits))
+
+    batch = _make_batch({"input_ids": torch.zeros(2, 2, dtype=torch.long)})
+    rewards = model.compute_reward(batch)
+
+    assert torch.allclose(rewards, torch.tensor([0.9, 0.6]))
+
+
+@skip_if_package_missing("transformers")
+def test_robometer_compute_reward_can_return_binary_success(monkeypatch):
+    from lerobot.rewards.robometer.modeling_robometer import RobometerRewardModel
+
+    progress = torch.tensor([[0.1, 0.9], [0.4, 0.6]])
+    success_logits = torch.tensor([[0.0, 5.0], [0.0, -5.0]])  # sigmoid(5) > 0.5; sigmoid(-5) < 0.5
+    _patch_build(monkeypatch)
+
+    cfg = RobometerConfig(
+        device="cpu",
+        reward_output="success",
+        success_threshold=0.5,
+        progress_loss_type="l2",
+    )
+    model = RobometerRewardModel(cfg)
+    monkeypatch.setattr(model, "_compute_rbm_logits", lambda _inputs: (progress, success_logits))
+
+    batch = _make_batch({"input_ids": torch.zeros(2, 2, dtype=torch.long)})
+    rewards = model.compute_reward(batch)
+
+    assert torch.equal(rewards, torch.tensor([1.0, 0.0]))
+
+
+@skip_if_package_missing("transformers")
+def test_robometer_compute_reward_errors_when_inputs_missing(monkeypatch):
+    from lerobot.rewards.robometer.modeling_robometer import RobometerRewardModel
+
+    _patch_build(monkeypatch)
+
+    cfg = RobometerConfig(device="cpu", progress_loss_type="l2")
+    model = RobometerRewardModel(cfg)
+
+    with pytest.raises(KeyError, match=r"observation\.robometer\.input_ids"):
+        model.compute_reward({})
+
+
+@skip_if_package_missing("transformers")
+def test_robometer_save_pretrained_roundtrips(monkeypatch, tmp_path):
+    """Saving and reloading a Robometer model in LeRobot HF format must produce
+    a single ``model.safetensors`` + ``config.json`` (no Hydra ``config.yaml``),
+    must round-trip user-tunable config fields, and must persist all three
+    prediction heads (``progress_head``, ``success_head``, ``preference_head``)
+    so the published ``Robometer-4B`` checkpoint loads byte-equivalently.
+    """
+    from huggingface_hub.constants import CONFIG_NAME, SAFETENSORS_SINGLE_FILE
+    from safetensors.torch import load_file
+
+    from lerobot.rewards.robometer.modeling_robometer import RobometerRewardModel
+
+    _patch_build(monkeypatch)
+    cfg = RobometerConfig(
+        device="cpu",
+        pretrained_path="robometer/Robometer-4B",
+        # Knobs the user might tweak — must survive the round-trip.
+        image_key="observation.images.cam_top",
+        task_key="task",
+        reward_output="success",
+        success_threshold=0.7,
+        progress_loss_type="l2",
+    )
+    model = RobometerRewardModel(cfg)
+    model.save_pretrained(str(tmp_path))
+
+    # Exactly the files LeRobot's HubMixin promises.
+    assert (tmp_path / CONFIG_NAME).exists()
+    assert (tmp_path / SAFETENSORS_SINGLE_FILE).exists()
+    assert not (tmp_path / "config.yaml").exists()  # we want HF-style, not Hydra
+
+    # All three heads must be present in the saved safetensors. The preference
+    # head is unused at inference but the published checkpoint expects its
+    # rows — losing it would silently break weight loading.
+    state = load_file(str(tmp_path / SAFETENSORS_SINGLE_FILE))
+    assert any(k.startswith("progress_head.") for k in state), "progress_head weights missing"
+    assert any(k.startswith("success_head.") for k in state), "success_head weights missing"
+    assert any(k.startswith("preference_head.") for k in state), "preference_head weights missing"
+
+    # Reload from the local directory: no Hub fetch, no YAML overlay. The
+    # base class drives subclass dispatch via the `type` field in config.json.
+    reloaded_cfg = RewardModelConfig.from_pretrained(str(tmp_path))
+    assert isinstance(reloaded_cfg, RobometerConfig)
+    reloaded_cfg.pretrained_path = str(tmp_path)  # mimic lerobot-train's `validate()`
+    reloaded = RobometerRewardModel.from_pretrained(str(tmp_path), config=reloaded_cfg)
+
+    assert reloaded.config.image_key == "observation.images.cam_top"
+    assert reloaded.config.task_key == "task"
+    assert reloaded.config.reward_output == "success"
+    assert reloaded.config.success_threshold == 0.7
+    assert reloaded.config.progress_loss_type == "l2"  # came back from config.json
--- a/tests/rewards/test_robometer_processor.py
+++ b/tests/rewards/test_robometer_processor.py
@@ -0,0 +1,354 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for Robometer's pre-processing helpers and encoder step.
+
+Covers the pure helpers (``_video_to_numpy`` and ``_expand_tasks``) directly,
+and exercises :class:`RobometerEncoderProcessorStep` with a stubbed
+``AutoProcessor`` so we don't need to download Qwen-VL just to test the
+dataclass plumbing (``transform_features`` / ``get_config``).
+
+The full ``__call__`` path that runs ``process_vision_info`` + the Qwen
+processor is intentionally *not* covered here — it is essentially HF glue
+that's exercised by the integration / parity scripts.
+"""
+
+from __future__ import annotations
+
+from typing import Any
+
+import numpy as np
+import pytest
+import torch
+
+from lerobot.configs import FeatureType, PipelineFeatureType, PolicyFeature
+from lerobot.rewards.robometer.processor_robometer import (
+    PROGRESS_PROMPT,
+    _expand_tasks,
+    _frames_to_pil,
+    _video_to_numpy,
+)
+from tests.utils import skip_if_package_missing
+
+
+def _skip_if_robometer_extras_missing(func):
+    """Apply both optional-dependency guards in one shot.
+
+    ``RobometerEncoderProcessorStep.__post_init__`` calls
+    ``require_package("transformers", ...)`` *and*
+    ``require_package("qwen-vl-utils", ...)``, so both need to be present
+    before we can instantiate the step.
+    """
+    func = skip_if_package_missing("qwen-vl-utils", import_name="qwen_vl_utils")(func)
+    func = skip_if_package_missing("transformers")(func)
+    return func
+
+
+# ---------------------------------------------------------------------------
+# _video_to_numpy — pure tensor → uint8 (T, H, W, C) conversion
+# ---------------------------------------------------------------------------
+
+
+def test_video_to_numpy_chw_float_is_converted_to_thwc_uint8():
+    video = torch.rand(4, 3, 8, 8)  # (T, C, H, W) floats in [0, 1]
+    array = _video_to_numpy(video, max_frames=None)
+
+    assert array.shape == (4, 8, 8, 3)
+    assert array.dtype == np.uint8
+    assert array.min() >= 0 and array.max() <= 255
+
+
+def test_video_to_numpy_already_thwc_uint8_passes_through():
+    video = torch.randint(0, 256, (3, 8, 8, 3), dtype=torch.uint8)  # (T, H, W, C)
+    array = _video_to_numpy(video, max_frames=None)
+
+    assert array.shape == (3, 8, 8, 3)
+    assert array.dtype == np.uint8
+
+
+def test_video_to_numpy_max_frames_tail_crops_recent_frames():
+    """``max_frames`` should keep the **last** K frames (most recent)."""
+    video = torch.zeros(10, 3, 4, 4)
+    for t in range(10):
+        video[t] = t / 9.0  # marker: 0 at t=0, ≈1 at t=9
+
+    array = _video_to_numpy(video, max_frames=3)
+
+    assert array.shape == (3, 4, 4, 3)
+    # The first kept frame is t=7 → marker ≈ 7/9 → uint8 ≈ 198
+    assert int(array[0, 0, 0, 0]) == int(round(7 / 9 * 255))
+    # The last kept frame is t=9 → marker = 1.0 → uint8 = 255
+    assert int(array[-1, 0, 0, 0]) == 255
+
+
+def test_video_to_numpy_rejects_3d_input():
+    with pytest.raises(ValueError, match="Expected channel dim"):
+        _video_to_numpy(torch.zeros(4, 8, 8), max_frames=None)
+
+
+def test_video_to_numpy_floats_above_one_pass_through_without_rescaling():
+    """If ``array.max() > 1`` the helper assumes the tensor is already in the
+    [0, 255] range (uint8-as-float), so values pass through unchanged."""
+    video = torch.full((1, 3, 2, 2), 5.0)
+    array = _video_to_numpy(video, max_frames=None)
+
+    assert array.shape == (1, 2, 2, 3)
+    assert int(array.max()) == 5
+
+
+def test_video_to_numpy_clips_very_large_floats_to_uint8_max():
+    """Out-of-uint8-range floats are clipped at 255 before the cast."""
+    video = torch.full((1, 3, 2, 2), 300.0)
+    array = _video_to_numpy(video, max_frames=None)
+
+    assert int(array.max()) == 255
+
+
+# ---------------------------------------------------------------------------
+# _expand_tasks — string / list / tuple broadcasting to batch size
+# ---------------------------------------------------------------------------
+
+
+def test_expand_tasks_string_is_broadcast_to_batch_size():
+    assert _expand_tasks("pick up", batch_size=3, default=None) == ["pick up", "pick up", "pick up"]
+
+
+def test_expand_tasks_list_of_matching_size_passes_through():
+    assert _expand_tasks(["a", "b", "c"], batch_size=3, default=None) == ["a", "b", "c"]
+
+
+def test_expand_tasks_tuple_is_normalised_to_list():
+    assert _expand_tasks(("a", "b"), batch_size=2, default=None) == ["a", "b"]
+
+
+def test_expand_tasks_single_element_list_is_broadcast():
+    assert _expand_tasks(["only one"], batch_size=3, default=None) == ["only one"] * 3
+
+
+def test_expand_tasks_size_mismatch_raises():
+    with pytest.raises(ValueError, match="Expected 3 tasks"):
+        _expand_tasks(["a", "b"], batch_size=3, default=None)
+
+
+def test_expand_tasks_missing_uses_default():
+    assert _expand_tasks(None, batch_size=2, default="fallback") == ["fallback", "fallback"]
+
+
+def test_expand_tasks_missing_without_default_raises():
+    with pytest.raises(KeyError, match="task description"):
+        _expand_tasks(None, batch_size=1, default=None)
+
+
+def test_expand_tasks_wrong_type_raises():
+    with pytest.raises(TypeError, match="must be a string or list"):
+        _expand_tasks(42, batch_size=1, default=None)
+
+
+# ---------------------------------------------------------------------------
+# _frames_to_pil — uint8 (T, H, W, C) → list[PIL.Image]
+# ---------------------------------------------------------------------------
+
+
+def test_frames_to_pil_returns_one_image_per_frame():
+    frames = np.zeros((4, 8, 8, 3), dtype=np.uint8)
+    images = _frames_to_pil(frames)
+
+    assert len(images) == 4
+    assert all(img.size == (8, 8) for img in images)
+
+
+def test_frames_to_pil_casts_floats_to_uint8():
+    frames = np.full((2, 4, 4, 3), 200.0, dtype=np.float32)
+    images = _frames_to_pil(frames)
+
+    assert len(images) == 2
+    # PIL converted from clipped uint8 - sanity check pixel values come through.
+    assert np.asarray(images[0]).dtype == np.uint8
+
+
+def test_frames_to_pil_rejects_non_4d_input():
+    with pytest.raises(ValueError, match=r"\(T,H,W,C\)"):
+        _frames_to_pil(np.zeros((4, 8, 8), dtype=np.uint8))
+
+
+# ---------------------------------------------------------------------------
+# Encoder step plumbing — exercise dataclass surface with a stubbed AutoProcessor
+# ---------------------------------------------------------------------------
+
+
+class _FakeTokenizer:
+    """Tokenizer surface the encoder step touches in ``__post_init__``."""
+
+    def __init__(self) -> None:
+        self.pad_token: str | None = None
+        self.eos_token = "<|endoftext|>"
+        self._vocab: dict[str, int] = {"<|endoftext|>": 0}
+        self.added: list[str] = []
+
+    def get_vocab(self) -> dict[str, int]:
+        return self._vocab
+
+    def add_special_tokens(self, payload: dict[str, Any]) -> int:
+        for token in payload.get("additional_special_tokens", []):
+            if token not in self._vocab:
+                self._vocab[token] = len(self._vocab)
+                self.added.append(token)
+        return len(self.added)
+
+
+class _FakeAutoProcessor:
+    """Stand-in returned by ``AutoProcessor.from_pretrained`` during tests."""
+
+    def __init__(self) -> None:
+        self.tokenizer = _FakeTokenizer()
+        self.image_processor = None
+        self.video_processor = None
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):  # noqa: ARG003
+        return cls()
+
+
+def _build_step(monkeypatch, **overrides):
+    from lerobot.rewards.robometer import processor_robometer
+
+    monkeypatch.setattr(processor_robometer, "AutoProcessor", _FakeAutoProcessor)
+
+    return processor_robometer.RobometerEncoderProcessorStep(**overrides)
+
+
+@_skip_if_robometer_extras_missing
+def test_encoder_step_registers_special_tokens_on_tokenizer(monkeypatch):
+    """``__post_init__`` must register Robometer's five special tokens on the
+    tokenizer that ships with the chosen Qwen-VL checkpoint."""
+    from lerobot.rewards.robometer.configuration_robometer import ROBOMETER_SPECIAL_TOKENS
+
+    step = _build_step(monkeypatch)
+
+    vocab = step._processor.tokenizer.get_vocab()
+    for token in ROBOMETER_SPECIAL_TOKENS:
+        assert token in vocab, f"{token} not registered on the tokenizer"
+
+
+@_skip_if_robometer_extras_missing
+def test_encoder_step_sets_pad_token_to_eos_when_missing(monkeypatch):
+    """Qwen tokenizers ship without a pad token; the step must reuse EOS so
+    batched processing doesn't crash on padding."""
+    step = _build_step(monkeypatch)
+
+    assert step._processor.tokenizer.pad_token == "<|endoftext|>"
+
+
+@_skip_if_robometer_extras_missing
+def test_encoder_step_get_config_roundtrips_user_fields(monkeypatch):
+    """``get_config`` must serialise every user-tunable field — these are what
+    the processor pipeline saves under ``preprocessor_config.json``."""
+    step = _build_step(
+        monkeypatch,
+        base_model_id="Qwen/Qwen3-VL-4B-Instruct",
+        image_key="observation.images.cam_top",
+        task_key="task",
+        default_task="do the thing",
+        max_frames=12,
+        use_multi_image=True,
+        use_per_frame_progress_token=True,
+        max_length=2048,
+    )
+
+    cfg = step.get_config()
+    assert cfg == {
+        "base_model_id": "Qwen/Qwen3-VL-4B-Instruct",
+        "image_key": "observation.images.cam_top",
+        "task_key": "task",
+        "default_task": "do the thing",
+        "max_frames": 12,
+        "use_multi_image": True,
+        "use_per_frame_progress_token": True,
+        "max_length": 2048,
+    }
+
+
+@_skip_if_robometer_extras_missing
+def test_encoder_step_transform_features_is_identity(monkeypatch):
+    """The encoder step writes Qwen tensors into ``observation`` at call time,
+    but it does **not** advertise new typed features at pipeline-build time —
+    the downstream model consumes them via the ``ROBOMETER_FEATURE_PREFIX``
+    namespace, not via the typed feature map.
+    """
+    step = _build_step(monkeypatch)
+
+    features = {
+        PipelineFeatureType.OBSERVATION: {
+            "observation.images.top": PolicyFeature(shape=(3, 224, 224), type=FeatureType.VISUAL),
+        }
+    }
+    assert step.transform_features(features) == features
+
+
+@_skip_if_robometer_extras_missing
+def test_encoder_step_build_conversation_inserts_prog_token_per_frame(monkeypatch):
+    """In multi-image mode with per-frame progress tokens, the conversation
+    must alternate ``image`` and ``<|prog_token|>`` text entries, one pair
+    per frame, after the task prompt."""
+    step = _build_step(
+        monkeypatch,
+        use_multi_image=True,
+        use_per_frame_progress_token=True,
+    )
+
+    frames = np.zeros((3, 8, 8, 3), dtype=np.uint8)
+    conversation = step._build_conversation(frames, task="pick up the cube")
+
+    assert len(conversation) == 1 and conversation[0]["role"] == "user"
+    content = conversation[0]["content"]
+
+    # First entry is the task prompt.
+    assert content[0] == {"type": "text", "text": PROGRESS_PROMPT.format(task="pick up the cube")}
+
+    # Then 3 (image, <|prog_token|>) pairs.
+    expected_tail = [
+        item
+        for _ in range(3)
+        for item in (
+            {"type": "image"},  # value asserted below
+            {"type": "text", "text": "<|prog_token|>"},
+        )
+    ]
+    assert len(content) == 1 + len(expected_tail)
+    for got, exp in zip(content[1:], expected_tail, strict=True):
+        assert got["type"] == exp["type"]
+        if exp["type"] == "text":
+            assert got["text"] == exp["text"]
+
+
+@_skip_if_robometer_extras_missing
+def test_encoder_step_build_conversation_video_mode_uses_single_video_entry(monkeypatch):
+    """When ``use_multi_image=False``, frames are bundled into a single
+    ``video`` content entry instead of individual ``image`` entries."""
+    step = _build_step(
+        monkeypatch,
+        use_multi_image=False,
+        use_per_frame_progress_token=False,
+    )
+
+    frames = np.zeros((4, 8, 8, 3), dtype=np.uint8)
+    conversation = step._build_conversation(frames, task="pour the water")
+
+    content = conversation[0]["content"]
+    # Exactly two entries: the prompt and one video entry.
+    assert len(content) == 2
+    assert content[0]["type"] == "text"
+    assert content[1]["type"] == "video"
+    # The video entry carries all four frames.
+    assert len(content[1]["video"]) == 4
--- a/tests/scripts/test_lerobot_annotate.py
+++ b/tests/scripts/test_lerobot_annotate.py
@@ -0,0 +1,51 @@
+#!/usr/bin/env python
+
+import json
+from types import SimpleNamespace
+
+
+def test_push_to_hub_tags_uploaded_dataset_revision(tmp_path, monkeypatch):
+    from lerobot.scripts.lerobot_annotate import _push_to_hub
+
+    root = tmp_path / "dataset"
+    (root / "meta").mkdir(parents=True)
+    (root / "meta" / "info.json").write_text(json.dumps({"codebase_version": "v3.0"}))
+
+    calls = {}
+
+    class FakeHfApi:
+        def create_repo(self, **kwargs):
+            calls["create_repo"] = kwargs
+
+        def upload_folder(self, **kwargs):
+            calls["upload_folder"] = kwargs
+            return SimpleNamespace(oid="abc123")
+
+        def create_tag(self, **kwargs):
+            calls["create_tag"] = kwargs
+
+    monkeypatch.setattr("huggingface_hub.HfApi", FakeHfApi)
+
+    cfg = SimpleNamespace(
+        repo_id="source/dataset",
+        dest_repo_id="annotated/dataset",
+        push_private=True,
+        push_commit_message=None,
+    )
+
+    _push_to_hub(root, cfg)
+
+    assert calls["create_repo"] == {
+        "repo_id": "annotated/dataset",
+        "repo_type": "dataset",
+        "private": True,
+        "exist_ok": True,
+    }
+    assert calls["upload_folder"]["repo_id"] == "annotated/dataset"
+    assert calls["create_tag"] == {
+        "repo_id": "annotated/dataset",
+        "tag": "v3.0",
+        "repo_type": "dataset",
+        "exist_ok": True,
+        "revision": "abc123",
+    }
--- a/uv.lock
+++ b/uv.lock
				`@@ -0,0 +1 @@`
				`../../../../docs/source/policy_molmoact2_README.md`