chore: enable Dependabot weekly GitHub Actions bumps

2026-05-31 10:51:35 +00:00 · 2026-05-26 10:32:49 +00:00
115 changed files with 366 additions and 19872 deletions
--- a/.github/dependabot.yml
+++ b/.github/dependabot.yml
@@ -0,0 +1,11 @@
+version: 2
+updates:
+  - package-ecosystem: "github-actions"
+    directory: "/"
+    schedule:
+      interval: "weekly"
+    cooldown:
+      default-days: 7
+    groups:
+      actions:
+        patterns: ["*"]
--- a/6
+++ b/6
@@ -178,9 +178,3 @@ test-smolvla-ete-eval:
 		--env.episode_length=5 \
 		--eval.n_episodes=1 \
 		--eval.batch_size=1
-
-# E2E annotation pipeline smoke test against a tiny in-memory fixture
-# dataset. Opt-in (not part of `make test-end-to-end`) and uses a stub VLM
-# backend, so it does not require a real model checkpoint or GPU.
-annotation-e2e:
-	uv run python -m tests.annotations.run_e2e_smoke
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -43,8 +43,6 @@
    title: Language Columns and Recipes
  - local: tools
    title: Tools
-  - local: annotation_pipeline
-    title: Annotation Pipeline
  - local: video_encoding_parameters
    title: Video encoding parameters
  - local: streaming_video_encoding
--- a/docs/source/annotation_pipeline.mdx
+++ b/docs/source/annotation_pipeline.mdx
@@ -1,199 +0,0 @@
-# Annotation Pipeline
-
-`lerobot-annotate` populates the two language columns introduced by the
-[Language Columns and Recipes](./language_and_recipes) page —
-`language_persistent` and `language_events` — directly into
-`data/chunk-*/file-*.parquet`.
-
-## What the pipeline produces
-
-A vocabulary-discovery phase derives a small canonical wording, then three
-modules write into a per-episode staging tree, then a single writer
-rewrites the data shards in place:
-
-| Style / atom                                | Column                | Module         |
-| ------------------------------------------- | --------------------- | -------------- |
-| `subtask` (Pi0.7-style "how, not what")     | `language_persistent` | `plan`         |
-| `plan` (initial + refresh on interjection)  | `language_persistent` | `plan`         |
-| `memory` (MEM-style compression)            | `language_persistent` | `plan`         |
-| `task_aug` (rephrasings of canonical task)  | `language_persistent` | `plan`         |
-| `interjection`                              | `language_events`     | `interjections`|
-| speech tool-call atom (`style=null`, `say`) | `language_events`     | `interjections`|
-| `vqa` (user / assistant pair)               | `language_events`     | `vqa`          |
-
-The `plan` module is constrained to a **canonical vocabulary** discovered
-once per dataset by the `vocabulary` module (phase 0). It watches a few
-sample episode videos (`--vocabulary.sample_episodes`, default `3`) and
-asks the VLM to derive a small set of imperative subtask labels and
-first-person memory milestones that recur across the demos. The VLM
-picks the right number of entries itself based on what it sees in the
-clips — short pick-and-place demos get ~6 subtask labels, longer
-multi-step recipes get more. The result lands at
-`meta/canonical_vocabulary.json` (human-readable / hand-editable) and
-is reused on every subsequent run. The `plan` module then constrains
-both subtask + memory generation to those exact strings — the
-downstream low-level policy sees a small, repeatable target
-distribution instead of thousands of LLM paraphrases. Disable with
-`--vocabulary.enabled=False` to fall back to free-form generation.
-
-The writer does **not** add a `tools` column to the parquet — the tool
-catalog lives at `meta/info.json["tools"]` instead (see
-[Tools](./tools)). After every annotation run the pipeline ensures the
-canonical `say` schema is present in that list, preserving any tools the
-user pre-declared.
-
-If you want to declare additional tools for a dataset before annotation
-runs, edit `meta/info.json["tools"]` directly — the pipeline preserves
-anything already there. Implementations of those tools live under
-`src/lerobot/tools/`; one file per tool, registered via
-`TOOL_REGISTRY`. See the [Tools](./tools) doc for the authoring guide.
-
-## Running locally
-
-Install the extra and invoke the console script. Episode-level
-concurrency comes from `--executor.episode_parallelism` (default 16);
-that is the only knob the in-process executor exposes.
-
-```bash
-uv sync --extra annotations
-uv run lerobot-annotate \
-  --root=/path/to/dataset \
-  --vlm.model_id=Qwen/Qwen2.5-VL-7B-Instruct
-```
-
-The pipeline attaches actual camera footage to every `plan` /
-`interjections` / `vqa` prompt by default, decoded from the dataset's
-first `observation.images.*` stream. Override with
-`--vlm.camera_key=observation.images.<name>` to pin a specific
-viewpoint. Datasets with no video tracks fall back to text-only prompts
-automatically.
-
-**The `plan` module sees the whole episode as one video block.** Subtask
-decomposition gets a `{"type":"video", "video":[<frames>]}` block
-covering the entire demonstration; Qwen-VL pools temporally on its own
-and decides where to cut. There is no keyframe stride or count knob —
-`--plan.max_video_frames` (default 128) only caps the frames packed
-into the video block as a model-capacity bound. The `interjections`
-module attaches a short window of frames straddling the interjection
-timestamp. The `vqa` module grounds each VQA pair on a single frame —
-its `--vqa.K` knob sets how many consecutive frames each emission tick
-anchors, and every anchored frame gets its own VQA pair on that one
-frame (there is no per-pair frame window).
-
-## Running on Hugging Face Jobs
-
-Distributed annotation is delegated to
-[Hugging Face Jobs](https://huggingface.co/docs/hub/en/jobs). The repo
-ships a launcher script you copy and edit for your dataset:
-
-```bash
-HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py
-```
-
-[`examples/annotations/run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
-spawns one `h200x2` job that:
-
-1. installs the branch under test plus the annotation extras,
-2. boots two vllm servers (one per GPU) for the chosen model,
-3. runs the `plan` / `interjections` / `vqa` modules across the dataset
-   via `lerobot-annotate`,
-4. uploads the annotated dataset to `--push_to_hub`.
-
-To target a different dataset, model, or hub repo, edit the `CMD` block
-inside the script — every flag in there maps directly onto a CLI flag of
-`lerobot-annotate` (see `lerobot-annotate --help` for the full list).
-
-## Style-to-recipe consumer mapping
-
-The pipeline's outputs are designed to be consumed by recipes (see
-[Language Columns and Recipes](./language_and_recipes)) — for the
-canonical PI052 blend `src/lerobot/configs/recipes/subtask_mem_vqa_speech.yaml`:
-
- low-level / high-level / memory-update branches consume
-  `subtask`/`plan`/`memory` from `language_persistent`.
- An interjection-response branch consumes `interjection` events plus
-  the paired speech atom (merged into one assistant target turn via
-  `tool_calls_from`) and the same-timestamp `plan` refresh.
- A VQA branch consumes the `(vqa, user)` and `(vqa, assistant)` pairs
-  from `language_events`.
-
-## Why the design splits state from events
-
-Two things drive the scope:
-
-1. **Persistent state vs exact-event split.** Persistent rows
-   (`subtask`, `plan`, `memory`) broadcast per episode and answer "what
-   state is in force at this frame?". Event rows (`interjection`, `vqa`,
-   speech) only appear on the exact frame whose timestamp matches the
-   emission. The pipeline writes timestamps taken straight from the
-   source parquet — no floating-point recomputation.
-2. **One Qwen-VL pass.** All three modules share a single VLM client
-   (vLLM if available, transformers fallback) so the cost is one model
-   load per dataset, not three.
-
-## Module independence and staged reruns
-
-Each module writes its raw output to
-`<root>/.annotate_staging/episode_{N:06d}/<module>.jsonl`. That makes
-prompt iteration cheap — re-running one module overwrites only its own
-JSONL file before the writer composes the final parquet. Modules can be
-disabled via `--plan.enabled=false` (and likewise `--interjections.enabled`
-/ `--vqa.enabled`) to
-test them in isolation.
-
-## Validation/report checks before final write
-
-Before the writer runs, `StagingValidator` checks:
-
- exact frame-timestamp alignment for every event row;
- no orphan speech / interjection pairs;
- `plan` is refreshed at every interjection timestamp;
- `memory` rows fall on subtask boundaries (warning, not error);
- VQA assistant `content` parses as JSON in one of the
-  bbox / keypoint / count / attribute / spatial shapes;
- every row routes to the column dictated by `column_for_style(style)`.
-
-Errors abort the writer (`--skip_validation=true` overrides for debugging).
-
-## Paper inspirations per module
-
- **`plan` module — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417))
-  atom granularity ("pick up one piece of lettuce", "place bowl to box");
-  Pi0.7 ([Physical Intelligence 2025](https://pi.website/pi07)) "how, not
-  what" detail.
- **`plan` module — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596))
-  compression directive: keep only minimal relevant information; functional
-  outcomes preserved, specific attributes dropped.
- **`interjections` module.** Hi Robot scenario taxonomy: negative task,
-  situated correction, specific constraint, preference. Speech is a
-  tool-call-only atom (`tool_calls=[{type:function, function:{name:"say",
-arguments:{text:...}}}]`).
- **`vqa` module.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693))
-  grounded features (bounding boxes in pixel `[x_min, y_min, x_max, y_max]`,
-  keypoints) and Steerable VLA Policies ([Zhao 2025](https://arxiv.org/abs/2509.07626))
-  multi-abstraction grounding. Pi0.7 also grounds answers across
-  multiple abstraction levels.
-
-Future maintainers should adjust the prompt templates in
-`src/lerobot/annotations/steerable_pipeline/prompts/` against these
-references rather than rewriting from scratch.
-
-## Compute and list-size estimates
-
-Per episode, the pipeline issues O(`max_steps`) `plan`-module calls,
-O(`max_interjections_per_episode`) `interjections`-module calls, and
-O(`vqa_emission_hz × episode_seconds`) `vqa`-module calls. With defaults
-(8 subtasks, 1 interjection, 1 Hz × 3 pairs) and 30-second episodes, that
-is ~50 VLM calls per episode. `language_persistent` per episode is ~10s of
-KB at most (parquet dictionary-encodes one entry per episode);
-`language_events` is empty on most frames and is bounded by the number of
-emissions, not `num_frames × num_emissions`.
-
-## Reproducibility via seed and prompt hashes
-
-`--seed` (default 1729) feeds the per-episode RNGs that select interjection
-timestamps and VQA question types. Combined with the deterministic prompt
-templates checked into `prompts/`, two runs at the same seed against the
-same dataset and the same model checkpoint produce byte-identical staging
-artifacts. Prompt edits are recorded by file hash; future tooling can pin
-expected `(seed, prompt_hash)` pairs into the dataset card.
--- a/docs/source/language_and_recipes.mdx
+++ b/docs/source/language_and_recipes.mdx
@@ -141,11 +141,6 @@ sample["target_message_indices"]

 The renderer does not apply a tokenizer chat template. Policy processors decide how to serialize the messages for their backbone, which keeps the same dataset usable across SmolVLA, Pi0.5, and any future VLM that expects OpenAI-style chat messages.

-## Blends
-
-Blend recipes select one weighted sub-recipe deterministically from the sample index.
-`recipes/subtasks_vqa.yaml` trains the core blend — high-level subtask prediction, low-level execution, and VQA. `recipes/subtask_mem_vqa_speech.yaml` is the fuller variant that also adds memory updates and spoken interjection responses.
-
 ## Graceful absence

 If both language columns are missing, `None`, or empty, `RenderMessagesStep` is a no-op.
--- a/examples/annotations/run_hf_job.py
+++ b/examples/annotations/run_hf_job.py
@@ -1,121 +0,0 @@
-#!/usr/bin/env python
-"""Launch ``lerobot-annotate`` on a Hugging Face job (vllm + Qwen3.6 MoE).
-
-Spawns one ``h200x4`` job that:
-
-  1. installs this branch of ``lerobot`` plus the annotation extras,
-  2. boots four vllm servers (one per H200) with Qwen3.6-35B-A3B-FP8,
-  3. runs the plan + vqa modules across the dataset in free-form
-     mode — phase 0 (canonical vocabulary discovery) is disabled so
-     every episode's subtasks + memory are generated independently;
-     interjections is also disabled, which short-circuits the
-     plan_update phase that depends on it,
-  4. uploads the annotated dataset to ``--dest_repo_id`` (when set)
-     or back to ``--repo_id``.
-
-Usage:
-
-    HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py
-
-Adjust ``CMD`` below to point at your own dataset / target hub repo.
-"""
-
-import os
-
-from huggingface_hub import get_token, run_job
-
-token = os.environ.get("HF_TOKEN") or get_token()
-if not token:
-    raise RuntimeError("No HF token. Run `huggingface-cli login` or `export HF_TOKEN=hf_...`")
-
-CMD = (
-    "apt-get update -qq && apt-get install -y -qq git ffmpeg && "
-    "pip install --no-deps "
-    "'lerobot @ git+https://github.com/huggingface/lerobot.git@feat/language-annotation-pipeline' && "
-    "pip install --upgrade-strategy only-if-needed "
-    "datasets pyarrow av jsonlines draccus gymnasium torchcodec mergedeep pyyaml-include toml typing-inspect "
-    "openai && "
-    "export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 && "
-    "export VLLM_VIDEO_BACKEND=pyav && "
-    "lerobot-annotate "
-    "--repo_id=pepijn223/robocasa_smoke_2atomic_v3 "
-    "--dest_repo_id=pepijn223/robocasa_smoke_2atomic_v3_annotated "
-    "--push_to_hub=true "
-    "--vlm.backend=openai "
-    "--vlm.model_id=Qwen/Qwen3.6-35B-A3B-FP8 "
-    "--vlm.parallel_servers=4 "
-    "--vlm.num_gpus=4 "
-    '--vlm.serve_command="vllm serve Qwen/Qwen3.6-35B-A3B-FP8 '
-    # 4× the context (32768 → 131072) so long episodes at 1 Hz fit even
-    # at full Qwen vision resolution: 90 frames @ ~700 vision tokens/frame
-    # ≈ 63 k tokens, comfortably under 131 k. On 1× H200 (144 GB) the
-    # 35B-FP8 model leaves plenty of room for the bigger KV cache.
-    "--tensor-parallel-size 1 --max-model-len 131072 "
-    '--gpu-memory-utilization 0.85 --uvicorn-log-level warning --port {port}" '
-    "--vlm.serve_ready_timeout_s=1800 "
-    "--vlm.client_concurrency=256 "
-    "--vlm.max_new_tokens=512 "
-    # Low temperature for VQA: bbox + keypoint are coordinate-regression
-    # tasks where sampling noise directly degrades localization
-    # (overlapping boxes, drifted points). 0.2 keeps the model decisive
-    # while still letting question/label phrasing vary across frames.
-    "--vlm.temperature=0.2 "
-    "--executor.episode_parallelism=64 "
-    "--vlm.chat_template_kwargs='{\"enable_thinking\": false}' "
-    # Whole-scene agentview is the right choice for subtask reasoning +
-    # VQA on robocasa: the wrist (``robot0_eye_in_hand``) usually only
-    # sees the gripper + nearby object, which hurts "what is happening
-    # in this episode" decomposition. Override per-dataset if your
-    # cameras are named differently (inspect ``meta/info.json``).
-    "--vlm.camera_key=observation.images.robot0_agentview_left "
-    # Phase 0 — canonical vocabulary discovery DISABLED. This dataset's
-    # episodes span heterogeneous tasks/scenes, so a single shared
-    # subtask + memory vocabulary would be too narrow — each episode
-    # generates its subtasks + memory free-form instead.
-    "--vocabulary.enabled=false "
-    # Phase 1 — plan module (subtasks + plan + memory + task_aug).
-    "--plan.enabled=true "
-    "--plan.frames_per_second=1.0 "
-    "--plan.use_video_url=true "
-    "--plan.use_video_url_fps=1.0 "
-    # Force coarse, composite subtasks (``pick up X`` = approach + grasp
-    # + lift in one span, not three). 3 s is large enough to host a
-    # full grasp-or-place composite at typical 20 fps robocasa speeds;
-    # any candidate span shorter than this gets merged into a neighbour
-    # by the prompt's authoring rules (see module_1_subtasks.txt).
-    "--plan.min_subtask_seconds=3.0 "
-    # Cap so the VLM can't drift into micro-segmentation. Combined with
-    # the composite-action rules in the prompt, this targets ~3-6
-    # meaningful spans per episode for typical pick-and-place demos.
-    "--plan.plan_max_steps=9 "
-    # ``off`` keeps the dataset's canonical ``record.episode_task`` as-is
-    # — no per-episode VLM "what is this video about" call. Switch to
-    # ``if_short`` (default) only if some episodes have placeholder /
-    # missing canonical tasks; ``always`` overrides every episode's task.
-    "--plan.derive_task_from_video=off "
-    # 0 disables the task_aug pass entirely (see PlanConfig.n_task_rephrasings
-    # docstring) — no per-episode paraphrase generation, no task_aug rows.
-    "--plan.n_task_rephrasings=0 "
-    # Phase 2 — interjections OFF (also skips phase 3 plan_update,
-    # see executor.py:_run_plan_update_phase guard).
-    "--interjections.enabled=false "
-    # Phase 4 — general VQA. K=1 keeps each VQA answer on its own
-    # emission frame (no temporal smear); see VqaConfig.K docstring.
-    # 3 Hz cadence: at 20 fps source, that's a VQA tick every ~7 frames.
-    # NOTE: VQA emits per-camera, so for robocasa (3 cameras) each tick
-    # produces 3 (user, assistant) row pairs — total call volume ~= 3 *
-    # 3 Hz * mean_episode_seconds * n_episodes.
-    "--vqa.enabled=true "
-    "--vqa.K=1 "
-    "--vqa.vqa_emission_hz=3.0"
-)
-
-job = run_job(
-    image="vllm/vllm-openai:latest",
-    command=["bash", "-c", CMD],
-    flavor="h200x4",
-    secrets={"HF_TOKEN": token},
-    timeout="24h",
-)
-print(f"Job URL: {job.url}")
-print(f"Job ID:  {job.id}")
--- a/examples/benchmark/bench_pi052_kernels.slurm
+++ b/examples/benchmark/bench_pi052_kernels.slurm
@@ -1,74 +0,0 @@
-#!/bin/bash
-#SBATCH --job-name=bench-pi052-kernels
-#SBATCH --partition=hopper-prod
-#SBATCH --qos=high
-#SBATCH --time=01:30:00
-#SBATCH --ntasks=1
-#SBATCH --gpus-per-task=1
-#SBATCH --output=/fsx/pepijn/logs/bench_pi052_kernels_%j.out
-
-# HF kernels exploration via Liger's apply_liger_kernel_to_paligemma.
-# Baseline (SDPA, no kernels) vs. per-subkernel ablations vs. all-on.
-# Same harness as bench_pi052_step.py — only the --kernels flag varies
-# across runs so any delta is attributable to the patched op(s).
-#
-# Subkernels exercised: rope, rms_norm, geglu, layer_norm.
-# Skipped: cross_entropy / fused_linear_cross_entropy — pi052 calls
-# F.cross_entropy directly and bypasses PaliGemma's forward, so those
-# patches wouldn't fire without model-code changes (separate PR).
-
-set -euo pipefail
-
-cd "${LEROBOT_ROOT:-$HOME/lerobot}"
-
-export PATH="$HOME/miniconda3/bin:$HOME/.local/bin:$PATH"
-export LD_LIBRARY_PATH="$HOME/miniconda3/lib:${LD_LIBRARY_PATH:-}"
-export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-expandable_segments:True}"
-
-# /fsx triton cache is shared across nodes with different glibc versions
-# — kernels built on one node trip GLIBC_2.34-not-found on another. Use
-# a node-local cache per job to side-step that.
-export TRITON_CACHE_DIR="/tmp/triton_${SLURM_JOB_ID}"
-export TORCHINDUCTOR_CACHE_DIR="/tmp/torchinductor_${SLURM_JOB_ID}"
-mkdir -p "$TRITON_CACHE_DIR" "$TORCHINDUCTOR_CACHE_DIR"
-
-echo "=== Node: $(hostname) ==="
-nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv,noheader
-ldd --version | head -1
-
-# Liger isn't in our standard env yet — install on the compute node so
-# the slurm log captures the exact version that produced the numbers.
-python -m pip install -q --upgrade 'liger-kernel'
-python - <<'PY' || true
-from importlib.metadata import version, PackageNotFoundError
-try:
-    print("liger-kernel", version("liger-kernel"))
-except PackageNotFoundError:
-    print("liger-kernel: not importable")
-import liger_kernel.transformers as t
-print("apply_liger_kernel_to_paligemma:", hasattr(t, "apply_liger_kernel_to_paligemma"))
-PY
-
-run() {
-    echo
-    echo "--- $* ---"
-    python examples/benchmark/bench_pi052_step.py "$@" || true
-}
-
-# -- Baseline (no kernels) at the BS we actually train at. --
-run --attn sdpa --batch-size 8  --kernels none
-run --attn sdpa --batch-size 16 --kernels none
-
-# -- Per-subkernel ablations at BS=16 to isolate each contributor. --
-run --attn sdpa --batch-size 16 --kernels rms_norm
-run --attn sdpa --batch-size 16 --kernels geglu
-run --attn sdpa --batch-size 16 --kernels layer_norm
-run --attn sdpa --batch-size 16 --kernels rope
-
-# -- All-on, both BS to compare against the matched baselines above. --
-run --attn sdpa --batch-size 8  --kernels all
-run --attn sdpa --batch-size 16 --kernels all
-
-# -- Headroom check: does kernels-all let BS=24 fit (baseline OOMs near here)? --
-run --attn sdpa --batch-size 24 --kernels none
-run --attn sdpa --batch-size 24 --kernels all
--- a/examples/benchmark/bench_pi052_step.py
+++ b/examples/benchmark/bench_pi052_step.py
@@ -1,338 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""Benchmark ``PI052Policy.forward + backward`` on a single GPU.
-
-Compares the new SDPA attention path against the eager baseline by
-monkeypatching ``sdpa_attention_forward`` before the first model
-forward — so both runs share identical Q/K/V plumbing and only the
-attention kernel differs. Reports steps/sec and peak GPU memory.
-
-SLURM-only:
-
-    sbatch examples/benchmark/bench_pi052_step.slurm
-
-Or one-off:
-
-    srun --partition=hopper-prod --qos=high --gpus=1 --time=15 \\
-        python examples/benchmark/bench_pi052_step.py --attn sdpa --batch-size 8
-"""
-
-from __future__ import annotations
-
-import argparse
-import gc
-import math
-import os
-import time
-
-import torch
-
-
-def _maybe_patch_eager() -> None:
-    """Swap ``sdpa_attention_forward`` for the original eager forward.
-
-    Must be called BEFORE PI052Policy is instantiated — the layer
-    compute functions resolve the symbol at call time (module-level
-    lookup), so this patch covers both pi05 and pi052 KI paths."""
-    from transformers.models.gemma import modeling_gemma
-
-    from lerobot.policies.pi05 import modeling_pi05
-
-    modeling_pi05.sdpa_attention_forward = modeling_gemma.eager_attention_forward
-
-
-_LIGER_SUBKERNELS = ("rope", "rms_norm", "geglu", "layer_norm")
-
-
-def _maybe_patch_liger(spec: str) -> dict:
-    """Globally patch PaliGemma/Gemma/Siglip modules with Liger Triton kernels.
-
-    Must be called BEFORE PI052Policy is instantiated — Liger replaces
-    classes inside ``transformers.models.{gemma,gemma2,siglip,paligemma}``,
-    so any model built after the call picks up the fused forwards.
-
-    ``spec`` is a comma-separated subset of {rope, rms_norm, geglu,
-    layer_norm} (also ``all`` and ``none``). ``cross_entropy`` and
-    ``fused_linear_cross_entropy`` are intentionally skipped — pi052's
-    losses use ``F.cross_entropy`` directly (not ``nn.CrossEntropyLoss``)
-    and never traverse ``PaliGemmaForConditionalGeneration.forward``,
-    so neither patch would fire without invasive model-code changes.
-    """
-    enabled = dict.fromkeys(_LIGER_SUBKERNELS, False)
-    if spec in ("", "none"):
-        return enabled
-    tokens = [t.strip() for t in spec.split(",") if t.strip()]
-    if tokens == ["all"]:
-        enabled = dict.fromkeys(_LIGER_SUBKERNELS, True)
-    else:
-        for t in tokens:
-            if t not in enabled:
-                raise SystemExit(f"Unknown liger subkernel: {t!r}. Choose from {_LIGER_SUBKERNELS} or 'all'.")
-            enabled[t] = True
-
-    from liger_kernel.transformers import apply_liger_kernel_to_paligemma
-
-    apply_liger_kernel_to_paligemma(
-        rope=enabled["rope"],
-        rms_norm=enabled["rms_norm"],
-        geglu=enabled["geglu"],
-        layer_norm=enabled["layer_norm"],
-        cross_entropy=False,
-        fused_linear_cross_entropy=False,
-    )
-    return enabled
-
-
-def _maybe_patch_flex() -> None:
-    """Swap ``sdpa_attention_forward`` for a FlexAttention-backed forward.
-
-    Experimental: builds a per-call ``score_mod`` from the additive
-    mask and dispatches to a compiled ``flex_attention`` kernel.
-
-    Known issue on torch 2.7.1: dynamo errors out with
-    ``FlexAttentionHigherOrderVariable() has no type`` when the
-    ``score_mod`` closure captures a per-call bias tensor. A proper
-    port needs ``create_block_mask(mask_mod, ...)`` plumbed at the
-    PI05Pytorch.forward level so a BlockMask object can be passed
-    down to the layer compute, not a per-call closure. Left as
-    future work; keep this stub for benchmark experimentation."""
-    import torch
-    from torch.nn.attention.flex_attention import flex_attention
-
-    from lerobot.policies.pi05 import modeling_pi05
-
-    compiled_flex = torch.compile(flex_attention, dynamic=True)
-
-    def flex_forward(module, query, key, value, attention_mask, scaling, dropout=0.0):
-        n_rep = module.num_key_value_groups
-        if n_rep > 1:
-            key = key.repeat_interleave(n_rep, dim=1)
-            value = value.repeat_interleave(n_rep, dim=1)
-
-        bias = attention_mask  # (B, 1, Lq, Lk) additive
-
-        def score_mod(score, b, h, q_idx, kv_idx):
-            return score + bias[b, 0, q_idx, kv_idx]
-
-        attn_output = compiled_flex(query, key, value, score_mod=score_mod, scale=scaling)
-        return attn_output.transpose(1, 2).contiguous(), None
-
-    modeling_pi05.sdpa_attention_forward = flex_forward
-
-
-def _build_policy(args, device: torch.device):
-    """Random-init PI052Policy at production-relevant shapes."""
-    from lerobot.configs.types import FeatureType, PolicyFeature
-    from lerobot.policies.pi052.configuration_pi052 import PI052Config
-    from lerobot.policies.pi052.modeling_pi052 import PI052Policy
-
-    # Production has ``unfreeze_lm_head=True`` + ``text_loss_weight>0``,
-    # which flips ``train_expert_only=False`` in __post_init__ and
-    # makes the whole PaliGemma + Gemma-expert stack trainable. We
-    # mirror that here so the optimizer-state count reflects reality;
-    # the loss path still goes through ``PI05Policy.forward`` because
-    # ``text_labels`` / FAST tokens are absent from the synthetic batch
-    # (see ``PI052Policy.forward`` early-return).
-    config = PI052Config(
-        max_action_dim=args.action_dim,
-        max_state_dim=args.state_dim,
-        dtype=args.dtype,
-        knowledge_insulation=args.knowledge_insulation,
-        text_loss_weight=1e-3 if args.train_full else 0.0,
-        flow_loss_weight=1.0,
-        enable_fast_action_loss=False,
-        unfreeze_lm_head=args.train_full,
-        tokenizer_max_length=args.lang_tokens,
-        device="cuda",
-        compile_model=args.compile_model,
-        compile_mode=args.compile_mode,
-    )
-    config.input_features = {
-        "observation.state": PolicyFeature(type=FeatureType.STATE, shape=(args.state_dim,)),
-        "observation.images.base_0_rgb": PolicyFeature(type=FeatureType.VISUAL, shape=(3, 224, 224)),
-    }
-    config.output_features = {
-        "action": PolicyFeature(type=FeatureType.ACTION, shape=(args.action_dim,)),
-    }
-    policy = PI052Policy(config)
-    policy.to(device)
-    if args.gradient_checkpointing:
-        policy.model.gradient_checkpointing_enable()
-    policy.train()
-    return policy, config
-
-
-def _build_batch(args, config, device: torch.device) -> dict:
-    """Synthetic batch matching the training-loop input contract."""
-    from lerobot.utils.constants import (
-        ACTION,
-        OBS_LANGUAGE_ATTENTION_MASK,
-        OBS_LANGUAGE_TOKENS,
-    )
-
-    B = args.batch_size
-    L = args.lang_tokens
-    return {
-        OBS_LANGUAGE_TOKENS: torch.randint(0, 250000, (B, L), device=device),
-        OBS_LANGUAGE_ATTENTION_MASK: torch.ones(B, L, dtype=torch.bool, device=device),
-        "observation.images.base_0_rgb": torch.rand(B, 3, 224, 224, device=device),
-        "observation.images.base_0_rgb_padding_mask": torch.ones(B, dtype=torch.bool, device=device),
-        "observation.state": torch.randn(B, args.state_dim, device=device),
-        ACTION: torch.randn(B, config.chunk_size, args.action_dim, device=device),
-        "action_is_pad": torch.zeros(B, config.chunk_size, dtype=torch.bool, device=device),
-        "task": ["bench task"] * B,
-    }
-
-
-def _step(policy, batch, optimizer=None) -> torch.Tensor:
-    loss, _ = policy.forward(batch)
-    loss.backward()
-    if optimizer is not None:
-        optimizer.step()
-        optimizer.zero_grad(set_to_none=True)
-    else:
-        for p in policy.parameters():
-            if p.grad is not None:
-                p.grad = None
-    return loss.detach()
-
-
-def main() -> int:
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--attn", choices=["sdpa", "eager", "flex"], default="sdpa")
-    parser.add_argument(
-        "--kernels",
-        default="none",
-        help=(
-            "Liger sub-kernels to enable, comma-separated. Choose from "
-            f"{_LIGER_SUBKERNELS} or use 'all' / 'none' (default). Applied "
-            "via apply_liger_kernel_to_paligemma() BEFORE model build."
-        ),
-    )
-    parser.add_argument(
-        "--compile",
-        dest="compile_model",
-        action="store_true",
-        help="Set policy.config.compile_model=True (torch.compile the forward).",
-    )
-    parser.add_argument(
-        "--compile-mode",
-        default="default",
-        help="torch.compile mode (default | reduce-overhead | max-autotune).",
-    )
-    parser.add_argument("--batch-size", type=int, default=8)
-    parser.add_argument("--warmup", type=int, default=8)
-    parser.add_argument("--steps", type=int, default=40)
-    parser.add_argument("--lang-tokens", type=int, default=512)
-    parser.add_argument("--dtype", choices=["bfloat16", "float32"], default="bfloat16")
-    parser.add_argument("--action-dim", type=int, default=14)
-    parser.add_argument("--state-dim", type=int, default=14)
-    parser.add_argument("--knowledge-insulation", action="store_true", default=True)
-    parser.add_argument(
-        "--gradient-checkpointing",
-        dest="gradient_checkpointing",
-        action=argparse.BooleanOptionalAction,
-        default=True,
-    )
-    parser.add_argument(
-        "--optimizer",
-        choices=["none", "adamw", "adamw_fused"],
-        default="adamw_fused",
-        help=(
-            "Whether to include an AdamW step in the timed iteration. "
-            "'none' mirrors the fwd+bwd-only original bench; 'adamw' / "
-            "'adamw_fused' add the realistic ~2x param-bytes optimizer "
-            "state and ``optimizer.step()`` cost."
-        ),
-    )
-    parser.add_argument(
-        "--train-full",
-        action=argparse.BooleanOptionalAction,
-        default=True,
-        help=(
-            "Mirror production: unfreeze the PaliGemma backbone (full "
-            "~3B trainable params) instead of training only the 300M "
-            "action expert."
-        ),
-    )
-    args = parser.parse_args()
-
-    if not torch.cuda.is_available():
-        raise SystemExit("Benchmark requires CUDA; submit via slurm (srun/sbatch).")
-
-    if args.attn == "eager":
-        _maybe_patch_eager()
-    elif args.attn == "flex":
-        _maybe_patch_flex()
-
-    liger_flags = _maybe_patch_liger(args.kernels)
-
-    device = torch.device("cuda")
-    torch.cuda.reset_peak_memory_stats()
-
-    policy, config = _build_policy(args, device)
-    batch = _build_batch(args, config, device)
-
-    optimizer = None
-    trainable_params = sum(p.numel() for p in policy.parameters() if p.requires_grad)
-    if args.optimizer != "none":
-        trainable = [p for p in policy.parameters() if p.requires_grad]
-        optimizer = torch.optim.AdamW(
-            trainable, lr=5e-5, fused=(args.optimizer == "adamw_fused")
-        )
-
-    for _ in range(args.warmup):
-        _step(policy, batch, optimizer)
-    torch.cuda.synchronize()
-
-    starter = torch.cuda.Event(enable_timing=True)
-    ender = torch.cuda.Event(enable_timing=True)
-    starter.record()
-    for _ in range(args.steps):
-        _step(policy, batch, optimizer)
-    ender.record()
-    torch.cuda.synchronize()
-    total_ms = starter.elapsed_time(ender)
-    step_ms = total_ms / args.steps
-    peak_gb = torch.cuda.max_memory_allocated() / (1024**3)
-    optim_gb = 0.0
-    if optimizer is not None:
-        for st in optimizer.state.values():
-            for v in st.values():
-                if torch.is_tensor(v):
-                    optim_gb += v.numel() * v.element_size() / (1024**3)
-
-    liger_on = ",".join(k for k, v in liger_flags.items() if v) or "none"
-    name = (
-        f"{args.attn:>5} | BS={args.batch_size} | L={args.lang_tokens} | "
-        f"KI={args.knowledge_insulation} | GC={args.gradient_checkpointing} | "
-        f"compile={args.compile_model} | liger={liger_on} | opt={args.optimizer} | dtype={args.dtype}"
-    )
-    print(
-        f"{name}\n  step_ms={step_ms:.1f}  steps/sec={1000.0 / step_ms:.3f}  "
-        f"peak_mem={peak_gb:.2f} GiB  optim_state={optim_gb:.2f} GiB  "
-        f"trainable_params={trainable_params / 1e9:.2f}B"
-    )
-
-    del policy, batch
-    gc.collect()
-    torch.cuda.empty_cache()
-    return 0
-
-
-if __name__ == "__main__":
-    raise SystemExit(main())
--- a/examples/benchmark/bench_pi052_step.slurm
+++ b/examples/benchmark/bench_pi052_step.slurm
@@ -1,36 +0,0 @@
-#!/bin/bash
-#SBATCH --job-name=bench-pi052-attn
-#SBATCH --partition=hopper-prod
-#SBATCH --qos=high
-#SBATCH --time=00:30:00
-#SBATCH --ntasks=1
-#SBATCH --gpus-per-task=1
-#SBATCH --output=/fsx/pepijn/logs/bench_pi052_%j.out
-
-set -euo pipefail
-
-cd "${LEROBOT_ROOT:-$HOME/lerobot}"
-
-export PATH="$HOME/miniconda3/bin:$HOME/.local/bin:$PATH"
-export LD_LIBRARY_PATH="$HOME/miniconda3/lib:${LD_LIBRARY_PATH:-}"
-export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-expandable_segments:True}"
-
-echo "=== Node: $(hostname) ==="
-nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv,noheader
-
-python -c "import torch; print('torch', torch.__version__, 'cuda', torch.version.cuda)"
-
-run() {
-    echo
-    echo "--- $* ---"
-    python examples/benchmark/bench_pi052_step.py "$@" || true
-}
-
-# Attention parity benchmark — same shapes, different attention kernel.
-run --attn eager --batch-size 8
-run --attn sdpa  --batch-size 8
-
-# Headroom benchmark — does SDPA's memory cut allow a bigger micro-batch?
-run --attn sdpa  --batch-size 12
-run --attn sdpa  --batch-size 16
-run --attn sdpa  --batch-size 24
--- a/examples/benchmark/bench_pi052_step_v2.slurm
+++ b/examples/benchmark/bench_pi052_step_v2.slurm
@@ -1,39 +0,0 @@
-#!/bin/bash
-#SBATCH --job-name=bench-pi052-v2
-#SBATCH --partition=hopper-prod
-#SBATCH --qos=high
-#SBATCH --time=00:45:00
-#SBATCH --ntasks=1
-#SBATCH --gpus-per-task=1
-#SBATCH --output=/fsx/pepijn/logs/bench_pi052_v2_%j.out
-
-set -euo pipefail
-
-cd "${LEROBOT_ROOT:-$HOME/lerobot}"
-
-export PATH="$HOME/miniconda3/bin:$HOME/.local/bin:$PATH"
-export LD_LIBRARY_PATH="$HOME/miniconda3/lib:${LD_LIBRARY_PATH:-}"
-export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-expandable_segments:True}"
-
-echo "=== Node: $(hostname) ==="
-nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv,noheader
-
-run() {
-    echo
-    echo "--- $* ---"
-    python examples/benchmark/bench_pi052_step.py "$@" || true
-}
-
-# A: GC ON — see if the selective-AC change (one less recompute level)
-# narrows the eager vs SDPA gap at BS=8.
-run --attn eager --batch-size 8
-run --attn sdpa  --batch-size 8
-
-# B: GC OFF — isolate the raw attention-kernel cost & memory delta.
-run --attn eager --batch-size 4 --no-gradient-checkpointing
-run --attn sdpa  --batch-size 4 --no-gradient-checkpointing
-
-# C: SDPA + GC headroom sweep — where does it OOM?
-run --attn sdpa  --batch-size 16
-run --attn sdpa  --batch-size 24
-run --attn sdpa  --batch-size 32
--- a/examples/benchmark/bench_pi052_step_v3.slurm
+++ b/examples/benchmark/bench_pi052_step_v3.slurm
@@ -1,36 +0,0 @@
-#!/bin/bash
-#SBATCH --job-name=bench-pi052-v3
-#SBATCH --partition=hopper-prod
-#SBATCH --qos=high
-#SBATCH --time=00:45:00
-#SBATCH --ntasks=1
-#SBATCH --gpus-per-task=1
-#SBATCH --output=/fsx/pepijn/logs/bench_pi052_v3_%j.out
-
-set -euo pipefail
-
-cd "${LEROBOT_ROOT:-$HOME/lerobot}"
-
-export PATH="$HOME/miniconda3/bin:$HOME/.local/bin:$PATH"
-export LD_LIBRARY_PATH="$HOME/miniconda3/lib:${LD_LIBRARY_PATH:-}"
-export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-expandable_segments:True}"
-
-echo "=== Node: $(hostname) ==="
-nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv,noheader
-
-run() {
-    echo
-    echo "--- $* ---"
-    python examples/benchmark/bench_pi052_step.py "$@" || true
-}
-
-# Compile sweep: does torch.compile + SDPA give a non-trivial boost on
-# top of the bare SDPA path?
-run --attn sdpa --batch-size 8  --compile
-run --attn sdpa --batch-size 16 --compile
-
-# FlexAttention sweep (experimental): score_mod adds the additive bias
-# in-kernel; expect a long first-step compile, then SDPA-or-better steady
-# state.
-run --attn flex --batch-size 8
-run --attn flex --batch-size 16
--- a/examples/benchmark/bench_pi052_step_v4.slurm
+++ b/examples/benchmark/bench_pi052_step_v4.slurm
@@ -1,41 +0,0 @@
-#!/bin/bash
-#SBATCH --job-name=bench-pi052-v4
-#SBATCH --partition=hopper-prod
-#SBATCH --qos=high
-#SBATCH --time=01:00:00
-#SBATCH --ntasks=1
-#SBATCH --gpus-per-task=1
-#SBATCH --output=/fsx/pepijn/logs/bench_pi052_v4_%j.out
-
-set -euo pipefail
-
-cd "${LEROBOT_ROOT:-$HOME/lerobot}"
-
-export PATH="$HOME/miniconda3/bin:$HOME/.local/bin:$PATH"
-export LD_LIBRARY_PATH="$HOME/miniconda3/lib:${LD_LIBRARY_PATH:-}"
-export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-expandable_segments:True}"
-
-# /fsx triton cache is shared across nodes with different glibc versions
-# — kernels built on one node trip GLIBC_2.34-not-found on another. Use
-# a node-local cache per job to side-step that.
-export TRITON_CACHE_DIR="/tmp/triton_${SLURM_JOB_ID}"
-export TORCHINDUCTOR_CACHE_DIR="/tmp/torchinductor_${SLURM_JOB_ID}"
-mkdir -p "$TRITON_CACHE_DIR" "$TORCHINDUCTOR_CACHE_DIR"
-
-echo "=== Node: $(hostname) ==="
-nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv,noheader
-ldd --version | head -1
-
-run() {
-    echo
-    echo "--- $* ---"
-    python examples/benchmark/bench_pi052_step.py "$@" || true
-}
-
-# compile path on top of SDPA + selective AC
-run --attn sdpa --batch-size 8  --compile
-run --attn sdpa --batch-size 16 --compile
-
-# FlexAttention experimental
-run --attn flex --batch-size 8
-run --attn flex --batch-size 16
--- a/examples/benchmark/bench_pi052_step_v5.slurm
+++ b/examples/benchmark/bench_pi052_step_v5.slurm
@@ -1,33 +0,0 @@
-#!/bin/bash
-#SBATCH --job-name=bench-pi052-v5
-#SBATCH --partition=hopper-prod
-#SBATCH --qos=high
-#SBATCH --time=00:45:00
-#SBATCH --ntasks=1
-#SBATCH --gpus-per-task=1
-#SBATCH --output=/fsx/pepijn/logs/bench_pi052_v5_%j.out
-
-set -euo pipefail
-
-cd "${LEROBOT_ROOT:-$HOME/lerobot}"
-
-export PATH="$HOME/miniconda3/bin:$HOME/.local/bin:$PATH"
-export LD_LIBRARY_PATH="$HOME/miniconda3/lib:${LD_LIBRARY_PATH:-}"
-export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-expandable_segments:True}"
-export TRITON_CACHE_DIR="/tmp/triton_${SLURM_JOB_ID}"
-export TORCHINDUCTOR_CACHE_DIR="/tmp/torchinductor_${SLURM_JOB_ID}"
-mkdir -p "$TRITON_CACHE_DIR" "$TORCHINDUCTOR_CACHE_DIR"
-
-echo "=== Node: $(hostname) ==="
-
-run() {
-    echo
-    echo "--- $* ---"
-    python examples/benchmark/bench_pi052_step.py "$@" || true
-}
-
-# compile_mode=default (graph-only, no autotune) is the right knob with
-# gradient checkpointing — max-autotune in v4 was 2x slower than no-compile.
-run --attn sdpa --batch-size 8  --compile --compile-mode default
-run --attn sdpa --batch-size 16 --compile --compile-mode default
-run --attn sdpa --batch-size 8  --compile --compile-mode reduce-overhead
--- a/examples/benchmark/bench_pi052_step_v6.slurm
+++ b/examples/benchmark/bench_pi052_step_v6.slurm
@@ -1,31 +0,0 @@
-#!/bin/bash
-#SBATCH --job-name=bench-pi052-v6-bs32
-#SBATCH --partition=hopper-prod
-#SBATCH --qos=high
-#SBATCH --time=00:30:00
-#SBATCH --ntasks=1
-#SBATCH --gpus-per-task=1
-#SBATCH --output=/fsx/pepijn/logs/bench_pi052_v6_%j.out
-
-set -euo pipefail
-
-cd "${LEROBOT_ROOT:-$HOME/lerobot}"
-
-export PATH="$HOME/miniconda3/bin:$HOME/.local/bin:$PATH"
-export LD_LIBRARY_PATH="$HOME/miniconda3/lib:${LD_LIBRARY_PATH:-}"
-export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-expandable_segments:True}"
-export TRITON_CACHE_DIR="/tmp/triton_${SLURM_JOB_ID}"
-export TORCHINDUCTOR_CACHE_DIR="/tmp/torchinductor_${SLURM_JOB_ID}"
-mkdir -p "$TRITON_CACHE_DIR" "$TORCHINDUCTOR_CACHE_DIR"
-
-echo "=== Node: $(hostname) ==="
-nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
-
-run() {
-    echo
-    echo "--- $* ---"
-    python examples/benchmark/bench_pi052_step.py "$@" || true
-}
-
-# BS=32 with the production settings (SDPA + compile=default).
-run --attn sdpa --batch-size 32 --compile --compile-mode default
--- a/examples/benchmark/bench_pi052_step_v7.slurm
+++ b/examples/benchmark/bench_pi052_step_v7.slurm
@@ -1,39 +0,0 @@
-#!/bin/bash
-#SBATCH --job-name=bench-pi052-v7-opt
-#SBATCH --partition=hopper-prod
-#SBATCH --qos=high
-#SBATCH --time=00:45:00
-#SBATCH --ntasks=1
-#SBATCH --gpus-per-task=1
-#SBATCH --output=/fsx/pepijn/logs/bench_pi052_v7_%j.out
-
-set -euo pipefail
-
-cd "${LEROBOT_ROOT:-$HOME/lerobot}"
-
-export PATH="$HOME/miniconda3/bin:$HOME/.local/bin:$PATH"
-export LD_LIBRARY_PATH="$HOME/miniconda3/lib:${LD_LIBRARY_PATH:-}"
-export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-expandable_segments:True}"
-export TRITON_CACHE_DIR="/tmp/triton_${SLURM_JOB_ID}"
-export TORCHINDUCTOR_CACHE_DIR="/tmp/torchinductor_${SLURM_JOB_ID}"
-mkdir -p "$TRITON_CACHE_DIR" "$TORCHINDUCTOR_CACHE_DIR"
-
-echo "=== Node: $(hostname) ==="
-nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
-
-run() {
-    echo
-    echo "--- $* ---"
-    python examples/benchmark/bench_pi052_step.py "$@" || true
-}
-
-# Realistic full-step memory: fwd + bwd + AdamW step. The original
-# sweep was fwd+bwd-only and undercounted memory by the optimizer-
-# state size (~2x param bytes for AdamW). This run confirms BS=16
-# and BS=32 still fit with the optimizer in residency.
-run --attn sdpa --batch-size 16 --compile --compile-mode default --optimizer adamw_fused
-run --attn sdpa --batch-size 32 --compile --compile-mode default --optimizer adamw_fused
-
-# Without compile, in case the production cluster has compile issues.
-run --attn sdpa --batch-size 16 --optimizer adamw_fused
-run --attn sdpa --batch-size 32 --optimizer adamw_fused
--- a/examples/benchmark/bench_pi052_step_v8.slurm
+++ b/examples/benchmark/bench_pi052_step_v8.slurm
@@ -1,36 +0,0 @@
-#!/bin/bash
-#SBATCH --job-name=bench-pi052-v8-bs40-dtype
-#SBATCH --partition=hopper-prod
-#SBATCH --qos=high
-#SBATCH --time=00:45:00
-#SBATCH --ntasks=1
-#SBATCH --gpus-per-task=1
-#SBATCH --output=/fsx/pepijn/logs/bench_pi052_v8_%j.out
-
-set -euo pipefail
-
-cd "${LEROBOT_ROOT:-$HOME/lerobot}"
-
-export PATH="$HOME/miniconda3/bin:$HOME/.local/bin:$PATH"
-export LD_LIBRARY_PATH="$HOME/miniconda3/lib:${LD_LIBRARY_PATH:-}"
-export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-expandable_segments:True}"
-export TRITON_CACHE_DIR="/tmp/triton_${SLURM_JOB_ID}"
-export TORCHINDUCTOR_CACHE_DIR="/tmp/torchinductor_${SLURM_JOB_ID}"
-mkdir -p "$TRITON_CACHE_DIR" "$TORCHINDUCTOR_CACHE_DIR"
-
-echo "=== Node: $(hostname) ==="
-nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
-
-run() {
-    echo
-    echo "--- $* ---"
-    python examples/benchmark/bench_pi052_step.py "$@" || true
-}
-
-# Confirm BS=40 fits on a single H100 with the optimizer in residency.
-run --attn sdpa --batch-size 40 --compile --compile-mode default --optimizer adamw_fused
-
-# Dtype A/B at modest batch — fp32 needs ~2x the memory of bf16, so we
-# drop to BS=4 to keep both runs comparable instead of OOMing fp32.
-run --attn sdpa --batch-size 4 --optimizer adamw_fused --dtype bfloat16
-run --attn sdpa --batch-size 4 --optimizer adamw_fused --dtype float32
--- a/examples/benchmark/fsdp_pi052.yaml
+++ b/examples/benchmark/fsdp_pi052.yaml
@@ -1,29 +0,0 @@
-compute_environment: LOCAL_MACHINE
-debug: false
-distributed_type: FSDP
-downcast_bf16: 'no'
-enable_cpu_affinity: false
-fsdp_config:
-  fsdp_activation_checkpointing: false
-  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
-  fsdp_backward_prefetch: BACKWARD_PRE
-  fsdp_cpu_ram_efficient_loading: true
-  fsdp_forward_prefetch: false
-  fsdp_offload_params: false
-  fsdp_reshard_after_forward: true
-  fsdp_state_dict_type: SHARDED_STATE_DICT
-  fsdp_sync_module_states: true
-  fsdp_transformer_layer_cls_to_wrap: GemmaDecoderLayer,SiglipEncoderLayer
-  fsdp_use_orig_params: true
-  fsdp_version: 2
-machine_rank: 0
-main_training_function: main
-mixed_precision: bf16
-num_machines: 1
-num_processes: 8
-rdzv_backend: static
-same_network: true
-tpu_env: []
-tpu_use_cluster: false
-tpu_use_sudo: false
-use_cpu: false
--- a/examples/port_datasets/slurm_build_robocasa_composite_seen.py
+++ b/examples/port_datasets/slurm_build_robocasa_composite_seen.py
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -85,11 +85,6 @@ dependencies = [
    "termcolor>=2.4.0,<4.0.0",
    "tqdm>=4.66.0,<5.0.0",

-    # Training utilities
-    # EMA of policy parameters (Diffusion Policy / pi05 style). Tiny
-    # pure-python dependency — preferred over a hand-rolled implementation.
-    "ema-pytorch>=0.7.7,<1.0.0",
-
    # Build tools (required by opencv-python-headless on some platforms)
    "cmake>=3.29.0.1,<4.2.0",
    "setuptools>=71.0.0,<81.0.0",
@@ -147,7 +142,6 @@ pygame-dep = ["pygame>=2.5.1,<2.7.0"]
 # (noble ships urdfdom 3.x). Cap below 0.9.16 until system urdfdom 4.x is broadly available.
 placo-dep = ["placo>=0.9.6,<0.9.16"]
 transformers-dep = ["transformers>=5.4.0,<5.6.0"]
-sentencepiece-dep = ["sentencepiece>=0.2.0,<0.3.0"] # FAST action tokenizer backend (pi052, pi0_fast)
 grpcio-dep = ["grpcio==1.73.1", "protobuf>=6.31.1,<6.32.0"]
 can-dep = ["python-can>=4.2.0,<5.0.0"]
 peft-dep = ["peft>=0.18.0,<1.0.0"]
@@ -203,7 +197,7 @@ wallx = [
    "torchdiffeq>=0.2.4,<0.3.0",
    "lerobot[qwen-vl-utils-dep]",
 ]
-pi = ["lerobot[transformers-dep]", "lerobot[scipy-dep]", "lerobot[sentencepiece-dep]"]
+pi = ["lerobot[transformers-dep]", "lerobot[scipy-dep]"]
 smolvla = ["lerobot[transformers-dep]", "num2words>=0.5.14,<0.6.0", "accelerate>=1.7.0,<2.0.0"]
 multi_task_dit = ["lerobot[transformers-dep]", "lerobot[diffusers-dep]"]
 groot = [
@@ -225,26 +219,6 @@ hilserl = ["lerobot[transformers-dep]", "lerobot[dataset]", "gym-hil>=0.1.13,<0.
 async = ["lerobot[grpcio-dep]", "lerobot[matplotlib-dep]"]
 peft = ["lerobot[transformers-dep]", "lerobot[peft-dep]"]

-# Annotation pipeline (lerobot-annotate). vllm is the preferred backend
-# on Linux, with a transformers fallback elsewhere; openai is the default
-# backend and talks to any OpenAI-compatible server (``vllm serve`` /
-# ``transformers serve`` / hosted endpoints). Distributed execution is
-# delegated to Hugging Face Jobs (see examples/annotations/run_hf_job.py).
-annotations = [
-    "lerobot[dataset]",
-    "lerobot[transformers-dep]",
-    "openai>=1.40,<2.0",
-    "vllm>=0.6.0,<1.0.0; sys_platform == 'linux'",
-]
-
-# Tool implementations under src/lerobot/tools/. Each tool's dependencies
-# are isolated so adding a new tool doesn't bloat the base install.
-# Currently only `say` (Kyutai pocket-tts; CPU-only, ~100M params).
-tools = [
-    "pocket-tts>=1.0.0,<3.0.0",
-    "scipy>=1.11.0,<2.0.0",  # SayTool.output_dir uses scipy.io.wavfile
-]
-
 # Development
 dev = ["pre-commit>=3.7.0,<5.0.0", "debugpy>=1.8.1,<1.9.0", "lerobot[grpcio-dep]", "grpcio-tools==1.73.1", "mypy>=1.19.1", "ruff>=0.14.1", "lerobot[notebook]"]
 notebook = ["jupyter>=1.0.0,<2.0.0", "ipykernel>=6.0.0,<7.0.0"]
@@ -335,10 +309,7 @@ lerobot-find-joint-limits="lerobot.scripts.lerobot_find_joint_limits:main"
 lerobot-imgtransform-viz="lerobot.scripts.lerobot_imgtransform_viz:main"
 lerobot-edit-dataset="lerobot.scripts.lerobot_edit_dataset:main"
 lerobot-setup-can="lerobot.scripts.lerobot_setup_can:main"
-lerobot-annotate="lerobot.scripts.lerobot_annotate:main"
 lerobot-rollout="lerobot.scripts.lerobot_rollout:main"
-# Interactive hierarchical-VLA runtime for PI052 (PaliGemma backbone).
-lerobot-pi052-runtime="lerobot.scripts.lerobot_pi052_runtime:main"

 # ---------------- Tool Configurations ----------------

@@ -356,7 +327,7 @@ torch = [{ index = "pytorch-cu128", marker = "sys_platform == 'linux'" }]
 torchvision = [{ index = "pytorch-cu128", marker = "sys_platform == 'linux'" }]

 [tool.setuptools.package-data]
-lerobot = ["envs/*.json", "annotations/steerable_pipeline/prompts/*.txt"]
+lerobot = ["envs/*.json"]

 [tool.setuptools.packages.find]
 where = ["src"]
--- a/scripts/build_robocasa_smoke.sh
+++ b/scripts/build_robocasa_smoke.sh
@@ -1,47 +0,0 @@
-#!/bin/bash
-# Build a tiny RoboCasa smoke dataset (2 short atomic tasks, all episodes) for
-# fast end-to-end training validation before the real run.
-#
-# Defaults: target/human, OpenStandMixerHead + NavigateKitchen (~1k episodes,
-# ~131k frames, ~109 min @ 20 fps), 2 SLURM workers on hopper-cpu.
-#
-# Override via env: TASKS, REPO_ID, WORK_DIR, WORKERS, CPUS, PARTITION, LOCAL=1.
-
-set -euo pipefail
-
-cd "${LEROBOT_ROOT:-$HOME/lerobot}"
-source ~/miniconda3/etc/profile.d/conda.sh
-conda activate lerobot
-
-REPO_ID="${REPO_ID:-${HF_USER:?HF_USER is unset}/robocasa_smoke_2atomic_v3}"
-WORK_DIR="${WORK_DIR:-/fsx/${USER}/robocasa/datasets/v1.0}"
-ROBOCASA_ROOT="${ROBOCASA_ROOT:-/fsx/${USER}/robocasa}"
-LOGS_DIR="${LOGS_DIR:-/fsx/${USER}/logs/robocasa}"
-TASKS="${TASKS:-OpenStandMixerHead NavigateKitchen}"
-WORKERS="${WORKERS:-2}"
-CPUS="${CPUS:-8}"
-PARTITION="${PARTITION:-hopper-cpu}"
-LOCAL="${LOCAL:-0}"
-
-ARGS=(
-    examples/port_datasets/slurm_build_robocasa_composite_seen.py
-    --repo-id="$REPO_ID"
-    --work-dir="$WORK_DIR"
-    --robocasa-root="$ROBOCASA_ROOT"
-    --split=target --source=human
-    --tasks $TASKS
-    --workers="$WORKERS"
-    --cpus-per-task="$CPUS"
-    --partition="$PARTITION"
-    --mem-per-cpu=4G
-    --time=04:00:00
-    --logs-dir="$LOGS_DIR"
-    --job-name=port_robocasa_smoke
-)
-if [[ "$LOCAL" == "1" ]]; then
-    ARGS+=(--slurm=0)
-fi
-
-echo "Smoke dataset: $REPO_ID"
-echo "Tasks: $TASKS"
-python "${ARGS[@]}"
--- a/src/lerobot/annotations/init.py
+++ b/src/lerobot/annotations/init.py
@@ -1,15 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
--- a/src/lerobot/annotations/steerable_pipeline/init.py
+++ b/src/lerobot/annotations/steerable_pipeline/init.py
@@ -1,50 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Steerable annotation pipeline producing ``language_persistent`` and
-``language_events`` columns for LeRobot datasets.
-
-The pipeline is decomposed into three independently runnable modules whose
-outputs are staged per-episode before a final parquet rewrite:
-
- :mod:`.modules.plan_subtasks_memory` (the ``plan`` module) — persistent styles
- :mod:`.modules.interjections_and_speech` (the ``interjections`` module) — event styles + speech
- :mod:`.modules.general_vqa` (the ``vqa`` module) — event-style VQA pairs
-"""
-
-from .config import AnnotationPipelineConfig
-from .validator import StagingValidator, ValidationReport
-from .vocabulary import (
-    VOCABULARY_FILENAME,
-    Vocabulary,
-    VocabularyDiscoveryModule,
-    load_vocabulary,
-    save_vocabulary,
-    vocabulary_path,
-)
-from .writer import LanguageColumnsWriter
-
-__all__ = [
-    "VOCABULARY_FILENAME",
-    "AnnotationPipelineConfig",
-    "LanguageColumnsWriter",
-    "StagingValidator",
-    "ValidationReport",
-    "Vocabulary",
-    "VocabularyDiscoveryModule",
-    "load_vocabulary",
-    "save_vocabulary",
-    "vocabulary_path",
-]
--- a/src/lerobot/annotations/steerable_pipeline/config.py
+++ b/src/lerobot/annotations/steerable_pipeline/config.py
@@ -1,251 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import annotations
-
-from dataclasses import dataclass, field
-from pathlib import Path
-from typing import Any
-
-
-@dataclass
-class VocabularyConfig:
-    """Phase 0 — dataset-level canonical vocabulary discovery.
-
-    Watches the first ``sample_episodes`` episode videos and asks the VLM
-    to derive a small canonical vocabulary (subtask labels + memory
-    milestones) that every episode in the dataset will reuse. The VLM
-    decides the count itself from what it sees in the clips — short
-    pick-and-place demos get ~6 labels, longer multi-step recipes more.
-    The output lands at ``meta/canonical_vocabulary.json`` and feeds
-    phase 1's subtask + memory generation as both a prompt-side
-    constraint and a post-VLM validation gate.
-
-    Why this exists: free-form LLM rephrasing per episode produces near-
-    unique subtask strings, which makes the downstream low-level policy's
-    conditioning effectively noise — at inference the policy generates a
-    *new* paraphrase the action expert has never seen and produces tiny
-    cautious actions. Forcing every episode onto the same small set of
-    canonical strings gives the action expert dense supervision per
-    string and a small target distribution to learn against.
-
-    Set ``enabled=False`` to fall back to free-form generation (original
-    behaviour). ``reuse_existing=True`` keeps a hand-edited vocabulary
-    file from being clobbered on re-runs.
-    """
-
-    enabled: bool = True
-    sample_episodes: int = 3
-    max_video_frames_per_episode: int = 32
-    # When True (default), an existing meta/canonical_vocabulary.json is
-    # loaded as-is and no VLM call is made — lets operators hand-edit the
-    # file. Set False to always rediscover from the sample episodes.
-    reuse_existing: bool = True
-
-
-@dataclass
-class PlanConfig:
-    """``plan`` module: plan + subtasks + memory + task augmentation.
-
-    The ``plan`` module attaches the whole episode as one Qwen-VL video
-    block; ``max_video_frames`` only caps the frames packed in (a
-    model-capacity bound, not an annotation-logic knob).
-    """
-
-    enabled: bool = True
-
-    # Number of ``task_aug`` rephrasings emitted at ``t=0``. The renderer's
-    # ``${task}`` binding rotates among them per ``sample_idx``. ``0`` disables.
-    n_task_rephrasings: int = 10
-
-    # When to derive the task from the video instead of using
-    # ``record.episode_task``: ``off``, ``if_short`` (short / placeholder /
-    # missing canonical task), or ``always``. The derived task replaces the
-    # canonical one for every ``plan``-module prompt; ``meta/tasks.parquet``
-    # is never modified.
-    derive_task_from_video: str = "if_short"
-    derive_task_min_words: int = 3
-
-    # Frame sampling for the subtask-decomposition prompt.
-    frames_per_second: float = 1.0
-    max_video_frames: int = 128
-
-    min_subtask_seconds: float = 1.5
-    plan_max_steps: int = 8
-
-    # When True (and backend supports it, e.g. ``openai``), the ``plan``
-    # module sends a ``video_url`` block pointing at a per-episode mp4
-    # subclip and lets the server sample frames at ``use_video_url_fps``.
-    use_video_url: bool = False
-    use_video_url_fps: float = 1.0
-
-
-@dataclass
-class InterjectionsConfig:
-    """``interjections`` module: interjections + paired speech."""
-
-    enabled: bool = True
-
-    # Each interjection emits a paired ``(interjection, speech)`` event row
-    # and triggers a ``plan`` refresh at the same timestamp via the
-    # ``plan`` module.
-    max_interjections_per_episode: int = 3
-    interjection_min_t: float = 2.0
-
-    # Visual context attached to the interjection prompt: a short window
-    # of frames centered on the chosen timestamp so the VLM sees the
-    # ongoing motion rather than a single frozen frame.
-    interjection_window_seconds: float = 2.0
-    interjection_window_frames: int = 4
-
-
-@dataclass
-class VqaConfig:
-    """``vqa`` module: general VQA."""
-
-    enabled: bool = True
-    vqa_emission_hz: float = 1.0
-    K: int = 1
-    """How many *consecutive* frames each emission tick anchors a VQA pair
-    to. The VLM grounds its answer (bbox / keypoint coordinates, count, …)
-    against the *first* anchored frame's image, so anchoring K>1 frames
-    copies that same answer onto later frames where the scene has already
-    moved — stale labels. Default ``1``: a VQA pair lands on exactly its
-    emission frame, no temporal smear. Raise it only to trade label
-    precision for more (noisier) VQA frames."""
-    question_types: tuple[str, ...] = ("bbox", "keypoint", "count", "attribute", "spatial")
-
-
-@dataclass
-class VlmConfig:
-    """Shared Qwen-VL client configuration."""
-
-    # One of ``vllm``, ``transformers``, ``openai``, or ``stub`` (tests).
-    # ``openai`` talks to a local OpenAI-compatible server; the CLI
-    # auto-spawns one when ``auto_serve=True``.
-    backend: str = "openai"
-    model_id: str = "Qwen/Qwen3.6-35B-A3B-FP8"
-
-    # OpenAI-compatible server endpoint; ``EMPTY`` works for local servers.
-    api_base: str = "http://localhost:8000/v1"
-    api_key: str = "EMPTY"
-
-    # When True with ``backend=openai``, the CLI probes ``api_base`` and
-    # spawns a server if none answers (default: ``transformers serve``).
-    # Set to False to fail fast when pointing at a remote endpoint.
-    auto_serve: bool = True
-    serve_port: int = 8000
-    # Override the auto-serve command. ``{port}`` is substituted per replica
-    # when ``parallel_servers > 1``.
-    serve_command: str | None = None
-
-    # Run multiple independent inference servers for round-robin client
-    # routing (each pinned to a GPU via ``CUDA_VISIBLE_DEVICES`` and bound
-    # to ``serve_port + i``). ``num_gpus=0`` means one GPU per replica.
-    parallel_servers: int = 1
-    num_gpus: int = 0
-    client_concurrency: int = 16
-    serve_ready_timeout_s: float = 600.0
-
-    max_new_tokens: int = 512
-    temperature: float = 0.2
-    json_mode: bool = True
-    batch_size: int = 4
-    tensor_parallel_size: int = 1
-
-    # Fraction of GPU memory vllm allocates for weights + KV cache.
-    gpu_memory_utilization: float = 0.9
-    # Cap context length (None = model default). On 80 GB H100 a 30B BF16
-    # model often needs <= 8192 to leave KV-cache headroom.
-    max_model_len: int | None = None
-    trust_remote_code: bool = False
-
-    # Override the camera stream used for keyframe attachment. None picks
-    # the first ``observation.images.*`` key the dataset declares.
-    camera_key: str | None = None
-    # Forwarded as ``extra_body.chat_template_kwargs`` on every chat call;
-    # use to pass model-specific flags such as ``{"enable_thinking": false}``.
-    chat_template_kwargs: dict[str, Any] | None = None
-
-
-@dataclass
-class ExecutorConfig:
-    """Executor settings.
-
-    Distributed execution is provided by Hugging Face Jobs (see
-    ``examples/annotation/run_hf_job.py``); this config only controls
-    intra-process episode concurrency.
-    """
-
-    # Episodes processed concurrently within each module phase. Each
-    # in-flight episode dispatches 3-5 dependent VLM calls, so this is the
-    # main knob for saturating ``parallel_servers`` and ``client_concurrency``.
-    episode_parallelism: int = 16
-
-
-@dataclass
-class AnnotationPipelineConfig:
-    """Top-level config for ``lerobot-annotate``.
-
-    The writer rewrites ``data/chunk-*/file-*.parquet`` in place. Multiple
-    revisions of the same dataset live in separate copies.
-    """
-
-    # Hub dataset id. Used as the download source when ``root`` is unset,
-    # and as the destination repo when ``push_to_hub`` is enabled and
-    # ``dest_repo_id`` is unset.
-    repo_id: str | None = None
-
-    # Optional separate Hub dataset id to push the annotated result to. When
-    # unset, ``push_to_hub`` uploads back to ``repo_id`` (annotate in place);
-    # when set, the source ``repo_id`` is left untouched.
-    dest_repo_id: str | None = None
-
-    root: Path | None = None
-
-    # Defaults to ``<root>/.annotate_staging/`` when unset.
-    staging_dir: Path | None = None
-
-    seed: int = 1729
-
-    vocabulary: VocabularyConfig = field(default_factory=VocabularyConfig)
-    plan: PlanConfig = field(default_factory=PlanConfig)
-    interjections: InterjectionsConfig = field(default_factory=InterjectionsConfig)
-    vqa: VqaConfig = field(default_factory=VqaConfig)
-
-    vlm: VlmConfig = field(default_factory=VlmConfig)
-    executor: ExecutorConfig = field(default_factory=ExecutorConfig)
-
-    skip_validation: bool = False
-    only_episodes: tuple[int, ...] | None = None
-
-    # Keyframe decode backend. When unset, the pipeline decodes with the
-    # ffmpeg CLI: it decodes AV1 and runs each decode as an isolated child
-    # process, which is both crash-safe and safe under the concurrent
-    # decode the executor performs (torchcodec is not thread-safe and
-    # SIGSEGVs there). Set to ``"torchcodec"`` or ``"pyav"`` to pin an
-    # in-process decoder when its build is known thread-safe.
-    video_backend: str | None = None
-
-    # When True, upload the annotated dataset to the Hugging Face Hub:
-    # to ``dest_repo_id`` if set, otherwise back to ``repo_id``. One of
-    # the two must be set for this to take effect.
-    push_to_hub: bool = False
-    push_private: bool = False
-    push_commit_message: str | None = None
-
-    def resolved_staging_dir(self, root: Path) -> Path:
-        return self.staging_dir if self.staging_dir is not None else root / ".annotate_staging"
--- a/src/lerobot/annotations/steerable_pipeline/executor.py
+++ b/src/lerobot/annotations/steerable_pipeline/executor.py
@@ -1,325 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""In-process executor that runs the annotation phases.
-
-The executor plans **seven phases** in the dependency order from the plan:
-
-    phase 0: vocabulary discovery — derive a small canonical vocabulary
-             from the first few sample-episode videos (subtask labels +
-             memory milestones) and persist it next to the dataset; the
-             ``plan`` module then constrains every per-episode generation
-             to those strings, so the downstream policy sees a small,
-             repeatable conditioning distribution
-    phase 1: ``plan`` module (plan + subtasks + memory)
-    phase 2: ``interjections`` module (interjections + speech)
-    phase 3: ``plan`` plan-update pass — re-runs plan emission at every
-             interjection timestamp produced by phase 2
-    phase 4: ``vqa`` module (VQA)
-    phase 5: validator
-    phase 6: writer
-
-Phase 3 is why the ``plan`` module must be re-entered after the
-``interjections`` module — to refresh ``plan`` rows at interjection
-timestamps.
-
-Distributed execution is provided by Hugging Face Jobs (see
-``examples/annotations/run_hf_job.py``); the runner inside the job
-invokes ``lerobot-annotate`` which uses this in-process executor.
-Episode-level concurrency is controlled by
-``ExecutorConfig.episode_parallelism``.
-"""
-
-from __future__ import annotations
-
-import logging
-import time
-from concurrent.futures import ThreadPoolExecutor, as_completed
-from dataclasses import dataclass
-from pathlib import Path
-from typing import Any
-
-from .config import AnnotationPipelineConfig
-from .reader import EpisodeRecord, iter_episodes
-from .staging import EpisodeStaging
-from .validator import StagingValidator
-from .writer import LanguageColumnsWriter
-
-logger = logging.getLogger(__name__)
-
-
-@dataclass
-class PhaseResult:
-    """Summary of one pipeline phase across all episodes."""
-
-    name: str
-    episodes_processed: int
-    episodes_skipped: int
-
-
-@dataclass
-class PipelineRunSummary:
-    """Aggregated result returned by :meth:`Executor.run`."""
-
-    phases: list[PhaseResult]
-    written_paths: list[Path]
-    validation_report: Any  # ValidationReport, kept Any to avoid import cycle
-
-
-@dataclass
-class Executor:
-    """Run all six phases over a dataset root in-process.
-
-    Episode-level concurrency comes from ``ExecutorConfig.episode_parallelism``
-    (a thread pool); cluster-level concurrency comes from running this
-    executor inside a Hugging Face Job. Tests construct the executor
-    directly with stub modules.
-    """
-
-    config: AnnotationPipelineConfig
-    plan: Any  # PlanSubtasksMemoryModule
-    interjections: Any  # InterjectionsAndSpeechModule
-    vqa: Any  # GeneralVqaModule
-    writer: LanguageColumnsWriter
-    validator: StagingValidator
-    vocabulary: Any = None  # VocabularyDiscoveryModule | None
-
-    def run(self, root: Path) -> PipelineRunSummary:
-        records = list(iter_episodes(root, only_episodes=self.config.only_episodes))
-        n = len(records)
-        if n == 0:
-            raise ValueError(f"No episodes found under {root}/data/")
-
-        print(f"[annotate] {n} episodes total", flush=True)
-
-        staging_dir = self.config.resolved_staging_dir(root)
-        staging_dir.mkdir(parents=True, exist_ok=True)
-
-        phases: list[PhaseResult] = []
-
-        # Phase 0: vocabulary discovery. Mutates ``self.plan.vocabulary``
-        # so subsequent per-episode plan calls see the canonical labels.
-        phases.append(self._run_vocabulary_phase(records, root))
-
-        # Phase 1: ``plan`` module (plan + subtasks + memory)
-        phases.append(self._run_module_phase("plan", records, staging_dir, self.plan))
-        # Phase 2: ``interjections`` module (interjections + speech). It
-        # reads the ``plan`` module's subtask rows from the same staging
-        # tree to ground the interjection prompt in the correct local subtask.
-        phases.append(self._run_module_phase("interjections", records, staging_dir, self.interjections))
-        # Phase 3: ``plan`` plan-update pass at interjection timestamps.
-        phases.append(self._run_plan_update_phase(records, staging_dir))
-        # Phase 4: ``vqa`` module (VQA)
-        phases.append(self._run_module_phase("vqa", records, staging_dir, self.vqa))
-
-        print("[annotate] running validator...", flush=True)
-        report = self.validator.validate(records, staging_dir)
-        if not report.ok and not self.config.skip_validation:
-            raise RuntimeError(f"Staging validation failed: {report.summary()}")
-        print(f"[annotate] validator: {report.summary()}", flush=True)
-
-        print(f"[annotate] writing parquet shards into {root}/data/...", flush=True)
-        written = self.writer.write_all(records, staging_dir, root)
-        print(f"[annotate] wrote {len(written)} shard(s); pipeline complete", flush=True)
-
-        # Keep meta/info.json aligned with the parquet schema we just wrote
-        # (language columns advertised; canonical ``say`` tool registered for
-        # PI052 / Pi0.5 / dataset-visualizer consumers via
-        # ``LeRobotDatasetMetadata.tools``). Idempotent and additive: existing
-        # user metadata is preserved.
-        self._ensure_annotation_metadata_in_info(root)
-
-        return PipelineRunSummary(phases=phases, written_paths=written, validation_report=report)
-
-    @staticmethod
-    def _ensure_annotation_metadata_in_info(root: Path) -> None:
-        """Write language features and canonical tools to ``meta/info.json``.
-
-        ``LanguageColumnsWriter`` adds ``language_persistent`` and
-        ``language_events`` to parquet shards. The metadata must advertise
-        those columns too, otherwise non-streaming ``LeRobotDataset`` loads
-        cast against the old schema and fail on the extra parquet columns.
-        """
-        from lerobot.datasets.io_utils import load_info, write_info  # noqa: PLC0415
-        from lerobot.datasets.language import SAY_TOOL_SCHEMA, language_feature_info  # noqa: PLC0415
-
-        info_path = root / "meta" / "info.json"
-        if not info_path.exists():
-            return
-        try:
-            info = load_info(root)
-        except Exception as exc:  # noqa: BLE001
-            print(f"[annotate] could not read {info_path}: {exc}", flush=True)
-            return
-
-        changed = False
-
-        merged_features = {**info.features, **language_feature_info()}
-        if merged_features != info.features:
-            info.features = merged_features
-            changed = True
-
-        existing = info.tools or []
-        names = {(t.get("function") or {}).get("name") for t in existing if isinstance(t, dict)}
-        if SAY_TOOL_SCHEMA["function"]["name"] not in names:
-            info.tools = [*existing, SAY_TOOL_SCHEMA]
-            changed = True
-
-        if changed:
-            write_info(info, root)
-            print(
-                "[annotate] meta/info.json: "
-                f"language_features={list(language_feature_info())}, "
-                f"tools={[t['function']['name'] for t in (info.tools or [])]}",
-                flush=True,
-            )
-
-    def _run_vocabulary_phase(
-        self, records: list[EpisodeRecord], root: Path
-    ) -> PhaseResult:
-        """Discover (or load) the canonical vocabulary, wire it into ``self.plan``.
-
-        Returns a ``PhaseResult`` whose ``episodes_processed`` is the number
-        of sample episodes consulted (0 when disabled or no VLM call was
-        needed); ``episodes_skipped`` is always ``0`` because vocabulary is
-        a once-per-dataset artifact, not a per-episode product.
-        """
-        from .vocabulary import load_vocabulary, save_vocabulary  # noqa: PLC0415
-
-        if self.vocabulary is None or not getattr(self.vocabulary, "enabled", False):
-            print(
-                "[annotate] phase=vocabulary skipped (module disabled or unset)",
-                flush=True,
-            )
-            return PhaseResult(name="vocabulary", episodes_processed=0, episodes_skipped=0)
-
-        existing = load_vocabulary(root)
-        if existing is not None and self.config.vocabulary.reuse_existing:
-            print(
-                f"[annotate] phase=vocabulary reusing {root / 'meta' / 'canonical_vocabulary.json'} "
-                f"({len(existing.subtasks)} subtask labels, "
-                f"{len(existing.memory_milestones)} memory milestones)",
-                flush=True,
-            )
-            self.plan.vocabulary = existing
-            return PhaseResult(name="vocabulary", episodes_processed=0, episodes_skipped=0)
-
-        sample_n = max(1, min(int(self.config.vocabulary.sample_episodes), len(records)))
-        print(
-            f"[annotate] phase=vocabulary discovering from {sample_n} sample episode(s)...",
-            flush=True,
-        )
-        t0 = time.time()
-        vocab = self.vocabulary.discover(records[:sample_n], existing=existing)
-        if vocab is None:
-            print(
-                "[annotate] phase=vocabulary returned no vocabulary — "
-                "plan module will fall back to free-form generation",
-                flush=True,
-            )
-            return PhaseResult(name="vocabulary", episodes_processed=0, episodes_skipped=0)
-
-        save_path = save_vocabulary(root, vocab)
-        print(
-            f"[annotate] phase=vocabulary wrote {save_path} "
-            f"({len(vocab.subtasks)} subtask labels, "
-            f"{len(vocab.memory_milestones)} memory milestones) in "
-            f"{time.time() - t0:.1f}s",
-            flush=True,
-        )
-        self.plan.vocabulary = vocab
-        return PhaseResult(name="vocabulary", episodes_processed=sample_n, episodes_skipped=0)
-
-    def _run_module_phase(
-        self,
-        name: str,
-        records: list[EpisodeRecord],
-        staging_dir: Path,
-        module: Any,
-    ) -> PhaseResult:
-        if not module.enabled:
-            print(f"[annotate] phase={name} skipped (module disabled)", flush=True)
-            return PhaseResult(name=name, episodes_processed=0, episodes_skipped=len(records))
-        n = len(records)
-        parallelism = max(1, min(self.config.executor.episode_parallelism, n))
-        print(
-            f"[annotate] phase={name} starting on {n} episode(s) (parallelism={parallelism})",
-            flush=True,
-        )
-        t0 = time.time()
-
-        def _do(idx_record: tuple[int, EpisodeRecord]) -> tuple[int, int, float]:
-            i, record = idx_record
-            ep_start = time.time()
-            staging = EpisodeStaging(staging_dir, record.episode_index)
-            module.run_episode(record, staging)
-            return i, record.episode_index, time.time() - ep_start
-
-        processed = 0
-        if parallelism == 1:
-            for i, record in enumerate(records, 1):
-                _, ep_idx, elapsed = _do((i, record))
-                processed += 1
-                print(
-                    f"[annotate]   {name} episode {i}/{n} (idx={ep_idx}) done in {elapsed:.1f}s",
-                    flush=True,
-                )
-        else:
-            with ThreadPoolExecutor(max_workers=parallelism) as pool:
-                futures = [pool.submit(_do, (i, r)) for i, r in enumerate(records, 1)]
-                for fut in as_completed(futures):
-                    i, ep_idx, elapsed = fut.result()
-                    processed += 1
-                    print(
-                        f"[annotate]   {name} episode {processed}/{n} "
-                        f"(idx={ep_idx}, submit_order={i}) done in {elapsed:.1f}s",
-                        flush=True,
-                    )
-        total = time.time() - t0
-        print(f"[annotate] phase={name} complete: {processed}/{n} in {total:.1f}s", flush=True)
-        return PhaseResult(name=name, episodes_processed=processed, episodes_skipped=0)
-
-    def _run_plan_update_phase(  # noqa: PLR0915
-        self, records: list[EpisodeRecord], staging_dir: Path
-    ) -> PhaseResult:
-        """Re-emit ``plan`` rows at each timestamp the ``interjections`` module produced.
-
-        The ``plan`` module owns the prompt; the ``interjections`` module
-        produced the timestamps. This phase therefore calls back into the
-        ``plan`` module with the interjection timestamps so its existing
-        prompt path is reused.
-        """
-        if not self.plan.enabled or not self.interjections.enabled:
-            return PhaseResult(
-                name="plan_update", episodes_processed=0, episodes_skipped=len(records)
-            )
-        processed = 0
-        for record in records:
-            staging = EpisodeStaging(staging_dir, record.episode_index)
-            interjection_rows = [
-                row for row in staging.read("interjections") if row.get("style") == "interjection"
-            ]
-            interjection_times = [float(row["timestamp"]) for row in interjection_rows]
-            interjection_texts = [str(row.get("content") or "") for row in interjection_rows]
-            if interjection_times:
-                self.plan.run_plan_updates(record, staging, interjection_times, interjection_texts)
-                processed += 1
-        # Episodes without any interjections are skipped (no plan refresh
-        # needed); count them so the summary's processed+skipped == total.
-        return PhaseResult(
-            name="plan_update",
-            episodes_processed=processed,
-            episodes_skipped=len(records) - processed,
-        )
--- a/src/lerobot/annotations/steerable_pipeline/frames.py
+++ b/src/lerobot/annotations/steerable_pipeline/frames.py
@@ -1,483 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Keyframe extraction for the annotation pipeline.
-
-Modules attach decoded camera frames to their VLM prompts so the model can
-ground subtask decomposition, interjection scenarios, and VQA in actual
-visual content. The pipeline shares one provider across modules and one
-episode at a time, with a small per-episode cache so multiple modules
-querying the same timestamp pay decode cost once.
-"""
-
-from __future__ import annotations
-
-import logging
-import threading
-from dataclasses import dataclass, field
-from pathlib import Path
-from typing import Any, Protocol
-
-import PIL.Image
-import torch
-
-from lerobot.datasets.video_utils import decode_video_frames
-
-from .reader import EpisodeRecord
-
-logger = logging.getLogger(__name__)
-
-
-class FrameProvider(Protocol):
-    """Decodes camera frames at episode-relative timestamps."""
-
-    @property
-    def camera_keys(self) -> list[str]:
-        """All ``observation.images.*`` feature keys this provider can decode."""
-
-    def frames_at(
-        self,
-        record: EpisodeRecord,
-        timestamps: list[float],
-        camera_key: str | None = None,
-    ) -> list[Any]:
-        """Return one decoded frame per timestamp from ``camera_key`` (or default).
-
-        Frames are ``torch.Tensor`` (``C, H, W`` uint8) — the shape
-        :func:`lerobot.datasets.video_utils.decode_video_frames` returns.
-        :func:`to_image_blocks` converts them to PIL only at the VLM-message
-        boundary.
-
-        Empty list if the camera is unavailable. ``camera_key=None`` falls back
-        to the provider's default camera so existing single-camera callers
-        (the ``plan`` and ``interjections`` modules) keep working unchanged.
-        """
-
-    def video_for_episode(
-        self,
-        record: EpisodeRecord,
-        max_frames: int,
-        camera_key: str | None = None,
-    ) -> list[Any]:
-        """Return up to ``max_frames`` decoded frames covering the whole episode.
-
-        Sampling is uniform across the episode duration. Frames are
-        ``torch.Tensor`` (``C, H, W`` uint8); :func:`to_video_block` wraps
-        them into one ``{"type":"video", "video":<list>}`` block for a
-        Qwen-VL-compatible model that pools temporally itself. Empty list if
-        no camera available.
-        """
-
-
-@dataclass
-class _NullProvider:
-    """No-op provider used when the dataset has no video keys or in tests."""
-
-    @property
-    def camera_keys(self) -> list[str]:
-        return []
-
-    def frames_at(
-        self,
-        record: EpisodeRecord,
-        timestamps: list[float],
-        camera_key: str | None = None,
-    ) -> list[Any]:
-        return []
-
-    def video_for_episode(
-        self,
-        record: EpisodeRecord,
-        max_frames: int,
-        camera_key: str | None = None,
-    ) -> list[Any]:
-        return []
-
-
-def null_provider() -> FrameProvider:
-    return _NullProvider()
-
-
-@dataclass
-class VideoFrameProvider:
-    """Decodes frames from the dataset's ``observation.images.*`` streams.
-
-    By default the *first* camera key is used for the ``plan`` module
-    (subtask decomposition) and the ``interjections`` module (interjection
-    scenarios) — those prompts care about *what is happening*, not which
-    angle. The ``vqa`` module instead iterates over every camera in
-    :attr:`camera_keys` so each frame's
-    grounded answer (bbox/keypoint/...) is tagged with the camera it was
-    grounded against.
-
-    ``camera_key`` overrides the default-camera choice but does not restrict
-    :attr:`camera_keys`. Pass ``camera_key`` explicitly to ``frames_at`` /
-    ``video_for_episode`` to read a non-default stream.
-
-    Caches up to ``cache_size`` decoded frames per process to keep
-    co-timestamped ``interjections`` + ``plan`` plan-update calls cheap.
-    """
-
-    root: Path
-    camera_key: str | None = None
-    tolerance_s: float = 1e-2
-    cache_size: int = 256
-    # Keyframe decode backend. ``None`` uses the ffmpeg CLI — the
-    # concurrency- and crash-safe default for the pipeline's threaded
-    # decode. Set to ``"torchcodec"`` or ``"pyav"`` to pin an in-process
-    # decoder when the build is known thread-safe.
-    video_backend: str | None = None
-    _meta: Any = field(default=None, init=False, repr=False)
-    _cache: dict = field(default_factory=dict, init=False, repr=False)
-    _camera_keys: list[str] = field(default_factory=list, init=False, repr=False)
-    # Pipeline runs the three module phases under a ThreadPoolExecutor (see
-    # ``ExecutorConfig.episode_parallelism``); guard the dict cache and the
-    # one-shot warn flag against concurrent updates from worker threads.
-    _lock: threading.Lock = field(default_factory=threading.Lock, init=False, repr=False)
-
-    def __post_init__(self) -> None:
-        from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata  # noqa: PLC0415
-
-        self._meta = LeRobotDatasetMetadata(repo_id="local", root=self.root)
-        # ``camera_keys`` covers both image- and video-stored cameras and is
-        # always defined on the metadata (``[]`` in the worst case), so it is
-        # the single source we need here.
-        keys = list(self._meta.camera_keys)
-        # Last-resort fallback: if metadata didn't surface anything but the
-        # caller explicitly named a camera (``--vlm.camera_key=...``), trust
-        # them — the key is by definition known to exist on the dataset.
-        if not keys and self.camera_key:
-            keys = [self.camera_key]
-        self._camera_keys = keys
-        if self.camera_key is None:
-            self.camera_key = keys[0] if keys else None
-
-    @property
-    def camera_keys(self) -> list[str]:
-        """All ``observation.images.*`` keys available on this dataset."""
-        return list(self._camera_keys)
-
-    def frames_at(
-        self,
-        record: EpisodeRecord,
-        timestamps: list[float],
-        camera_key: str | None = None,
-    ) -> list[Any]:
-        target = camera_key if camera_key is not None else self.camera_key
-        if not timestamps or target is None:
-            return []
-
-        out: list[Any] = []
-        misses: list[float] = []
-        miss_indices: list[int] = []
-        with self._lock:
-            for i, ts in enumerate(timestamps):
-                key = (record.episode_index, target, round(float(ts), 6))
-                cached = self._cache.get(key)
-                if cached is not None:
-                    out.append(cached)
-                else:
-                    out.append(None)
-                    misses.append(float(ts))
-                    miss_indices.append(i)
-
-        if misses:
-            decoded = self._decode(record.episode_index, misses, target)
-            # ``_decode`` returns exactly one frame per requested timestamp,
-            # or an empty list if decoding failed wholesale. A partial list
-            # would mean a frame/timestamp misalignment, so only pair them up
-            # when the counts match (``strict=True`` then guards regressions).
-            if len(decoded) == len(miss_indices):
-                with self._lock:
-                    for i, frame in zip(miss_indices, decoded, strict=True):
-                        out[i] = frame
-                        key = (record.episode_index, target, round(float(timestamps[i]), 6))
-                        if len(self._cache) >= self.cache_size:
-                            self._cache.pop(next(iter(self._cache)))
-                        self._cache[key] = frame
-        # filter out any None left over from decode failures
-        return [frame for frame in out if frame is not None]
-
-    def video_for_episode(
-        self,
-        record: EpisodeRecord,
-        max_frames: int,
-        camera_key: str | None = None,
-    ) -> list[Any]:
-        """Return up to ``max_frames`` frames uniformly sampled across the episode.
-
-        The whole episode duration is covered; the model picks subtask
-        boundaries from the temporal pooling it does internally. Frames are
-        ``torch.Tensor`` (see :meth:`frames_at`).
-        """
-        target = camera_key if camera_key is not None else self.camera_key
-        if max_frames <= 0 or target is None or not record.frame_timestamps:
-            return []
-        n_frames = min(max_frames, len(record.frame_timestamps))
-        if n_frames == len(record.frame_timestamps):
-            timestamps = list(record.frame_timestamps)
-        else:
-            t0 = record.frame_timestamps[0]
-            t_last = record.frame_timestamps[-1]
-            if t_last <= t0:
-                timestamps = [float(t0)] * n_frames
-            else:
-                step = (t_last - t0) / (n_frames - 1) if n_frames > 1 else 0.0
-                timestamps = [float(t0 + i * step) for i in range(n_frames)]
-        return self.frames_at(record, timestamps, camera_key=target)
-
-    def episode_clip_path(self, record: EpisodeRecord, cache_dir: Path) -> Path | None:
-        """Extract the episode's subclip to ``cache_dir/ep_{idx:06d}.mp4``.
-
-        Returns ``None`` if the dataset has no video tracks. Skips
-        re-extract when the cached clip already exists. Re-encodes to
-        H.264 (libx264) so the resulting mp4 is decodable by every
-        downstream video processor — stream-copy would inherit the
-        source codec (often AV1 in modern LeRobot datasets), which
-        vllm's libav build cannot decode.
-        """
-        import subprocess  # noqa: PLC0415
-
-        if self.camera_key is None:
-            return None
-        cache_dir.mkdir(parents=True, exist_ok=True)
-        out_path = cache_dir / f"ep_{record.episode_index:06d}.mp4"
-        if out_path.exists() and out_path.stat().st_size > 0:
-            return out_path
-        ep = self._meta.episodes[record.episode_index]
-        from_timestamp = float(ep[f"videos/{self.camera_key}/from_timestamp"])
-        to_timestamp = float(ep[f"videos/{self.camera_key}/to_timestamp"])
-        src = self.root / self._meta.get_video_file_path(record.episode_index, self.camera_key)
-        cmd = [
-            "ffmpeg",
-            "-y",
-            "-loglevel",
-            "error",
-            "-ss",
-            f"{from_timestamp:.3f}",
-            "-to",
-            f"{to_timestamp:.3f}",
-            "-i",
-            str(src),
-            "-c:v",
-            "libx264",
-            "-preset",
-            "ultrafast",
-            "-crf",
-            "23",
-            "-pix_fmt",
-            "yuv420p",
-            "-an",
-            str(out_path),
-        ]
-        try:
-            subprocess.run(cmd, check=True, timeout=300)
-        except (subprocess.CalledProcessError, subprocess.TimeoutExpired, FileNotFoundError):
-            return None
-        return out_path if out_path.exists() and out_path.stat().st_size > 0 else None
-
-    def _decode(self, episode_index: int, timestamps: list[float], camera_key: str) -> list[Any]:
-        """Decode ``timestamps`` from the episode's video as ``(C, H, W)`` tensors.
-
-        Delegates to :func:`lerobot.datasets.video_utils.decode_video_frames`
-        (torchcodec by default, PyAV fallback) rather than a bespoke decoder.
-        Returns one frame per requested timestamp, or ``[]`` if decoding
-        failed wholesale — callers treat ``[]`` as "no frames available".
-        """
-        ep = self._meta.episodes[episode_index]
-        from_timestamp = ep[f"videos/{camera_key}/from_timestamp"]
-        shifted = [from_timestamp + ts for ts in timestamps]
-        video_path = self.root / self._meta.get_video_file_path(episode_index, camera_key)
-
-        # Default to the ffmpeg CLI. The pipeline decodes under a 16-wide
-        # ThreadPoolExecutor and the in-process decoders are unsafe there:
-        # torchcodec is not thread-safe and SIGSEGVs under concurrent decode
-        # (a crash no try/except can catch), PyAV can likewise segfault on
-        # AV1, and lerobot's ``pyav`` backend routes through the removed
-        # ``torchvision.io.VideoReader``. ``_decode_frames_ffmpeg`` shells
-        # out per frame: each decode is an isolated child process, so it is
-        # both crash-safe and concurrency-safe. ``video_backend`` can pin
-        # ``torchcodec`` / ``pyav`` explicitly for callers that know their
-        # build is safe.
-        chain = [self.video_backend] if self.video_backend else ["ffmpeg"]
-
-        exc: Exception | None = None
-        for backend in chain:
-            try:
-                if backend == "ffmpeg":
-                    return _decode_frames_ffmpeg(video_path, shifted)
-                if backend in ("pyav", "av"):
-                    return _decode_frames_av(video_path, shifted)
-                # Stacked ``(N, C, H, W)`` uint8 tensor; one row per timestamp.
-                decoded = decode_video_frames(
-                    video_path, shifted, self.tolerance_s, backend=backend, return_uint8=True
-                )
-                return list(decoded)
-            except Exception as e:  # noqa: PERF203
-                exc = e
-
-        # Every backend raised. Log loudly the first time so a silent
-        # vqa-module no-op (every prompt skipped because frames_at returned
-        # []) is debuggable from the job log instead of post-hoc parquet
-        # inspection. Subsequent failures stay quiet.
-        with self._lock:
-            already_warned = getattr(self, "_warned_decode_fail", False)
-            if not already_warned:
-                self._warned_decode_fail = True
-        if not already_warned:
-            logger.warning(
-                "VideoFrameProvider._decode failed for episode=%s camera=%s "
-                "video_path=%s backends=%s: %s",
-                episode_index,
-                camera_key,
-                video_path,
-                chain,
-                exc,
-                exc_info=exc,
-            )
-        return []
-
-
-def make_frame_provider(
-    root: Path, camera_key: str | None = None, video_backend: str | None = None
-) -> FrameProvider:
-    """Build a :class:`VideoFrameProvider` if videos are present, else null."""
-    try:
-        provider = VideoFrameProvider(root=root, camera_key=camera_key, video_backend=video_backend)
-    except Exception:
-        return null_provider()
-    if provider.camera_key is None:
-        return null_provider()
-    return provider
-
-
-def _decode_frames_ffmpeg(video_path: Path, timestamps: list[float]) -> list[Any]:
-    """Decode the frames nearest to ``timestamps`` via the ffmpeg CLI.
-
-    Runs one ``ffmpeg`` process per timestamp, seeking with ``-ss`` and
-    piping a single PNG to stdout. Unlike the in-process decoders this
-    survives a hostile container: a full ffmpeg build decodes AV1 (the codec
-    modern LeRobot datasets use) where torchcodec raises and PyAV can
-    SIGSEGV, and a crash stays isolated to the child process — a non-zero
-    exit is a catchable error, not a segfault of the whole job. Returns one
-    ``(C, H, W)`` uint8 tensor per timestamp.
-    """
-    import io  # noqa: PLC0415
-    import subprocess  # noqa: PLC0415
-
-    import numpy as np  # noqa: PLC0415
-
-    frames: list[Any] = []
-    for ts in timestamps:
-        proc = subprocess.run(
-            [
-                "ffmpeg", "-nostdin", "-loglevel", "error",
-                "-ss", f"{max(ts, 0.0):.3f}",
-                "-i", str(video_path),
-                "-frames:v", "1",
-                "-f", "image2pipe", "-vcodec", "png", "pipe:1",
-            ],
-            capture_output=True,
-            check=True,
-            timeout=120,
-        )
-        if not proc.stdout:
-            raise RuntimeError(f"ffmpeg returned no frame for t={ts:.3f}s of {video_path}")
-        img = PIL.Image.open(io.BytesIO(proc.stdout)).convert("RGB")
-        frames.append(torch.from_numpy(np.asarray(img).copy()).permute(2, 0, 1).contiguous())
-    return frames
-
-
-def _decode_frames_av(video_path: Path, timestamps: list[float]) -> list[Any]:
-    """Decode the frames nearest to ``timestamps`` using PyAV directly.
-
-    lerobot's ``decode_video_frames(backend="pyav")`` routes through
-    ``torchvision.io.VideoReader``, removed in torchvision 0.23+. This helper
-    talks to the ``av`` package directly. Note PyAV can SIGSEGV on AV1
-    streams in some builds — prefer ``_decode_frames_ffmpeg`` as the default
-    fallback; this stays available behind ``video_backend="pyav"``. Returns
-    one ``(C, H, W)`` uint8 tensor per timestamp.
-    """
-    import av  # noqa: PLC0415
-
-    first_ts = min(timestamps)
-    last_ts = max(timestamps)
-    loaded_frames: list[torch.Tensor] = []
-    loaded_ts: list[float] = []
-    with av.open(str(video_path)) as container:
-        stream = container.streams.video[0]
-        # Seek to the keyframe at or before the first requested timestamp.
-        offset = max(int(first_ts / stream.time_base), 0) if stream.time_base else 0
-        container.seek(offset, stream=stream, backward=True, any_frame=False)
-        for idx, frame in enumerate(container.decode(stream)):
-            ts = frame.time
-            if ts is None:
-                ts = float(frame.pts * stream.time_base) if frame.pts is not None else float(idx)
-            loaded_ts.append(ts)
-            loaded_frames.append(
-                torch.from_numpy(frame.to_ndarray(format="rgb24")).permute(2, 0, 1).contiguous()
-            )
-            if ts >= last_ts:
-                break
-    if not loaded_frames:
-        raise RuntimeError(f"PyAV decoded no frames from {video_path}")
-    ts_tensor = torch.tensor(loaded_ts)
-    return [loaded_frames[int(torch.argmin((ts_tensor - q).abs()))] for q in timestamps]
-
-
-def _frame_to_pil(frame: Any) -> Any:
-    """Materialise a decoded frame as a ``PIL.Image`` for the VLM message.
-
-    Frames flow through the provider as ``torch.Tensor`` (``C, H, W`` uint8,
-    straight from :func:`decode_video_frames`); PIL is only created here, at
-    the VLM-message boundary, because the chat backends expect PIL images /
-    data URLs. Non-tensor inputs (e.g. test stubs) pass through untouched.
-    """
-    if not isinstance(frame, torch.Tensor):
-        return frame
-    array = frame.detach().cpu()
-    if array.ndim == 3 and array.shape[0] in (1, 3):
-        array = array.permute(1, 2, 0)  # (C, H, W) -> (H, W, C)
-    if array.shape[-1] == 1:
-        array = array.squeeze(-1)
-    return PIL.Image.fromarray(array.to(torch.uint8).numpy())
-
-
-def to_image_blocks(frames: list[Any]) -> list[dict[str, Any]]:
-    """Convert decoded frames to Qwen-VL-compatible image content blocks."""
-    return [{"type": "image", "image": _frame_to_pil(frame)} for frame in frames]
-
-
-def to_video_block(frames: list[Any]) -> list[dict[str, Any]]:
-    """Wrap a list of decoded frames as one Qwen-VL video block.
-
-    Returns ``[]`` when the list is empty, so the caller can splat the result
-    into a content array without a separate emptiness check.
-    """
-    if not frames:
-        return []
-    return [{"type": "video", "video": [_frame_to_pil(frame) for frame in frames]}]
-
-
-def to_video_url_block(url: str | None, fps: float = 2.0) -> list[dict[str, Any]]:
-    """Wrap a video file URL as one ``video_url`` block.
-
-    Used by the ``openai`` backend (transformers serve / vllm serve /
-    ktransformers serve), where the server handles frame sampling.
-    Returns ``[]`` when ``url`` is ``None`` so the caller can splat.
-    """
-    if not url:
-        return []
-    return [{"type": "video_url", "video_url": {"url": url}, "fps": fps}]
--- a/src/lerobot/annotations/steerable_pipeline/modules/init.py
+++ b/src/lerobot/annotations/steerable_pipeline/modules/init.py
@@ -1,25 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from .general_vqa import GeneralVqaModule
-from .interjections_and_speech import InterjectionsAndSpeechModule
-from .plan_subtasks_memory import PlanSubtasksMemoryModule
-
-__all__ = [
-    "GeneralVqaModule",
-    "InterjectionsAndSpeechModule",
-    "PlanSubtasksMemoryModule",
-]
--- a/src/lerobot/annotations/steerable_pipeline/modules/general_vqa.py
+++ b/src/lerobot/annotations/steerable_pipeline/modules/general_vqa.py
@@ -1,228 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""``vqa`` module: general VQA at a timed cadence.
-
-Every ``1/hz`` seconds an emission tick fires; each tick anchors ``K``
-consecutive frames, and every anchored frame gets its own VQA pair. Each
-pair is grounded on that single anchor frame — there is no per-pair frame
-window. For datasets with multiple cameras, every anchored frame produces
-one ``(vqa, user)`` + ``(vqa, assistant)`` pair *per camera*: each pair is
-generated against that camera's frame and stamped with the matching
-``camera`` field on the emitted rows. The resolver disambiguates via
-``camera=...``; recipes that consume VQA do so through one sub-recipe
-per camera (see ``recipes/subtasks_vqa.yaml``).
-
-Within a single (frame, camera) we still emit at most one ``(vqa, user)``
-and one ``(vqa, assistant)`` row, so the resolver contract stays scalar.
-
-Question types covered (per the plan's ``vqa`` table): bbox, keypoint,
-count, attribute, spatial. The assistant's ``content`` is a JSON string
-whose schema depends on the question type. Malformed JSON triggers one
-retry inside :meth:`VlmClient.generate_json`.
-"""
-
-from __future__ import annotations
-
-import json
-import logging
-import random
-from collections.abc import Sequence
-from dataclasses import dataclass, field
-from typing import Any
-
-from ..config import VqaConfig
-from ..frames import FrameProvider, null_provider, to_image_blocks
-from ..prompts import load as load_prompt
-from ..reader import EpisodeRecord
-from ..staging import EpisodeStaging
-from ..validator import classify_vqa_answer
-from ..vlm_client import VlmClient
-
-
-def _emission_anchor_indices(frame_timestamps: Sequence[float], hz: float, k: int) -> list[int]:
-    """Return the relative frame indices to anchor VQA emissions to.
-
-    For each emission tick (every ``1/hz`` seconds), we anchor ``k``
-    consecutive frames starting at the tick. Ticks fall on the nearest
-    available source frame timestamp.
-    """
-    if hz <= 0 or k <= 0 or not frame_timestamps:
-        return []
-    t0 = frame_timestamps[0]
-    t_last = frame_timestamps[-1]
-    period = 1.0 / hz
-    indices: list[int] = []
-    t = t0
-    while t <= t_last + 1e-9:
-        # find the index of the nearest frame to t
-        nearest_i = min(range(len(frame_timestamps)), key=lambda i: abs(frame_timestamps[i] - t))
-        for offset in range(k):
-            j = nearest_i + offset
-            if j >= len(frame_timestamps):
-                break
-            if not indices or indices[-1] != j:
-                indices.append(j)
-        t += period
-    # dedupe while preserving order
-    seen: set[int] = set()
-    deduped: list[int] = []
-    for i in indices:
-        if i in seen:
-            continue
-        seen.add(i)
-        deduped.append(i)
-    return deduped
-
-
-@dataclass
-class GeneralVqaModule:
-    """Emit grounded VQA pairs at a timed cadence."""
-
-    vlm: VlmClient
-    config: VqaConfig
-    seed: int = 1729
-    frame_provider: FrameProvider = field(default_factory=null_provider)
-
-    @property
-    def enabled(self) -> bool:
-        return self.config.enabled
-
-    def run_episode(self, record: EpisodeRecord, staging: EpisodeStaging) -> None:
-        if not record.frame_timestamps:
-            staging.write("vqa", [])
-            return
-        rng = random.Random(f"{self.seed}:{record.episode_index}:vqa")
-        anchor_idx = _emission_anchor_indices(
-            record.frame_timestamps, self.config.vqa_emission_hz, self.config.K
-        )
-        cameras = self._target_cameras()
-        if not cameras:
-            # No camera available — emit nothing rather than producing
-            # untagged rows that would fail validation. Surface a loud one-
-            # time warning so this is never silently a no-op.
-            if not getattr(self, "_warned_no_camera", False):
-                logging.getLogger(__name__).warning(
-                    "vqa module found no cameras on the frame provider — "
-                    "every episode will emit zero VQA rows. Check that the "
-                    "dataset declares observation.images.* features in "
-                    "meta/info.json; passing --vlm.camera_key=<key> at the "
-                    "CLI now also seeds the cameras list as a fallback."
-                )
-                self._warned_no_camera = True
-            staging.write("vqa", [])
-            return
-
-        # Build all messages first (one per (frame, camera)), then issue them
-        # as a single batched generate_json call so the client can fan them
-        # out concurrently.
-        per_call: list[tuple[float, str, str, list[dict[str, Any]]]] = []
-        for idx in anchor_idx:
-            ts = float(record.frame_timestamps[idx])
-            qtype = rng.choice(self.config.question_types)
-            for camera in cameras:
-                messages = self._build_messages(record, qtype, ts, camera)
-                # Skip cameras that decoded to zero frames at this ts: no point
-                # asking the VLM to ground a bbox without an image.
-                if not _has_image_block(messages):
-                    continue
-                per_call.append((ts, camera, qtype, messages))
-
-        if not per_call:
-            staging.write("vqa", [])
-            return
-
-        results = self.vlm.generate_json([m for _, _, _, m in per_call])
-
-        rows: list[dict[str, Any]] = []
-        for (ts, camera, _qtype, _messages), result in zip(per_call, results, strict=True):
-            qa = self._postprocess(result)
-            if qa is None:
-                continue
-            question, answer = qa
-            rows.append(
-                {
-                    "role": "user",
-                    "content": question,
-                    "style": "vqa",
-                    "timestamp": ts,
-                    "camera": camera,
-                    "tool_calls": None,
-                }
-            )
-            rows.append(
-                {
-                    "role": "assistant",
-                    "content": json.dumps(answer, sort_keys=True),
-                    "style": "vqa",
-                    "timestamp": ts,
-                    "camera": camera,
-                    "tool_calls": None,
-                }
-            )
-        staging.write("vqa", rows)
-
-    def _target_cameras(self) -> list[str]:
-        """Return the cameras the ``vqa`` module should iterate per anchored frame.
-
-        Defaults to every camera the provider exposes. Datasets with no
-        cameras (or test/null providers) yield an empty list, which makes
-        ``run_episode`` a no-op.
-        """
-        return list(getattr(self.frame_provider, "camera_keys", []) or [])
-
-    def _build_messages(
-        self,
-        record: EpisodeRecord,
-        question_type: str,
-        frame_timestamp: float,
-        camera_key: str,
-    ) -> list[dict[str, Any]]:
-        prompt = load_prompt("module_3_vqa").format(
-            episode_task=record.episode_task,
-            question_type=question_type,
-        )
-        images = self.frame_provider.frames_at(
-            record, [frame_timestamp], camera_key=camera_key
-        )
-        content = [*to_image_blocks(images), {"type": "text", "text": prompt}]
-        return [{"role": "user", "content": content}]
-
-    def _postprocess(self, result: Any) -> tuple[str, dict[str, Any]] | None:
-        if not isinstance(result, dict):
-            return None
-        question = result.get("question")
-        answer = result.get("answer")
-        if not isinstance(question, str) or not question.strip():
-            return None
-        if not isinstance(answer, dict):
-            return None
-        # The validator will enforce shape; here we just sanity-check that the
-        # answer matches *some* known shape so we can drop garbage early.
-        if classify_vqa_answer(answer) is None:
-            return None
-        return question.strip(), answer
-
-
-def _has_image_block(messages: list[dict[str, Any]]) -> bool:
-    """Return True if any user content block is a populated image block."""
-    for msg in messages:
-        content = msg.get("content")
-        if not isinstance(content, list):
-            continue
-        for block in content:
-            if isinstance(block, dict) and block.get("type") == "image":
-                return True
-    return False
--- a/src/lerobot/annotations/steerable_pipeline/modules/interjections_and_speech.py
+++ b/src/lerobot/annotations/steerable_pipeline/modules/interjections_and_speech.py
@@ -1,210 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""``interjections`` module: interjections + paired speech (EVENT styles + speech atoms).
-
-Two sub-passes:
-
-1. At ``t=0``, emit ONLY a speech tool-call atom (acknowledgement of the
-   canonical task). No interjection row — the canonical task is already the
-   user utterance from ``meta/tasks.parquet``.
-
-2. For mid-episode interruptions, emit a co-timestamped pair:
-       {role:user, style:interjection, content:<text>}
-       speech atom (role:assistant, style:None, tool_calls=[say(...)])
-   Both rows go in ``language_events`` at the same timestamp.
-
-The ``plan`` module's :meth:`run_plan_updates` reuses this module's
-interjection timestamps to refresh the ``plan`` row at the same instant.
-"""
-
-from __future__ import annotations
-
-import random
-from collections.abc import Sequence
-from dataclasses import dataclass, field
-from typing import Any
-
-from ..config import InterjectionsConfig
-from ..frames import FrameProvider, null_provider, to_image_blocks
-from ..prompts import load as load_prompt
-from ..reader import EpisodeRecord, reconstruct_subtask_spans, snap_to_frame
-from ..staging import EpisodeStaging
-from ..vlm_client import VlmClient
-from ..writer import speech_atom
-
-
-@dataclass
-class InterjectionsAndSpeechModule:
-    """Generate task-start speech and mid-episode interjection/speech pairs."""
-
-    vlm: VlmClient
-    config: InterjectionsConfig
-    seed: int = 1729
-    frame_provider: FrameProvider = field(default_factory=null_provider)
-
-    @property
-    def enabled(self) -> bool:
-        return self.config.enabled
-
-    def run_episode(self, record: EpisodeRecord, staging: EpisodeStaging) -> None:
-        rows: list[dict[str, Any]] = []
-        if record.frame_timestamps:
-            t0 = float(record.frame_timestamps[0])
-            initial = self._initial_speech(record)
-            if initial:
-                rows.append(speech_atom(t0, initial))
-        # Pull the ``plan`` module's subtask spans for this episode so the
-        # interjection prompt can ground itself in the actual current
-        # subtask at each chosen timestamp. The ``plan`` module ran first.
-        episode_end_t = float(record.frame_timestamps[-1]) if record.frame_timestamps else None
-        subtask_spans = reconstruct_subtask_spans(staging.read("plan"), episode_end_t=episode_end_t)
-        rows.extend(self._mid_episode_interjections(record, subtask_spans))
-        staging.write("interjections", rows)
-
-    @staticmethod
-    def _subtask_at(spans: Sequence[dict[str, Any]], t: float) -> str | None:
-        current: str | None = None
-        for span in spans:
-            if float(span["start"]) <= t:
-                current = span.get("text")
-            else:
-                break
-        return current
-
-    def _initial_speech(self, record: EpisodeRecord) -> str | None:
-        prompt = load_prompt("module_2_initial_speech").format(
-            episode_task=record.episode_task,
-        )
-        messages = [{"role": "user", "content": [{"type": "text", "text": prompt}]}]
-        result = self.vlm.generate_json([messages])[0]
-        if isinstance(result, dict) and isinstance(result.get("text"), str):
-            text = result["text"].strip()
-            if text:
-                return text
-        return None
-
-    def _mid_episode_interjections(
-        self,
-        record: EpisodeRecord,
-        subtask_spans: Sequence[dict[str, Any]],
-    ) -> list[dict[str, Any]]:
-        """Generate interjections aligned with the actual demo trajectory.
-
-        Teleop data is frozen — the robot already executed every step in
-        the video. A *counterfactual* interjection like "actually skip
-        the wipe" contradicts what then happens in the video, which is
-        what qwen36moe-10/11 surfaced as low-quality interjections.
-
-        Instead, anchor every interjection at a subtask boundary and
-        write it as a natural user request for the *upcoming* subtask.
-        The robot's visible next behavior IS the interjection's effect,
-        so the training signal stays consistent: interjection text →
-        plan refresh → action stream all line up.
-        """
-        if self.config.max_interjections_per_episode <= 0:
-            return []
-        if len(subtask_spans) < 2:
-            # Need at least one transition (subtask 0 → subtask 1).
-            return []
-        # Deterministic per-episode RNG so reruns are stable across SLURM jobs.
-        rng = random.Random(f"{self.seed}:{record.episode_index}:interjection")
-
-        # Boundaries: the start time of every subtask except the first
-        # (which is just t0 and is covered by the initial-task speech atom).
-        boundaries: list[tuple[float, str, str]] = []
-        for i in range(1, len(subtask_spans)):
-            ts = float(subtask_spans[i]["start"])
-            if ts < self.config.interjection_min_t:
-                continue
-            prev_text = (subtask_spans[i - 1].get("text") or "").strip()
-            next_text = (subtask_spans[i].get("text") or "").strip()
-            if not next_text:
-                continue
-            boundaries.append((ts, prev_text, next_text))
-        if not boundaries:
-            return []
-
-        n = min(self.config.max_interjections_per_episode, len(boundaries))
-        chosen = sorted(rng.sample(boundaries, n), key=lambda b: b[0])
-
-        out: list[dict[str, Any]] = []
-        for t, prev_subtask, next_subtask in chosen:
-            t_snap = snap_to_frame(t, record.frame_timestamps)
-            # Window straddles the boundary so the VLM sees the end of the
-            # previous subtask and the start of the next one — same
-            # conditioning the policy will see at training time.
-            window_ts = self._window_timestamps(t_snap, record.frame_timestamps)
-            prompt = load_prompt("module_2_interjection").format(
-                episode_task=record.episode_task,
-                prev_subtask=prev_subtask or "(starting from initial state)",
-                next_subtask=next_subtask,
-                timestamp=t_snap,
-                window_seconds=self.config.interjection_window_seconds,
-            )
-            images = self.frame_provider.frames_at(record, window_ts)
-            content = [*to_image_blocks(images), {"type": "text", "text": prompt}]
-            messages = [{"role": "user", "content": content}]
-            result = self.vlm.generate_json([messages])[0]
-            if not isinstance(result, dict):
-                continue
-            interjection_text = result.get("interjection")
-            speech_text = result.get("speech")
-            if not isinstance(interjection_text, str) or not interjection_text.strip():
-                continue
-            if not isinstance(speech_text, str) or not speech_text.strip():
-                continue
-            out.append(
-                {
-                    "role": "user",
-                    "content": interjection_text.strip(),
-                    "style": "interjection",
-                    "timestamp": t_snap,
-                    "tool_calls": None,
-                }
-            )
-            out.append(speech_atom(t_snap, speech_text.strip()))
-        return out
-
-    def _window_timestamps(self, t_anchor: float, frame_timestamps: Sequence[float]) -> list[float]:
-        """Return a small set of frame timestamps centered on ``t_anchor``.
-
-        The window straddles the subtask boundary the interjection sits
-        on: roughly half the frames cover the end of the previous
-        subtask, half cover the start of the next one. The VLM therefore
-        sees BOTH what just finished AND what's about to start, which is
-        the conditioning we need to write a natural "now please do X"
-        request that matches the visible upcoming behavior.
-        """
-        if not frame_timestamps:
-            return [t_anchor]
-        n = max(1, int(self.config.interjection_window_frames))
-        if n == 1:
-            return [t_anchor]
-        window = float(self.config.interjection_window_seconds)
-        step = window / max(1, n - 1)
-        # Center the window on the anchor so half lands before, half after.
-        start_offset = -window / 2.0
-        targets = [t_anchor + start_offset + step * i for i in range(n)]
-        last_ts = float(frame_timestamps[-1])
-        snapped: list[float] = []
-        seen: set[float] = set()
-        for tgt in targets:
-            clamped = min(last_ts, max(0.0, tgt))
-            t = snap_to_frame(clamped, frame_timestamps)
-            if t not in seen:
-                seen.add(t)
-                snapped.append(t)
-        return snapped or [t_anchor]
--- a/src/lerobot/annotations/steerable_pipeline/modules/plan_subtasks_memory.py
+++ b/src/lerobot/annotations/steerable_pipeline/modules/plan_subtasks_memory.py
@@ -1,617 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""``plan`` module: subtask decomposition + plan + memory (PERSISTENT styles)."""
-
-from __future__ import annotations
-
-import logging
-from collections.abc import Sequence
-from dataclasses import dataclass, field
-from pathlib import Path
-from typing import Any
-
-from ..config import PlanConfig
-from ..frames import (
-    FrameProvider,
-    VideoFrameProvider,
-    null_provider,
-    to_video_block,
-    to_video_url_block,
-)
-from ..prompts import load as load_prompt
-from ..reader import EpisodeRecord, reconstruct_subtask_spans, snap_to_frame
-from ..staging import EpisodeStaging
-from ..vlm_client import VlmClient
-from ..vocabulary import Vocabulary
-
-logger = logging.getLogger(__name__)
-
-
-@dataclass
-class PlanSubtasksMemoryModule:
-    """Generate subtask spans, plan, and memory rows.
-
-    All output is persistent (lives in ``language_persistent``):
-
-    - ``subtask`` rows: one per span, stamped at the span's *start* timestamp
-      (snapped to an exact frame).
-    - ``plan`` rows: emitted at ``t=0``; refreshed at every interjection
-      timestamp via :meth:`run_plan_updates` (called by the executor after
-      the ``interjections`` module completes).
-    - ``memory`` rows: emitted at each subtask boundary (= subtask start
-      timestamp from the second subtask onward).
-    """
-
-    vlm: VlmClient
-    config: PlanConfig
-    frame_provider: FrameProvider = field(default_factory=null_provider)
-    vocabulary: Vocabulary | None = None
-    """When set, the module constrains subtask + memory generation to the
-    canonical strings in ``vocabulary``. Phase 0 (vocabulary discovery)
-    populates this once per dataset; ``None`` falls back to free-form
-    generation (original behaviour)."""
-
-    @property
-    def enabled(self) -> bool:
-        return self.config.enabled
-
-    def run_episode(self, record: EpisodeRecord, staging: EpisodeStaging) -> None:
-        rows: list[dict[str, Any]] = []
-        # Resolve the task that drives every other ``plan``-module prompt.
-        # May be the canonical ``record.episode_task`` (default), or a fresh
-        # description derived from the video when the canonical task is
-        # empty / placeholder / forced-off (see PlanConfig.derive_task_*).
-        effective_task = self._resolve_effective_task(record)
-        # ``task_aug`` rows at t=0 (role=user), one per rephrasing — the
-        # message renderer rotates ``${task}`` deterministically through
-        # them so the policy sees diverse phrasings during training.
-        t0 = float(record.frame_timestamps[0]) if record.frame_timestamps else 0.0
-        if self.config.n_task_rephrasings > 0 and effective_task:
-            rephrasings = self._generate_task_rephrasings(effective_task, n=self.config.n_task_rephrasings)
-            # Always include the effective task itself as the first variant
-            # so the rotation is guaranteed to cover the source-of-truth
-            # phrasing, not just synthetic alternatives.
-            seen: set[str] = set()
-            ordered = [effective_task, *rephrasings]
-            for phrasing in ordered:
-                key = phrasing.strip()
-                if not key or key in seen:
-                    continue
-                seen.add(key)
-                rows.append(
-                    {
-                        "role": "user",
-                        "content": key,
-                        "style": "task_aug",
-                        "timestamp": t0,
-                        "tool_calls": None,
-                    }
-                )
-
-        subtask_spans = self._generate_subtasks(record, task=effective_task)
-        # subtask rows
-        for span in subtask_spans:
-            rows.append(
-                {
-                    "role": "assistant",
-                    "content": span["text"],
-                    "style": "subtask",
-                    "timestamp": snap_to_frame(span["start"], record.frame_timestamps),
-                    "tool_calls": None,
-                }
-            )
-        # Plan rows at every subtask boundary — including t=0 (start of
-        # the first subtask). Because the plan is just a numbered list
-        # of *still-todo* subtasks, re-emitting at each boundary makes
-        # the active plan shrink as work progresses: at frame t the
-        # rendered ``${plan}`` is the most recent emission, which
-        # contains exactly the subtasks that started at or after the
-        # current span. Saves the runtime from having to derive
-        # "what's still left" at inference time.
-        for span in subtask_spans:
-            boundary_t = snap_to_frame(span["start"], record.frame_timestamps)
-            plan_text = self._generate_plan(
-                record, subtask_spans, refresh_t=boundary_t, task=effective_task
-            )
-            if plan_text is not None:
-                rows.append(
-                    {
-                        "role": "assistant",
-                        "content": plan_text,
-                        "style": "plan",
-                        "timestamp": float(boundary_t),
-                        "tool_calls": None,
-                    }
-                )
-        # memory rows at every subtask boundary except the very first start
-        prior_memory = ""
-        for i, span in enumerate(subtask_spans[1:], start=1):
-            completed = subtask_spans[i - 1]["text"]
-            remaining = [s["text"] for s in subtask_spans[i:]]
-            mem_text = self._generate_memory(record, prior_memory, completed, remaining, task=effective_task)
-            if mem_text:
-                ts = snap_to_frame(span["start"], record.frame_timestamps)
-                rows.append(
-                    {
-                        "role": "assistant",
-                        "content": mem_text,
-                        "style": "memory",
-                        "timestamp": ts,
-                        "tool_calls": None,
-                    }
-                )
-                prior_memory = mem_text
-        staging.write("plan", rows)
-
-    # ------------------------------------------------------------------
-    # Task derivation + rephrasings
-    # ------------------------------------------------------------------
-
-    _PLACEHOLDER_TASKS: frozenset[str] = frozenset(
-        {
-            "debug",
-            "test",
-            "tbd",
-            "todo",
-            "n/a",
-            "na",
-            "untitled",
-            "unnamed",
-            "default",
-            "placeholder",
-        }
-    )
-
-    def _resolve_effective_task(self, record: EpisodeRecord) -> str:
-        """Decide which task string drives the ``plan`` module for this episode.
-
-        Returns the user-supplied ``record.episode_task`` unless
-        ``derive_task_from_video`` says otherwise (see config docstring).
-        Falls back gracefully to the canonical task if video derivation
-        fails.
-        """
-        canonical = (record.episode_task or "").strip()
-        mode = (self.config.derive_task_from_video or "off").strip().lower()
-        if mode == "always":
-            derived = self._derive_task_from_video(record)
-            return derived or canonical
-        if mode == "if_short" and self._task_seems_bad(canonical):
-            derived = self._derive_task_from_video(record)
-            if derived:
-                return derived
-        return canonical
-
-    def _task_seems_bad(self, task: str) -> bool:
-        if not task:
-            return True
-        if len(task.split()) < int(self.config.derive_task_min_words):
-            return True
-        return task.lower() in self._PLACEHOLDER_TASKS
-
-    # ------------------------------------------------------------------
-    # VLM call helpers (factored out: every ``plan``-module prompt below follows
-    # the same "build messages → single VLM call → pull a named field"
-    # shape, only differing in field name + post-processing).
-    # ------------------------------------------------------------------
-
-    def _vlm_field(self, messages: list[dict[str, Any]], field: str) -> Any:
-        """Run a single VLM call and return ``result[field]`` or ``None``.
-
-        Centralizes the ``vlm.generate_json([m])[0]`` + ``isinstance(dict)``
-        dance every prompt-call site needs.
-        """
-        result = self.vlm.generate_json([messages])[0]
-        if isinstance(result, dict):
-            return result.get(field)
-        return None
-
-    @staticmethod
-    def _text_message(text: str) -> list[dict[str, Any]]:
-        """One-shot text-only user message wrapped for ``generate_json``."""
-        return [{"role": "user", "content": [{"type": "text", "text": text}]}]
-
-    def _video_message(self, record: EpisodeRecord, prompt: str) -> list[dict[str, Any]]:
-        """User message combining the episode video block with ``prompt``."""
-        content = [*self._episode_video_block(record), {"type": "text", "text": prompt}]
-        return [{"role": "user", "content": content}]
-
-    def _derive_task_from_video(self, record: EpisodeRecord) -> str | None:
-        """Ask the VLM "what is this video about" with no task hint at all."""
-        text = self._vlm_field(self._video_message(record, load_prompt("module_1_video_task")), "task")
-        return text.strip() if isinstance(text, str) and text.strip() else None
-
-    def _generate_task_rephrasings(self, base_task: str, *, n: int) -> list[str]:
-        """Generate ``n`` text-only paraphrases of ``base_task``."""
-        if n <= 0 or not base_task:
-            return []
-        prompt = load_prompt("module_1_task_rephrasings").format(base_task=base_task, n=n)
-        raw = self._vlm_field(self._text_message(prompt), "rephrasings")
-        if not isinstance(raw, list):
-            return []
-        out = [item.strip().strip('"').strip("'") for item in raw if isinstance(item, str)]
-        return [s for s in out if s][:n]
-
-    def _episode_video_block(self, record: EpisodeRecord) -> list[dict[str, Any]]:
-        """Same video block ``_generate_subtasks`` builds — extracted helper."""
-        if not record.frame_timestamps:
-            return []
-        if self.config.use_video_url and isinstance(self.frame_provider, VideoFrameProvider):
-            cache_dir = Path(self.frame_provider.root) / ".annotate_staging" / ".video_clips"
-            clip = self.frame_provider.episode_clip_path(record, cache_dir)
-            return (
-                to_video_url_block(f"file://{clip}", fps=self.config.use_video_url_fps)
-                if clip is not None
-                else []
-            )
-        episode_duration = record.frame_timestamps[-1] - record.frame_timestamps[0]
-        target_count = max(1, int(round(episode_duration * self.config.frames_per_second)))
-        target_count = min(target_count, self.config.max_video_frames)
-        video_frames = self.frame_provider.video_for_episode(record, target_count)
-        return to_video_block(video_frames)
-
-    def run_plan_updates(
-        self,
-        record: EpisodeRecord,
-        staging: EpisodeStaging,
-        interjection_times: Sequence[float],
-        interjection_texts: Sequence[str] | None = None,
-    ) -> None:
-        """Append additional ``plan`` rows at every interjection timestamp.
-
-        Plans refresh ONLY on user interjections — subtask generation
-        runs ~1 Hz at inference, but plan re-emission is event-driven.
-        Now also forwards the interjection's own text into the prompt so
-        the refreshed plan can actually reflect the user's correction
-        (the previous version told the model "an interjection happened"
-        without telling it what the user said).
-        """
-        existing = staging.read("plan")
-        # Pass the episode's last frame timestamp so the final subtask
-        # span is closed (otherwise its ``end`` equals its ``start``,
-        # zero duration, and the "current subtask at refresh_t" lookup
-        # in ``_generate_plan`` misses any refresh that lands inside it).
-        episode_end_t = float(record.frame_timestamps[-1]) if record.frame_timestamps else None
-        spans = reconstruct_subtask_spans(existing, episode_end_t=episode_end_t)
-        already_planned: set[float] = {float(r["timestamp"]) for r in existing if r.get("style") == "plan"}
-        new_rows = list(existing)
-
-        texts: list[str | None] = (
-            [None] * len(interjection_times)
-            if interjection_texts is None
-            else [str(t) if t else None for t in interjection_texts]
-        )
-        for raw_t, inter_text in zip(interjection_times, texts, strict=True):
-            t = snap_to_frame(raw_t, record.frame_timestamps)
-            if t in already_planned:
-                continue
-            already_planned.add(t)
-            plan_text = self._generate_plan(record, spans, refresh_t=t, interjection=inter_text)
-            if plan_text is not None:
-                new_rows.append(
-                    {
-                        "role": "assistant",
-                        "content": plan_text,
-                        "style": "plan",
-                        "timestamp": t,
-                        "tool_calls": None,
-                    }
-                )
-        staging.write("plan", new_rows)
-
-    def _generate_subtasks(self, record: EpisodeRecord, *, task: str | None = None) -> list[dict[str, Any]]:
-        if record.row_count == 0 or not record.frame_timestamps:
-            return []
-        episode_duration = record.frame_timestamps[-1] - record.frame_timestamps[0]
-        prompt = load_prompt("module_1_subtasks").format(
-            episode_task=(task if task is not None else record.episode_task),
-            min_subtask_seconds=self.config.min_subtask_seconds,
-            max_steps=self.config.plan_max_steps,
-            episode_duration=f"{episode_duration:.3f}",
-            vocabulary_block=self._subtask_vocabulary_block(),
-        )
-        messages = self._video_message(record, prompt)
-        spans = self._vlm_field(messages, "subtasks")
-        # When a vocabulary is in force, do a single targeted retry if
-        # any returned subtask is off-vocab — strict exact-match only,
-        # no fuzzy snapping. The retry includes the offending strings
-        # and the full canonical list so the VLM can correct itself.
-        if self.vocabulary is not None and self.vocabulary.subtasks and spans:
-            invalid = self._invalid_subtasks(spans)
-            if invalid:
-                logger.info(
-                    "episode %d: VLM emitted %d off-vocab subtask(s) (%s); retrying once",
-                    record.episode_index,
-                    len(invalid),
-                    invalid,
-                )
-                retry_msg = self._build_subtask_retry_message(messages, invalid)
-                retried = self._vlm_field(retry_msg, "subtasks")
-                if retried:
-                    spans = retried
-
-        if not spans:
-            return []
-        # clamp to [t0, t_last] and sort
-        t0 = record.frame_timestamps[0]
-        t_last = record.frame_timestamps[-1]
-        cleaned: list[dict[str, Any]] = []
-        for span in spans:
-            try:
-                start = float(span["start"])
-                end = float(span["end"])
-                text = str(span["text"]).strip()
-            except (KeyError, ValueError, TypeError):
-                continue
-            start = max(t0, min(start, t_last))
-            end = max(t0, min(end, t_last))
-            if end < start:
-                start, end = end, start
-            if not text:
-                continue
-            text = self._canonicalize_subtask(text)
-            if not text:
-                continue
-            cleaned.append({"text": text, "start": start, "end": end})
-        cleaned.sort(key=lambda s: s["start"])
-        cleaned = self._dedupe_starts_to_distinct_frames(cleaned, record)
-        if self.vocabulary is not None and self.vocabulary.subtasks and not cleaned:
-            logger.warning(
-                "episode %d: every VLM subtask was off-vocab even after retry — "
-                "episode left empty (extend meta/canonical_vocabulary.json to "
-                "cover the missing phase)",
-                record.episode_index,
-            )
-        return cleaned
-
-    @staticmethod
-    def _dedupe_starts_to_distinct_frames(
-        spans: list[dict[str, Any]], record: EpisodeRecord
-    ) -> list[dict[str, Any]]:
-        """Bump same-frame subtask starts onto distinct frames.
-
-        Two consecutive VLM spans whose ``start`` rounds to the same
-        source frame (after :func:`snap_to_frame`) would otherwise emit
-        two ``style=subtask`` rows at the identical persistent
-        timestamp. The training-time renderer's ``active_at(t,
-        style=subtask)`` resolver can't disambiguate that and raises
-        ``Ambiguous resolver for style='subtask'``.
-
-        Walk the (sorted-by-start) spans, snap each to its frame, and
-        if the snapped frame is already taken push the span onto the
-        next unused frame so both subtasks survive on distinct
-        timestamps. If the episode ends before a free frame is found,
-        the trailing span is dropped with a warning — better than
-        poisoning the render.
-        """
-        if not spans:
-            return spans
-        frames = record.frame_timestamps
-        if not frames:
-            return spans
-        used: set[float] = set()
-        out: list[dict[str, Any]] = []
-        for span in spans:
-            ts = snap_to_frame(span["start"], frames)
-            if ts in used:
-                next_ts = next((f for f in frames if f > ts and f not in used), None)
-                if next_ts is None:
-                    logger.warning(
-                        "episode %d: subtask %r snapped to occupied frame "
-                        "%.3f and no free later frame exists — dropping",
-                        record.episode_index,
-                        span.get("text"),
-                        ts,
-                    )
-                    continue
-                ts = next_ts
-            used.add(ts)
-            new_span = {**span, "start": ts}
-            if float(new_span.get("end", ts)) < ts:
-                new_span["end"] = ts
-            out.append(new_span)
-        return out
-
-    # ------------------------------------------------------------------
-    # Canonical-vocabulary helpers
-    # ------------------------------------------------------------------
-
-    def _subtask_vocabulary_block(self) -> str:
-        """Bullet-list of canonical subtasks the VLM must pick from.
-
-        Returns an empty string when no vocabulary is configured —
-        ``module_1_subtasks.txt`` then falls back to its free-form
-        rules (original behaviour).
-        """
-        if self.vocabulary is None or not self.vocabulary.subtasks:
-            return ""
-        bullets = "\n".join(f"- {s}" for s in self.vocabulary.subtasks)
-        return (
-            "You MUST choose each subtask label verbatim from this canonical "
-            "vocabulary — pick the closest match for each phase of the demo, "
-            "and reuse the SAME string every time that phase recurs. The "
-            "low-level policy is conditioned on these exact strings; any "
-            "novel paraphrase you invent will make its conditioning OOD.\n"
-            "Canonical subtask labels:\n"
-            f"{bullets}\n\n"
-        )
-
-    def _memory_vocabulary_block(self) -> str:
-        """Bullet-list of canonical memory milestones the VLM must pick from."""
-        if self.vocabulary is None or not self.vocabulary.memory_milestones:
-            return ""
-        bullets = "\n".join(f"- {m}" for m in self.vocabulary.memory_milestones)
-        return (
-            "Compose the memory by picking ONLY from this canonical milestone "
-            "list — append a milestone (or rewrite the running memory to "
-            "compress past ones) using these exact phrases. Do not invent new "
-            "wording: every paraphrase weakens the downstream conditioning.\n"
-            "Canonical memory milestones:\n"
-            f"{bullets}\n\n"
-        )
-
-    _NORMALIZE_STRIP_TOKENS: frozenset[str] = frozenset({"the", "a", "an"})
-
-    def _canonicalize_subtask(self, text: str) -> str:
-        """Validate ``text`` against the canonical vocabulary; no fuzzy snap.
-
-        Without a vocabulary, the original text passes through. With a
-        vocabulary, accept the span only if its normalised form (lower-
-        cased, articles stripped, whitespace collapsed) matches a
-        canonical entry exactly — the canonical wording is returned so
-        the supervised string is byte-identical across episodes.
-
-        Off-vocab spans are dropped (empty string). Upstream
-        ``_generate_subtasks`` triggers a targeted retry before reaching
-        the drop path; this function never snaps or warps a span into
-        a different label.
-        """
-        if self.vocabulary is None or not self.vocabulary.subtasks:
-            return text.strip()
-        normalised = self._normalize(text)
-        if not normalised:
-            return ""
-        for candidate in self.vocabulary.subtasks:
-            if self._normalize(candidate) == normalised:
-                return candidate
-        return ""
-
-    @classmethod
-    def _normalize(cls, text: str) -> str:
-        """Lowercase, strip articles, collapse whitespace, drop punctuation."""
-        words = [
-            w.strip(".,:;\"'!?()")
-            for w in text.lower().replace(",", " ").split()
-        ]
-        return " ".join(w for w in words if w and w not in cls._NORMALIZE_STRIP_TOKENS)
-
-    def _invalid_subtasks(self, spans: list[dict[str, Any]]) -> list[str]:
-        """Return the unique off-vocab subtask strings the VLM produced."""
-        seen: list[str] = []
-        for span in spans:
-            text = str((span or {}).get("text") or "").strip()
-            if not text:
-                continue
-            if self._canonicalize_subtask(text):
-                continue
-            if text not in seen:
-                seen.append(text)
-        return seen
-
-    def _build_subtask_retry_message(
-        self, original_messages: list[dict[str, Any]], invalid: list[str]
-    ) -> list[dict[str, Any]]:
-        """Compose a one-shot correction prompt naming the off-vocab strings."""
-        assert self.vocabulary is not None
-        canonical = "\n".join(f"- {s}" for s in self.vocabulary.subtasks)
-        invalid_list = "\n".join(f"- {s!r}" for s in invalid)
-        correction = (
-            "Your previous response included subtask labels that are NOT in "
-            "the canonical vocabulary:\n"
-            f"{invalid_list}\n\n"
-            "Re-emit the same segmentation (same number of spans, same start/end "
-            "timestamps where they were valid) but replace every off-vocab "
-            "label with the EXACT canonical string for that phase, copied "
-            "verbatim from this list:\n"
-            f"{canonical}\n\n"
-            "Strict rules:\n"
-            "- Output strings must be byte-for-byte identical to entries above.\n"
-            "- No articles, no adverbs, no extra words.\n"
-            "- If a phase truly has no canonical match, omit that span entirely.\n"
-            "Return the same JSON shape as before."
-        )
-        # Append the correction as an additional user turn; the model
-        # sees the original prompt + its prior output is implied by the
-        # conversation context (the VLM client is stateless, so we
-        # re-send the original content plus this correction).
-        retry_messages = [
-            {
-                "role": m.get("role", "user"),
-                "content": (
-                    m.get("content")
-                    if isinstance(m.get("content"), str)
-                    else list(m.get("content") or [])
-                ),
-            }
-            for m in original_messages
-        ]
-        retry_messages.append({"role": "user", "content": correction})
-        return retry_messages
-
-    def _generate_plan(
-        self,
-        record: EpisodeRecord,  # noqa: ARG002  (kept for signature stability)
-        subtask_spans: Sequence[dict[str, Any]],
-        *,
-        refresh_t: float | None = None,
-        interjection: str | None = None,  # noqa: ARG002
-        task: str | None = None,  # noqa: ARG002
-    ) -> str | None:
-        """Deterministic plan = numbered list of *still-todo* subtasks.
-
-        Previously this called the VLM with a prompt that asked it to
-        compress the subtasks into a "compact hierarchical plan". That
-        produced longer-than-necessary plans, cost an extra VLM round-trip
-        per episode (plus one per interjection on refresh), and could
-        diverge from the actual subtask sequence the model is going to
-        execute. Replacing it with a plain summarisation keeps the plan
-        tightly aligned with the upcoming subtasks and removes the VLM
-        call entirely.
-
-        Layout — short imperative fragments prefixed by "N. ":
-
-            1. <subtask 1>
-            2. <subtask 2>
-            ...
-
-        On a refresh at ``refresh_t`` (called from ``run_plan_updates``
-        on interjection events, and from ``run_episode`` at every subtask
-        boundary), only subtasks whose start is at or after ``refresh_t``
-        are included — the plan shrinks as work progresses, so it always
-        describes what's left.
-        """
-        if not subtask_spans:
-            return None
-        remaining = [
-            s
-            for s in subtask_spans
-            if refresh_t is None or float(s.get("start", 0.0)) >= float(refresh_t)
-        ]
-        if not remaining:
-            # Past the last subtask boundary on a late refresh — nothing
-            # left to plan; emit None so the caller skips the row.
-            return None
-        return "\n".join(
-            f"{i}. {span.get('text', '').strip()}" for i, span in enumerate(remaining, start=1)
-        )
-
-    def _generate_memory(
-        self,
-        record: EpisodeRecord,
-        prior_memory: str,
-        completed: str,
-        remaining: Sequence[str],
-        *,
-        task: str | None = None,
-    ) -> str:
-        prompt = load_prompt("module_1_memory").format(
-            episode_task=(task if task is not None else record.episode_task),
-            prior_memory=prior_memory or "(none)",
-            completed_subtask=completed,
-            remaining_subtasks=", ".join(remaining) if remaining else "(none)",
-            vocabulary_block=self._memory_vocabulary_block(),
-        )
-        memory = self._vlm_field(self._text_message(prompt), "memory")
-        return memory.strip() if isinstance(memory, str) else ""
--- a/src/lerobot/annotations/steerable_pipeline/prompts/init.py
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/init.py
@@ -1,33 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Prompt templates loaded as plain text.
-
-One file per use site. Templates use ``str.format(**vars)`` substitution; we
-intentionally avoid jinja2 here so the templates remain inspectable in
-plain editors and roundtrip cleanly through ``ruff format``.
-"""
-
-from __future__ import annotations
-
-from pathlib import Path
-
-_DIR = Path(__file__).parent
-
-
-def load(name: str) -> str:
-    """Read prompt template ``name.txt`` from the ``prompts/`` directory."""
-    path = _DIR / f"{name}.txt"
-    return path.read_text(encoding="utf-8")
--- a/src/lerobot/annotations/steerable_pipeline/prompts/module_0_vocabulary.txt
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/module_0_vocabulary.txt
@@ -1,53 +0,0 @@
-You are inspecting {n_episodes} sample episode video(s) from a teleoperated
-robot dataset. Every episode in the dataset performs the SAME task; the
-user originally asked: "{episode_task}".
-
-Watch all the clips and produce a SHORT canonical vocabulary that every
-episode in this dataset will reuse. The downstream low-level policy is
-conditioned on these strings — duplicate phrasings (e.g. "grasp blue
-cube" vs "pick up the blue cube") would destroy the conditioning, so
-pick one wording per concept and reuse it everywhere.
-
-Decide how many entries each list needs YOURSELF based on what you see —
-the smallest set that still covers every recurring phase in the demos.
-A simple two-object pick-and-place might need ~6 subtask labels and 2
-memory milestones; a long multi-step recipe needs more. Err on the side
-of FEWER — extra entries that don't recur across episodes weaken the
-conditioning.
-
-You output two lists:
-
-1. `subtasks`: imperative, telegraphic commands the robot can execute.
-   - Verb-first. Drop articles, adverbs, qualifiers.
-   - Consistent object nouns (if the task says "cube", every subtask says
-     "cube" — never "block" / "object").
-   - Atomic — one skill per subtask (gripper-open events, contact, regrasps,
-     transitions all become cut points).
-   - Each label must recur across the demos. If you see a motion only
-     once across all sample clips, it probably isn't a canonical phase.
-   - Good: "move to blue cube", "grasp blue cube", "lift blue cube",
-     "place blue cube in box", "release blue cube", "retract arm".
-   - Bad: "the robot arm moves towards the blue cube" (third person,
-     too long), "carefully pick up the cube" (adverb, article),
-     "carrying the yellow cube over the green basket" (gerund — should
-     be imperative "transport yellow cube to green basket").
-
-2. `memory_milestones`: first-person past-tense sentences the running
-   memory composes from. Each subtask phase that produces a lasting
-   change should have a milestone; transient motions (move, retract)
-   should NOT.
-   - First person, past tense. Start with "I".
-   - One sentence. Functional outcome only — no grasp / motion detail.
-   - Good: "I picked up the blue cube.", "I placed the blue cube in
-     the green box.", "I wiped the counter."
-   - Bad: "The robot arm grasped the blue cube." (third person),
-     "I carefully grasped the blue cube with the parallel gripper."
-     (irrelevant detail), "I moved towards the blue cube." (transient
-     motion — should be omitted, not memorialised).
-
-Output strictly valid JSON of shape:
-
-  {{
-    "subtasks": ["<verb phrase>", ...],
-    "memory_milestones": ["I <past-tense sentence>.", ...]
-  }}
--- a/src/lerobot/annotations/steerable_pipeline/prompts/module_1_memory.txt
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/module_1_memory.txt
@@ -1,36 +0,0 @@
-You are updating the robot's compressed semantic memory at the boundary of
-a completed subtask.
-
-Reference (verbatim from MEM, Torne 2026):
-"Remove or compress information in the language memory whenever
-appropriate. Keep ONLY the minimal set of relevant information for future
-task execution. Specific object attributes (colors, precise quantities of
-each item) get discarded when their details won't affect subsequent
-actions. Functional outcomes (where items went, how many) are preserved."
-
-Episode task: "{episode_task}"
-Previous memory: {prior_memory}
-Just-completed subtask: "{completed_subtask}"
-Remaining subtasks (for relevance judgement only): {remaining_subtasks}
-
-{vocabulary_block}Write the memory as a short FIRST-PERSON, PAST-TENSE narrative of what the
-robot has accomplished so far — the running story it would tell itself.
-
-Authoring rules:
- First person, past tense. Every sentence starts with "I": "I picked
-  up...", "I opened...", "I moved to...".
- One or two short sentences. Extend the previous memory with the
-  just-completed subtask; do not rewrite it from scratch.
- Keep WHAT happened (functional outcomes — where items went, how many),
-  drop HOW (grasp details, motions).
- Compress completed steps and drop object attributes (colors, exact
-  counts) once they no longer affect the remaining subtasks.
-
-Example (MEM, Torne 2026):
-  Before: "I prepared the pot and got the potatoes, milk, and butter. I
-           moved to the drawer."
-  After:  "I prepared the pot and got the ingredients. I opened the
-           drawer with the masher."
-
-Output strictly valid JSON:
-  {{ "memory": "<one or two short first-person past-tense sentences>" }}
--- a/src/lerobot/annotations/steerable_pipeline/prompts/module_1_subtasks.txt
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/module_1_subtasks.txt
@@ -1,80 +0,0 @@
-You are labeling a teleoperated robot demonstration.
-
-The user originally asked: "{episode_task}"
-
-You are shown the entire demonstration as a single video. Watch the
-whole clip, then segment it into a list of consecutive atomic subtasks
-the robot performs.
-
-{vocabulary_block}Authoring rules — Hi Robot atom granularity, pi0.7-style short prompts:
-
- Each subtask = one COMPOSITE atomic skill the low-level policy can
-  execute end-to-end. A "skill" bundles its own approach motion with
-  its terminal action — do NOT split the approach off as its own
-  subtask. The whole-arm policy already learns to reach as part of
-  every manipulation primitive.
- Write each subtask as an IMPERATIVE COMMAND, starting with one of
-  these verbs (extend only when none fits):
-    pick up <obj>           — approach + grasp + lift in one subtask
-    put <obj> on/in <loc>   — transport + release in one subtask
-    place <obj> on/in <loc> — synonym of "put"; pick one and stay consistent
-    push <obj>              — contact + linear shove
-    pull <obj>              — contact + linear retract
-    turn <knob/dial/handle> — rotary actuation
-    press <button>          — single-press contact
-    open <drawer/door/lid>  — full open motion
-    close <drawer/door/lid> — full close motion
-    pour <src> into <dst>   — tilt + flow
-    insert <obj> into <slot>— alignment + push-fit
-    go to <loc>             — ONLY when no grasp / actuation follows
-                             (e.g. a pure relocation between phases).
-                             If the next subtask grasps something at
-                             that location, drop "go to ..." and just
-                             write "pick up ..." instead.
- Forbidden ultra-fine splits — the VLM is NOT allowed to emit these
-  as standalone subtasks; fold them into the parent composite:
-    "move to X"   → fold into "pick up X" (or whatever follows)
-    "reach for X" → fold into "pick up X"
-    "grasp X"     → fold into "pick up X"
-    "lift X"      → fold into "pick up X" (or "put X on Y" if it's
-                    the transport phase of a place)
-    "release X"   → fold into "put X on Y" (or "place X in Y")
- Keep it SHORT — a verb phrase, not a sentence. Drop articles
-  ("the", "a") and adverbs ("carefully", "slowly"). Add a "how"
-  detail (which hand, which grasp point) ONLY when it is needed to
-  disambiguate. Every subtask must begin with one of the verbs
-  above (no leading nouns, no "then", no "first").
- NEVER use third person. Never write "the robot", "the arm", "the
-  gripper moves", "it picks up" — the robot is implied. Command it,
-  do not describe it.
- Use the exact object nouns from the task above. If the task says
-  "cube", every subtask says "cube" — never switch to "block". If it
-  says "box", never switch to "bin"/"container". Keep vocabulary
-  consistent across the whole episode.
- Good: "pick up blue cube", "put blue cube in box", "open drawer",
-  "turn red knob", "press start button", "go to sink".
- Bad: "move to blue cube" (approach as its own subtask — forbidden,
-  must be folded into "pick up blue cube"); "the robot arm moves
-  towards the blue cube" (third person, too long); "carefully pick
-  up the cube" (adverb, article); "release the yellow block"
-  ("block" when the task said "cube", and "release" must be folded
-  into a "put"/"place" subtask).
- Subtasks are non-overlapping and cover the full episode in order.
-  Choose the cut points yourself based on what you see in the video
-  (gripper open/close events, contact, regrasps, transitions).
- Each subtask spans at least {min_subtask_seconds} seconds. If a
-  candidate span would be shorter, merge it into its neighbour
-  rather than emitting it.
- Do not exceed {max_steps} subtasks total. Fewer, larger composites
-  are preferred over many micro-steps.
- Every subtask's [start_time, end_time] must lie within
-  [0.0, {episode_duration}] seconds.
-
-Output strictly valid JSON of shape:
-
-  {{
-    "subtasks": [
-      {{"text": "<short imperative verb phrase>", "start": <float>, "end": <float>}},
-      ...
-    ]
-  }}
--- a/src/lerobot/annotations/steerable_pipeline/prompts/module_1_task_rephrasings.txt
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/module_1_task_rephrasings.txt
@@ -1,32 +0,0 @@
-You are generating training data for a Hi Robot-style policy. We need
-{n} alternative phrasings of the same robot task so the policy sees
-diverse user prompts during training instead of the same canonical
-string repeated every frame.
-
-Original task:
-"{base_task}"
-
-Generate exactly {n} alternative phrasings of the same task. Vary:
-
- formality (casual / polite / curt)
- verbosity (mostly short imperative; occasional polite request)
- word choice (synonyms, different verbs)
- sentence structure (imperative / question / suggestion)
-
-Hard rules:
- Each phrasing MUST preserve the exact meaning of the original task.
-  Do not change which object is involved, the destination, or the
-  action. Do not add extra steps. Do not invent new objects.
- Each phrasing must be a short phrase or sentence, plain prose, no
-  markdown, no quotes, no list numbers.
- Phrasings must be distinct — no near-duplicates.
- Output exactly {n} entries.
-
-Output strictly valid JSON:
-  {{
-    "rephrasings": [
-      "<phrasing 1>",
-      "<phrasing 2>",
-      ...
-    ]
-  }}
--- a/src/lerobot/annotations/steerable_pipeline/prompts/module_1_video_task.txt
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/module_1_video_task.txt
@@ -1,17 +0,0 @@
-The video above shows a robot manipulation episode in full. Look at
-the entire video and describe in ONE concise sentence what the robot
-is doing.
-
-Rules:
- One sentence, in natural English, like a user instruction.
- Capture the goal of the demonstration, not low-level motions.
-  Example: "place the yellow cube into the red bin" — not "move the
-  end-effector down 5cm and close the gripper".
- 4 to 15 words. Plain prose, no markdown, no bullets, no quotes.
- Do not invent objects or actions that aren't visible.
- Do not output anything other than the JSON object below.
-
-Output strictly valid JSON:
-  {{
-    "task": "<single concise sentence describing what the robot does in this video>"
-  }}
--- a/src/lerobot/annotations/steerable_pipeline/prompts/module_2_initial_speech.txt
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/module_2_initial_speech.txt
@@ -1,12 +0,0 @@
-The user just asked the robot: "{episode_task}".
-
-Generate a short verbal acknowledgement the robot would speak back before
-beginning the task. Style: compact, confident, friendly.
-
-Examples (Hi Robot, Shi 2025): "Sure, I won't put cheese on it.",
-"OK, starting with the sponge.", "Got it.".
-
-Prefer very short replies: "Got it.", "On it.", "OK."
-
-Output strictly valid JSON:
-  {{ "text": "<the spoken acknowledgement>" }}
--- a/src/lerobot/annotations/steerable_pipeline/prompts/module_2_interjection.txt
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/module_2_interjection.txt
@@ -1,46 +0,0 @@
-You are generating training data for a Hi Robot-style hierarchical
-robot policy. The robot in this demonstration has ALREADY executed
-every step shown in the video — we cannot retroactively change the
-action stream. To keep training data consistent with the video, the
-"interjection" must align with what the robot is *about to do next* in
-the demonstration, framed as a natural mid-task user request.
-
-The episode's overall task: "{episode_task}".
-
-The images above show roughly {window_seconds:.1f} seconds straddling a
-subtask boundary in the demonstration:
-
- Subtask the robot just finished: "{prev_subtask}"
- Subtask the robot is about to start: "{next_subtask}"
- Time into episode: {timestamp:.2f}s
-
-Write ONE compact interjection the user would naturally say at this
-moment to prompt / confirm / encourage the robot to do "{next_subtask}".
-Keep it like a mid-task coaching cue, not a full instruction paragraph.
-Also write the robot's compact verbal acknowledgement.
-
-Hard rules:
-
- The interjection MUST be consistent with the next subtask. The user
-  cannot ask for something different from what the robot then does in
-  the video. If you're tempted to say "actually skip X" or "do Y
-  instead", DO NOT — those would contradict the demonstration.
- The interjection must reference an object, location, or action that
-  is plausible given the visible scene and the next subtask text.
- One short phrase or sentence each. Conversational, not robotic.
- Prefer direct cues: "{next_subtask}, please."; "Now {next_subtask}."
- Keep robot speech very short: "OK.", "On it.", "Doing that."
-
-Style examples (vary the phrasing — don't reuse these verbatim):
-  - "Now go ahead and {next_subtask}."
-  - "Great, can you {next_subtask} next?"
-  - "{next_subtask}, please."
-  - "Before you continue, please {next_subtask}."
-  - "Looking good — {next_subtask} now."
-  - "Okay, {next_subtask}."
-
-Output strictly valid JSON:
-  {{
-    "interjection": "<short cue from the user, asking for the next subtask>",
-    "speech":       "<short robot acknowledgement>"
-  }}
--- a/src/lerobot/annotations/steerable_pipeline/prompts/module_3_vqa.txt
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/module_3_vqa.txt
@@ -1,57 +0,0 @@
-You are generating a frame-grounded visual question/answer pair for
-chain-of-thought training. Reference: ECoT (Zawalski 2024) and Steerable
-Policies — both train policies on grounded features such as bounding box
-pixel coordinates, keypoints, counts, attributes, and spatial relations.
-
-The frame shows a robot working on: "{episode_task}".
-
-QUALITY BAR — read before answering:
-
- Only label objects you are highly confident about. If you are not
-  sure what an object is, do NOT include it. A short, certain answer
-  beats a long, speculative one.
- For coordinate-grounded answers (bbox, keypoint) only emit a label
-  when you can localize the object *tightly and precisely*. If the
-  object is occluded, ambiguous, off-frame, or you can't pin its
-  extent, return an empty detections list / pick a different object
-  rather than guessing.
- Prefer task-relevant objects (the thing the robot is manipulating
-  or interacting with) over background clutter.
-
-Question types and the EXACT answer JSON shape required for each:
-
-  bbox       => {{"detections": [{{"label": "<obj>", "bbox_format": "xyxy",
-                                    "bbox": [x1, y1, x2, y2]}}, ...]}}
-                Pixel coordinates (x_min, y_min, x_max, y_max). Emit
-                AT MOST 3 detections, and *only* the highest-confidence
-                ones — 1 tight, certain detection is preferred over 3
-                loose ones. Each box must be tight (no >10% padding
-                around the object) and the label must be specific
-                ("red mug" not "object"). Return an empty list if no
-                object meets the bar.
-                ECoT example: "a white cup [124, 25, 176, 113]".
-
-  keypoint   => {{"label": "<point>", "point_format": "xy",
-                  "point": [x, y]}}
-                Pick ONE high-confidence, precisely-localizable point
-                (e.g. a graspable handle, a button center, the gripper
-                tip). The point must land within a few pixels of the
-                feature. Do not emit a coarse "somewhere on the object"
-                point — pick a different question type if no such
-                point exists in this frame.
-
-  count      => {{"label": "<obj>", "count": <int>,
-                  "note": "<optional short note>"}}
-
-  attribute  => {{"label": "<obj>", "attribute": "<color|shape|state|...>",
-                  "value": "<observed value>"}}
-
-  spatial    => {{"subject": "<obj>", "relation": "<left_of|right_of|on|in|"
-                  "above|below|near>", "object": "<obj>"}}
-
-Generate a question of type "{question_type}". Output strictly valid JSON:
-
-  {{
-    "question": "<short, frame-grounded question>",
-    "answer":   <object whose shape matches the schema above>
-  }}
--- a/src/lerobot/annotations/steerable_pipeline/reader.py
+++ b/src/lerobot/annotations/steerable_pipeline/reader.py
@@ -1,274 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Datatrove-shaped reader.
-
-The reader walks ``data/chunk-*/file-*.parquet`` and yields one record per
-episode containing:
-
- ``episode_index``: int
- ``frame_timestamps``: tuple[float, ...]
- ``frame_indices``: tuple[int, ...]
- ``episode_task``: str (canonical task from ``meta/tasks.parquet``)
- ``data_path``: pathlib.Path of the source parquet shard
- ``frames_df``: pandas.DataFrame slice for the episode (only loaded on demand)
-
-This shape lets each module operate per-episode without loading all parquet
-rows into memory at once.
-"""
-
-from __future__ import annotations
-
-from collections.abc import Iterator, Sequence
-from dataclasses import dataclass, field
-from pathlib import Path
-from typing import Any
-
-import pyarrow.parquet as pq
-
-from lerobot.datasets.io_utils import load_tasks
-from lerobot.datasets.utils import DEFAULT_TASKS_PATH
-
-
-@dataclass
-class EpisodeRecord:
-    """Per-episode record yielded by the reader."""
-
-    episode_index: int
-    episode_task: str
-    frame_timestamps: tuple[float, ...]
-    frame_indices: tuple[int, ...]
-    data_path: Path
-    row_offset: int  # row offset within the parquet file where this episode starts
-    row_count: int  # number of rows for this episode
-
-    # Memoized parquet slice — populated on first ``frames_df()`` call so
-    # repeat queries from different modules don't re-read the whole shard.
-    _frames_df_cache: Any = field(default=None, init=False, repr=False, compare=False)
-
-    def frames_df(self):  # type: ignore[no-untyped-def]
-        """Lazy-load the pandas slice for this episode (memoized)."""
-        if self._frames_df_cache is None:
-            import pandas as pd  # noqa: PLC0415  - deferred for optional dataset extra
-
-            table = pq.read_table(self.data_path)
-            df: pd.DataFrame = table.to_pandas()
-            self._frames_df_cache = df.iloc[self.row_offset : self.row_offset + self.row_count].reset_index(
-                drop=True
-            )
-        return self._frames_df_cache
-
-
-def reconstruct_subtask_spans(
-    rows: Sequence[dict[str, Any]],
-    *,
-    episode_end_t: float | None = None,
-) -> list[dict[str, Any]]:
-    """Turn ``style="subtask"`` rows into ``{text, start, end}`` spans.
-
-    Each span's ``end`` is the next span's ``start``. The final span's
-    ``end`` defaults to its own ``start`` (zero-duration) — pass
-    ``episode_end_t`` to extend it to the episode's last frame instead,
-    which is what downstream consumers (memory, interjection boundary
-    selection) expect.
-
-    Used by the ``plan`` module (plan-update pass) and the
-    ``interjections`` module (interjection anchoring), which both need the
-    same span shape.
-    """
-    sorted_rows = sorted(
-        (r for r in rows if r.get("style") == "subtask"),
-        key=lambda r: float(r["timestamp"]),
-    )
-    spans: list[dict[str, Any]] = []
-    for r in sorted_rows:
-        t = float(r["timestamp"])
-        if spans:
-            spans[-1]["end"] = t
-        spans.append({"text": r.get("content") or "", "start": t, "end": t})
-    if spans and episode_end_t is not None and float(episode_end_t) > spans[-1]["start"]:
-        spans[-1]["end"] = float(episode_end_t)
-    return spans
-
-
-def snap_to_frame(t: float, frame_timestamps: Sequence[float]) -> float:
-    """Snap an arbitrary float to the nearest exact source frame timestamp.
-
-    Modules use this when emitting event-style rows so the row's
-    timestamp matches a real parquet frame: event rows must land on an
-    exact frame, otherwise the per-frame event lookup the writer does
-    would never match them.
-    """
-    if not frame_timestamps:
-        return float(t)
-    nearest = min(frame_timestamps, key=lambda f: abs(f - t))
-    return float(nearest)
-
-
-def _load_tasks_lookup(root: Path) -> dict[int, str]:
-    """Map ``task_index -> task`` from ``meta/tasks.parquet``.
-
-    Returns an empty dict when the file is absent — the task description is
-    derived later from the video if needed. Reuses the library-level
-    :func:`lerobot.datasets.io_utils.load_tasks`, which returns the tasks
-    frame indexed by task string with a ``task_index`` column.
-    """
-    if not (root / DEFAULT_TASKS_PATH).exists():
-        return {}
-    tasks = load_tasks(root)
-    return {int(idx): str(task) for task, idx in zip(tasks.index, tasks["task_index"], strict=True)}
-
-
-def iter_episodes(root: Path, *, only_episodes: tuple[int, ...] | None = None) -> Iterator[EpisodeRecord]:
-    """Yield :class:`EpisodeRecord` for every episode under ``root/data/``.
-
-    Episodes are yielded in ascending ``episode_index`` order. The reader does
-    not assume a specific chunk/file layout: it scans every ``*.parquet``
-    under ``data/`` and groups by ``episode_index``.
-    """
-    tasks = _load_tasks_lookup(root)
-    data_dir = root / "data"
-    parquet_files = sorted(data_dir.rglob("*.parquet"))
-
-    only_set = set(only_episodes) if only_episodes is not None else None
-
-    for path in parquet_files:
-        yield from _iter_one_path(path, tasks, only_set)
-
-
-def _iter_one_path(path: Path, tasks: dict[int, str], only_set: set[int] | None) -> Iterator[EpisodeRecord]:
-    table = pq.read_table(path)
-    names = table.column_names
-    if "episode_index" not in names:
-        return
-    episode_col = table.column("episode_index").to_pylist()
-    timestamp_col = (
-        table.column("timestamp").to_pylist() if "timestamp" in names else [0.0] * len(episode_col)
-    )
-    frame_col = (
-        table.column("frame_index").to_pylist() if "frame_index" in names else list(range(len(episode_col)))
-    )
-    task_col = table.column("task_index").to_pylist() if "task_index" in names else None
-
-    def _build(
-        ep: int,
-        start: int,
-        end: int,
-        task_idx: int | None,
-        ts_buf: list[float],
-        fi_buf: list[int],
-    ) -> EpisodeRecord | None:
-        if only_set is not None and ep not in only_set:
-            return None
-        task = tasks.get(task_idx, "") if task_idx is not None else ""
-        return EpisodeRecord(
-            episode_index=ep,
-            episode_task=task,
-            frame_timestamps=tuple(ts_buf),
-            frame_indices=tuple(fi_buf),
-            data_path=path,
-            row_offset=start,
-            row_count=end - start,
-        )
-
-    cur_ep: int | None = None
-    start_offset = 0
-    ts_buf: list[float] = []
-    fi_buf: list[int] = []
-    cur_task_idx: int | None = None
-
-    for i, ep in enumerate(episode_col):
-        if cur_ep is None:
-            cur_ep = ep
-            start_offset = i
-            ts_buf = [timestamp_col[i]]
-            fi_buf = [frame_col[i]]
-            cur_task_idx = task_col[i] if task_col is not None else None
-            continue
-        if ep != cur_ep:
-            rec = _build(cur_ep, start_offset, i, cur_task_idx, ts_buf, fi_buf)
-            if rec is not None:
-                yield rec
-            cur_ep = ep
-            start_offset = i
-            ts_buf = [timestamp_col[i]]
-            fi_buf = [frame_col[i]]
-            cur_task_idx = task_col[i] if task_col is not None else None
-        else:
-            ts_buf.append(timestamp_col[i])
-            fi_buf.append(frame_col[i])
-
-    if cur_ep is not None:
-        rec = _build(cur_ep, start_offset, len(episode_col), cur_task_idx, ts_buf, fi_buf)
-        if rec is not None:
-            yield rec
-
-
-def gather_data_paths(root: Path) -> list[Path]:
-    """Return every ``data/chunk-*/file-*.parquet`` path under ``root``."""
-    return sorted((root / "data").rglob("*.parquet"))
-
-
-def episode_offsets_per_path(path: Path) -> dict[int, tuple[int, int]]:
-    """Return ``{episode_index: (row_offset, row_count)}`` for one parquet."""
-    table = pq.read_table(path, columns=["episode_index"])
-    episode_col = table.column("episode_index").to_pylist()
-    out: dict[int, tuple[int, int]] = {}
-    cur_ep: int | None = None
-    start = 0
-    for i, ep in enumerate(episode_col):
-        if cur_ep is None:
-            cur_ep = ep
-            start = i
-            continue
-        if ep != cur_ep:
-            out[cur_ep] = (start, i - start)
-            cur_ep = ep
-            start = i
-    if cur_ep is not None:
-        out[cur_ep] = (start, len(episode_col) - start)
-    return out
-
-
-def keyframe_indices(record: EpisodeRecord, k: int) -> list[int]:
-    """Return ``k`` evenly spaced row indices into the episode (relative)."""
-    n = record.row_count
-    if k <= 0 or n == 0:
-        return []
-    if k >= n:
-        return list(range(n))
-    step = (n - 1) / (k - 1) if k > 1 else 0.0
-    return [int(round(i * step)) for i in range(k)] if k > 1 else [n // 2]
-
-
-def lookup_data_path(root: Path, episode_index: int) -> tuple[Path, int, int] | None:
-    """Find the parquet file containing ``episode_index`` and its slice bounds."""
-    for path in gather_data_paths(root):
-        offsets = episode_offsets_per_path(path)
-        if episode_index in offsets:
-            start, count = offsets[episode_index]
-            return path, start, count
-    return None
-
-
-def episode_frame_timestamps(root: Path, episode_index: int) -> tuple[Any, list[float]]:
-    """Return the parquet path and per-frame timestamps for ``episode_index``."""
-    found = lookup_data_path(root, episode_index)
-    if found is None:
-        raise ValueError(f"Episode {episode_index} not found under {root}/data/")
-    path, start, count = found
-    table = pq.read_table(path, columns=["timestamp"])
-    timestamps = table.column("timestamp").to_pylist()[start : start + count]
-    return path, [float(t) for t in timestamps]
--- a/src/lerobot/annotations/steerable_pipeline/staging.py
+++ b/src/lerobot/annotations/steerable_pipeline/staging.py
@@ -1,104 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Per-episode staging.
-
-Each module writes its raw output as a JSONL file under
-``<staging_dir>/episode_{ep:06d}/<module>.jsonl``. The writer reads back this
-staging tree and partitions rows into the two language columns.
-
-JSONL is preferred over parquet here because the staging artifact is meant to
-be human-inspectable, easy to diff between prompt iterations, and trivially
-appended to. The final dataset format is parquet; staging is just an
-intermediate.
-"""
-
-from __future__ import annotations
-
-import json
-from collections.abc import Iterable, Iterator
-from dataclasses import dataclass
-from pathlib import Path
-from typing import Any
-
-ModuleName = str
-
-_MODULES: tuple[ModuleName, ...] = (
-    "plan",
-    "interjections",
-    "vqa",
-)
-
-
-@dataclass
-class EpisodeStaging:
-    """Filesystem layout for a single episode's staged module outputs."""
-
-    root: Path
-    episode_index: int
-
-    @property
-    def episode_dir(self) -> Path:
-        return self.root / f"episode_{self.episode_index:06d}"
-
-    def path_for(self, module: ModuleName) -> Path:
-        if module not in _MODULES:
-            raise ValueError(f"Unknown module {module!r}; expected one of {_MODULES}")
-        return self.episode_dir / f"{module}.jsonl"
-
-    def write(self, module: ModuleName, rows: Iterable[dict[str, Any]]) -> Path:
-        path = self.path_for(module)
-        path.parent.mkdir(parents=True, exist_ok=True)
-        # Atomic replace: a crash mid-write would otherwise leave a
-        # half-written JSONL file that ``read()`` would then fail to
-        # parse. Write to a sibling .tmp and rename so the target path
-        # only ever points at a complete file.
-        tmp_path = path.with_suffix(path.suffix + ".tmp")
-        with tmp_path.open("w", encoding="utf-8") as f:
-            for row in rows:
-                f.write(json.dumps(row, ensure_ascii=False, sort_keys=True))
-                f.write("\n")
-        tmp_path.replace(path)
-        return path
-
-    def read(self, module: ModuleName) -> list[dict[str, Any]]:
-        path = self.path_for(module)
-        if not path.exists():
-            return []
-        out: list[dict[str, Any]] = []
-        with path.open(encoding="utf-8") as f:
-            for line in f:
-                line = line.strip()
-                if line:
-                    out.append(json.loads(line))
-        return out
-
-    def read_all(self) -> dict[ModuleName, list[dict[str, Any]]]:
-        return {m: self.read(m) for m in _MODULES}
-
-    def has(self, module: ModuleName) -> bool:
-        return self.path_for(module).exists()
-
-
-def iter_staged_episodes(root: Path) -> Iterator[int]:
-    """Yield episode indices for which any staging artifact exists."""
-    if not root.exists():
-        return
-    for child in sorted(root.iterdir()):
-        if child.is_dir() and child.name.startswith("episode_"):
-            try:
-                yield int(child.name.removeprefix("episode_"))
-            except ValueError:
-                continue
--- a/src/lerobot/annotations/steerable_pipeline/validator.py
+++ b/src/lerobot/annotations/steerable_pipeline/validator.py
@@ -1,334 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Pre-write validation against staged outputs.
-
-Runs after all three modules have written their per-episode artifacts but
-*before* the writer rewrites parquet shards. The validator never touches
-parquet; it only inspects the staging tree and the source frame timestamps
-exposed by :class:`EpisodeRecord`.
-
-Checks (per the plan's "Intermediate staging and validation" section):
-
- exact timestamp alignment against source frame timestamps
- no orphan speech / interjection pairs
- plan / memory emission consistency (events have a paired persistent row)
- VQA assistant ``content`` is valid JSON (one of bbox / keypoint / count /
-  attribute / spatial)
- every row maps to its correct column under :func:`column_for_style`
-"""
-
-from __future__ import annotations
-
-import json
-import logging
-from collections.abc import Iterable, Sequence
-from dataclasses import dataclass, field
-from pathlib import Path
-from typing import Any
-
-from lerobot.datasets.language import (
-    LANGUAGE_EVENTS,
-    LANGUAGE_PERSISTENT,
-    column_for_style,
-    is_view_dependent_style,
-    validate_camera_field,
-)
-
-from .reader import EpisodeRecord
-from .staging import EpisodeStaging
-
-logger = logging.getLogger(__name__)
-
-
-@dataclass
-class ValidationReport:
-    """Outcome of one validation pass across all episodes."""
-
-    errors: list[str] = field(default_factory=list)
-    warnings: list[str] = field(default_factory=list)
-    episodes_checked: int = 0
-
-    @property
-    def ok(self) -> bool:
-        return not self.errors
-
-    def add_error(self, message: str) -> None:
-        self.errors.append(message)
-
-    def add_warning(self, message: str) -> None:
-        self.warnings.append(message)
-
-    def summary(self) -> str:
-        return f"checked={self.episodes_checked} errors={len(self.errors)} warnings={len(self.warnings)}"
-
-
-VQA_ANSWER_SHAPES: dict[str, set[str]] = {
-    "bbox": {"detections"},
-    "keypoint": {"label", "point_format", "point"},
-    "count": {"label", "count"},
-    "attribute": {"label", "attribute", "value"},
-    "spatial": {"subject", "relation", "object"},
-}
-
-
-def classify_vqa_answer(payload: Any) -> str | None:
-    """Best-effort classification of a VQA answer payload to a question type."""
-    if not isinstance(payload, dict):
-        return None
-    keys = set(payload.keys())
-    for kind, required in VQA_ANSWER_SHAPES.items():
-        if required.issubset(keys):
-            return kind
-    return None
-
-
-@dataclass
-class StagingValidator:
-    """Walks the staging tree and produces a :class:`ValidationReport`."""
-
-    timestamp_atol: float = 0.0  # exact-match by default
-    dataset_camera_keys: tuple[str, ...] | None = None
-    """Known ``observation.images.*`` keys on the dataset. When set, the
-    validator additionally enforces that every view-dependent row's
-    ``camera`` field references one of these keys. Pass ``None`` (default)
-    to skip that cross-check (e.g. in unit tests with no real dataset)."""
-
-    def validate(
-        self,
-        records: Sequence[EpisodeRecord],
-        staging_dir: Path,
-    ) -> ValidationReport:
-        report = ValidationReport()
-        for record in records:
-            self._validate_episode(record, staging_dir, report)
-            report.episodes_checked += 1
-        return report
-
-    def _validate_episode(
-        self,
-        record: EpisodeRecord,
-        staging_dir: Path,
-        report: ValidationReport,
-    ) -> None:
-        staging = EpisodeStaging(staging_dir, record.episode_index)
-        staged = staging.read_all()
-        all_rows: list[dict[str, Any]] = []
-        for module_name, rows in staged.items():
-            for row in rows:
-                row = {**row, "_module": module_name}
-                all_rows.append(row)
-
-        frame_ts = set(record.frame_timestamps)
-
-        events: list[dict[str, Any]] = []
-        persistent: list[dict[str, Any]] = []
-        for row in all_rows:
-            self._check_column_routing(row, report, record.episode_index)
-            self._check_camera_field(
-                row, report, record.episode_index, self.dataset_camera_keys
-            )
-            if column_for_style(row.get("style")) == LANGUAGE_PERSISTENT:
-                persistent.append(row)
-            else:
-                events.append(row)
-
-        for row in events:
-            self._check_event_timestamp_alignment(row, frame_ts, report, record.episode_index)
-
-        self._check_speech_interjection_pairs(events, report, record.episode_index)
-        self._check_plan_memory_consistency(persistent, events, report, record.episode_index)
-        self._check_vqa_json(events, report, record.episode_index)
-        self._check_vqa_uniqueness_per_frame_camera(events, report, record.episode_index)
-
-    def _check_camera_field(
-        self,
-        row: dict[str, Any],
-        report: ValidationReport,
-        episode_index: int,
-        dataset_camera_keys: Sequence[str] | None,
-    ) -> None:
-        """Enforce the camera invariant + that the key matches the dataset's cameras."""
-        style = row.get("style")
-        camera = row.get("camera")
-        try:
-            validate_camera_field(style, camera)
-        except ValueError as exc:
-            report.add_error(
-                f"ep={episode_index} module={row.get('_module')}: {exc}"
-            )
-            return
-        if (
-            is_view_dependent_style(style)
-            and dataset_camera_keys
-            and camera not in dataset_camera_keys
-        ):
-            report.add_error(
-                f"ep={episode_index} module={row.get('_module')}: camera {camera!r} on style "
-                f"{style!r} is not one of the dataset's video keys {sorted(dataset_camera_keys)!r}"
-            )
-
-    def _check_vqa_uniqueness_per_frame_camera(
-        self,
-        events: Iterable[dict[str, Any]],
-        report: ValidationReport,
-        episode_index: int,
-    ) -> None:
-        """Ensure at most one (vqa, user) and one (vqa, assistant) per (t, camera)."""
-        counts: dict[tuple[float, str, str], int] = {}
-        for row in events:
-            if row.get("style") != "vqa":
-                continue
-            ts = row.get("timestamp")
-            camera = row.get("camera")
-            role = row.get("role")
-            if ts is None or camera is None or role is None:
-                continue  # other validators flag these
-            key = (float(ts), str(camera), str(role))
-            counts[key] = counts.get(key, 0) + 1
-        for (ts, camera, role), n in counts.items():
-            if n > 1:
-                report.add_error(
-                    f"ep={episode_index}: {n} duplicate vqa rows at t={ts} "
-                    f"camera={camera!r} role={role!r}; expected at most one per (t, camera, role)"
-                )
-
-    def _check_column_routing(
-        self,
-        row: dict[str, Any],
-        report: ValidationReport,
-        episode_index: int,
-    ) -> None:
-        style = row.get("style")
-        module = row.get("_module")
-        try:
-            target_col = column_for_style(style)
-        except ValueError:
-            report.add_error(f"ep={episode_index} module={module}: unknown style {style!r}")
-            return
-        if module == "plan" and target_col != LANGUAGE_PERSISTENT:
-            report.add_error(
-                f"ep={episode_index} module=plan emitted style {style!r} that routes to {target_col} (must be persistent)"
-            )
-        if module in {"interjections", "vqa"} and target_col != LANGUAGE_EVENTS:
-            report.add_error(
-                f"ep={episode_index} module={module} emitted style {style!r} that routes to {target_col} (must be events)"
-            )
-
-    def _check_event_timestamp_alignment(
-        self,
-        row: dict[str, Any],
-        frame_ts: set[float],
-        report: ValidationReport,
-        episode_index: int,
-    ) -> None:
-        ts = row.get("timestamp")
-        if ts is None:
-            report.add_error(f"ep={episode_index}: event row missing timestamp: {row!r}")
-            return
-        if self.timestamp_atol == 0.0:
-            if float(ts) not in frame_ts:
-                report.add_error(
-                    f"ep={episode_index}: event row timestamp {ts!r} does not match any source frame timestamp"
-                )
-        else:
-            if not any(abs(float(ts) - f) <= self.timestamp_atol for f in frame_ts):
-                report.add_error(
-                    f"ep={episode_index}: event row timestamp {ts!r} not within {self.timestamp_atol}s of any frame"
-                )
-
-    def _check_speech_interjection_pairs(
-        self,
-        events: Iterable[dict[str, Any]],
-        report: ValidationReport,
-        episode_index: int,
-    ) -> None:
-        speech_ts: dict[float, int] = {}
-        interjection_ts: dict[float, int] = {}
-        for row in events:
-            ts = row.get("timestamp")
-            if ts is None:
-                continue
-            ts_f = float(ts)
-            if row.get("style") is None and row.get("role") == "assistant":
-                speech_ts[ts_f] = speech_ts.get(ts_f, 0) + 1
-            if row.get("style") == "interjection":
-                interjection_ts[ts_f] = interjection_ts.get(ts_f, 0) + 1
-
-        for ts in interjection_ts:
-            if ts not in speech_ts:
-                report.add_error(f"ep={episode_index}: interjection at t={ts} has no paired speech atom")
-
-    def _check_plan_memory_consistency(
-        self,
-        persistent: Sequence[dict[str, Any]],
-        events: Sequence[dict[str, Any]],
-        report: ValidationReport,
-        episode_index: int,
-    ) -> None:
-        plan_ts = sorted({float(r["timestamp"]) for r in persistent if r.get("style") == "plan"})
-        memory_ts = sorted({float(r["timestamp"]) for r in persistent if r.get("style") == "memory"})
-        subtask_ts = sorted({float(r["timestamp"]) for r in persistent if r.get("style") == "subtask"})
-        interjection_ts = sorted(
-            {
-                float(r["timestamp"])
-                for r in events
-                if r.get("style") == "interjection" and r.get("timestamp") is not None
-            }
-        )
-
-        if persistent and not plan_ts:
-            report.add_warning(f"ep={episode_index}: persistent rows present but no plan emitted")
-        # every interjection should have a same-timestamp plan refresh
-        for ts in interjection_ts:
-            if ts not in set(plan_ts):
-                report.add_error(
-                    f"ep={episode_index}: interjection at t={ts} has no co-timestamped plan update"
-                )
-        # memory should be emitted at subtask boundaries (subset relation)
-        if memory_ts and subtask_ts:
-            mem_set = set(memory_ts)
-            sub_set = set(subtask_ts)
-            stray = sorted(mem_set - sub_set)
-            if stray:
-                report.add_warning(f"ep={episode_index}: memory rows at {stray} not at any subtask boundary")
-
-    def _check_vqa_json(
-        self,
-        events: Iterable[dict[str, Any]],
-        report: ValidationReport,
-        episode_index: int,
-    ) -> None:
-        for row in events:
-            if row.get("style") != "vqa" or row.get("role") != "assistant":
-                continue
-            content = row.get("content")
-            if content is None:
-                report.add_error(
-                    f"ep={episode_index}: VQA assistant row at t={row.get('timestamp')} has null content"
-                )
-                continue
-            try:
-                payload = json.loads(content)
-            except (TypeError, ValueError) as exc:
-                report.add_error(
-                    f"ep={episode_index}: VQA assistant content not valid JSON at t={row.get('timestamp')}: {exc}"
-                )
-                continue
-            shape = classify_vqa_answer(payload)
-            if shape is None:
-                report.add_error(
-                    f"ep={episode_index}: VQA assistant payload at t={row.get('timestamp')} does not match any known shape: keys={list(payload) if isinstance(payload, dict) else type(payload).__name__}"
-                )
--- a/src/lerobot/annotations/steerable_pipeline/vlm_client.py
+++ b/src/lerobot/annotations/steerable_pipeline/vlm_client.py
@@ -1,703 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Shared Qwen-VL client.
-
-The pipeline uses a single shared VLM across modules. vLLM is preferred when
-available (high throughput, JSON-guided decoding); transformers is the
-fallback. A ``stub`` backend is used for unit tests so fixtures never call
-into a real model.
-
-The client speaks one method, :meth:`VlmClient.generate_json`, which:
-
- accepts a list of OpenAI/HF-style multimodal messages,
- requests JSON output (``json_mode=True`` enables guided decoding when the
-  backend supports it),
- batches requests transparently,
- and reprompts once on a JSON parse failure with an inline correction
-  message before raising.
-"""
-
-from __future__ import annotations
-
-import atexit
-import base64
-import io
-import json
-import os
-import shlex
-import signal
-import subprocess
-import sys
-import threading
-import time
-import urllib.request
-from collections.abc import Callable, Sequence
-from concurrent.futures import ThreadPoolExecutor
-from dataclasses import dataclass
-from typing import Any, Protocol
-
-from .config import VlmConfig
-
-
-class VlmClient(Protocol):
-    """Protocol every backend must implement."""
-
-    def generate_json(
-        self,
-        messages_batch: Sequence[Sequence[dict[str, Any]]],
-        *,
-        max_new_tokens: int | None = None,
-        temperature: float | None = None,
-    ) -> list[Any]:
-        """Generate one JSON-decoded response per messages list."""
-
-
-@dataclass
-class StubVlmClient:
-    """Deterministic stub used in unit tests.
-
-    A test passes a callable that maps the *last user message text* (or, if
-    that is empty, the full message list) to a JSON-serializable response.
-    """
-
-    responder: Callable[[Sequence[dict[str, Any]]], Any]
-
-    def generate_json(
-        self,
-        messages_batch: Sequence[Sequence[dict[str, Any]]],
-        *,
-        max_new_tokens: int | None = None,
-        temperature: float | None = None,
-    ) -> list[Any]:
-        return [self.responder(list(messages)) for messages in messages_batch]
-
-
-def _strip_to_json(text: str) -> Any:
-    text = text.strip()
-    # Strip <think>...</think> blocks (Qwen3 Thinking style)
-    while "<think>" in text and "</think>" in text:
-        start = text.find("<think>")
-        end = text.find("</think>", start) + len("</think>")
-        text = (text[:start] + text[end:]).strip()
-    # Strip ```json ... ``` fences from chat-tuned backbones
-    if text.startswith("```"):
-        first = text.find("\n")
-        last = text.rfind("```")
-        if first != -1 and last != -1 and last > first:
-            text = text[first + 1 : last].strip()
-    try:
-        return json.loads(text)
-    except (ValueError, json.JSONDecodeError):
-        pass
-    # Fall back to extracting the first balanced {...} block.
-    obj_text = _extract_first_json_object(text)
-    if obj_text is None:
-        raise json.JSONDecodeError("No JSON object found", text, 0)
-    return json.loads(obj_text)
-
-
-def _extract_first_json_object(text: str) -> str | None:
-    """Return the first balanced ``{...}`` substring, ignoring braces in
-    string literals. Returns ``None`` if no balanced block is found."""
-    start = text.find("{")
-    if start < 0:
-        return None
-    depth = 0
-    in_string = False
-    escape = False
-    for i in range(start, len(text)):
-        ch = text[i]
-        if escape:
-            escape = False
-            continue
-        if ch == "\\":
-            escape = True
-            continue
-        # Note: ``escape`` is always False here — the ``if escape`` branch
-        # above already handled and reset it.
-        if ch == '"':
-            in_string = not in_string
-            continue
-        if in_string:
-            continue
-        if ch == "{":
-            depth += 1
-        elif ch == "}":
-            depth -= 1
-            if depth == 0:
-                return text[start : i + 1]
-    return None
-
-
-@dataclass
-class _GenericTextClient:
-    """Wraps any text-generation callable in JSON-mode + one-retry semantics."""
-
-    generate_text: Callable[[Sequence[Sequence[dict[str, Any]]], int, float], list[str]]
-    config: VlmConfig
-
-    def generate_json(
-        self,
-        messages_batch: Sequence[Sequence[dict[str, Any]]],
-        *,
-        max_new_tokens: int | None = None,
-        temperature: float | None = None,
-    ) -> list[Any]:
-        max_tok = max_new_tokens if max_new_tokens is not None else self.config.max_new_tokens
-        temp = temperature if temperature is not None else self.config.temperature
-        raw = self.generate_text(messages_batch, max_tok, temp)
-        out: list[Any] = []
-        for messages, text in zip(messages_batch, raw, strict=True):
-            try:
-                out.append(_strip_to_json(text))
-                continue
-            except (ValueError, json.JSONDecodeError):
-                pass
-            retry = list(messages) + [
-                {"role": "assistant", "content": text},
-                {
-                    "role": "user",
-                    "content": (
-                        "Your previous reply was not valid JSON. "
-                        "Reply with strictly valid JSON, no prose, no fences."
-                    ),
-                },
-            ]
-            retry_text = self.generate_text([retry], max_tok, temp)[0]
-            try:
-                out.append(_strip_to_json(retry_text))
-            except (ValueError, json.JSONDecodeError):
-                # After retry: log preview and return None instead of crashing
-                # the whole pipeline. Modules treat None as "skip".
-                preview = retry_text.strip().replace("\n", " ")[:200]
-                print(
-                    f"[vlm] WARNING: failed to parse JSON after retry; preview: {preview!r}",
-                    flush=True,
-                )
-                out.append(None)
-        return out
-
-
-def make_vlm_client(config: VlmConfig) -> VlmClient:
-    """Build the shared VLM client per the configured backend.
-
-    For ``stub``, callers should construct :class:`StubVlmClient` directly with
-    a responder callable. ``stub`` here is rejected to make accidental misuse
-    obvious.
-    """
-    if config.backend == "stub":
-        raise ValueError(
-            "Use StubVlmClient(...) directly for the stub backend; make_vlm_client builds real clients."
-        )
-    if config.backend == "vllm":
-        return _make_vllm_client(config)
-    if config.backend == "transformers":
-        return _make_transformers_client(config)
-    if config.backend == "openai":
-        return _make_openai_client(config)
-    raise ValueError(f"Unknown VLM backend: {config.backend!r}")
-
-
-def _make_vllm_client(config: VlmConfig) -> VlmClient:
-    try:
-        from vllm import LLM, SamplingParams  # type: ignore[import-not-found]
-    except ImportError as exc:
-        raise ImportError(
-            "vllm is required for backend='vllm'. Install with `pip install lerobot[annotations]`."
-        ) from exc
-    # Workaround for cuDNN 9.x + torch 2.8 conv3d regression that surfaces
-    # as CUDNN_STATUS_NOT_INITIALIZED in Qwen-VL vision-tower patch
-    # embedders. Setting LEROBOT_DISABLE_CUDNN=1 forces native PyTorch
-    # convolution kernels — slower but functional.
-    if os.environ.get("LEROBOT_DISABLE_CUDNN", "").lower() in {"1", "true", "yes"}:
-        import torch as _torch  # noqa: PLC0415  - optional GPU dep, deferred
-
-        _torch.backends.cudnn.enabled = False
-    llm_kwargs: dict[str, Any] = {
-        "model": config.model_id,
-        "tensor_parallel_size": config.tensor_parallel_size,
-        "gpu_memory_utilization": config.gpu_memory_utilization,
-        "trust_remote_code": config.trust_remote_code,
-    }
-    if config.max_model_len is not None:
-        llm_kwargs["max_model_len"] = config.max_model_len
-    llm = LLM(**llm_kwargs)
-
-    def _gen(batch: Sequence[Sequence[dict[str, Any]]], max_tok: int, temp: float) -> list[str]:
-        # ``guided_decoding`` would speed up parsing but its API differs across
-        # vllm releases (dict vs GuidedDecodingParams). The _GenericTextClient
-        # wrapper already has a one-retry JSON-recovery path, so we skip it.
-        params = SamplingParams(max_tokens=max_tok, temperature=temp)
-        # ``llm.chat`` handles chat-template application + multimodal input
-        # extraction (image/video blocks) internally, which ``llm.generate``
-        # does not.
-        outputs = llm.chat([list(m) for m in batch], params)
-        return [o.outputs[0].text for o in outputs]
-
-    return _GenericTextClient(_gen, config)
-
-
-def _make_transformers_client(config: VlmConfig) -> VlmClient:
-    try:
-        import torch  # type: ignore[import-not-found]
-        import transformers  # type: ignore[import-not-found]
-        from transformers import AutoProcessor  # type: ignore[import-not-found]
-    except ImportError as exc:
-        raise ImportError("transformers + torch are required for backend='transformers'.") from exc
-    auto_cls = getattr(transformers, "AutoModelForImageTextToText", None) or getattr(
-        transformers, "AutoModelForVision2Seq", None
-    )
-    if auto_cls is None:
-        raise ImportError(
-            "Neither AutoModelForImageTextToText nor AutoModelForVision2Seq is available in this "
-            "transformers version. Install transformers>=4.45 (which has AutoModelForImageTextToText) "
-            "for VL models."
-        )
-    processor = AutoProcessor.from_pretrained(config.model_id, trust_remote_code=config.trust_remote_code)
-    use_accelerate = os.environ.get("LEROBOT_TRANSFORMERS_DEVICE_MAP", "manual") != "manual"
-    # ``device_map='auto'`` triggers a known std::bad_alloc on the Qwen3-VL
-    # post-load dispatch path (the alloc fails in accelerate's hook setup
-    # even with TBs of host RAM). Default to manual: load on CPU with
-    # ``low_cpu_mem_usage=True``, then ``.to("cuda")``. Set
-    # ``LEROBOT_TRANSFORMERS_DEVICE_MAP=auto`` to opt back into the old path.
-    if use_accelerate:
-        model = auto_cls.from_pretrained(
-            config.model_id,
-            torch_dtype="auto",
-            device_map="auto",
-            low_cpu_mem_usage=True,
-            trust_remote_code=config.trust_remote_code,
-        )
-    else:
-        import torch as _torch  # noqa: PLC0415  - optional GPU dep, deferred
-
-        model = auto_cls.from_pretrained(
-            config.model_id,
-            torch_dtype=_torch.bfloat16,
-            low_cpu_mem_usage=True,
-            trust_remote_code=config.trust_remote_code,
-        )
-        model = model.to("cuda")
-    model.eval()
-
-    def _gen(batch: Sequence[Sequence[dict[str, Any]]], max_tok: int, temp: float) -> list[str]:
-        outs: list[str] = []
-        for messages in batch:
-            text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
-            inputs = processor(text=[text], return_tensors="pt").to(model.device)
-            with torch.no_grad():
-                gen = model.generate(
-                    **inputs,
-                    max_new_tokens=max_tok,
-                    temperature=temp,
-                    do_sample=temp > 0.0,
-                )
-            decoded = processor.batch_decode(
-                gen[:, inputs["input_ids"].shape[-1] :], skip_special_tokens=True
-            )[0]
-            outs.append(decoded)
-        return outs
-
-    return _GenericTextClient(_gen, config)
-
-
-def _make_openai_client(config: VlmConfig) -> VlmClient:
-    """Backend that talks to any OpenAI-compatible server.
-
-    Compatible with ``vllm serve``, ``transformers serve``,
-    ``ktransformers serve``, and hosted endpoints. By default the server
-    is expected to be already running. Set ``auto_serve=True`` to have
-    this client spawn one (default: ``transformers serve``), wait until
-    it's ready, and tear it down on process exit.
-
-    Image blocks ``{"type":"image", "image":<PIL.Image>}`` are
-    auto-converted to ``image_url`` data-URLs. Video blocks
-    ``{"type":"video", "video":[<PIL>...]}`` are forwarded as
-    multi-frame ``video_url`` items where supported.
-    """
-    try:
-        from openai import OpenAI  # type: ignore[import-not-found]
-    except ImportError as exc:
-        raise ImportError(
-            "openai package is required for backend='openai'. Install with `pip install openai`."
-        ) from exc
-
-    api_base = config.api_base
-    api_key = config.api_key
-    auto_serve = config.auto_serve
-    api_bases: list[str] = [api_base]
-
-    print(
-        f"[lerobot-annotate] backend=openai model={config.model_id} "
-        f"api_base={api_base} auto_serve={auto_serve}",
-        flush=True,
-    )
-    if auto_serve:
-        if config.parallel_servers > 1:
-            print(
-                f"[lerobot-annotate] spawning {config.parallel_servers} parallel servers",
-                flush=True,
-            )
-            api_bases = _spawn_parallel_inference_servers(config)
-        elif _server_is_up(api_base):
-            print(f"[lerobot-annotate] reusing server already up at {api_base}", flush=True)
-        else:
-            print("[lerobot-annotate] no server reachable; spawning one", flush=True)
-            api_base = _spawn_inference_server(config)
-            api_bases = [api_base]
-            print(f"[lerobot-annotate] server ready at {api_base}", flush=True)
-
-    clients = [OpenAI(base_url=base, api_key=api_key) for base in api_bases]
-    # round-robin counter for parallel mode
-    rr_counter = {"i": 0}
-
-    # ``mm_processor_kwargs`` is a vllm-specific extra; transformers serve
-    # rejects it with HTTP 422. Send it only when explicitly opted in via
-    # an env var (e.g. ``LEROBOT_OPENAI_SEND_MM_KWARGS=1`` for vllm).
-    send_mm_kwargs = os.environ.get("LEROBOT_OPENAI_SEND_MM_KWARGS", "").lower() in {"1", "true", "yes"}
-
-    rr_lock = threading.Lock()
-
-    def _one_call(messages: Sequence[dict[str, Any]], max_tok: int, temp: float) -> str:
-        api_messages, mm_kwargs = _to_openai_messages(messages)
-        kwargs: dict[str, Any] = {
-            "model": config.model_id,
-            "messages": api_messages,
-            "max_tokens": max_tok,
-            "temperature": temp,
-        }
-        extra_body: dict[str, Any] = {}
-        if send_mm_kwargs and mm_kwargs:
-            extra_body["mm_processor_kwargs"] = {**mm_kwargs, "do_sample_frames": True}
-        if config.chat_template_kwargs:
-            extra_body["chat_template_kwargs"] = config.chat_template_kwargs
-        if extra_body:
-            kwargs["extra_body"] = extra_body
-        with rr_lock:
-            chosen = clients[rr_counter["i"] % len(clients)]
-            rr_counter["i"] += 1
-        response = chosen.chat.completions.create(**kwargs)
-        return response.choices[0].message.content or ""
-
-    def _gen(batch: Sequence[Sequence[dict[str, Any]]], max_tok: int, temp: float) -> list[str]:
-        if len(batch) <= 1 or config.client_concurrency <= 1:
-            return [_one_call(messages, max_tok, temp) for messages in batch]
-        # Parallel fan-out — vllm batches these on the server side.
-        max_workers = min(config.client_concurrency, len(batch))
-        with ThreadPoolExecutor(max_workers=max_workers) as pool:
-            futures = [pool.submit(_one_call, messages, max_tok, temp) for messages in batch]
-            return [f.result() for f in futures]
-
-    return _GenericTextClient(_gen, config)
-
-
-def _spawn_parallel_inference_servers(config: VlmConfig) -> list[str]:
-    """Spawn ``config.parallel_servers`` independent vllm replicas.
-
-    Each replica:
-    - is pinned to a single GPU via ``CUDA_VISIBLE_DEVICES``
-    - listens on ``serve_port + i``
-    - is shut down via the same atexit hook as the single-server path
-
-    Returns the list of ``api_base`` URLs the client should round-robin
-    across.
-    """
-    n = config.parallel_servers
-    api_bases: list[str] = []
-    procs: list[subprocess.Popen] = []
-    ready_events: list[threading.Event] = []
-    # Multiple readiness signals — uvicorn's own banner is suppressed at
-    # ``--uvicorn-log-level warning``, so we also accept vllm's own
-    # "Starting vLLM API server" line and the route-listing line. The
-    # HTTP probe below is the ultimate fallback.
-    ready_markers = (
-        "Uvicorn running",
-        "Application startup complete",
-        "Starting vLLM API server",
-        "Available routes are",
-    )
-    # Single lock for all server-stream threads so multibyte chars from
-    # different servers don't interleave and tear UTF-8 sequences.
-    print_lock = threading.Lock()
-
-    base_cmd = config.serve_command or (
-        f"vllm serve {shlex.quote(config.model_id)} "
-        f"--tensor-parallel-size 1 "
-        f"--max-model-len {config.max_model_len or 32768} "
-        f"--uvicorn-log-level warning"
-    )
-
-    num_gpus = config.num_gpus if config.num_gpus > 0 else n
-    for i in range(n):
-        port = config.serve_port + i
-        gpu = i % num_gpus
-        env = os.environ.copy()
-        env["CUDA_VISIBLE_DEVICES"] = str(gpu)
-        cmd = base_cmd.replace("{port}", str(port)) if "{port}" in base_cmd else f"{base_cmd} --port {port}"
-        api_base = f"http://localhost:{port}/v1"
-        api_bases.append(api_base)
-        print(f"[server-{i}] launching on GPU {gpu} port {port}: {cmd}", flush=True)
-        proc = subprocess.Popen(
-            shlex.split(cmd),
-            stdout=subprocess.PIPE,
-            stderr=subprocess.STDOUT,
-            text=True,
-            bufsize=1,
-            env=env,
-        )
-        procs.append(proc)
-        ready = threading.Event()
-        ready_events.append(ready)
-
-        def _stream(idx: int, p: subprocess.Popen, ev: threading.Event) -> None:
-            # Read whole lines and emit each line atomically under the
-            # shared print_lock so output from N servers stays readable.
-            assert p.stdout is not None
-            for line in iter(p.stdout.readline, ""):
-                with print_lock:
-                    sys.stdout.write(f"[server-{idx}] {line}")
-                    if not line.endswith(("\n", "\r")):
-                        sys.stdout.write("\n")
-                    sys.stdout.flush()
-                if any(m in line for m in ready_markers):
-                    ev.set()
-
-        threading.Thread(target=_stream, args=(i, proc, ready), daemon=True).start()
-
-        def _probe(idx: int, base: str, ev: threading.Event, p: subprocess.Popen) -> None:
-            while not ev.is_set() and p.poll() is None:
-                if _server_is_up(base):
-                    print(f"[server-{idx}] ready (http probe)", flush=True)
-                    ev.set()
-                    return
-                time.sleep(2)
-
-        threading.Thread(target=_probe, args=(i, api_base, ready, proc), daemon=True).start()
-
-    def _shutdown() -> None:
-        for i, p in enumerate(procs):
-            if p.poll() is None:
-                print(f"[server-{i}] stopping pid={p.pid}", flush=True)
-                p.send_signal(signal.SIGINT)
-        for p in procs:
-            try:
-                p.wait(timeout=15)
-            except subprocess.TimeoutExpired:
-                p.kill()
-                p.wait(timeout=5)
-
-    atexit.register(_shutdown)
-
-    deadline = time.monotonic() + config.serve_ready_timeout_s
-    while any(not ev.is_set() for ev in ready_events) and time.monotonic() < deadline:
-        for i, p in enumerate(procs):
-            if p.poll() is not None:
-                raise RuntimeError(
-                    f"[server-{i}] inference server exited unexpectedly with rc={p.returncode}"
-                )
-        time.sleep(2)
-    if any(not ev.is_set() for ev in ready_events):
-        raise RuntimeError(f"[server] not all replicas became ready within {config.serve_ready_timeout_s}s")
-    print(f"[lerobot-annotate] all {n} servers ready: {api_bases}", flush=True)
-    return api_bases
-
-
-def _server_is_up(api_base: str) -> bool:
-    """Return True if ``api_base/models`` answers 200 within 2 seconds."""
-    url = api_base.rstrip("/") + "/models"
-    # ``api_base`` is the user-configured local-server URL we just spawned
-    # or the user passed in via ``--vlm.api_base``; the bandit B310 warning
-    # is for arbitrary user-controlled URLs with file:/ schemes which
-    # cannot reach this code path.
-    try:
-        with urllib.request.urlopen(url, timeout=2) as resp:  # noqa: S310  # nosec B310
-            return resp.status == 200
-    except Exception:  # noqa: BLE001
-        return False
-
-
-def _spawn_inference_server(config: VlmConfig) -> str:
-    """Spawn ``transformers serve`` (or ``serve_command``), wait until it
-    accepts ``/v1/models``, and register a shutdown hook.
-
-    Streams the server's stdout/stderr to the parent terminal in
-    real-time on a background thread so users can see model-load
-    progress and errors as they happen.
-
-    Returns the full ``api_base`` URL the OpenAI client should use.
-    """
-    cmd = config.serve_command
-    if not cmd:
-        cmd = (
-            f"transformers serve {shlex.quote(config.model_id)} "
-            f"--port {config.serve_port} --continuous-batching"
-        )
-    api_base = f"http://localhost:{config.serve_port}/v1"
-    print(f"[server] launching: {cmd}", flush=True)
-    proc = subprocess.Popen(
-        shlex.split(cmd),
-        stdout=subprocess.PIPE,
-        stderr=subprocess.STDOUT,
-        text=True,
-        bufsize=1,
-    )
-
-    # Watch the server output for the uvicorn readiness banner. This is
-    # more reliable than polling /v1/models because transformers serve
-    # rescans its cache on every model-list request, which can exceed
-    # the urllib timeout and trigger an infinite probe loop.
-    ready_event = threading.Event()
-    # See _spawn_parallel_inference_servers for why we accept these.
-    ready_markers = (
-        "Uvicorn running",
-        "Application startup complete",
-        "Starting vLLM API server",
-        "Available routes are",
-    )
-
-    def _probe() -> None:
-        while not ready_event.is_set() and proc.poll() is None:
-            if _server_is_up(api_base):
-                print("[server] ready (http probe)", flush=True)
-                ready_event.set()
-                return
-            time.sleep(2)
-
-    threading.Thread(target=_probe, daemon=True).start()
-
-    def _stream_output() -> None:
-        # Read raw chunks instead of iterating lines so tqdm progress
-        # bars (which overwrite using \r) flush in real time.
-        assert proc.stdout is not None
-        buf = ""
-        prefix_started = False
-        while True:
-            ch = proc.stdout.read(1)
-            if ch == "":
-                # process exited; flush any tail
-                if buf:
-                    sys.stdout.write(buf)
-                    sys.stdout.flush()
-                return
-            if not prefix_started:
-                sys.stdout.write("[server] ")
-                prefix_started = True
-            sys.stdout.write(ch)
-            sys.stdout.flush()
-            buf += ch
-            if ch in ("\n", "\r"):
-                if any(marker in buf for marker in ready_markers):
-                    ready_event.set()
-                buf = ""
-                prefix_started = False
-
-    threading.Thread(target=_stream_output, daemon=True).start()
-
-    def _shutdown() -> None:
-        if proc.poll() is None:
-            print(f"[server] stopping pid={proc.pid}", flush=True)
-            proc.send_signal(signal.SIGINT)
-            try:
-                proc.wait(timeout=15)
-            except subprocess.TimeoutExpired:
-                proc.kill()
-                proc.wait(timeout=5)
-
-    atexit.register(_shutdown)
-
-    deadline = time.monotonic() + config.serve_ready_timeout_s
-    while time.monotonic() < deadline:
-        if proc.poll() is not None:
-            raise RuntimeError(
-                f"[server] inference server exited unexpectedly with rc={proc.returncode}. "
-                f"See [server] log lines above for the cause."
-            )
-        if ready_event.wait(timeout=2):
-            return api_base
-    proc.terminate()
-    raise RuntimeError(f"[server] did not become ready within {config.serve_ready_timeout_s}s")
-
-
-def _to_openai_messages(
-    messages: Sequence[dict[str, Any]],
-) -> tuple[list[dict[str, Any]], dict[str, Any]]:
-    """Convert internal messages to OpenAI chat format.
-
-    Returns ``(api_messages, mm_kwargs)``. Multimodal-processor kwargs
-    (``fps`` from ``video_url`` blocks) are extracted out so the caller
-    can pass them via ``extra_body.mm_processor_kwargs`` rather than
-    inside the content blocks (which transformers serve rejects).
-
-    File-URL video blocks are inlined as base64 data URLs.
-    """
-    out_messages: list[dict[str, Any]] = []
-    mm_kwargs: dict[str, Any] = {}
-    for message in messages:
-        content = message.get("content")
-        if not isinstance(content, list):
-            out_messages.append({"role": message["role"], "content": content})
-            continue
-        out_blocks: list[dict[str, Any]] = []
-        for block in content:
-            block_type = block.get("type") if isinstance(block, dict) else None
-            if block_type == "text":
-                out_blocks.append({"type": "text", "text": block.get("text", "")})
-            elif block_type == "image":
-                out_blocks.append(
-                    {"type": "image_url", "image_url": {"url": _pil_to_data_url(block["image"])}}
-                )
-            elif block_type == "video":
-                frames = block.get("video", [])
-                for img in frames:
-                    out_blocks.append({"type": "image_url", "image_url": {"url": _pil_to_data_url(img)}})
-            elif block_type == "video_url":
-                video_url = dict(block["video_url"])
-                url = video_url.get("url", "")
-                if url.startswith("file://"):
-                    video_url["url"] = _file_to_data_url(url[len("file://") :])
-                out_blocks.append({"type": "video_url", "video_url": video_url})
-                fps = block.get("fps")
-                if fps is not None:
-                    mm_kwargs["fps"] = fps
-            else:
-                out_blocks.append(block)
-        out_messages.append({"role": message["role"], "content": out_blocks})
-    return out_messages, mm_kwargs
-
-
-def _file_to_data_url(path: str) -> str:
-    """Read a local video file and return a base64 ``data:video/mp4`` URL."""
-    with open(path, "rb") as f:
-        b64 = base64.b64encode(f.read()).decode("ascii")
-    return f"data:video/mp4;base64,{b64}"
-
-
-def _pil_to_data_url(image: Any) -> str:
-    """Encode a PIL.Image as a base64 data URL."""
-    buf = io.BytesIO()
-    image.save(buf, format="PNG")
-    b64 = base64.b64encode(buf.getvalue()).decode("ascii")
-    return f"data:image/png;base64,{b64}"
-
-
-def _messages_to_prompt(messages: Sequence[dict[str, Any]]) -> Any:
-    """Pass-through hook used by the vllm backend.
-
-    vllm exposes its own multimodal entry points that vary by version; for the
-    base flow we simply forward the raw message list and let the caller's
-    custom backend handle templating. Real deployments override this.
-    """
-    return list(messages)
--- a/src/lerobot/annotations/steerable_pipeline/vocabulary.py
+++ b/src/lerobot/annotations/steerable_pipeline/vocabulary.py
@@ -1,222 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Dataset-level canonical vocabulary discovery (Phase 0).
-
-The downstream consumer of these annotations is a low-level action expert
-conditioned on the ``subtask`` string. Free-form per-episode LLM rephrasing
-gives near-unique strings per occurrence, which collapses the action
-expert's conditioning to noise and makes runtime subtask-paraphrase drift
-catastrophic. The Hi-Robot / π0.6-MEM recipe ships a small canonical
-vocabulary per environment (~10 strings) that every episode reuses; this
-module derives that vocabulary automatically from the first few episode
-videos and persists it next to the dataset.
-
-Pipeline-level flow:
-
-    Phase 0 (here): watch N sample episodes → produce vocabulary.json
-    Phase 1 (plan module): reuse vocabulary on every episode, both as
-                           prompt-side constraint *and* post-VLM validation
-
-The vocabulary is JSON, lives at ``<root>/meta/canonical_vocabulary.json``,
-and is human-inspectable / hand-editable — if the discovered set is wrong,
-operators edit the file and re-run the pipeline without phase 0.
-"""
-
-from __future__ import annotations
-
-import json
-import logging
-from collections.abc import Sequence
-from dataclasses import dataclass, field
-from pathlib import Path
-from typing import Any
-
-from .config import VocabularyConfig
-from .frames import FrameProvider, null_provider, to_video_block
-from .prompts import load as load_prompt
-from .reader import EpisodeRecord
-from .vlm_client import VlmClient
-
-logger = logging.getLogger(__name__)
-
-VOCABULARY_FILENAME = "canonical_vocabulary.json"
-
-
-@dataclass
-class Vocabulary:
-    """Canonical phrasings shared across every episode of one dataset.
-
-    Both lists are strict: per-episode subtask + memory generation pick
-    from these strings only; the downstream policy then has a small,
-    repeatable target distribution to learn instead of thousands of
-    LLM paraphrases.
-    """
-
-    subtasks: tuple[str, ...]
-    """Imperative subtask labels — what the low-level policy is conditioned
-    on. Verb-first, telegraphic, consistent object nouns. Example:
-    ``("move to blue cube", "grasp blue cube", "lift blue cube",
-       "place blue cube in box", "retract arm")``.
-    """
-
-    memory_milestones: tuple[str, ...]
-    """First-person past-tense milestone sentences — building blocks for
-    the running memory string. Example: ``("I picked up the blue cube.",
-    "I placed the blue cube in the green box.")``. Each milestone maps
-    1:1 onto a completed subtask phase; ``memory_at_step_k`` is the
-    concatenation of milestones for completed phases.
-    """
-
-    def to_json(self) -> dict[str, list[str]]:
-        return {
-            "subtasks": list(self.subtasks),
-            "memory_milestones": list(self.memory_milestones),
-        }
-
-    @classmethod
-    def from_json(cls, payload: dict[str, Any]) -> Vocabulary:
-        subtasks = tuple(
-            str(s).strip() for s in (payload.get("subtasks") or []) if str(s).strip()
-        )
-        memory_milestones = tuple(
-            str(s).strip() for s in (payload.get("memory_milestones") or []) if str(s).strip()
-        )
-        return cls(subtasks=subtasks, memory_milestones=memory_milestones)
-
-    def is_empty(self) -> bool:
-        return not self.subtasks and not self.memory_milestones
-
-
-def vocabulary_path(root: Path) -> Path:
-    """Return the canonical on-disk location for the vocabulary file."""
-    return root / "meta" / VOCABULARY_FILENAME
-
-
-def load_vocabulary(root: Path) -> Vocabulary | None:
-    """Read ``<root>/meta/canonical_vocabulary.json`` if present.
-
-    Returns ``None`` when the file does not exist — callers fall back to
-    free-form (unconstrained) subtask + memory generation, preserving the
-    pipeline's behaviour on datasets that never ran phase 0.
-    """
-    path = vocabulary_path(root)
-    if not path.exists():
-        return None
-    try:
-        payload = json.loads(path.read_text(encoding="utf-8"))
-    except (OSError, json.JSONDecodeError) as exc:
-        logger.warning("could not read %s: %s — proceeding without vocabulary", path, exc)
-        return None
-    if not isinstance(payload, dict):
-        logger.warning("%s is not a JSON object — ignoring", path)
-        return None
-    vocab = Vocabulary.from_json(payload)
-    if vocab.is_empty():
-        return None
-    return vocab
-
-
-def save_vocabulary(root: Path, vocab: Vocabulary) -> Path:
-    """Atomically persist ``vocab`` to ``<root>/meta/canonical_vocabulary.json``."""
-    path = vocabulary_path(root)
-    path.parent.mkdir(parents=True, exist_ok=True)
-    tmp = path.with_suffix(path.suffix + ".tmp")
-    tmp.write_text(
-        json.dumps(vocab.to_json(), indent=2, ensure_ascii=False) + "\n",
-        encoding="utf-8",
-    )
-    tmp.replace(path)
-    return path
-
-
-@dataclass
-class VocabularyDiscoveryModule:
-    """Derive a dataset-level canonical vocabulary from sample episodes.
-
-    Phase 0 of the executor: pulls ``config.sample_episodes`` episode
-    videos, packs them into one Qwen-VL multi-video prompt, and asks the
-    model to enumerate the small set of canonical subtask labels +
-    memory milestones that recur across them. The output is persisted
-    to ``meta/canonical_vocabulary.json`` and consumed by phase 1.
-    """
-
-    vlm: VlmClient
-    config: VocabularyConfig
-    frame_provider: FrameProvider = field(default_factory=null_provider)
-
-    @property
-    def enabled(self) -> bool:
-        return self.config.enabled
-
-    def discover(
-        self,
-        records: Sequence[EpisodeRecord],
-        *,
-        existing: Vocabulary | None = None,
-    ) -> Vocabulary | None:
-        """Run vocabulary discovery against the first N sample episodes.
-
-        ``existing`` short-circuits the VLM call when ``config.reuse_existing``
-        is True and an on-disk vocabulary is already present — keeps re-runs
-        cheap and lets operators hand-edit the file without it getting
-        overwritten.
-        """
-        if existing is not None and self.config.reuse_existing:
-            logger.info(
-                "vocabulary: reusing existing (%d subtasks, %d memory milestones)",
-                len(existing.subtasks),
-                len(existing.memory_milestones),
-            )
-            return existing
-
-        sample = list(records[: max(1, int(self.config.sample_episodes))])
-        if not sample:
-            return None
-
-        task_hint = next((r.episode_task for r in sample if r.episode_task), "")
-        prompt = load_prompt("module_0_vocabulary").format(
-            episode_task=task_hint or "(unspecified)",
-            n_episodes=len(sample),
-        )
-        # Pack one video block per sample episode so the VLM sees the
-        # variation across episodes (different starting poses, different
-        # object placements) rather than overfitting to one trajectory.
-        content: list[dict[str, Any]] = []
-        for record in sample:
-            video_frames = self.frame_provider.video_for_episode(
-                record, int(self.config.max_video_frames_per_episode)
-            )
-            if video_frames:
-                content.extend(to_video_block(video_frames))
-        content.append({"type": "text", "text": prompt})
-        messages = [{"role": "user", "content": content}]
-
-        result = self.vlm.generate_json([messages])[0]
-        if not isinstance(result, dict):
-            logger.warning("vocabulary: VLM did not return a JSON object — skipping")
-            return None
-
-        vocab = Vocabulary.from_json(result)
-        if vocab.is_empty():
-            logger.warning("vocabulary: VLM returned an empty vocabulary — skipping")
-            return None
-        logger.info(
-            "vocabulary: discovered %d subtask labels + %d memory milestones from %d episodes",
-            len(vocab.subtasks),
-            len(vocab.memory_milestones),
-            len(sample),
-        )
-        return vocab
--- a/src/lerobot/annotations/steerable_pipeline/writer.py
+++ b/src/lerobot/annotations/steerable_pipeline/writer.py
@@ -1,356 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Final parquet rewrite.
-
-For every episode the writer:
-
-1. reads the staged module outputs,
-2. partitions them into a persistent slice (PERSISTENT_STYLES) and an event
-   slice (EVENT_ONLY_STYLES + style=None tool-call atoms),
-3. sorts each slice deterministically,
-4. broadcasts the persistent slice across every frame in the episode,
-5. for each frame, materializes the sublist of event rows whose timestamp
-   exactly equals that frame's timestamp,
-6. drops the legacy ``subtask_index`` column,
-7. writes the parquet shard back in place.
-
-The writer does NOT add a dataset-level ``tools`` column. Tool *calls* are
-emitted per-row via the existing ``tool_calls`` field on the v3.1 row
-struct for every speech atom. The tool *schema* (the description
-of the ``say`` function and its parameters) is a fixed code constant —
-``SAY_TOOL_SCHEMA`` below — and downstream chat-template consumers import
-it directly rather than reading a redundant per-row column.
-
-Invariants enforced here (and re-checked by the validator):
-
- per-episode persistent slice is byte-identical across every frame;
- ``language_events`` rows on a frame all have ``timestamp == frame_ts``
-  (timestamps come straight from the source parquet — never recomputed);
- every row passes ``column_for_style(style)``.
-"""
-
-from __future__ import annotations
-
-import logging
-from collections import defaultdict
-from collections.abc import Iterable, Sequence
-from dataclasses import dataclass
-from pathlib import Path
-from typing import Any
-
-import pyarrow as pa
-import pyarrow.parquet as pq
-
-from lerobot.datasets.language import (
-    EVENT_ONLY_STYLES,
-    LANGUAGE_EVENTS,
-    LANGUAGE_PERSISTENT,
-    PERSISTENT_STYLES,
-    column_for_style,
-    validate_camera_field,
-)
-
-from .reader import EpisodeRecord
-from .staging import EpisodeStaging
-
-logger = logging.getLogger(__name__)
-
-
-# Tool schema constants live in lerobot.datasets.language — single
-# source of truth. Re-exported here so existing imports
-# (``from lerobot.annotations.steerable_pipeline.writer import SAY_TOOL_SCHEMA``)
-# keep working.
-from lerobot.datasets.language import DEFAULT_TOOLS, SAY_TOOL_SCHEMA  # noqa: F401, E402
-
-
-def _row_persistent_sort_key(row: dict[str, Any]) -> tuple:
-    return (float(row["timestamp"]), row.get("style") or "", row.get("role") or "")
-
-
-def _row_event_sort_key(row: dict[str, Any]) -> tuple:
-    # events are bucketed per-frame, but within a frame we still want determinism
-    return (
-        row.get("style") or "",
-        row.get("role") or "",
-        row.get("camera") or "",
-    )
-
-
-def _normalize_persistent_row(row: dict[str, Any]) -> dict[str, Any]:
-    """Coerce a staged row into the persistent column's struct shape."""
-    style = row.get("style")
-    if style not in PERSISTENT_STYLES:
-        raise ValueError(
-            f"persistent slice contains row with non-persistent style {style!r}; "
-            "row would be misrouted under column_for_style()"
-        )
-    if "timestamp" not in row:
-        raise ValueError(f"persistent row missing timestamp: {row!r}")
-    if "role" not in row:
-        # Surface a friendly error from the writer rather than letting
-        # the raw KeyError bubble out of the dict access below — modules
-        # are expected to always emit ``role``, but the validator
-        # currently doesn't check this so a future bug would otherwise
-        # be hard to triage.
-        raise ValueError(f"persistent row missing role: {row!r}")
-    camera = row.get("camera")
-    validate_camera_field(style, camera)
-    return {
-        "role": str(row["role"]),
-        "content": None if row.get("content") is None else str(row["content"]),
-        "style": style,
-        "timestamp": float(row["timestamp"]),
-        "camera": None if camera is None else str(camera),
-        "tool_calls": _normalize_tool_calls(row.get("tool_calls")),
-    }
-
-
-def _normalize_event_row(row: dict[str, Any]) -> dict[str, Any]:
-    """Coerce a staged row into the event column's struct shape (no timestamp)."""
-    style = row.get("style")
-    if style is not None and style not in EVENT_ONLY_STYLES:
-        raise ValueError(
-            f"event slice contains row with style {style!r}; expected None or one of {EVENT_ONLY_STYLES}"
-        )
-    if column_for_style(style) != LANGUAGE_EVENTS:
-        raise ValueError(f"event row with style {style!r} would not route to language_events")
-    if "role" not in row:
-        raise ValueError(f"event row missing role: {row!r}")
-    camera = row.get("camera")
-    validate_camera_field(style, camera)
-    return {
-        "role": str(row["role"]),
-        "content": None if row.get("content") is None else str(row["content"]),
-        "style": style,
-        "camera": None if camera is None else str(camera),
-        "tool_calls": _normalize_tool_calls(row.get("tool_calls")),
-    }
-
-
-def _normalize_tool_calls(value: Any) -> list[Any] | None:
-    if value is None:
-        return None
-    if not isinstance(value, list):
-        raise ValueError(f"tool_calls must be a list or None, got {type(value).__name__}")
-    return list(value)
-
-
-def _validate_atom_invariants(row: dict[str, Any]) -> None:
-    """At-least-one of content/tool_calls; style=None implies tool_calls."""
-    has_content = row.get("content") is not None
-    has_tools = row.get("tool_calls") is not None
-    if not (has_content or has_tools):
-        raise ValueError(f"row has neither content nor tool_calls: {row!r}")
-    if row.get("style") is None and not has_tools:
-        raise ValueError(f"style=None requires tool_calls: {row!r}")
-
-
-def _validate_speech_atom(row: dict[str, Any]) -> None:
-    """Speech atoms: role=assistant, style=None, content=None, say tool call."""
-    if row.get("style") is not None:
-        return  # not a speech atom
-    if row.get("role") != "assistant":
-        raise ValueError(f"speech atom must have role=assistant: {row!r}")
-    if row.get("content") is not None:
-        raise ValueError(f"speech atom must have content=null: {row!r}")
-    tool_calls = row.get("tool_calls")
-    if not tool_calls or not isinstance(tool_calls, list):
-        raise ValueError(f"speech atom must have non-empty tool_calls list: {row!r}")
-    first = tool_calls[0]
-    if not isinstance(first, dict):
-        raise ValueError(f"speech atom tool_calls[0] must be a dict: {row!r}")
-    if first.get("type") != "function":
-        raise ValueError(f"speech atom tool_calls[0].type must be 'function': {row!r}")
-    fn = first.get("function") or {}
-    if fn.get("name") != "say":
-        raise ValueError(f"speech atom tool_calls[0].function.name must be 'say': {row!r}")
-    args = fn.get("arguments") or {}
-    if not isinstance(args, dict) or "text" not in args or not isinstance(args["text"], str):
-        raise ValueError(f"speech atom must carry 'text' string in arguments: {row!r}")
-
-
-@dataclass
-class LanguageColumnsWriter:
-    """Rewrite ``data/chunk-*/file-*.parquet`` with the two language columns."""
-
-    drop_existing_subtask_index: bool = True
-
-    def write_all(
-        self,
-        records: Sequence[EpisodeRecord],
-        staging_dir: Path,
-        root: Path,
-    ) -> list[Path]:
-        episodes_by_path: dict[Path, list[EpisodeRecord]] = defaultdict(list)
-        for record in records:
-            episodes_by_path[record.data_path].append(record)
-
-        written: list[Path] = []
-        for path, eps in episodes_by_path.items():
-            self._rewrite_one(path, eps, staging_dir, root)
-            written.append(path)
-        return written
-
-    def _rewrite_one(
-        self,
-        path: Path,
-        episodes: Sequence[EpisodeRecord],
-        staging_dir: Path,
-        root: Path,
-    ) -> None:
-        table = pq.read_table(path)
-        n_rows = table.num_rows
-
-        # Ensure we cover every episode in the file. Episodes that don't have
-        # staging artifacts are passed through with empty annotation lists —
-        # this keeps the writer idempotent and safe for partial reruns.
-        staged_per_ep: dict[int, dict[str, list[dict[str, Any]]]] = {}
-        for record in episodes:
-            staging = EpisodeStaging(staging_dir, record.episode_index)
-            staged_per_ep[record.episode_index] = staging.read_all()
-
-        persistent_by_ep: dict[int, list[dict[str, Any]]] = {}
-        events_by_ep_ts: dict[int, dict[float, list[dict[str, Any]]]] = {}
-
-        for ep_index, ep_staged in staged_per_ep.items():
-            persistent_rows: list[dict[str, Any]] = []
-            event_rows: list[dict[str, Any]] = []  # carry timestamp until bucketed
-            for _module_name, rows in ep_staged.items():
-                for row in rows:
-                    style = row.get("style")
-                    if column_for_style(style) == LANGUAGE_PERSISTENT:
-                        persistent_rows.append(row)
-                    else:
-                        event_rows.append(row)
-
-            persistent_rows.sort(key=_row_persistent_sort_key)
-            normalized_persistent = []
-            for r in persistent_rows:
-                _validate_atom_invariants(r)
-                _validate_speech_atom(r)
-                normalized_persistent.append(_normalize_persistent_row(r))
-            persistent_by_ep[ep_index] = normalized_persistent
-
-            buckets: dict[float, list[dict[str, Any]]] = defaultdict(list)
-            for r in event_rows:
-                _validate_atom_invariants(r)
-                _validate_speech_atom(r)
-                ts = float(r["timestamp"])
-                buckets[ts].append(_normalize_event_row(r))
-            for ts in list(buckets.keys()):
-                buckets[ts].sort(key=_row_event_sort_key)
-            events_by_ep_ts[ep_index] = buckets
-
-        episode_col = (
-            table.column("episode_index").to_pylist() if "episode_index" in table.column_names else None
-        )
-        ts_col = table.column("timestamp").to_pylist() if "timestamp" in table.column_names else None
-        if episode_col is None or ts_col is None:
-            raise ValueError(f"{path} is missing 'episode_index' or 'timestamp' — required by the writer.")
-
-        per_row_persistent: list[list[dict[str, Any]]] = []
-        per_row_events: list[list[dict[str, Any]]] = []
-        for i in range(n_rows):
-            ep = episode_col[i]
-            ts = float(ts_col[i])
-            per_row_persistent.append(persistent_by_ep.get(ep, []))
-            buckets = events_by_ep_ts.get(ep, {})
-            per_row_events.append(buckets.get(ts, []))
-
-        new_table = self._materialize_table(
-            table, per_row_persistent, per_row_events, drop_old=self.drop_existing_subtask_index
-        )
-        # Atomic replace: write to a sibling tmp path and rename so a crash
-        # mid-write can't leave a half-written shard that ``pq.read_table``
-        # would then fail to open. ``Path.replace`` is atomic on POSIX +
-        # Windows when source and target sit on the same filesystem.
-        tmp_path = path.with_suffix(path.suffix + ".tmp")
-        pq.write_table(new_table, tmp_path)
-        tmp_path.replace(path)
-
-    def _materialize_table(
-        self,
-        table: pa.Table,
-        persistent: list[list[dict[str, Any]]],
-        events: list[list[dict[str, Any]]],
-        *,
-        drop_old: bool,
-    ) -> pa.Table:
-        cols = []
-        names = []
-        for name in table.column_names:
-            if drop_old and name == "subtask_index":
-                continue
-            if name in (LANGUAGE_PERSISTENT, LANGUAGE_EVENTS):
-                continue  # we'll re-add canonical versions
-            # Strip any legacy ``tools`` column previously emitted by older
-            # writers — the schema no longer uses it (constant lives in
-            # SAY_TOOL_SCHEMA / DEFAULT_TOOLS).
-            if name == "tools":
-                continue
-            cols.append(table.column(name))
-            names.append(name)
-
-        # We let pyarrow infer struct/list schema rather than passing the
-        # canonical type from `lerobot.datasets.language` directly: that type
-        # uses `pa.json_()` for the `tool_calls` element type, which
-        # `pa.array(..., type=...)` cannot materialize from Python lists on
-        # current pyarrow versions. The inferred schema round-trips through
-        # parquet and `LeRobotDataset` correctly — `tests/datasets/test_language.py`
-        # exercises the same flow.
-        persistent_arr = pa.array(persistent)
-        events_arr = pa.array(events)
-
-        cols.extend([persistent_arr, events_arr])
-        names.extend([LANGUAGE_PERSISTENT, LANGUAGE_EVENTS])
-
-        return pa.Table.from_arrays(cols, names=names)
-
-
-def speech_atom(timestamp: float, text: str) -> dict[str, Any]:
-    """Build a canonical speech tool-call atom for the events column."""
-    return {
-        "role": "assistant",
-        "content": None,
-        "style": None,
-        "timestamp": float(timestamp),
-        "camera": None,
-        "tool_calls": [
-            {
-                "type": "function",
-                "function": {
-                    "name": "say",
-                    "arguments": {"text": text},
-                },
-            }
-        ],
-    }
-
-
-def normalize_rows_for_writer(
-    rows: Iterable[dict[str, Any]],
-) -> tuple[list[dict[str, Any]], list[dict[str, Any]]]:
-    """Helper used by tests/validators to partition a flat row list into
-    (persistent_rows, event_rows) using ``column_for_style``.
-    """
-    persistent: list[dict[str, Any]] = []
-    events: list[dict[str, Any]] = []
-    for row in rows:
-        if column_for_style(row.get("style")) == LANGUAGE_PERSISTENT:
-            persistent.append(row)
-        else:
-            events.append(row)
-    return persistent, events
--- a/src/lerobot/common/wandb_utils.py
+++ b/src/lerobot/common/wandb_utils.py
@@ -205,149 +205,3 @@ class WandBLogger:

        wandb_video = self._wandb.Video(video_path, fps=self.env_fps, format="mp4")
        self._wandb.log({f"{mode}/video": wandb_video}, step=step)
-
-    def log_training_examples(
-        self,
-        batch: dict,
-        step: int,
-        *,
-        camera_keys: list[str],
-        n_samples: int = 4,
-        policy=None,
-        predict_actions: bool = False,
-        mode: str = "train",
-    ) -> None:
-        """Push a ``wandb.Table`` of training-example rows for the current batch.
-
-        Each row is one batch element with:
-          * one ``wandb.Image`` column per camera in ``camera_keys`` (CHW or
-            HWC, uint8 or float in [0,1] — auto-detected),
-          * any text fields present in the batch (``task`` / ``subtask`` /
-            ``memory`` / ``instruction``),
-          * ground-truth action first/last frame (the action chunk's
-            endpoints — gives a quick sense of trajectory direction),
-          * if ``predict_actions=True`` and ``policy`` is supplied, the model's
-            ``predict_action_chunk`` first/last frame alongside.
-
-        This is opt-in via ``--wandb.log_examples_freq=N`` on the CLI; the
-        training loop calls it once every N steps. Cheap to keep on: with
-        N=4 samples and 3 cameras you upload 12 small PNGs per dump and (if
-        enabled) run one extra inference forward pass.
-        """
-        import logging  # noqa: PLC0415
-        import numpy as np  # noqa: PLC0415
-        import torch  # noqa: PLC0415
-
-        if mode not in {"train", "eval"}:
-            raise ValueError(mode)
-
-        # Batch size — first tensor-like value wins.
-        bsz = next(
-            (int(v.shape[0]) for v in batch.values() if hasattr(v, "shape") and v.ndim > 0),
-            None,
-        )
-        if not bsz:
-            return
-        n = min(int(n_samples), bsz)
-
-        # Optional predicted-action forward pass on the first n samples.
-        pred_actions: np.ndarray | None = None
-        if predict_actions and policy is not None:
-            was_training = policy.training
-            try:
-                policy.eval()
-                sub_batch = {}
-                for k, v in batch.items():
-                    if isinstance(v, torch.Tensor):
-                        sub_batch[k] = v[:n]
-                    elif isinstance(v, (list, tuple)):
-                        sub_batch[k] = list(v[:n])
-                    else:
-                        sub_batch[k] = v
-                with torch.no_grad():
-                    pred = policy.predict_action_chunk(sub_batch)
-                pred_actions = pred.detach().cpu().float().numpy()
-            except Exception as exc:  # noqa: BLE001
-                logging.warning(
-                    "log_training_examples: predict_action_chunk failed (%s) — "
-                    "skipping predicted-action columns",
-                    exc,
-                )
-                pred_actions = None
-            finally:
-                if was_training:
-                    policy.train()
-
-        present_cameras = [c for c in camera_keys if c in batch]
-        text_keys = [k for k in ("task", "subtask", "memory", "instruction") if k in batch]
-
-        columns = ["sample"]
-        columns.extend(c.removeprefix("observation.images.") or c for c in present_cameras)
-        columns.extend(text_keys)
-        columns.append("gt_action_first")
-        columns.append("gt_action_last")
-        if pred_actions is not None:
-            columns.append("pred_action_first")
-            columns.append("pred_action_last")
-
-        table = self._wandb.Table(columns=columns)
-
-        def _to_uint8_hwc(t: torch.Tensor) -> np.ndarray:
-            # Strip an outer time dim if present: (T, C, H, W) -> first frame.
-            if t.ndim == 4:
-                t = t[0]
-            # CHW -> HWC.
-            if t.ndim == 3 and t.shape[0] in (1, 3, 4) and t.shape[-1] not in (1, 3, 4):
-                t = t.permute(1, 2, 0)
-            arr = t.detach().cpu().float().numpy()
-            if arr.size and float(arr.max()) <= 1.5:
-                arr = arr * 255.0
-            return np.clip(arr, 0, 255).astype(np.uint8)
-
-        def _action_endpoints(a: torch.Tensor) -> tuple[str, str]:
-            arr = a.detach().cpu().float().numpy()
-            if arr.ndim == 2:  # (T, D)
-                return (
-                    str(np.round(arr[0], 3).tolist()),
-                    str(np.round(arr[-1], 3).tolist()),
-                )
-            if arr.ndim == 1:
-                rounded = np.round(arr, 3).tolist()
-                return (str(rounded), str(rounded))
-            return (str(arr.tolist()), str(arr.tolist()))
-
-        for i in range(n):
-            row: list = [i]
-            for cam in present_cameras:
-                try:
-                    row.append(self._wandb.Image(_to_uint8_hwc(batch[cam][i])))
-                except Exception as exc:  # noqa: BLE001
-                    logging.warning(
-                        "log_training_examples: camera %s sample %d failed (%s)",
-                        cam,
-                        i,
-                        exc,
-                    )
-                    row.append(None)
-            for tk in text_keys:
-                v = batch[tk]
-                if isinstance(v, (list, tuple)):
-                    row.append(str(v[i]) if i < len(v) else "")
-                else:
-                    row.append(str(v))
-            action = batch.get("action")
-            if isinstance(action, torch.Tensor) and action.ndim >= 1:
-                first, last = _action_endpoints(action[i])
-                row.append(first)
-                row.append(last)
-            else:
-                row.append("")
-                row.append("")
-            if pred_actions is not None:
-                p = torch.from_numpy(pred_actions[i])
-                pfirst, plast = _action_endpoints(p)
-                row.append(pfirst)
-                row.append(plast)
-            table.add_data(*row)
-
-        self._wandb.log({f"{mode}/examples": table}, step=step)
--- a/src/lerobot/configs/default.py
+++ b/src/lerobot/configs/default.py
@@ -62,72 +62,6 @@ class WandBConfig:
    run_id: str | None = None
    mode: str | None = None  # Allowed values: 'online', 'offline' 'disabled'. Defaults to 'online'
    add_tags: bool = True  # If True, save configuration as tags in the WandB run.
-    # Periodic training-example dump (independent of ``log_freq``). When > 0,
-    # every ``log_examples_freq`` steps the trainer pushes a ``wandb.Table``
-    # with one row per sampled batch element containing each camera view
-    # (rendered as ``wandb.Image``), any text fields present in the batch
-    # (``task`` / ``subtask`` / ``memory`` / ``instruction``), and the
-    # ground-truth action chunk's first + last frames. Defaults to 5000 — set
-    # to 0 to disable. Only fires when ``enable=True``, so runs without wandb
-    # are unaffected.
-    log_examples_freq: int = 5000
-    # Number of batch elements to include in each example dump.
-    log_examples_n: int = 4
-    # If True (default), also run ``policy.predict_action_chunk`` on the logged
-    # samples (in eval mode, no_grad) and add predicted vs ground-truth action
-    # columns to the table. Costs one extra forward pass per dump — negligible
-    # at the 5k-step default cadence. Set to ``False`` if your policy doesn't
-    # implement ``predict_action_chunk`` or you want to skip the extra forward.
-    log_examples_predict_actions: bool = True
-
-
-@dataclass
-class EMAConfig:
-    """Exponential Moving Average of trainable policy parameters.
-
-    Diffusion / flow-matching policies (Diffusion Policy, π0/π0.5,
-    pi052) benefit substantially from averaging late-training
-    parameter oscillations — see Chi et al. 2023 §V.D. The official
-    JAX openpi trainer ships EMA with ``ema_decay=0.99`` (default) and
-    ``0.999`` for its pi05_libero config; the openpi PyTorch port
-    explicitly lists EMA as unsupported, and LeRobot main inherited
-    that gap. Enabling this flag plugs ema-pytorch
-    (https://github.com/lucidrains/ema-pytorch) into the LeRobot
-    training loop with a shadow ``nn.Module`` clone of the policy.
-
-    Cost: 1× model params in fp32 shadow (~13 GB for pi052's 3.3B
-    params) + one elementwise update per training step (~1% step time).
-
-    On by default — matches openpi (JAX) which ships EMA on for every
-    config, and closes the gap with the openpi PyTorch port which
-    explicitly lists EMA as unsupported. Set ``--ema.enable=false`` to
-    disable for short runs / memory-constrained training where the
-    extra fp32 shadow copy is the bottleneck.
-    """
-
-    enable: bool = True
-    # Target EMA decay β in θ_ema ← β·θ_ema + (1-β)·θ_live (passed to
-    # ema-pytorch as ``beta``).
-    #   0.999  — last ~1000 steps; pi05_libero default in openpi
-    #   0.99   — last ~100 steps; openpi top-level default
-    #   0.75   — very fast EMA (Diffusion Policy original setting)
-    #   0.9999 — very slow EMA (long classification runs)
-    decay: float = 0.999
-    # Skip the first N calls to ``ema.update()``; during this window
-    # the shadow is just a hard copy of the live weights (no averaging).
-    # Lets early-training rapid changes settle before averaging begins.
-    # Maps to ema-pytorch's ``update_after_step`` (NOT a smooth decay
-    # ramp like older lerobot EMA implementations).
-    warmup_steps: int = 0
-    # When True, the periodic eval block uses the EMA shadow model
-    # directly (``ema.ema_model``) instead of the live policy. Standard
-    # practice for diffusion-style policies — eval scores are usually
-    # 1–3% higher than the live policy at the same step.
-    use_for_eval: bool = True
-    # When True, the periodic wandb training-example dump uses the EMA
-    # shadow for the optional predicted-action columns (so what you see
-    # in W&B matches eval behavior).
-    use_for_wandb_examples: bool = True


@dataclass
--- a/src/lerobot/configs/recipe.py
+++ b/src/lerobot/configs/recipe.py
@@ -147,16 +147,7 @@ class TrainingRecipe:
        return cls.from_dict(data)

    def _validate_message_recipe(self) -> None:
-        """Ensure every templated binding is known and the recipe supervises something.
-
-        A recipe is valid if it has at least one of:
-
-        * a ``target: true`` assistant turn (drives text-CE supervision), or
-        * a ``stream: low_level`` turn (drives flow / action supervision via
-          ``predict_actions=True``, even when no assistant turn is targeted —
-          e.g. π0.5-style ``low_level_execution`` where the action expert
-          conditions on a user-only ``${subtask}`` prompt).
-        """
+        """Ensure every templated binding is known and at least one turn is a target."""
        assert self.messages is not None
        known_bindings = set(DEFAULT_BINDINGS) | set(self.bindings or {}) | {"task"}

@@ -165,14 +156,8 @@ class TrainingRecipe:
            if missing:
                raise ValueError(f"MessageTurn references unknown binding(s): {sorted(missing)}")

-        has_target = any(turn.target for turn in self.messages)
-        has_low_level = any(turn.stream == "low_level" for turn in self.messages)
-        if not (has_target or has_low_level):
-            raise ValueError(
-                "Message recipes must contain at least one supervised turn — "
-                "either ``target: true`` (text CE) or ``stream: low_level`` "
-                "(flow/action loss)."
-            )
+        if not any(turn.target for turn in self.messages):
+            raise ValueError("Message recipes must contain at least one target turn.")

    def _validate_blend_recipe(self) -> None:
        """Ensure each blend component is a non-empty, weighted message recipe."""
--- a/src/lerobot/configs/recipes/subtask_mem.yaml
+++ b/src/lerobot/configs/recipes/subtask_mem.yaml
@@ -1,68 +0,0 @@
-# subtask_mem_vqa_speech — Hi-Robot blend + memory + spoken responses.
-#
-# Superset of subtasks_vqa.yaml. Keeps the core subtask + action + VQA
-# training, and adds two text-supervised tasks:
-#
-#   high_level_subtask         — predict the subtask from the task.
-#   low_level_execution        — flow loss with [images, subtask, state].
-#   memory_update              — compress progress into a memory note.
-#   user_interjection_response — reply to a user interjection with a
-#                                spoken `say` tool call (no plan, no
-#                                subtask text — just the spoken reply).
-#   ask_vqa_{top,wrist}        — camera-grounded VQA.
-#
-# Plan is intentionally left out — memory is the only persistent
-# high-level state here, keeping the prompt short.
-#
-# Requires the dataset to carry `memory`, `interjection` and `say`-tool
-# annotations (the annotation pipeline's memory + interjection modules)
-# in addition to `subtask` and `vqa`. Sub-recipes whose `if_present`
-# bindings are missing simply don't render for that sample, so a
-# dataset without interjections still trains the rest of the blend.
-#
-# Tool-call note: the `say` tool call on the interjection-response turn
-# is flattened to a `<say>...</say>` text marker by the tokenizer step
-# (`_flatten_say_tool_calls`) so the LM head learns to emit exactly the
-# marker the runtime parses back (`_split_plan_and_say`).
-
-blend:
-
-  high_level_subtask:
-    weight: 0.30
-    messages:
-      - {role: user, content: "${task}", stream: high_level}
-      - {role: assistant, content: "${subtask}", stream: high_level, target: true, if_present: subtask}
-
-  low_level_execution:
-    weight: 0.55
-    messages:
-      # The action expert is conditioned on the SUBTASK — at inference
-      # `HighLevelSubtaskFwd` generates it via the LM head and feeds it
-      # here. `stream: low_level` flips `predict_actions=True` so the
-      # flow loss fires; no text-CE target (subtask prediction is owned
-      # by `high_level_subtask`).
-      - {role: user, content: "${subtask}", stream: low_level, if_present: subtask}
-
-  memory_update:
-    # At inference, `MemoryUpdateFwd` is triggered only on
-    # `subtask_change` events (sparse). Training densely with
-    # `active_at` — i.e. on every frame inside a subtask interval,
-    # not just the boundary frame — supervises the same
-    # (prior_memory, completed_subtask) → current_memory mapping
-    # against varied observations within the interval. The model
-    # learns a stateless transformation; the *when* to emit lives in
-    # the inference trigger, not the model. Annotations only exist
-    # for ~1% of frames as boundary events, so `emitted_at` would
-    # waste 99% of the blend draws (and silently leak them into a
-    # task-conditioned fallback); `active_at` lifts the renderable
-    # rate to ~87% on this dataset.
-    weight: 0.15
-    bindings:
-      prior_memory: "nth_prev(style=memory, offset=1)"
-      current_memory: "active_at(t, style=memory)"
-      completed_subtask: "nth_prev(style=subtask, offset=1)"
-    messages:
-      - {role: user, content: "${task}", stream: high_level}
-      - {role: assistant, content: "Previous memory: ${prior_memory}", stream: high_level, if_present: prior_memory}
-      - {role: user, content: "Completed subtask: ${completed_subtask}", stream: high_level, if_present: completed_subtask}
-      - {role: assistant, content: "${current_memory}", stream: high_level, target: true, if_present: current_memory}
--- a/src/lerobot/configs/recipes/subtask_mem_vqa_robocasa.yaml
+++ b/src/lerobot/configs/recipes/subtask_mem_vqa_robocasa.yaml
@@ -1,99 +0,0 @@
-# subtask_mem_vqa_robocasa — Hi-Robot blend tuned for RoboCasa cameras.
-#
-# Same supervision as ``subtask_mem.yaml`` (subtask + memory) plus
-# camera-grounded VQA across the three RoboCasa camera keys produced
-# by ``slurm_build_robocasa_composite_seen.py``:
-#
-#   observation.images.robot0_agentview_left   (left scene view)
-#   observation.images.robot0_agentview_right  (right scene view)
-#   observation.images.robot0_eye_in_hand      (wrist)
-#
-# The annotation pipeline (``examples/annotations/run_hf_job.py``) emits
-# VQA per camera, so each anchor frame produces three (user, assistant)
-# rows tagged with their source camera. Each VQA sub-recipe consumes
-# the rows for one camera via ``camera=...`` resolver bindings.
-#
-# Spatial VQA targets (bbox / point) are rewritten from JSON to
-# PaliGemma ``<locDDDD>`` tokens by ``_messages_vqa_to_loc`` —
-# ``register_paligemma_loc_tokens`` already collapses them to single
-# detection-vocab ids so the LM head learns the pretrained pointing /
-# detection prior, not a 7-piece BPE salad.
-#
-# Interjections / spoken responses are intentionally absent — the
-# annotation job runs with ``--interjections.enabled=false``.
-
-blend:
-
-  high_level_subtask:
-    weight: 0.25
-    messages:
-      - {role: user, content: "${task}", stream: high_level}
-      - {role: assistant, content: "${subtask}", stream: high_level, target: true, if_present: subtask}
-
-  low_level_execution:
-    weight: 0.45
-    messages:
-      # Action expert is conditioned on the SUBTASK; at inference the
-      # high-level loop generates it via the LM head and feeds it here.
-      # ``stream: low_level`` flips ``predict_actions=True`` so the flow
-      # loss fires; subtask CE is owned by ``high_level_subtask``.
-      - {role: user, content: "${subtask}", stream: low_level, if_present: subtask}
-
-  memory_update:
-    # Trained densely with ``active_at`` — every frame inside a subtask
-    # interval — so the (prior_memory, completed_subtask) → current_memory
-    # mapping is supervised against varied observations. The *when* to
-    # emit lives in the inference trigger (subtask_change), not the
-    # model. See ``subtask_mem.yaml`` for the long version of this note.
-    weight: 0.15
-    bindings:
-      prior_memory: "nth_prev(style=memory, offset=1)"
-      current_memory: "active_at(t, style=memory)"
-      completed_subtask: "nth_prev(style=subtask, offset=1)"
-    messages:
-      - {role: user, content: "${task}", stream: high_level}
-      - {role: assistant, content: "Previous memory: ${prior_memory}", stream: high_level, if_present: prior_memory}
-      - {role: user, content: "Completed subtask: ${completed_subtask}", stream: high_level, if_present: completed_subtask}
-      - {role: assistant, content: "${current_memory}", stream: high_level, target: true, if_present: current_memory}
-
-  ask_vqa_agentview_left:
-    weight: 0.05
-    bindings:
-      vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.robot0_agentview_left)"
-      vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.robot0_agentview_left)"
-    messages:
-      - role: user
-        stream: high_level
-        if_present: vqa_query
-        content:
-          - {type: image, feature: observation.images.robot0_agentview_left}
-          - {type: text, text: "${vqa_query}"}
-      - {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
-
-  ask_vqa_agentview_right:
-    weight: 0.05
-    bindings:
-      vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.robot0_agentview_right)"
-      vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.robot0_agentview_right)"
-    messages:
-      - role: user
-        stream: high_level
-        if_present: vqa_query
-        content:
-          - {type: image, feature: observation.images.robot0_agentview_right}
-          - {type: text, text: "${vqa_query}"}
-      - {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
-
-  ask_vqa_wrist:
-    weight: 0.05
-    bindings:
-      vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.robot0_eye_in_hand)"
-      vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.robot0_eye_in_hand)"
-    messages:
-      - role: user
-        stream: high_level
-        if_present: vqa_query
-        content:
-          - {type: image, feature: observation.images.robot0_eye_in_hand}
-          - {type: text, text: "${vqa_query}"}
-      - {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
--- a/src/lerobot/configs/recipes/subtask_mem_vqa_speech.yaml
+++ b/src/lerobot/configs/recipes/subtask_mem_vqa_speech.yaml
@@ -1,114 +0,0 @@
-# subtask_mem_vqa_speech — Hi-Robot blend + memory + spoken responses.
-#
-# Superset of subtasks_vqa.yaml. Keeps the core subtask + action + VQA
-# training, and adds two text-supervised tasks:
-#
-#   high_level_subtask         — predict the subtask from the task.
-#   low_level_execution        — flow loss with [images, subtask, state].
-#   memory_update              — compress progress into a memory note.
-#   user_interjection_response — reply to a user interjection with a
-#                                spoken `say` tool call (no plan, no
-#                                subtask text — just the spoken reply).
-#   ask_vqa_{top,wrist}        — camera-grounded VQA.
-#
-# Plan is intentionally left out — memory is the only persistent
-# high-level state here, keeping the prompt short.
-#
-# Requires the dataset to carry `memory`, `interjection` and `say`-tool
-# annotations (the annotation pipeline's memory + interjection modules)
-# in addition to `subtask` and `vqa`. Sub-recipes whose `if_present`
-# bindings are missing simply don't render for that sample, so a
-# dataset without interjections still trains the rest of the blend.
-#
-# Tool-call note: the `say` tool call on the interjection-response turn
-# is flattened to a `<say>...</say>` text marker by the tokenizer step
-# (`_flatten_say_tool_calls`) so the LM head learns to emit exactly the
-# marker the runtime parses back (`_split_plan_and_say`).
-
-blend:
-
-  high_level_subtask:
-    weight: 0.25
-    messages:
-      - {role: user, content: "${task}", stream: high_level}
-      - {role: assistant, content: "${subtask}", stream: high_level, target: true, if_present: subtask}
-
-  low_level_execution:
-    weight: 0.40
-    messages:
-      # The action expert is conditioned on the SUBTASK — at inference
-      # `HighLevelSubtaskFwd` generates it via the LM head and feeds it
-      # here. `stream: low_level` flips `predict_actions=True` so the
-      # flow loss fires; no text-CE target (subtask prediction is owned
-      # by `high_level_subtask`).
-      - {role: user, content: "${subtask}", stream: low_level, if_present: subtask}
-
-  memory_update:
-    # At inference, `MemoryUpdateFwd` is triggered only on
-    # `subtask_change` events (sparse). Training densely with
-    # `active_at` — i.e. on every frame inside a subtask interval,
-    # not just the boundary frame — supervises the same
-    # (prior_memory, completed_subtask) → current_memory mapping
-    # against varied observations within the interval. The model
-    # learns a stateless transformation; the *when* to emit lives in
-    # the inference trigger, not the model. Annotations only exist
-    # for ~1% of frames as boundary events, so `emitted_at` would
-    # waste 99% of the blend draws (and silently leak them into the
-    # task-conditioned fallback); `active_at` lifts the renderable
-    # rate to ~87% on Hi-Robot-style datasets.
-    weight: 0.10
-    bindings:
-      prior_memory: "nth_prev(style=memory, offset=1)"
-      current_memory: "active_at(t, style=memory)"
-      completed_subtask: "nth_prev(style=subtask, offset=1)"
-    messages:
-      - {role: user, content: "${task}", stream: high_level}
-      - {role: assistant, content: "Previous memory: ${prior_memory}", stream: high_level, if_present: prior_memory}
-      - {role: user, content: "Completed subtask: ${completed_subtask}", stream: high_level, if_present: completed_subtask}
-      - {role: assistant, content: "${current_memory}", stream: high_level, target: true, if_present: current_memory}
-
-  user_interjection_response:
-    weight: 0.10
-    bindings:
-      interjection: "emitted_at(t, style=interjection)"
-      speech: "emitted_at(t, role=assistant, tool_name=say)"
-    messages:
-      - {role: user, content: "${task}", stream: high_level}
-      - {role: user, content: "${interjection}", stream: high_level, if_present: interjection}
-      # Spoken reply only: the assistant turn carries no text content,
-      # just a `say` tool call (`tool_calls_from: speech`). The chat
-      # tokenizer flattens it to a `<say>...</say>` marker, so the
-      # supervised target trains the model to respond to an
-      # interjection with a spoken acknowledgement.
-      - {role: assistant, stream: high_level, target: true, if_present: speech, tool_calls_from: speech}
-
-  # VQA is view-dependent — each camera gets its own sub-recipe so the
-  # resolver disambiguates via `camera=...`. Camera keys match
-  # subtasks_vqa.yaml (`front` + `wrist`); adjust to your dataset.
-  ask_vqa_top:
-    weight: 0.075
-    bindings:
-      vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.front)"
-      vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.front)"
-    messages:
-      - role: user
-        stream: high_level
-        if_present: vqa_query
-        content:
-          - {type: image, feature: observation.images.front}
-          - {type: text, text: "${vqa_query}"}
-      - {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
-
-  ask_vqa_wrist:
-    weight: 0.075
-    bindings:
-      vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.wrist)"
-      vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.wrist)"
-    messages:
-      - role: user
-        stream: high_level
-        if_present: vqa_query
-        content:
-          - {type: image, feature: observation.images.wrist}
-          - {type: text, text: "${vqa_query}"}
-      - {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
--- a/src/lerobot/configs/recipes/subtasks_vqa.yaml
+++ b/src/lerobot/configs/recipes/subtasks_vqa.yaml
@@ -1,61 +0,0 @@
-# subtasks_vqa — Hi-Robot blend for PI052 (PaliGemma backbone).
-#
-#   Trains two things only: subtasks and VQA. Plan and memory are
-#   intentionally left out — keeps the prompt short and the training
-#   surface small. The fuller blend with memory + spoken replies is
-#   ``subtask_mem_vqa_speech.yaml``.
-#
-#     high_level_subtask  — predict the subtask from the task.
-#     low_level_execution — flow loss with [images, subtask, state].
-#     ask_vqa_{top,wrist} — camera-grounded VQA.
-#
-# PI052's text tokenizer renders these messages as plain
-# ``Role: content`` text (PaliGemma is not chat-pretrained).
-
-blend:
-
-  high_level_subtask:
-    weight: 0.40
-    messages:
-      - {role: user, content: "${task}", stream: high_level}
-      - {role: assistant, content: "${subtask}", stream: high_level, target: true, if_present: subtask}
-
-  low_level_execution:
-    weight: 0.40
-    messages:
-      # The action expert is conditioned on the SUBTASK — at inference
-      # the high-level loop (``HighLevelSubtaskFwd``) generates the
-      # subtask via the LM head and feeds it here. The action expert's
-      # prefix is [images, subtask, state]. ``stream: low_level`` flips
-      # ``predict_actions=True`` so the flow loss fires; no text-CE
-      # target here (subtask prediction is owned by
-      # ``high_level_subtask``).
-      - {role: user, content: "${subtask}", stream: low_level, if_present: subtask}
-
-  ask_vqa_top:
-    weight: 0.10
-    bindings:
-      vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.front)"
-      vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.front)"
-    messages:
-      - role: user
-        stream: high_level
-        if_present: vqa_query
-        content:
-          - {type: image, feature: observation.images.front}
-          - {type: text, text: "${vqa_query}"}
-      - {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
-
-  ask_vqa_wrist:
-    weight: 0.10
-    bindings:
-      vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.wrist)"
-      vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.wrist)"
-    messages:
-      - role: user
-        stream: high_level
-        if_present: vqa_query
-        content:
-          - {type: image, feature: observation.images.wrist}
-          - {type: text, text: "${vqa_query}"}
-      - {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
--- a/src/lerobot/configs/train.py
+++ b/src/lerobot/configs/train.py
@@ -30,7 +30,7 @@ from lerobot.utils.hub import HubMixin
 from lerobot.utils.sample_weighting import SampleWeightingConfig

 from . import parser
-from .default import DatasetConfig, EMAConfig, EvalConfig, PeftConfig, WandBConfig
+from .default import DatasetConfig, EvalConfig, PeftConfig, WandBConfig
 from .policies import PreTrainedConfig
 from .rewards import RewardModelConfig

@@ -111,20 +111,9 @@ class TrainPipelineConfig(HubMixin):
    scheduler: LRSchedulerConfig | None = None
    eval: EvalConfig = field(default_factory=EvalConfig)
    wandb: WandBConfig = field(default_factory=WandBConfig)
-    ema: EMAConfig = field(default_factory=EMAConfig)
    peft: PeftConfig | None = None

-    # VQA oversampling. When set (a fraction in (0, 1)), the training
-    # dataloader uses a WeightedEpisodeAwareSampler that draws frames
-    # carrying a `vqa` language annotation often enough that they make
-    # up roughly this fraction of the training stream. VQA annotations
-    # are typically sparse, so without this they are underrepresented.
-    # `None` (default) keeps uniform episode-aware sampling.
-    vqa_target_fraction: float | None = None
-
-    # Sample weighting configuration (e.g., for RA-BC training). Old
-    # inline ``use_rabc`` / ``rabc_*`` params are migrated to this
-    # field by ``_migrate_legacy_rabc_keys`` above.
+    # Sample weighting configuration (e.g., for RA-BC training)
    sample_weighting: SampleWeightingConfig | None = None

    # Rename map for the observation to override the image and state keys
--- a/src/lerobot/datasets/init.py
+++ b/src/lerobot/datasets/init.py
@@ -35,6 +35,7 @@ from .dataset_tools import (
    remove_feature,
    split_dataset,
 )
+from .factory import make_dataset, resolve_delta_timestamps
 from .image_writer import safe_stop_image_writer
 from .io_utils import load_episodes, write_stats
 from .language import (
@@ -49,24 +50,11 @@ from .lerobot_dataset import LeRobotDataset
 from .multi_dataset import MultiLeRobotDataset
 from .pipeline_features import aggregate_pipeline_dataset_features, create_initial_features
 from .pyav_utils import check_video_encoder_parameters_pyav, detect_available_encoders_pyav
-from .sampler import EpisodeAwareSampler, WeightedEpisodeAwareSampler
+from .sampler import EpisodeAwareSampler
 from .streaming_dataset import StreamingLeRobotDataset
 from .utils import DEFAULT_EPISODES_PATH, create_lerobot_dataset_card
 from .video_utils import VideoEncodingManager

-
-def make_dataset(*args, **kwargs):
-    from .factory import make_dataset as _make_dataset
-
-    return _make_dataset(*args, **kwargs)
-
-
-def resolve_delta_timestamps(*args, **kwargs):
-    from .factory import resolve_delta_timestamps as _resolve_delta_timestamps
-
-    return _resolve_delta_timestamps(*args, **kwargs)
-
-
 # NOTE: Low-level I/O functions (cast_stats_to_numpy, get_parquet_file_size_in_mb, etc.)
 # and legacy migration constants are intentionally NOT re-exported here.
 # Import directly: ``from lerobot.datasets.io_utils import ...``
@@ -77,7 +65,6 @@ __all__ = [
    "DEFAULT_QUANTILES",
    "EVENT_ONLY_STYLES",
    "EpisodeAwareSampler",
-    "WeightedEpisodeAwareSampler",
    "LANGUAGE_EVENTS",
    "LANGUAGE_PERSISTENT",
    "LeRobotDataset",
--- a/src/lerobot/datasets/dataset_reader.py
+++ b/src/lerobot/datasets/dataset_reader.py
@@ -126,53 +126,10 @@ class DatasetReader:
    def _load_hf_dataset(self) -> datasets.Dataset:
        """hf_dataset contains all the observations, states, actions, rewards, etc."""
        features = get_hf_features_from_features(self._meta.features)
-        # Datasets annotated with the PR1 language columns may have been
-        # written without registering those columns in ``meta/info.json``
-        # (e.g. they predate ``CODEBASE_VERSION="v3.1"`` and were
-        # back-filled by ``lerobot-annotate``). Probe a single parquet
-        # shard and graft the column features on so the strict
-        # ``Dataset.from_parquet`` cast doesn't fail with
-        # ``column names don't match``.
-        features = self._extend_features_with_language_columns(features)
        hf_dataset = load_nested_dataset(self.root / "data", features=features, episodes=self.episodes)
        hf_dataset.set_transform(hf_transform_to_torch)
        return hf_dataset

-    def _extend_features_with_language_columns(
-        self, features: datasets.Features
-    ) -> datasets.Features:
-        """Add ``language_persistent`` / ``language_events`` to ``features``
-        when the underlying parquet shards declare them but the metadata
-        doesn't. No-op when neither column is present or both are
-        already registered.
-        """
-        # Find any one parquet to peek at; bail if there are none yet
-        # (the dataset will fail later for an unrelated reason and we
-        # want that error to surface as-is).
-        try:
-            sample = next((self.root / "data").glob("*/*.parquet"))
-        except StopIteration:
-            return features
-
-        from pyarrow import parquet as _pq  # noqa: PLC0415
-
-        schema_names = set(_pq.read_schema(sample).names)
-        from .language import (  # noqa: PLC0415
-            LANGUAGE_EVENTS,
-            LANGUAGE_PERSISTENT,
-            language_events_column_feature,
-            language_persistent_column_feature,
-        )
-
-        extra: dict[str, object] = {}
-        if LANGUAGE_PERSISTENT in schema_names and LANGUAGE_PERSISTENT not in features:
-            extra[LANGUAGE_PERSISTENT] = language_persistent_column_feature()
-        if LANGUAGE_EVENTS in schema_names and LANGUAGE_EVENTS not in features:
-            extra[LANGUAGE_EVENTS] = language_events_column_feature()
-        if not extra:
-            return features
-        return datasets.Features({**features, **extra})
-
    def _check_cached_episodes_sufficient(self) -> bool:
        """Check if the cached dataset contains all requested episodes and their video files."""
        if self.hf_dataset is None or len(self.hf_dataset) == 0:
--- a/src/lerobot/datasets/language.py
+++ b/src/lerobot/datasets/language.py
@@ -70,22 +70,8 @@ def _json_arrow_type() -> pa.DataType:


 def _json_feature() -> object:
-    """Return the HF feature used for tool-call payloads.
-
-    Older ``datasets`` versions do not expose ``datasets.Json``. The
-    annotation pipeline currently emits the canonical ``say`` tool call
-    shape, so use that explicit struct instead of falling back to a string
-    that cannot cast structured parquet values.
-    """
-    if hasattr(datasets, "Json"):
-        return datasets.Json()
-    return {
-        "type": datasets.Value("string"),
-        "function": {
-            "name": datasets.Value("string"),
-            "arguments": {"text": datasets.Value("string")},
-        },
-    }
+    """Return the HF ``datasets`` JSON feature, falling back to a string value."""
+    return datasets.Json() if hasattr(datasets, "Json") else datasets.Value("string")


 def language_persistent_row_arrow_type() -> pa.StructType:
--- a/src/lerobot/datasets/language_render.py
+++ b/src/lerobot/datasets/language_render.py
@@ -170,29 +170,6 @@ def render_sample(
    """
    persistent_rows = _normalize_rows(persistent or [])
    event_rows = _normalize_rows(events or [])
-
-    # VQA-priority routing. A ``vqa`` annotation is sparse and
-    # view-dependent; the plain weighted blend would (a) waste a draw
-    # whenever it picks an ``ask_vqa*`` sub-recipe for a frame that has
-    # no VQA, and (b) silently drop a VQA-annotated frame whenever it
-    # picks a non-VQA sub-recipe. So: if the blend has ``ask_vqa*``
-    # sub-recipes and *this* frame carries one of their VQA bindings,
-    # render VQA here regardless of the weighted draw. That makes VQA's
-    # recipe-side training share equal the VQA-annotation density (the
-    # maximum reachable without a dataset-level oversampling sampler).
-    if recipe.blend is not None:
-        vqa_rendered = _render_vqa_if_present(
-            recipe,
-            persistent=persistent_rows,
-            events=event_rows,
-            t=t,
-            sample_idx=sample_idx,
-            task=task,
-            dataset_ctx=dataset_ctx,
-        )
-        if vqa_rendered is not None:
-            return vqa_rendered
-
    selected_recipe = _select_recipe(recipe, sample_idx)
    bindings = _resolve_bindings(
        selected_recipe,
@@ -206,59 +183,6 @@ def render_sample(
    return _render_message_recipe(selected_recipe, bindings)


-def _render_vqa_if_present(
-    recipe: TrainingRecipe,
-    *,
-    persistent: Sequence[LanguageRow],
-    events: Sequence[LanguageRow],
-    t: float,
-    sample_idx: int,
-    task: str | None,
-    dataset_ctx: Any | None,
-) -> RenderedMessages | None:
-    """Render an ``ask_vqa*`` sub-recipe iff this frame carries a VQA
-    annotation; otherwise return ``None`` so the caller falls back to the
-    normal weighted blend.
-
-    When several VQA sub-recipes resolve (e.g. a frame annotated for more
-    than one camera), one is chosen deterministically by relative weight.
-    """
-    assert recipe.blend is not None
-    renderable: list[tuple[float, RenderedMessages]] = []
-    for name, component in recipe.blend.items():
-        if not name.startswith("ask_vqa"):
-            continue
-        bindings = _resolve_bindings(
-            component,
-            persistent=persistent,
-            events=events,
-            t=t,
-            sample_idx=sample_idx,
-            task=task,
-            dataset_ctx=dataset_ctx,
-        )
-        rendered = _render_message_recipe(component, bindings)
-        if rendered is not None:
-            renderable.append((float(component.weight or 0.0), rendered))
-
-    if not renderable:
-        return None
-    if len(renderable) == 1:
-        return renderable[0][1]
-
-    # Multiple cameras have a VQA for this frame — deterministic pick by
-    # relative weight (fall back to a uniform draw if all weights are 0).
-    total = sum(w for w, _ in renderable) or float(len(renderable))
-    digest = hashlib.blake2b(f"vqa:{sample_idx}".encode(), digest_size=8).digest()
-    draw = int.from_bytes(digest, "big") / 2**64 * total
-    cumulative = 0.0
-    for w, rendered in renderable:
-        cumulative += w or (total / len(renderable))
-        if draw < cumulative:
-            return rendered
-    return renderable[-1][1]
-
-
 def _select_recipe(recipe: TrainingRecipe, sample_idx: int) -> TrainingRecipe:
    """Pick a deterministic blend component for ``sample_idx`` (or return ``recipe``)."""
    if recipe.blend is None:
@@ -422,15 +346,7 @@ def _render_message_recipe(
        if turn.target:
            target_indices.append(message_idx)

-    # A render is meaningful if it supervises *something*: either a
-    # text-CE target turn, or a ``low_level`` stream turn (flow / action
-    # supervision — e.g. the flow-only ``low_level_execution`` recipe,
-    # ``user(${subtask})`` with ``stream: low_level`` and no target).
-    # Without this, a flow-only recipe renders to ``None`` every time
-    # the blend draws it → ``predict_actions`` is never True → the
-    # action expert never receives a flow loss.
-    has_low_level = any(stream == "low_level" for stream in streams)
-    if not target_indices and not has_low_level:
+    if not target_indices:
        return None

    rendered = {
@@ -487,10 +403,8 @@ def _validate_rendered(rendered: RenderedMessages) -> None:

    if len(streams) != len(messages):
        raise ValueError("message_streams must be aligned with messages.")
-    # Valid iff it supervises something: a text-CE target turn OR a
-    # ``low_level`` stream turn (flow / action supervision).
-    if not target_indices and not any(s == "low_level" for s in streams):
-        raise ValueError("Rendered samples must contain a target message or a low_level-stream message.")
+    if not target_indices:
+        raise ValueError("Rendered samples must contain at least one target message.")
    for idx in target_indices:
        if idx < 0 or idx >= len(messages):
            raise ValueError(f"Target message index {idx} is out of bounds.")
--- a/src/lerobot/datasets/sampler.py
+++ b/src/lerobot/datasets/sampler.py
@@ -84,66 +84,3 @@ class EpisodeAwareSampler:

    def __len__(self) -> int:
        return len(self.indices)
-
-
-class WeightedEpisodeAwareSampler(EpisodeAwareSampler):
-    """``EpisodeAwareSampler`` that draws frames *with replacement* in
-    proportion to per-frame weights.
-
-    Used to oversample frames carrying a sparse annotation (e.g. a VQA
-    question) so the policy sees them more often than their natural
-    dataset density. One epoch still yields ``len(self.indices)``
-    samples — the weights only change the *composition* of the stream,
-    not its length. Each epoch re-draws, so the oversampled subset
-    varies run to run.
-    """
-
-    def __init__(
-        self,
-        dataset_from_indices: list[int],
-        dataset_to_indices: list[int],
-        frame_weights,
-        *,
-        episode_indices_to_use: list | None = None,
-        drop_n_first_frames: int = 0,
-        drop_n_last_frames: int = 0,
-    ):
-        """
-        Args:
-            dataset_from_indices: Episode start indices (see ``EpisodeAwareSampler``).
-            dataset_to_indices: Episode end indices.
-            frame_weights: 1-D sequence/tensor of non-negative weights, one per
-                dataset frame (length == total dataset frames). Higher weight ⇒
-                that frame is sampled more often.
-            episode_indices_to_use / drop_n_first_frames / drop_n_last_frames:
-                Same meaning as ``EpisodeAwareSampler`` — the episode-boundary
-                frame filtering is applied first, then weighting is restricted
-                to the surviving frames.
-        """
-        super().__init__(
-            dataset_from_indices,
-            dataset_to_indices,
-            episode_indices_to_use=episode_indices_to_use,
-            drop_n_first_frames=drop_n_first_frames,
-            drop_n_last_frames=drop_n_last_frames,
-            shuffle=False,
-        )
-        weights = torch.as_tensor(frame_weights, dtype=torch.double).flatten()
-        idx = torch.tensor(self.indices, dtype=torch.long)
-        if weights.numel() <= int(idx.max()):
-            raise ValueError(
-                f"frame_weights has {weights.numel()} entries but the sampler "
-                f"references frame index {int(idx.max())}."
-            )
-        selected = weights[idx]
-        if not torch.isfinite(selected).all() or bool((selected < 0).any()):
-            raise ValueError("frame_weights must be finite and non-negative.")
-        if float(selected.sum()) <= 0.0:
-            # All surviving frames have zero weight — fall back to uniform.
-            selected = torch.ones_like(selected)
-        self._weights = selected
-
-    def __iter__(self) -> Iterator[int]:
-        picks = torch.multinomial(self._weights, num_samples=len(self.indices), replacement=True)
-        for i in picks.tolist():
-            yield self.indices[i]
--- a/src/lerobot/datasets/utils.py
+++ b/src/lerobot/datasets/utils.py
@@ -366,24 +366,17 @@ def get_safe_version(repo_id: str, version: str | packaging.version.Version) ->
    hub_versions = get_repo_versions(repo_id)

    if not hub_versions:
-        msg = (
-            f"Repo {repo_id!r} has no codebase-version tags. The dataset "
-            f"either doesn't exist on the Hub yet, or it was uploaded "
-            f"without a ``v3.x``-style tag. To tag an existing dataset run:\n"
-            f"  from huggingface_hub import HfApi\n"
-            f"  HfApi().create_tag({repo_id!r}, tag='v3.0', repo_type='dataset', exist_ok=True)"
+        raise RevisionNotFoundError(
+            f"""Your dataset must be tagged with a codebase version.
+            Assuming _version_ is the codebase_version value in the info.json, you can run this:
+            ```python
+            from huggingface_hub import HfApi
+
+            hub_api = HfApi()
+            hub_api.create_tag("{repo_id}", tag="_version_", repo_type="dataset")
+            ```
+            """
        )
-        # ``RevisionNotFoundError`` extends ``HfHubHTTPError`` whose
-        # ``__init__`` indexes ``response.headers`` unconditionally on
-        # current ``huggingface_hub`` versions. Constructing it without
-        # a real ``Response`` object crashes with either
-        # ``TypeError: missing 1 required keyword-only argument`` (old
-        # builds) or ``AttributeError: 'NoneType' object has no attribute
-        # 'headers'`` (new builds). Skip that path entirely — this isn't
-        # really an HTTP error, it's a configuration issue — and raise a
-        # plain ``RuntimeError`` so the message actually reaches the
-        # caller.
-        raise RuntimeError(msg)

    if target_version in hub_versions:
        return f"v{target_version}"
--- a/src/lerobot/optim/optimizers.py
+++ b/src/lerobot/optim/optimizers.py
@@ -104,8 +104,6 @@ class AdamWConfig(OptimizerConfig):
    eps: float = 1e-8
    weight_decay: float = 1e-2
    grad_clip_norm: float = 10.0
-    foreach: bool | None = None
-    fused: bool | None = None

    def build(self, params: OptimizerParams) -> torch.optim.Optimizer:
        kwargs = asdict(self)
--- a/src/lerobot/policies/init.py
+++ b/src/lerobot/policies/init.py
@@ -24,7 +24,6 @@ from .multi_task_dit.configuration_multi_task_dit import MultiTaskDiTConfig as M
 from .pi0.configuration_pi0 import PI0Config as PI0Config
 from .pi0_fast.configuration_pi0_fast import PI0FastConfig as PI0FastConfig
 from .pi05.configuration_pi05 import PI05Config as PI05Config
-from .pi052.configuration_pi052 import PI052Config as PI052Config
 from .pretrained import PreTrainedPolicy as PreTrainedPolicy
 from .smolvla.configuration_smolvla import SmolVLAConfig as SmolVLAConfig
 from .tdmpc.configuration_tdmpc import TDMPCConfig as TDMPCConfig
@@ -48,7 +47,6 @@ __all__ = [
    "PI0Config",
    "PI0FastConfig",
    "PI05Config",
-    "PI052Config",
    "SmolVLAConfig",
    "TDMPCConfig",
    "VQBeTConfig",
--- a/src/lerobot/policies/factory.py
+++ b/src/lerobot/policies/factory.py
@@ -127,10 +127,6 @@ def get_policy_class(name: str) -> type[PreTrainedPolicy]:
        from .pi05.modeling_pi05 import PI05Policy

        return PI05Policy
-    elif name == "pi052":
-        from .pi052.modeling_pi052 import PI052Policy
-
-        return PI052Policy
    elif name == "gaussian_actor":
        from .gaussian_actor.modeling_gaussian_actor import GaussianActorPolicy

@@ -171,8 +167,8 @@ def make_policy_config(policy_type: str, **kwargs) -> PreTrainedConfig:

    Args:
        policy_type: The type of the policy. Supported types include "tdmpc",
-                     "multi_task_dit", "diffusion", "act", "vqbet", "pi0", "pi05",
-                     "pi052", "gaussian_actor", "smolvla", "wall_x".
+                     "multi_task_dit", "diffusion", "act", "vqbet", "pi0", "pi05", "gaussian_actor",
+                     "smolvla", "wall_x".
        **kwargs: Keyword arguments to be passed to the configuration class constructor.

    Returns:
@@ -195,10 +191,6 @@ def make_policy_config(policy_type: str, **kwargs) -> PreTrainedConfig:
        return PI0Config(**kwargs)
    elif policy_type == "pi05":
        return PI05Config(**kwargs)
-    elif policy_type == "pi052":
-        from .pi052.configuration_pi052 import PI052Config
-
-        return PI052Config(**kwargs)
    elif policy_type == "gaussian_actor":
        return GaussianActorConfig(**kwargs)
    elif policy_type == "smolvla":
@@ -239,12 +231,6 @@ class ProcessorConfigKwargs(TypedDict, total=False):
    preprocessor_overrides: dict[str, Any] | None
    postprocessor_overrides: dict[str, Any] | None
    dataset_stats: dict[str, dict[str, torch.Tensor]] | None
-    # Optional: HF Hub repo id of the dataset the policy is being
-    # trained on. Used by policies that auto-fit pieces of their
-    # preprocessing (e.g. pi052's FAST action tokenizer per
-    # Pertsch et al. 2025 [64], π0.5 §III.C). When omitted, those
-    # policies fall back to their universal pre-fitted tokenizers.
-    dataset_repo_id: str | None


 def make_pre_post_processors(
@@ -371,22 +357,6 @@ def make_pre_post_processors(
            dataset_stats=kwargs.get("dataset_stats"),
        )

-    elif policy_cfg.type == "pi052":
-        # NOTE: PI052Config subclasses PI05Config, so this branch MUST
-        # come before the PI05Config isinstance check below (otherwise
-        # pi052 would silently pick up π0.5's processor).
-        from .pi052.processor_pi052 import make_pi052_pre_post_processors
-
-        processors = make_pi052_pre_post_processors(
-            config=policy_cfg,
-            dataset_stats=kwargs.get("dataset_stats"),
-            # ``dataset_repo_id`` flows in via kwargs when FAST CE is
-            # enabled — the train loop sets it from ``--dataset.repo_id``.
-            # When ``None``, ``make_pi052_pre_post_processors`` skips
-            # the auto-fit and uses the universal tokenizer.
-            dataset_repo_id=kwargs.get("dataset_repo_id"),
-        )
-
    elif isinstance(policy_cfg, PI05Config):
        from .pi05.processor_pi05 import make_pi05_pre_post_processors

--- a/src/lerobot/policies/groot/groot_n1.py
+++ b/src/lerobot/policies/groot/groot_n1.py
@@ -178,6 +178,7 @@ N_COLOR_CHANNELS = 3


 # config
+@strict
 class GR00TN15Config(PretrainedConfig):
    model_type = "gr00t_n1_5"

--- a/src/lerobot/policies/pi05/configuration_pi05.py
+++ b/src/lerobot/policies/pi05/configuration_pi05.py
@@ -93,21 +93,6 @@ class PI05Config(PreTrainedConfig):
    optimizer_eps: float = 1e-8
    optimizer_weight_decay: float = 0.01
    optimizer_grad_clip_norm: float = 1.0
-    optimizer_foreach: bool | None = False
-    optimizer_fused: bool | None = True
-
-    # LM-head LR multiplier. The PaliGemma `lm_head` projection (and its
-    # tied `embed_tokens`) is the surface the LM head's first-token
-    # distribution depends on. With ``knowledge_insulation`` blocking
-    # action→VLM gradients, the LM head only sees gradients on text-CE
-    # samples — which can be a small fraction of the mix (e.g. ~45% in
-    # ``subtask_mem.yaml``). Under aggressive cosine LR decay the head's
-    # first-token distribution can drift back toward PaliGemma's
-    # pretrained ``<loc>`` detection prior, despite teacher-forced CE
-    # staying near zero. Boosting just the LM-head LR (e.g. 5x) keeps
-    # the head pinned to fine-tuning targets without perturbing the
-    # backbone / vision tower / action expert. Default 1.0 = no change.
-    lm_head_lr_scale: float = 1.0

    # Scheduler settings: see openpi `CosineDecaySchedule`
    # Note: These will auto-scale if --steps < scheduler_decay_steps
@@ -167,8 +152,6 @@ class PI05Config(PreTrainedConfig):
            eps=self.optimizer_eps,
            weight_decay=self.optimizer_weight_decay,
            grad_clip_norm=self.optimizer_grad_clip_norm,
-            foreach=self.optimizer_foreach,
-            fused=self.optimizer_fused,
        )

    def get_scheduler_preset(self):
--- a/src/lerobot/policies/pi05/modeling_pi05.py
+++ b/src/lerobot/policies/pi05/modeling_pi05.py
@@ -15,7 +15,6 @@
 # limitations under the License.

 import builtins
-import copy
 import logging
 import math
 from collections import deque
@@ -30,6 +29,7 @@ from lerobot.utils.import_utils import _transformers_available, require_package

 # Conditional import for type checking and lazy loading
 if TYPE_CHECKING or _transformers_available:
+    from transformers.cache_utils import DynamicCache
    from transformers.models.auto import CONFIG_MAPPING
    from transformers.models.gemma import modeling_gemma

@@ -41,6 +41,7 @@ if TYPE_CHECKING or _transformers_available:
    )
 else:
    CONFIG_MAPPING = None
+    DynamicCache = None
    modeling_gemma = None
    PiGemmaForCausalLM = None
    _gated_residual = None
@@ -138,6 +139,15 @@ def make_att_2d_masks(pad_masks, att_masks):  # see openpi `make_att_2d_masks` (
    return att_2d_masks & pad_2d_masks


+def clone_past_key_values(past_key_values):
+    """Clone the DynamicCache returned by prefix prefill for compiled denoising."""
+    return DynamicCache(
+        tuple(
+            (keys.clone(), values.clone(), sliding_window) for keys, values, sliding_window in past_key_values
+        )
+    )
+
+
 def pad_vector(vector, new_dim):
    """Pad the last dimension of a vector to new_dim with zeros.

@@ -223,53 +233,14 @@ def resize_with_pad_torch(  # see openpi `resize_with_pad_torch` (exact copy)
    return padded_images


-def sdpa_attention_forward(
-    module,
-    query: torch.Tensor,
-    key: torch.Tensor,
-    value: torch.Tensor,
-    attention_mask: torch.Tensor | None,
-    scaling: float,
-    dropout: float = 0.0,
-):
-    """Drop-in for ``modeling_gemma.eager_attention_forward`` using
-    ``torch.nn.functional.scaled_dot_product_attention``.
-
-    PyTorch SDPA picks the memory-efficient kernel for arbitrary additive
-    bias masks (the FA backend only accepts causal/sliding-window). On
-    H100 that is ~1.3-1.7x faster and uses ~30-40% less attention memory
-    than the eager softmax(QK^T)+matmul path. Mirrors eager's signature
-    and output shape (``(B, Lq, H, D)``) so call sites are unchanged.
-    """
-    n_rep = module.num_key_value_groups
-    if n_rep > 1:
-        key = key.repeat_interleave(n_rep, dim=1)
-        value = value.repeat_interleave(n_rep, dim=1)
-    if attention_mask is not None and attention_mask.dtype != query.dtype:
-        attention_mask = attention_mask.to(dtype=query.dtype)
-    attn_output = F.scaled_dot_product_attention(
-        query,
-        key,
-        value,
-        attn_mask=attention_mask,
-        dropout_p=dropout if module.training else 0.0,
-        is_causal=False,
-        scale=scaling,
-    )
-    return attn_output.transpose(1, 2).contiguous(), None
-
-
 # Define the complete layer computation function for gradient checkpointing
-def compute_layer_complete(
-    layer_idx, inputs_embeds, attention_mask, position_ids, adarms_cond, paligemma, gemma_expert
-):
-    models = [paligemma.model.language_model, gemma_expert.model]
+def compute_layer_complete(inputs_embeds, attention_mask, position_ids, adarms_cond, layers, rotary_emb):
    query_states = []
    key_states = []
    value_states = []
    gates = []
    for i, hidden_states in enumerate(inputs_embeds):
-        layer = models[i].layers[layer_idx]
+        layer = layers[i]
        hidden_states, gate = layernorm_forward(layer.input_layernorm, hidden_states, adarms_cond[i])
        gates.append(gate)
        input_shape = hidden_states.shape[:-1]
@@ -291,14 +262,16 @@ def compute_layer_complete(
        device=query_states.device,
        dtype=query_states.dtype,
    )
-    cos, sin = paligemma.model.language_model.rotary_emb(dummy_tensor, position_ids)
+    cos, sin = rotary_emb(dummy_tensor, position_ids)
    query_states, key_states = modeling_gemma.apply_rotary_pos_emb(
        query_states, key_states, cos, sin, unsqueeze_dim=1
    )
    batch_size = query_states.shape[0]
-    scaling = paligemma.model.language_model.layers[layer_idx].self_attn.scaling
-    att_output, _ = sdpa_attention_forward(
-        paligemma.model.language_model.layers[layer_idx].self_attn,
+    paligemma_layer = layers[0]
+    scaling = paligemma_layer.self_attn.scaling
+    # Attention computation
+    att_output, _ = modeling_gemma.eager_attention_forward(
+        paligemma_layer.self_attn,
        query_states,
        key_states,
        value_states,
@@ -306,13 +279,13 @@ def compute_layer_complete(
        scaling,
    )
    # Get head_dim from the current layer, not from the model
-    head_dim = paligemma.model.language_model.layers[layer_idx].self_attn.head_dim
+    head_dim = paligemma_layer.self_attn.head_dim
    att_output = att_output.reshape(batch_size, -1, 1 * 8 * head_dim)
    # Process layer outputs
    outputs_embeds = []
    start_pos = 0
    for i, hidden_states in enumerate(inputs_embeds):
-        layer = models[i].layers[layer_idx]
+        layer = layers[i]
        end_pos = start_pos + hidden_states.shape[1]
        if att_output.dtype != layer.self_attn.o_proj.weight.dtype:
            att_output = att_output.to(layer.self_attn.o_proj.weight.dtype)
@@ -444,7 +417,6 @@ class PaliGemmaWithExpertModel(
        params_to_keep_float32 = [
            "vision_tower",
            "multi_modal_projector",
-            "lm_head",
            "input_layernorm",
            "post_attention_layernorm",
            "model.norm",
@@ -477,13 +449,13 @@ class PaliGemmaWithExpertModel(
        if image.dtype != torch.float32:
            image = image.to(torch.float32)
        image_outputs = self.paligemma.model.get_image_features(image)
-        features = image_outputs.pooler_output * self.paligemma.config.text_config.hidden_size**0.5
+        features = image_outputs.pooler_output
        if features.dtype != out_dtype:
            features = features.to(out_dtype)
        return features

    def embed_language_tokens(self, tokens: torch.Tensor):
-        return self.paligemma.model.language_model.embed_tokens(tokens)
+        return self.paligemma.model.language_model.get_input_embeddings()(tokens)

    def forward(
        self,
@@ -521,8 +493,9 @@ class PaliGemmaWithExpertModel(
            prefix_output = None
            prefix_past_key_values = None
        else:
-            models = [self.paligemma.model.language_model, self.gemma_expert.model]
-            num_layers = self.paligemma.config.text_config.num_hidden_layers
+            paligemma_layers = self.paligemma.model.language_model.layers
+            gemma_expert_layers = self.gemma_expert.model.layers
+            rotary_emb = self.paligemma.model.language_model.rotary_emb

            # Check if gradient checkpointing is enabled for any of the models
            use_gradient_checkpointing = (
@@ -532,36 +505,39 @@ class PaliGemmaWithExpertModel(
            ) or (hasattr(self, "gradient_checkpointing") and self.gradient_checkpointing and self.training)

            # Process all layers with gradient checkpointing if enabled
-            for layer_idx in range(num_layers):
+            for layers in zip(paligemma_layers, gemma_expert_layers, strict=True):
                if use_gradient_checkpointing:
                    inputs_embeds = torch.utils.checkpoint.checkpoint(
                        compute_layer_complete,
-                        layer_idx,
                        inputs_embeds,
                        attention_mask,
                        position_ids,
                        adarms_cond,
                        use_reentrant=False,
                        preserve_rng_state=False,
-                        paligemma=self.paligemma,
-                        gemma_expert=self.gemma_expert,
+                        layers=layers,
+                        rotary_emb=rotary_emb,
                    )
                else:
                    inputs_embeds = compute_layer_complete(
-                        layer_idx,
                        inputs_embeds,
                        attention_mask,
                        position_ids,
                        adarms_cond,
-                        paligemma=self.paligemma,
-                        gemma_expert=self.gemma_expert,
+                        layers=layers,
+                        rotary_emb=rotary_emb,
                    )

            # final norm
+            final_norms = (
+                self.paligemma.model.language_model.norm,
+                self.gemma_expert.model.norm,
+            )
+
            def compute_final_norms(inputs_embeds, adarms_cond):
                outputs_embeds = []
                for i, hidden_states in enumerate(inputs_embeds):
-                    out_emb, _ = layernorm_forward(models[i].norm, hidden_states, adarms_cond[i])
+                    out_emb, _ = layernorm_forward(final_norms[i], hidden_states, adarms_cond[i])
                    outputs_embeds.append(out_emb)
                return outputs_embeds

@@ -653,13 +629,10 @@ class PI05Pytorch(nn.Module):  # see openpi `PI0Pytorch`
            )
        return func(*args, **kwargs)

-    def _prepare_attention_masks_4d(self, att_2d_masks, dtype=None):
+    def _prepare_attention_masks_4d(self, att_2d_masks):
        """Helper method to prepare 4D attention masks for transformer."""
        att_2d_masks_4d = att_2d_masks[:, None, :, :]
-        result = torch.where(att_2d_masks_4d, 0.0, OPENPI_ATTENTION_MASK_VALUE)
-        if dtype is not None:
-            result = result.to(dtype=dtype)
-        return result
+        return torch.where(att_2d_masks_4d, 0.0, OPENPI_ATTENTION_MASK_VALUE)

    def sample_noise(self, shape, device):
        return torch.normal(
@@ -701,8 +674,7 @@ class PI05Pytorch(nn.Module):  # see openpi `PI0Pytorch`
        # Process language tokens
        def lang_embed_func(tokens):
            lang_emb = self.paligemma_with_expert.embed_language_tokens(tokens)
-            lang_emb_dim = lang_emb.shape[-1]
-            return lang_emb * math.sqrt(lang_emb_dim)
+            return lang_emb

        lang_emb = self._apply_checkpoint(lang_embed_func, tokens)
        embs.append(lang_emb)
@@ -789,22 +761,21 @@ class PI05Pytorch(nn.Module):  # see openpi `PI0Pytorch`
        att_2d_masks = make_att_2d_masks(pad_masks, att_masks)
        position_ids = torch.cumsum(pad_masks, dim=1) - 1

-        att_2d_masks_4d = self._prepare_attention_masks_4d(att_2d_masks, dtype=prefix_embs.dtype)
+        att_2d_masks_4d = self._prepare_attention_masks_4d(att_2d_masks)

-        # Selective AC: rely on the per-layer checkpoint inside
-        # ``PaliGemmaWithExpertModel.forward`` (which wraps each
-        # transformer block individually). The previous outer
-        # ``_apply_checkpoint(forward_func, ...)`` doubled up — it
-        # re-ran the full backbone forward during backward *and* each
-        # block's own checkpoint re-ran during that recompute. Pure
-        # waste with SDPA, which already streams attention activations.
-        (_, suffix_out), _ = self.paligemma_with_expert.forward(
-            attention_mask=att_2d_masks_4d,
-            position_ids=position_ids,
-            past_key_values=None,
-            inputs_embeds=[prefix_embs, suffix_embs],
-            use_cache=False,
-            adarms_cond=[None, adarms_cond],
+        def forward_func(prefix_embs, suffix_embs, att_2d_masks_4d, position_ids, adarms_cond):
+            (_, suffix_out), _ = self.paligemma_with_expert.forward(
+                attention_mask=att_2d_masks_4d,
+                position_ids=position_ids,
+                past_key_values=None,
+                inputs_embeds=[prefix_embs, suffix_embs],
+                use_cache=False,
+                adarms_cond=[None, adarms_cond],
+            )
+            return suffix_out
+
+        suffix_out = self._apply_checkpoint(
+            forward_func, prefix_embs, suffix_embs, att_2d_masks_4d, position_ids, adarms_cond
        )

        suffix_out = suffix_out[:, -self.config.chunk_size :]
@@ -848,9 +819,7 @@ class PI05Pytorch(nn.Module):  # see openpi `PI0Pytorch`
        prefix_att_2d_masks = make_att_2d_masks(prefix_pad_masks, prefix_att_masks)
        prefix_position_ids = torch.cumsum(prefix_pad_masks, dim=1) - 1

-        prefix_att_2d_masks_4d = self._prepare_attention_masks_4d(
-            prefix_att_2d_masks, dtype=prefix_embs.dtype
-        )
+        prefix_att_2d_masks_4d = self._prepare_attention_masks_4d(prefix_att_2d_masks)
        self.paligemma_with_expert.paligemma.model.language_model.config._attn_implementation = "eager"  # noqa: SLF001

        _, past_key_values = self.paligemma_with_expert.forward(
@@ -920,12 +889,10 @@ class PI05Pytorch(nn.Module):  # see openpi `PI0Pytorch`
        prefix_offsets = torch.sum(prefix_pad_masks, dim=-1)[:, None]
        position_ids = prefix_offsets + torch.cumsum(suffix_pad_masks, dim=1) - 1

-        full_att_2d_masks_4d = self._prepare_attention_masks_4d(
-            full_att_2d_masks, dtype=suffix_embs.dtype
-        )
+        full_att_2d_masks_4d = self._prepare_attention_masks_4d(full_att_2d_masks)
        self.paligemma_with_expert.gemma_expert.model.config._attn_implementation = "eager"  # noqa: SLF001

-        past_key_values = copy.deepcopy(past_key_values)
+        past_key_values = clone_past_key_values(past_key_values)
        outputs_embeds, _ = self.paligemma_with_expert.forward(
            attention_mask=full_att_2d_masks_4d,
            position_ids=position_ids,
@@ -1060,16 +1027,6 @@ class PI05Policy(PreTrainedPolicy):
            if remap_count > 0:
                print(f"Remapped {remap_count} state dict keys")

-            lm_head_key = "model.paligemma_with_expert.paligemma.lm_head.weight"
-            embed_tokens_key = (
-                "model.paligemma_with_expert.paligemma.model.language_model.embed_tokens.weight"
-            )
-            if lm_head_key not in remapped_state_dict and embed_tokens_key in remapped_state_dict:
-                remapped_state_dict[lm_head_key] = remapped_state_dict[embed_tokens_key].clone().float()
-                print("Initialized PaliGemma lm_head from language token embeddings")
-            elif lm_head_key in remapped_state_dict:
-                remapped_state_dict[lm_head_key] = remapped_state_dict[lm_head_key].float()
-
            # Load the remapped state dict into the model
            missing_keys, unexpected_keys = model.load_state_dict(remapped_state_dict, strict=strict)

@@ -1163,62 +1120,8 @@ class PI05Policy(PreTrainedPolicy):

        return fixed_state_dict

-    def get_optim_params(self):
-        """Return policy parameters, optionally split into LR-scaled groups.
-
-        When ``config.lm_head_lr_scale != 1.0``, the PaliGemma ``lm_head``
-        and its tied ``embed_tokens`` are placed in their own param
-        group with ``lr = base_lr * lm_head_lr_scale``. The cosine
-        scheduler multiplies both groups by the same lambda each step,
-        so the ratio is preserved across decay. Default ``1.0`` =
-        return ``self.parameters()`` (back-compat with existing checkpoints
-        and configs).
-        """
-        scale = float(getattr(self.config, "lm_head_lr_scale", 1.0))
-        if scale == 1.0:
-            return self.parameters()
-        head_params: list[torch.nn.Parameter] = []
-        other_params: list[torch.nn.Parameter] = []
-        # Both ``lm_head.weight`` and the tied ``embed_tokens.weight`` —
-        # boosting only the projection without the embedding pulls them
-        # apart and breaks the tie that PaliGemma was pre-trained with.
-        head_substrings = (
-            "paligemma_with_expert.paligemma.lm_head.",
-            "paligemma_with_expert.paligemma.model.language_model.embed_tokens.",
-        )
-        for name, p in self.named_parameters():
-            if not p.requires_grad:
-                continue
-            if any(s in name for s in head_substrings):
-                head_params.append(p)
-            else:
-                other_params.append(p)
-        base_lr = float(self.config.optimizer_lr)
-        groups: list[dict[str, object]] = []
-        if other_params:
-            groups.append({"params": other_params, "lr": base_lr, "name": "policy"})
-        if head_params:
-            groups.append(
-                {"params": head_params, "lr": base_lr * scale, "name": "lm_head"}
-            )
-        # Sanity: head_substrings must match at least one parameter, otherwise
-        # the scale silently does nothing — surface that fast.
-        if not head_params:
-            raise RuntimeError(
-                "lm_head_lr_scale != 1.0 but no parameters matched the LM-head "
-                "name patterns: "
-                f"{head_substrings!r}. Did the underlying PaliGemma module rename?"
-            )
-        logging.info(
-            "PI05Policy: LM-head LR scale = %.3g (base=%.3g, head=%.3g) over "
-            "%d head params + %d other params",
-            scale,
-            base_lr,
-            base_lr * scale,
-            len(head_params),
-            len(other_params),
-        )
-        return groups
+    def get_optim_params(self) -> dict:
+        return self.parameters()

    def reset(self):
        """Reset internal state - called when environment resets."""
--- a/src/lerobot/policies/pi052/init.py
+++ b/src/lerobot/policies/pi052/init.py
@@ -1,42 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""π0.5 v2 — full reproduction of the π0.5 paper's hierarchical
-inference recipe on lerobot.
-
-Extends :class:`lerobot.policies.pi05.PI05Policy` with:
-
-* recipe-driven training (PR 1's :class:`RenderMessagesStep`),
-* PaliGemma ``lm_head`` cross-entropy on supervised subtask spans
-  (the "high-level subtask prediction" of the paper, §IV.D),
-* AR text generation at inference (:meth:`PI052Policy.select_message`),
-* per-component prompt dropout (Pi 0.7 §V.E) for regularising the
-  text head against missing context at inference.
-
-See ``src/lerobot/configs/recipes/subtasks_vqa.yaml`` for the
-canonical training recipe and
-``examples/training/pi052_hirobot.slurm`` for the launcher.
-"""
-
-from .configuration_pi052 import PI052Config
-from .modeling_pi052 import PI052Policy
-from .processor_pi052 import make_pi052_pre_post_processors
-from .text_processor_pi052 import PI052TextTokenizerStep
-
-__all__ = [
-    "PI052Config",
-    "PI052Policy",
-    "PI052TextTokenizerStep",
-    "make_pi052_pre_post_processors",
-]
--- a/src/lerobot/policies/pi052/configuration_pi052.py
+++ b/src/lerobot/policies/pi052/configuration_pi052.py
@@ -1,208 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""π0.5 v2 (with text head) — reproduction of the π0.5 paper's
-hierarchical inference recipe.
-
-Same architecture as the existing ``PI05Policy`` (PaliGemma 2B VLM +
-~300M Gemma action expert, joint training with FAST tokens during
-pre-train and flow matching during post-train), but with the
-PaliGemma ``lm_head`` re-enabled so the same model can be supervised
-to predict both:
-
-  * **subtask strings** at the high level (cross-entropy on the LM
-    head), and
-  * **action chunks** at the low level (flow matching on the
-    action-expert tokens).
-
-This is the dual-head co-training pattern from the paper:
-
-    L = H(x, f_θ_text) + α * ‖ω - a - f_θ_action(a_τ, o, ℓ)‖²
-
-with α = 10.0 per § IV.D of arxiv:2504.16054. The π0.5 model splits
-inference into a text-prediction step followed by an action-prediction
-step, which the multi-rate ``PI052Runtime`` (in
-``lerobot.policies.pi052.inference``) drives at separate rates.
-"""
-
-from dataclasses import dataclass
-
-from lerobot.configs import PreTrainedConfig
-
-from ..pi05.configuration_pi05 import PI05Config
-
-
-@PreTrainedConfig.register_subclass("pi052")
-@dataclass
-class PI052Config(PI05Config):
-    """π0.5 with the PaliGemma LM head re-enabled for subtask prediction.
-
-    Recipe-driven dual-head training: the flow head supervises actions,
-    the LM head supervises subtask / plan / memory / VQA text. The
-    flow:text loss split is the milder 5:1 (see ``flow_loss_weight``).
-    """
-
-    # Recipe / language stack ---------------------------------------------
-    recipe_path: str | None = "recipes/subtasks_vqa.yaml"
-    """Path (absolute or relative to ``src/lerobot/configs/``) to a
-    ``TrainingRecipe`` YAML. Defaults to the canonical Hi-Robot blend
-    shipped alongside this policy. Set to ``None`` to disable recipe
-    rendering and fall back to π0.5's single-task ``Task: ... Action:``
-    prompt path (unannotated datasets keep working that way)."""
-
-    apply_chat_template: bool = False
-    """PaliGemma is *not* chat-pretrained — its tokenizer doesn't ship a
-    chat template, so we don't apply one. The recipe renderer's output
-    is concatenated as a plain prefix + assistant suffix instead,
-    mirroring how the π0.5 paper's high-level inference samples text
-    auto-regressively after the prefix."""
-
-    # Loss weights --------------------------------------------------------
-    # Paper §IV.D uses α=10 between the flow and text terms, assuming
-    # text is a rare auxiliary task. With the recipe stack the flow-only
-    # `low_level` branch fires on a large share of samples, so α=10
-    # swamps the LM head and collapses generation into degenerate
-    # repetition. We use the milder 5:1 split here.
-    text_loss_weight: float = 1.0
-    """Weight on the LM-head cross-entropy term. Set to ``0`` to disable
-    text training entirely (reverts to flow-only / π0.5 behaviour)."""
-
-    flow_loss_weight: float = 5.0
-    """Weight on the action-expert flow-matching term. ``5.0`` — a milder
-    flow:text split than the paper's α=10, since the flow-only
-    ``low_level`` recipe already gives the action expert frequent
-    gradient. Lower it further if the LM head still underfits."""
-
-    # Backbone training ---------------------------------------------------
-    unfreeze_lm_head: bool = True
-    """Whether to keep the PaliGemma ``lm_head`` unfrozen for fine-tuning.
-    The existing ``PI05Policy`` zeroes / freezes the head on load
-    because it never reads from it. Must be ``True`` for π0.5-style
-    hierarchical inference."""
-
-    # Per-component prompt dropout (Pi0.7 §V.E) ---------------------------
-    # Randomly drop non-target context messages so the LM head learns
-    # to handle missing /
-    # stale plan / memory at inference. Defaults to 0.0 so behaviour
-    # is identical until explicitly enabled.
-    plan_dropout_prob: float = 0.0
-    memory_dropout_prob: float = 0.0
-    subtask_dropout_prob: float = 0.0
-
-    # FAST discrete-action supervision — paper §III.B-C ------------------
-    # When enabled, actions are *also* tokenised via the FAST tokenizer
-    # ("physical-intelligence/fast") and supervised with cross-entropy
-    # on the PaliGemma LM head — exactly as in the paper's pre-training
-    # objective (Eq. 1 mixes FAST CE + flow MSE + subtask CE). The
-    # ActionTokenizerProcessorStep is wired into the preprocessor
-    # pipeline when this flag is set; the loss is computed in
-    # PI052Policy.forward.
-    enable_fast_action_loss: bool = True
-    """If True, tokenise actions with the FAST tokenizer and add a
-    cross-entropy loss on the LM head. On by default to match the
-    π0.5 paper's three-loss objective (text CE + FAST CE + flow MSE,
-    §III.B-C Eq. 1). Set to False if you only want the
-    post-training-style flow + text recipe."""
-
-    action_tokenizer_name: str = "physical-intelligence/fast"
-    """HF identifier for the FAST action tokenizer."""
-
-    max_action_tokens: int = 256
-    """Maximum number of FAST tokens per action chunk."""
-
-    fast_skip_tokens: int = 128
-    """Number of low-vocab tokens the FAST tokenizer skips to avoid
-    collisions with PaliGemma's text vocabulary."""
-
-    fast_action_loss_weight: float = 1.0
-    """Weight on the FAST-action-token CE loss. Paper §III.C uses 1.0."""
-
-    auto_fit_fast_tokenizer: bool = False
-    """If True, the processor factory checks ``fast_tokenizer_cache_dir``
-    for a previously-fitted tokenizer keyed on ``(dataset_repo_id,
-    base_tokenizer_name, fit_samples)``. On cache miss, it loads
-    ``action_tokenizer_name`` as a base, samples
-    ``fast_tokenizer_fit_samples`` action chunks from the dataset, runs
-    ``.fit()``, saves the result, and uses *that* fitted path as the
-    actual tokenizer. Pertsch et al. 2025 (FAST paper [64], π0.5 §III.C)
-    explicitly recommend per-dataset fitting for best compression.
-
-    Off by default because the fit requires a separate pre-training
-    pass over the dataset (~1-2 min on a medium dataset) and depends
-    on the FAST tokenizer snapshot having a ``.fit()`` method. Opt in
-    when you want paper-faithful compression; leave off to fall back
-    on the universal ``physical-intelligence/fast`` codebook."""
-
-    fast_tokenizer_cache_dir: str = "~/.cache/lerobot/fast_tokenizers"
-    """Where fitted FAST tokenizers are stored. ``~`` expands."""
-
-    fast_tokenizer_fit_samples: int = 1024
-    """Number of action chunks to sample for the fit. The FAST paper uses
-    a few thousand; 1024 is a reasonable default for medium datasets."""
-
-    # Knowledge insulation — paper §III.B --------------------------------
-    # When enabled, gradients from the action expert's flow loss are
-    # blocked from flowing back into the VLM's K/V projections. This
-    # prevents the action loss from over-fitting the language backbone
-    # to robot-specific features. Implemented in ``modeling_pi052`` as
-    # a per-instance monkey-patch on ``paligemma_with_expert.forward``
-    # that splits queries into VLM and action halves and ``.detach()``-s
-    # the VLM K/V tensors used in the action-half's attention.
-    knowledge_insulation: bool = False
-    """If True, route every transformer layer through the KI
-    attention path that blocks action→VLM gradient flow on K/V."""
-
-    # Learning-rate defaults --------------------------------------------
-    # pi052 inherits π0.5's openpi-validated optimizer config (peak LR
-    # 2.5e-5, cosine→2.5e-6, 1k warmup, AdamW (0.9, 0.95), wd=0.01,
-    # grad_clip=1.0). The only place pi052 needs to diverge from pi05
-    # is the LM-head LR multiplier: pi05 has no text supervision so the
-    # head doesn't get gradients; pi052 always has text supervision
-    # (subtask / memory / VQA) via the recipe, and under KI the LM head
-    # only sees gradients on ~30–45% of the batch (the text-CE mask
-    # share of the recipe). Under aggressive cosine decay this is too
-    # weak to keep the head pinned, so it drifts back toward PaliGemma's
-    # pretrained ``<loc>`` first-token bias. 5x is the documented fix
-    # (see ``PI05Config.lm_head_lr_scale`` docstring); the wiring is
-    # already in ``PI05Policy.get_optim_params`` — it splits the LM head
-    # + tied ``embed_tokens`` into their own param group while sharing
-    # the same cosine lambda, so the 5x ratio is preserved across decay.
-    lm_head_lr_scale: float = 5.0
-
-    # PaLM-style z-loss on text CE. Penalises the log-partition function
-    # ``z = log Σ exp(logits)`` drifting away from zero — without it, large-
-    # vocab models (PaliGemma is 257k) can let ``logsumexp`` grow unbounded
-    # while CE stays low, because a uniform additive logit bias cancels in
-    # softmax. PaLM appendix B / Chinchilla report z-loss is essential for
-    # stable large-vocab CE; it especially helps under ``lm_head_lr_scale=
-    # 5.0`` which amplifies drift risk on the LM head. ``1e-4`` is the
-    # commonly cited weight; set 0 to disable entirely.
-    text_ce_z_loss_weight: float = 1e-4
-
-    # Liger Triton kernels (rope + geglu + layer_norm) are now patched
-    # unconditionally at model build time — see ``_enable_hf_kernels``
-    # in ``modeling_pi052``. The patch is process-global, idempotent
-    # and degrades gracefully if ``liger-kernel`` is missing. Measured
-    # at -4.5% step time on H100 (bench job 22161421); peak memory
-    # unchanged. ``fused_linear_cross_entropy`` ships separately via
-    # ``_shifted_lin_ce`` / ``_fast_lin_ce``.
-
-    def __post_init__(self) -> None:
-        super().__post_init__()
-        # Backbone needs gradients flowing through the text head when
-        # we're training it. Override the π0.5 default
-        # (``train_expert_only=True``) unless the user explicitly opts
-        # out of text training via ``text_loss_weight=0``.
-        if self.text_loss_weight > 0 and self.unfreeze_lm_head:
-            self.train_expert_only = False
--- a/src/lerobot/policies/pi052/fit_fast_tokenizer.py
+++ b/src/lerobot/policies/pi052/fit_fast_tokenizer.py
@@ -1,263 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""Dataset-specific FAST action tokenizer fitting.
-
-The published ``physical-intelligence/fast`` tokenizer is a *universal*
-codebook fitted on a heterogeneous mix of robot datasets. Per Pertsch
-et al. 2025 (the FAST paper, [64] in the π0.5 paper) and §III.C of
-π0.5 itself, the recommended practice is to **finetune the tokenizer on
-your specific dataset's action distribution** before training the
-policy — same way one would adapt a language tokenizer to a domain
-corpus. Without this finetune step, action sequences from your robot
-may require more tokens per chunk than necessary, lowering effective
-compression and slowing convergence of the action-CE loss.
-
-This module provides a single utility, :func:`fit_fast_tokenizer`,
-that does the finetune. The training entry point invokes it
-automatically when the policy's ``enable_fast_action_loss`` and
-``auto_fit_fast_tokenizer`` flags are both ``True`` and no cached
-fitted tokenizer is found at ``fast_tokenizer_cache_dir``.
-
-The fitted tokenizer is saved to
-``{cache_dir}/{dataset_hash}_{base_hash}/`` so successive training
-runs over the same dataset re-use it.
-"""
-
-from __future__ import annotations
-
-import hashlib
-import logging
-import os
-import time
-from pathlib import Path
-
-import numpy as np
-
-logger = logging.getLogger(__name__)
-
-# Marker file the cache-hit check looks for. ``ProcessorMixin.save_pretrained``
-# writes ``processor_config.json`` (NOT ``preprocessor_config.json`` —
-# that's the image / feature-extractor convention). Centralised here so
-# the cache-hit check and the rank-N readiness wait agree on the same
-# sentinel.
-_CACHE_SENTINEL = "processor_config.json"
-
-
-def _dataset_signature(
-    dataset_repo_id: str,
-    base_tokenizer_name: str,
-    n_samples: int,
-    chunk_size: int,
-) -> str:
-    """Deterministic short hash for naming the cache directory.
-
-    Keys on (dataset, base tokenizer, sample count, chunk size) so any
-    of those changing re-runs the fit. ``chunk_size`` matters because
-    the tokenizer is fit on chunks of that length.
-    """
-    h = hashlib.sha256()
-    h.update(dataset_repo_id.encode("utf-8"))
-    h.update(b"\0")
-    h.update(base_tokenizer_name.encode("utf-8"))
-    h.update(b"\0")
-    h.update(str(n_samples).encode("utf-8"))
-    h.update(b"\0")
-    h.update(str(chunk_size).encode("utf-8"))
-    return h.hexdigest()[:16]
-
-
-def fit_fast_tokenizer(
-    *,
-    dataset_repo_id: str,
-    cache_dir: str | Path,
-    base_tokenizer_name: str = "physical-intelligence/fast",
-    n_samples: int = 1024,
-    chunk_size: int = 50,
-    seed: int = 42,
-) -> str:
-    """Fit a FAST tokenizer on a LeRobot dataset's action distribution.
-
-    Args:
-        dataset_repo_id: HF Hub repo id of the LeRobotDataset to fit on.
-        cache_dir: Directory under which to save (and look up) fitted
-            tokenizers. The actual save path is
-            ``{cache_dir}/{signature}``.
-        base_tokenizer_name: HF identifier for the base FAST tokenizer
-            to finetune from. ``physical-intelligence/fast`` is the
-            universal one.
-        n_samples: Number of action chunks to sample for the fit. The
-            FAST paper uses a few thousand; ``1024`` is a good default
-            for medium datasets.
-        chunk_size: Length of each action chunk (matches
-            ``policy.chunk_size``). The FAST tokenizer is fit on
-            sequences of this length.
-        seed: RNG seed for sample selection.
-
-    Returns:
-        The local path to the fitted tokenizer. Passed directly to
-        ``--policy.action_tokenizer_name`` for the training run.
-
-    Raises:
-        ImportError: If the ``transformers`` library doesn't expose
-            ``AutoProcessor`` or the FAST tokenizer doesn't have a
-            ``.fit()`` method (then you're on an older FAST snapshot —
-            update to the current published model).
-        FileNotFoundError: If the dataset can't be loaded.
-    """
-    cache_dir = Path(cache_dir)
-    sig = _dataset_signature(dataset_repo_id, base_tokenizer_name, n_samples, chunk_size)
-    out_dir = cache_dir / sig
-
-    if out_dir.exists() and (out_dir / _CACHE_SENTINEL).exists():
-        logger.info(
-            "FAST tokenizer cache hit: %s — re-using fitted tokenizer for "
-            "dataset=%s base=%s n_samples=%d",
-            out_dir, dataset_repo_id, base_tokenizer_name, n_samples,
-        )
-        return str(out_dir)
-
-    # DDP-safe fit: only the (local) main process actually fits + saves;
-    # other ranks poll the cache sentinel until the leader is done.
-    # Without this guard, all N ranks fit concurrently and race on
-    # ``save_pretrained`` + ``AutoProcessor.from_pretrained`` (the latter
-    # copies ``processing_action_tokenizer.py`` into ``HF_MODULES_CACHE``
-    # and compiles a ``.pyc`` — concurrent writers occasionally produce
-    # a stale / partial ``.pyc`` and the subsequent ``from .. import
-    # UniversalActionProcessor`` raises ``AttributeError``.
-    is_leader = (
-        int(os.environ.get("RANK", "0")) == 0
-        and int(os.environ.get("LOCAL_RANK", "0")) == 0
-    )
-    if not is_leader:
-        timeout_s = 1800.0  # 30 min — covers ~1024-sample fits on cold caches
-        start = time.monotonic()
-        while not (out_dir / _CACHE_SENTINEL).exists():
-            if time.monotonic() - start > timeout_s:
-                raise RuntimeError(
-                    f"FAST tokenizer fit: non-leader rank timed out after "
-                    f"{timeout_s:.0f}s waiting for {out_dir / _CACHE_SENTINEL}. "
-                    "Leader rank likely crashed during the fit."
-                )
-            time.sleep(2.0)
-        logger.info("FAST tokenizer ready (leader populated cache): %s", out_dir)
-        return str(out_dir)
-
-    logger.info(
-        "FAST tokenizer cache miss — fitting on dataset=%s "
-        "base=%s n_samples=%d chunk_size=%d → %s",
-        dataset_repo_id, base_tokenizer_name, n_samples, chunk_size, out_dir,
-    )
-
-    from transformers import AutoProcessor  # noqa: PLC0415
-
-    from lerobot.datasets.lerobot_dataset import LeRobotDataset  # noqa: PLC0415
-
-    # Stream a single episode's worth of action chunks at a time so
-    # we don't blow memory on huge datasets. Random episode +
-    # random start offset gives a reasonable spread.
-    #
-    # Actions are read straight from the underlying HF dataset's
-    # ``action`` *column* — never via ``ds[i]``. ``ds[i]`` builds a full
-    # training item (delta-timestamp expansion + video decode + image
-    # transforms); a single bad video frame would then throw and, since
-    # the failure was swallowed at debug level, silently starve the fit
-    # of every chunk. The action column carries no video, so reading it
-    # directly is both faster and immune to decode errors.
-    rng = np.random.default_rng(seed)
-    actions_buf: list[np.ndarray] = []
-
-    # Load just the metadata first to know episode boundaries.
-    ds_meta_only = LeRobotDataset(dataset_repo_id, episodes=[0])
-    num_episodes = ds_meta_only.meta.total_episodes
-    if "action" not in ds_meta_only.features:
-        available = ", ".join(sorted(ds_meta_only.features)) or "<none>"
-        raise RuntimeError(
-            f"FAST fit: dataset {dataset_repo_id!r} has no ``action`` feature. "
-            f"Available features: {available}."
-        )
-    del ds_meta_only
-
-    samples_per_episode = max(1, n_samples // max(num_episodes, 1))
-    collected = 0
-    eps_visited = 0
-    short_episodes = 0
-    for ep_idx in rng.permutation(num_episodes):
-        if collected >= n_samples:
-            break
-        ep_idx = int(ep_idx)
-        try:
-            ds = LeRobotDataset(dataset_repo_id, episodes=[ep_idx])
-            ep_actions = np.asarray(ds.hf_dataset["action"], dtype=np.float32)
-        except Exception as exc:  # noqa: BLE001
-            logger.warning("FAST fit: skipping episode %d: %s", ep_idx, exc)
-            continue
-        if ep_actions.ndim != 2 or ep_actions.shape[0] < chunk_size:
-            short_episodes += 1
-            continue
-        # Sample ``samples_per_episode`` contiguous chunks uniformly.
-        starts = rng.integers(0, ep_actions.shape[0] - chunk_size + 1, size=samples_per_episode)
-        for s in starts:
-            actions_buf.append(ep_actions[int(s) : int(s) + chunk_size])
-            collected += 1
-            if collected >= n_samples:
-                break
-        eps_visited += 1
-
-    if not actions_buf:
-        raise RuntimeError(
-            f"FAST fit collected zero action chunks from {dataset_repo_id!r}: "
-            f"all {num_episodes} episodes were shorter than chunk_size="
-            f"{chunk_size} ({short_episodes} too short) or had an unreadable "
-            "``action`` column. Lower ``chunk_size`` to match your episode "
-            "lengths."
-        )
-
-    actions = np.stack(actions_buf, axis=0).astype(np.float32)  # (N, H, D)
-    logger.info(
-        "FAST fit: collected %d chunks of shape %s from %d episodes",
-        actions.shape[0], actions.shape[1:], eps_visited,
-    )
-
-    # Quantile-normalise per dimension before fitting.
-    #
-    # The FAST tokenizer DCT-transforms actions, scales by ``scale`` and
-    # rounds to integer tokens; the integer *range* must fit the
-    # codebook (vocab_size, default 1024). Raw motor units (e.g. encoder
-    # ticks) blow that range up — hence "Vocab size 1024 is too small".
-    # More importantly, at training time ``ActionTokenizerProcessorStep``
-    # runs *after* the QUANTILES ``NormalizerProcessorStep``, so it
-    # encodes normalised actions. Fitting on raw actions would mismatch
-    # that space. We replicate QUANTILES normalisation here (per-dim
-    # [q01, q99] → [-1, 1], clipped) so the fit and the training-time
-    # encode see the same distribution.
-    flat = actions.reshape(-1, actions.shape[-1])
-    q01 = np.quantile(flat, 0.01, axis=0)
-    q99 = np.quantile(flat, 0.99, axis=0)
-    span = np.where((q99 - q01) > 1e-6, q99 - q01, 1.0)
-    actions = np.clip((actions - q01) / span * 2.0 - 1.0, -1.0, 1.0).astype(np.float32)
-
-    base = AutoProcessor.from_pretrained(base_tokenizer_name, trust_remote_code=True)
-    if not hasattr(base, "fit"):
-        raise ImportError(
-            f"Base FAST tokenizer {base_tokenizer_name!r} has no ``.fit()`` "
-            "method — your transformers / model snapshot is too old. Update "
-            "to the current ``physical-intelligence/fast`` revision."
-        )
-
-    fitted = base.fit(actions)
-    out_dir.mkdir(parents=True, exist_ok=True)
-    fitted.save_pretrained(str(out_dir))
-    logger.info("FAST fit: saved fitted tokenizer to %s", out_dir)
-    return str(out_dir)
--- a/src/lerobot/policies/pi052/inference/init.py
+++ b/src/lerobot/policies/pi052/inference/init.py
@@ -1,73 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""PI052 inference / runtime orchestration.
-
-Multi-rate runtime that mirrors the recipe-time training shape:
-
-  low_level_execution        → LowLevelForward + DispatchAction (high Hz)
-  high_level_subtask         → HighLevelSubtaskFwd (~1 Hz)
-  memory_update              → MemoryUpdateFwd (event: subtask_change)
-  user_interjection_response → UserInterjectionFwd (event: stdin)
-  ask_vqa_*                  → AskVQAFwd (event: stdin question)
-  speech tool calls          → DispatchToolCalls (event: tool_call_pending)
-
-The CLI ``lerobot-pi052-runtime`` builds a ``PI052Runtime`` and calls
-``run()``.
-"""
-
-from .repl import StdinReader
-from .runtime import PI052Runtime
-from .runtime_state import initial_runtime_state, push_log, set_if_changed, take_event
-from .steps import (
-    AskVQAFwd,
-    DispatchAction,
-    DispatchToolCalls,
-    HighLevelSubtaskFwd,
-    InferenceStep,
-    LowLevelForward,
-    MemoryUpdateFwd,
-    UserInterjectionFwd,
-)
-from .triggers import EventTrigger, HzTrigger, Tick, TickClock, Trigger
-from .ui import make_state_panel, print_robot_lines, print_user_line
-
-__all__ = [
-    # runtime
-    "PI052Runtime",
-    "StdinReader",
-    # state helpers
-    "initial_runtime_state",
-    "push_log",
-    "set_if_changed",
-    "take_event",
-    # triggers
-    "Trigger",
-    "Tick",
-    "TickClock",
-    "HzTrigger",
-    "EventTrigger",
-    # steps
-    "InferenceStep",
-    "LowLevelForward",
-    "DispatchAction",
-    "HighLevelSubtaskFwd",
-    "MemoryUpdateFwd",
-    "UserInterjectionFwd",
-    "AskVQAFwd",
-    "DispatchToolCalls",
-    # UI
-    "make_state_panel",
-    "print_robot_lines",
-    "print_user_line",
-]
--- a/src/lerobot/policies/pi052/inference/repl.py
+++ b/src/lerobot/policies/pi052/inference/repl.py
@@ -1,105 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Stdin REPL event collector for the PI052 runtime.
-
-Reads non-blocking stdin lines, classifies each one heuristically:
-
-  "stop" / "quit" / "exit"               → state["stop"] = True
-  "/action" / "/pause"                    → set state["mode"]
-  ends with "?"                           → user_vqa_query event
-  starts with "task:" or first line       → set runtime task
-  anything else                           → user_interjection event
-
-Plugged into the runtime via ``event_collector=StdinReader().poll``.
-
-Note: the shipped CLI (``lerobot-pi052-runtime``) drives stdin
-directly in its REPL / autonomous loops and does *not* wire this
-collector; it's kept as the documented embedding hook and for tests.
-"""
-
-from __future__ import annotations
-
-import select
-import sys
-from dataclasses import dataclass, field
-from typing import Any
-
-
-@dataclass
-class StdinReader:
-    """Non-blocking stdin line collector for the runtime loop."""
-
-    prompt: str = "> "
-    _seen_first_line: bool = field(default=False, init=False)
-    _prompted: bool = field(default=False, init=False)
-
-    def poll(self, state: dict[str, Any]) -> None:
-        """Drain pending stdin lines into runtime events."""
-        # Print the input prompt once on every fresh tick if we don't
-        # already have a pending line; matches the expected REPL feel.
-        if not self._prompted:
-            print(self.prompt, end="", flush=True)
-            self._prompted = True
-
-        # ``select`` with timeout=0 makes this non-blocking. Only works
-        # for actual TTY / pipe stdins; CI / scripted runs hit EOF.
-        try:
-            ready, _, _ = select.select([sys.stdin], [], [], 0)
-        except (ValueError, OSError):
-            return
-        if not ready:
-            return
-
-        line = sys.stdin.readline()
-        if not line:  # EOF
-            state["stop"] = True
-            return
-        line = line.strip()
-        self._prompted = False  # we'll re-prompt next tick
-        if not line:
-            return
-
-        lower = line.lower()
-        if lower in {"stop", "quit", "exit"}:
-            state["stop"] = True
-            return
-
-        # Slash commands flip the run mode. ``/pause`` stops the action
-        # loop (the action steps gate on ``state["mode"]``); ``/action``
-        # resumes it.
-        if lower.split(" ", 1)[0] in {"/action", "/act", "/run"}:
-            state["mode"] = "action"
-            return
-        if lower in {"/pause", "/p"}:
-            state["mode"] = "paused"
-            queue = state.get("action_queue")
-            if hasattr(queue, "clear"):
-                queue.clear()
-            return
-
-        # First non-control line sets the task if no task is active.
-        if not state.get("task"):
-            task = line[5:].strip() if lower.startswith("task:") else line
-            state["task"] = task
-            print(f"[pi052] Task: {task}", flush=True)
-            self._seen_first_line = True
-            return
-
-        # Question → VQA; statement → interjection.
-        if lower.endswith("?"):
-            state["recent_vqa_query"] = line
-            state.setdefault("events_this_tick", []).append("user_vqa_query")
-        else:
-            state["recent_interjection"] = line
-            state.setdefault("events_this_tick", []).append("user_interjection")
--- a/src/lerobot/policies/pi052/inference/runtime.py
+++ b/src/lerobot/policies/pi052/inference/runtime.py
@@ -1,205 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""PI052 runtime loop.
-
-Threads the multi-rate inference pipeline together with a stdin REPL
-event collector, drives ticks through :class:`TickClock`, and prints
-state-change updates to the user.
-"""
-
-from __future__ import annotations
-
-import logging
-from collections import deque
-from dataclasses import dataclass, field
-from typing import Any, Callable
-
-from .runtime_state import initial_runtime_state, push_log
-from .steps import (
-    AskVQAFwd,
-    DispatchAction,
-    DispatchToolCalls,
-    HighLevelSubtaskFwd,
-    InferenceStep,
-    LowLevelForward,
-    MemoryUpdateFwd,
-)
-from .triggers import EventTrigger, HzTrigger, TickClock
-
-logger = logging.getLogger(__name__)
-
-
-@dataclass
-class PI052Runtime:
-    """Compose the inference pipeline and drive it tick-by-tick."""
-
-    policy: Any
-    tools: dict[str, Any] = field(default_factory=dict)
-    """Name → tool-instance dict, e.g. ``{"say": SayTool(...)}``. Read
-    from :func:`lerobot.tools.get_tools(meta)` when wiring the
-    runtime."""
-    observation_provider: Callable[[], dict | None] | None = None
-    """Closure returning the current preprocessed observation batch.
-    ``None`` for dry-run / language-only sessions."""
-    robot_executor: Callable[[Any], None] | None = None
-    """Closure that takes one action chunk and forwards it to the
-    robot. ``None`` for dry-run."""
-    event_collector: Callable[[dict], None] | None = None
-    """Per-tick hook that polls external sources (stdin, network) and
-    appends event names to ``state["events_this_tick"]``."""
-    chunk_hz: float = 4.0
-    ctrl_hz: float = 50.0
-    high_level_hz: float = 1.0
-    max_rate_hz: float = 50.0
-
-    pipeline: list[InferenceStep] = field(init=False)
-    state: dict[str, Any] = field(init=False)
-    _stop: bool = field(default=False, init=False)
-
-    def __post_init__(self) -> None:
-        # Subtask + memory + VQA configuration. Pipeline:
-        #
-        #   HighLevelSubtaskFwd → generate the next subtask via the LM
-        #                         head at ~``high_level_hz``; writes
-        #                         ``current_subtask`` and emits
-        #                         ``subtask_change`` on a transition.
-        #   MemoryUpdateFwd     → on ``subtask_change``, refresh
-        #                         ``current_memory`` from the
-        #                         ``memory_update`` head.
-        #   AskVQAFwd           → answer camera-grounded stdin questions.
-        #   LowLevelForward     → action chunk conditioned on the
-        #                         generated ``current_subtask``.
-        #   DispatchAction      → drain the chunk to the robot.
-        #   DispatchToolCalls   → fire any pending tool calls.
-        #
-        # Order matters: ``HighLevelSubtaskFwd`` must run before
-        # ``MemoryUpdateFwd`` so the event is visible the same tick, and
-        # both must run before ``LowLevelForward`` (which is gated on
-        # "action queue empty") so the chunk consumes the freshest
-        # subtask. ``UserInterjectionFwd`` is still importable but
-        # disabled until plan generation is wired in.
-        self.pipeline = [
-            HighLevelSubtaskFwd(
-                trigger=HzTrigger(self.high_level_hz),
-                policy=self.policy,
-                observation_provider=self.observation_provider,
-            ),
-            # Listens for the ``subtask_change`` event raised by
-            # ``HighLevelSubtaskFwd`` and refreshes ``current_memory``.
-            MemoryUpdateFwd(
-                trigger=EventTrigger("subtask_change"),
-                policy=self.policy,
-                observation_provider=self.observation_provider,
-            ),
-            AskVQAFwd(
-                policy=self.policy,
-                observation_provider=self.observation_provider,
-            ),
-            LowLevelForward(
-                trigger=HzTrigger(self.chunk_hz),
-                policy=self.policy,
-                observation_provider=self.observation_provider,
-            ),
-            DispatchAction(
-                trigger=HzTrigger(self.ctrl_hz),
-                robot_executor=self.robot_executor,
-            ),
-            DispatchToolCalls(tools=self.tools),
-        ]
-        self.state = initial_runtime_state()
-
-    # ------------------------------------------------------------------
-    # Lifecycle
-    # ------------------------------------------------------------------
-
-    def set_task(self, task: str) -> None:
-        """Set or replace the active task. Logged for the REPL."""
-        self.state["task"] = task
-        push_log(self.state, f"Task: {task}")
-
-    def stop(self) -> None:
-        self._stop = True
-
-    def run(self, *, max_ticks: int | None = None) -> None:
-        """Main loop. Returns when ``stop()`` is called or after
-        ``max_ticks`` ticks (useful for tests / dry-run)."""
-        clock = TickClock(max_rate_hz=self.max_rate_hz)
-        while not self._stop:
-            tick = clock.advance()
-            self.state["_tick"] = tick
-            self.state["events_this_tick"] = []
-            self.state["log_lines"] = []
-
-            if self.event_collector is not None:
-                self.event_collector(self.state)
-            if self.state.get("stop"):
-                self._stop = True
-                break
-
-            for step in self.pipeline:
-                self.state = step(self.state)
-
-            self._flush_logs()
-            if max_ticks is not None and tick.index >= max_ticks:
-                break
-
-        self._on_shutdown()
-
-    # ------------------------------------------------------------------
-    # REPL helper: drive one full pipeline pass and return its logs
-    # ------------------------------------------------------------------
-
-    def step_once(self) -> list[str]:
-        """Run one tick of the pipeline and return the log lines.
-
-        Used by the interactive REPL: instead of a background thread,
-        the CLI drives ticks synchronously after each user input. Logs
-        are returned (not printed) so the caller can route them into
-        the rich-Live chat scrollback.
-        """
-        from .triggers import Tick  # noqa: PLC0415
-
-        # Synthesize a tick. We don't need the real wall-clock pacing
-        # here — the REPL drives the runtime, not vice versa — but
-        # ``HzTrigger`` uses ``tick.monotonic_seconds`` to gate, so we
-        # bump it generously so every Hz-triggered step considers
-        # itself due.
-        import time as _time  # noqa: PLC0415
-
-        prev_index = self.state.get("_tick").index if isinstance(self.state.get("_tick"), Tick) else 0
-        self.state["_tick"] = Tick(index=prev_index + 1, monotonic_seconds=_time.monotonic())
-        self.state["log_lines"] = []
-        # ``events_this_tick`` is set up by the caller before
-        # ``step_once`` (the REPL pushes user-driven events first).
-        self.state.setdefault("events_this_tick", [])
-
-        for step in self.pipeline:
-            self.state = step(self.state)
-
-        return list(self.state.get("log_lines") or [])
-
-    # ------------------------------------------------------------------
-    # I/O
-    # ------------------------------------------------------------------
-
-    def _flush_logs(self) -> None:
-        for line in self.state.get("log_lines") or []:
-            print(f"[pi052] {line}", flush=True)
-
-    def _on_shutdown(self) -> None:
-        # Drain any queued action chunks safely.
-        queue = self.state.get("action_queue")
-        if isinstance(queue, deque):
-            queue.clear()
-        print("[pi052] runtime stopped", flush=True)
--- a/src/lerobot/policies/pi052/inference/runtime_state.py
+++ b/src/lerobot/policies/pi052/inference/runtime_state.py
@@ -1,95 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Runtime state passed between inference steps each tick.
-
-The runtime threads a single dict through the pipeline; this module
-documents the shape and provides factories. We use a plain ``dict``
-rather than a frozen dataclass because steps freely add and remove
-keys (``events_this_tick``, ``messages_pending``, ``tool_calls_pending``,
-…) and dataclass field churn would just get in the way.
-
-Stable keys (read by multiple steps):
-
-  task          str             the current top-level task
-  current_plan  str | None      latest plan emitted by the planner
-  current_subtask str | None    latest subtask the policy is executing
-  current_memory str | None     latest compressed memory
-  recent_interjection str | None  most recent user interjection text (consumed)
-
-  action_queue  collections.deque[Tensor]  pending action chunks
-  tool_calls_pending list[dict]  parsed but not-yet-dispatched tool calls
-
-  events_this_tick list[str]    triggers consumed this tick
-  _tick         Tick            current tick (set by the loop)
-
-  mode          str             "action" (run the robot) | "paused"
-                                 (action loop stopped — robot holds)
-
-  log_lines     list[str]       human-readable status lines printed each tick
-"""
-
-from __future__ import annotations
-
-from collections import deque
-from typing import Any
-
-
-def initial_runtime_state(task: str | None = None) -> dict[str, Any]:
-    """Build a fresh runtime state dict with sensible defaults."""
-    return {
-        "task": task,
-        "current_plan": None,
-        "current_subtask": None,
-        "current_memory": None,
-        "recent_interjection": None,
-        "action_queue": deque(),
-        "tool_calls_pending": [],
-        "events_this_tick": [],
-        "log_lines": [],
-        "mode": "action",
-        "stop": False,
-    }
-
-
-def take_event(state: dict[str, Any], event_name: str) -> bool:
-    """Pop ``event_name`` from ``events_this_tick`` if present.
-
-    Steps that consume an event call this so the same event doesn't
-    re-fire on a sibling step within the same tick.
-    """
-    events: list[str] = state.get("events_this_tick") or []
-    if event_name in events:
-        events.remove(event_name)
-        return True
-    return False
-
-
-def push_log(state: dict[str, Any], line: str) -> None:
-    """Append ``line`` to the per-tick log buffer; the runtime prints
-    it at the end of the tick."""
-    state.setdefault("log_lines", []).append(line)
-
-
-def set_if_changed(state: dict[str, Any], key: str, value: Any, label: str | None = None) -> bool:
-    """Update ``state[key]`` and log a diff line if the value changed.
-
-    Returns ``True`` if the value actually changed.
-    """
-    prev = state.get(key)
-    if prev == value:
-        return False
-    state[key] = value
-    if label is not None:
-        push_log(state, f"  {label}: {value}")
-    return True
--- a/src/lerobot/policies/pi052/inference/steps.py
+++ b/src/lerobot/policies/pi052/inference/steps.py
@@ -1,936 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Inference steps for the PI052 multi-rate runtime.
-
-Each step is a tiny class with a ``trigger`` and an ``__call__(state)``;
-the runtime applies them in order each tick. When a step's trigger
-doesn't fire, the step is a no-op and the runtime moves on.
-
-Stream-to-step mapping mirrors the ``subtasks_vqa.yaml`` recipe:
-
-* ``LowLevelForward``        — calls ``policy.select_action`` for the
-                                action chunk; trained by
-                                ``low_level_execution``
-* ``EnqueueChunk``           — pushes the chunk to ``action_queue``
-* ``DispatchAction``         — pops one action per control tick and
-                                forwards to the robot
-* ``HighLevelSubtaskFwd``    — calls ``policy.select_message`` for the
-                                next subtask; trained by
-                                ``high_level_subtask``
-* ``MemoryUpdateFwd``        — fires on subtask boundary; trained by
-                                ``memory_update``
-* ``UserInterjectionFwd``    — fires on stdin interjection; trained by
-                                ``user_interjection_response``
-* ``AskVQAFwd``              — fires on stdin question; trained by
-                                ``ask_vqa_*``
-* ``DispatchToolCalls``      — pops ``tool_calls_pending`` and calls
-                                the matching ``Tool`` instance
-"""
-
-from __future__ import annotations
-
-import logging
-import re
-from dataclasses import dataclass, field
-from typing import Any
-
-from .runtime_state import push_log, set_if_changed, take_event
-from .triggers import EventTrigger, HzTrigger, Trigger
-
-logger = logging.getLogger(__name__)
-
-
-# ---------------------------------------------------------------------------
-# Step base + runner
-# ---------------------------------------------------------------------------
-
-
-@dataclass
-class InferenceStep:
-    """A trigger-gated callable. Subclasses override :meth:`run`."""
-
-    trigger: Trigger
-
-    def __call__(self, state: dict[str, Any]) -> dict[str, Any]:
-        if not self.trigger.should_fire(state["_tick"], state):
-            return state
-        return self.run(state) or state
-
-    def run(self, state: dict[str, Any]) -> dict[str, Any] | None:  # pragma: no cover
-        raise NotImplementedError
-
-
-# ---------------------------------------------------------------------------
-# Low-level (action) path
-# ---------------------------------------------------------------------------
-
-
-@dataclass
-class LowLevelForward(InferenceStep):
-    """Run the policy's action head and produce one action chunk."""
-
-    policy: Any = None
-    observation_provider: Any = None
-    """Callable ``() -> dict``: returns the current observation batch
-    (already preprocessed). Typically wraps the robot's camera /
-    proprio reads. ``None`` in dry-run mode → step skips."""
-
-    trigger: Trigger = field(default_factory=lambda: HzTrigger(hz=4.0))
-
-    def run(self, state: dict[str, Any]) -> dict[str, Any] | None:
-        if self.policy is None or self.observation_provider is None:
-            return None
-        # ``/vlm`` mode pauses the whole action loop so the robot holds
-        # position while the operator probes the VLM with VQA.
-        if state.get("mode", "action") != "action":
-            return None
-        if not state.get("task"):
-            return None
-
-        # PI052 produces *action chunks* (typically 50 steps via
-        # flow-matching). Every step gets dispatched to the robot;
-        # popping one per dispatch tick is essentially free. Only
-        # generate a new chunk once the previous one has fully
-        # drained — this is the canonical "sense → think → act"
-        # loop. Refreshing while a chunk is still queued causes the
-        # new chunk to "telescope" past the old one (planned from an
-        # observation that's already 25+ steps stale by the time it
-        # starts dispatching).
-        queue = state.setdefault("action_queue", [])
-        if len(queue) > 0:
-            return None
-
-        observation = self.observation_provider()
-        if observation is None:
-            return None
-
-        # The action expert is conditioned on the SUBTASK generated by
-        # the high-level loop (``HighLevelSubtaskFwd`` runs earlier in
-        # the pipeline and writes ``current_subtask``). Matches the
-        # training-time ``low_level_execution`` recipe — ``user(${subtask})``.
-        # Falls back to the task string only on the very first frame,
-        # before the high-level loop has produced a subtask.
-        subtask = state.get("current_subtask") or state.get("task") or ""
-        ctx = [{"role": "user", "content": subtask}]
-        # ``add_generation_prompt=False`` to match the training-time
-        # prefix shape: at training the action expert sees the rendered
-        # user turn ending at ``<|im_end|>`` (no trailing
-        # ``<|im_start|>assistant\n``). Passing True here would append
-        # extra role-marker tokens the action expert never saw during
-        # training.
-        text_batch = _build_text_batch(self.policy, ctx, add_generation_prompt=False)
-        from lerobot.utils.constants import (  # noqa: PLC0415
-            OBS_LANGUAGE_ATTENTION_MASK,
-            OBS_LANGUAGE_TOKENS,
-        )
-
-        observation = dict(observation)
-        observation[OBS_LANGUAGE_TOKENS] = text_batch["lang_tokens"]
-        observation[OBS_LANGUAGE_ATTENTION_MASK] = text_batch["lang_masks"]
-
-        try:
-            # ``predict_action_chunk`` returns the *full* chunk shape
-            # ``(batch, n_action_steps, action_dim)``. Enqueue every
-            # step so DispatchAction at ctrl_hz can drain them
-            # smoothly until the next refresh.
-            chunk = self.policy.predict_action_chunk(observation)
-        except Exception as exc:  # noqa: BLE001
-            logger.warning(
-                "predict_action_chunk failed: %s",
-                exc,
-                exc_info=logger.isEnabledFor(logging.DEBUG),
-            )
-            push_log(
-                state,
-                f"  [warn] predict_action_chunk failed: "
-                f"{type(exc).__name__}: {exc}",
-            )
-            return None
-
-        # ``chunk`` shape: ``(batch, n_action_steps, action_dim)``. Push
-        # each step as a ``(1, action_dim)`` tensor so the existing
-        # action executor's batch-squeeze logic works unchanged.
-        if chunk.ndim == 3:
-            chunk_iter = chunk[0]  # ``(n_action_steps, action_dim)``
-        elif chunk.ndim == 2:
-            chunk_iter = chunk
-        else:
-            chunk_iter = chunk.unsqueeze(0)
-
-        for step in chunk_iter:
-            queue.append(step.unsqueeze(0))
-        state["last_chunk_size"] = int(chunk_iter.shape[0])
-        return None
-
-
-@dataclass
-class DispatchAction(InferenceStep):
-    """Pop one action per tick and hand it to the robot.
-
-    In dry-run mode (``robot_executor=None``) the step still pops the
-    queue so it doesn't grow unbounded — the popped tensor is logged
-    instead of executed.
-
-    Wall-clock catch-up: the action queue represents an open-loop
-    trajectory at a fixed step rate (``trigger.hz`` ≈ ``ctrl_hz``).
-    When the main loop stalls — e.g. an LLM call for the high-level
-    subtask blocks for ~2 s on MPS — the dispatch trigger fires only
-    once over that whole interval. Naively popping a single entry per
-    fire makes the robot lag further and further behind the planned
-    timeline, and a 50-step chunk would take ~125 s to drain instead
-    of ~1.7 s. Track real elapsed time between dispatches and pop
-    ``round(elapsed * hz)`` entries, sending the most recent one. The
-    skipped intermediate joint targets are stale anyway — the dynamixel
-    will smooth toward the latest goal position.
-    """
-
-    robot_executor: Any = None
-    trigger: Trigger = field(default_factory=lambda: HzTrigger(hz=50.0))
-    _last_dispatch_t: float | None = field(default=None, init=False)
-
-    def run(self, state: dict[str, Any]) -> dict[str, Any] | None:
-        import time as _time  # noqa: PLC0415
-
-        # ``/vlm`` mode pauses dispatch — the robot holds its last
-        # commanded position while the operator runs VQA.
-        if state.get("mode", "action") != "action":
-            self._last_dispatch_t = None
-            return None
-
-        queue = state.get("action_queue")
-        if not queue:
-            # Reset wall-clock anchor when the queue is empty so the
-            # next chunk doesn't see a huge fake "elapsed" window.
-            self._last_dispatch_t = None
-            return None
-
-        now = _time.monotonic()
-        hz = getattr(self.trigger, "hz", 30.0)
-        if self._last_dispatch_t is None or hz <= 0:
-            n_to_pop = 1
-        else:
-            elapsed = now - self._last_dispatch_t
-            # ``max(1, ...)`` so we always pop at least one when the
-            # trigger fires; ``min(len(queue), ...)`` so we don't run
-            # off the end of the chunk.
-            n_to_pop = max(1, min(len(queue), int(round(elapsed * hz))))
-        self._last_dispatch_t = now
-
-        # Drain ``n_to_pop`` stale entries, keep only the latest as the
-        # action actually sent. The intermediate joint targets would
-        # all be ~10–30 ms apart in chunk time — the robot can't track
-        # them individually anyway when the host loop is slow.
-        latest = None
-        for _ in range(n_to_pop):
-            if not queue:
-                break
-            latest = queue.popleft() if hasattr(queue, "popleft") else queue.pop(0)
-            state["actions_dispatched"] = state.get("actions_dispatched", 0) + 1
-
-        if latest is not None and self.robot_executor is not None:
-            self.robot_executor(latest)
-        return None
-
-
-# ---------------------------------------------------------------------------
-# High-level (text) paths — all use policy.select_message
-# ---------------------------------------------------------------------------
-
-
-def _build_text_batch(
-    policy: Any,
-    prompt_messages: list[dict[str, Any]],
-    *,
-    add_generation_prompt: bool = True,
-) -> dict[str, Any]:
-    """Tokenize chat messages into the batch ``select_message`` expects.
-
-    PI052's backbone (PaliGemma) ships no chat template, so we train on
-    a plain role-prefixed concatenation built by
-    ``PI052TextTokenizerStep``. We reuse that exact formatter so the
-    inference prefix matches training; ``add_generation_prompt`` appends
-    the bare ``Assistant: `` header the LM head continues from.
-    """
-    import torch  # noqa: PLC0415
-    from transformers import AutoTokenizer  # noqa: PLC0415
-
-    from lerobot.policies.pi052.text_processor_pi052 import (  # noqa: PLC0415
-        _flatten_say_tool_calls,
-        _format_messages,
-        _strip_blocks,
-        register_paligemma_loc_tokens,
-    )
-
-    tok_name = (
-        getattr(policy.config, "tokenizer_name", None) or "google/paligemma-3b-pt-224"
-    )
-    # Register PaliGemma's <locDDDD> tokens so inference encoding /
-    # decoding sees them as single vocab ids — must match training.
-    tokenizer = register_paligemma_loc_tokens(AutoTokenizer.from_pretrained(tok_name))
-
-    messages = [_strip_blocks(_flatten_say_tool_calls(m)) for m in prompt_messages]
-    prompt, _spans = _format_messages(messages)
-    if add_generation_prompt:
-        prompt = prompt + "Assistant: "
-
-    encoded = tokenizer(prompt, return_tensors="pt")
-    ids = encoded["input_ids"]
-    attn = encoded.get("attention_mask")
-    if attn is None and tokenizer.pad_token_id is not None:
-        attn = ids != tokenizer.pad_token_id
-    if attn is not None and hasattr(attn, "dtype") and attn.dtype != torch.bool:
-        attn = attn.bool()
-
-    # Move tokens onto the policy's device — otherwise prefix embedding
-    # raises a device-mismatch on every forward (CPU tensor vs MPS / CUDA
-    # model), which the caller's broad except would swallow silently.
-    device = getattr(getattr(policy, "config", None), "device", None)
-    if device is not None:
-        try:
-            ids = ids.to(device)
-            if attn is not None and hasattr(attn, "to"):
-                attn = attn.to(device)
-        except Exception as exc:  # noqa: BLE001
-            logger.debug("could not move pi052 lang tokens to %s: %s", device, exc)
-    return {"lang_tokens": ids, "lang_masks": attn, "tokenizer": tokenizer}
-
-
-def _strip_recipe_keys(m: dict[str, Any]) -> dict[str, Any]:
-    new = dict(m)
-    new.pop("stream", None)
-    new.pop("target", None)
-    return new
-
-
-@dataclass
-class HighLevelSubtaskFwd(InferenceStep):
-    """At ~1 Hz, ask the policy for the next subtask.
-
-    Mirrors the ``high_level_subtask`` recipe layout exactly:
-
-        user:   "${task}\\nPlan: ${plan}\\nMemory: ${memory}"
-        user:   "Current subtask: ${subtask}"        (if subtask present)
-        ↓ generate ↓
-        assistant: <next subtask>
-    """
-
-    policy: Any = None
-    observation_provider: Any = None
-    """Same shape as ``LowLevelForward.observation_provider``. When
-    set, the resulting observation is merged into ``select_message``'s
-    batch so text generation runs against real video + state."""
-
-    trigger: Trigger = field(default_factory=lambda: HzTrigger(hz=1.0))
-
-    def run(self, state: dict[str, Any]) -> dict[str, Any] | None:
-        if self.policy is None or not state.get("task"):
-            return None
-        # ``/vlm`` mode pauses subtask generation along with the rest of
-        # the action loop.
-        if state.get("mode", "action") != "action":
-            return None
-        # Gate to chunk boundaries: only generate a fresh subtask when
-        # the action queue is empty (i.e. right before LowLevelForward
-        # refreshes the chunk). ``select_message`` takes ~2 s on MPS,
-        # and running it every loop iteration starves DispatchAction
-        # at ctrl_hz=30 — the queue drains at ~0.4 actions/sec instead
-        # of 30/sec and the robot barely moves. Tying it to the same
-        # "queue empty" condition as the chunk refresh produces a
-        # clean sense → think → act cycle.
-        #
-        # Rearm the trigger when skipping so a low-hz schedule
-        # (e.g. ``--high_level_hz=0.2`` = once per 5 s) doesn't lose
-        # the slot: the trigger fires once on the timer but the brief
-        # queue-empty window almost never coincides, so without rearm
-        # HL would effectively never run.
-        queue = state.get("action_queue") or []
-        if len(queue) > 0:
-            if hasattr(self.trigger, "rearm"):
-                self.trigger.rearm()
-            return None
-        # Per-chunk-boundary throttle: at each "queue empty" moment we
-        # increment a counter; subtask gen only fires once the counter
-        # reaches ``subtask_chunks_per_gen``. Lets the operator run e.g.
-        # 5 action chunks per subtask-gen so the LM head doesn't churn
-        # every 1.7 s (a fresh subtask while the previous one is still
-        # being executed is wasted compute *and* causes the action
-        # expert's flow trajectory to be re-planned mid-grasp).
-        chunks_per_gen = max(1, int(state.get("subtask_chunks_per_gen", 1) or 1))
-        # Initialise so the first chunk boundary fires immediately
-        # (counter starts at chunks_per_gen, decrements per skip,
-        # generates and resets when it hits 0).
-        if "_hl_chunks_until_gen" not in state:
-            state["_hl_chunks_until_gen"] = 0
-        if state["_hl_chunks_until_gen"] > 0:
-            state["_hl_chunks_until_gen"] -= 1
-            if hasattr(self.trigger, "rearm"):
-                self.trigger.rearm()
-            return None
-        state["_hl_chunks_until_gen"] = chunks_per_gen - 1
-        ctx = _msgs_for_subtask(state)
-        observation = _maybe_observation(self.observation_provider)
-        # Default: greedy argmax, no min_new_tokens, no special-token
-        # suppression — matches training. Operator can override via
-        # ``--text_min_new_tokens=N --text_temperature=T --text_top_p=P``
-        # on the CLI; useful for under-trained checkpoints whose LM
-        # head still favours EOS at position 0 (pre-trained chat
-        # backbone's short-turn prior hasn't been fully overridden
-        # by the fine-tuning supervision yet).
-        msg = _generate_with_policy(
-            self.policy,
-            ctx,
-            observation=observation,
-            state=state,
-            label="subtask gen",
-            min_new_tokens=int(state.get("text_gen_min_new_tokens") or 0),
-            temperature=float(state.get("text_gen_temperature") or 0.0),
-            top_p=float(state.get("text_gen_top_p") or 1.0),
-            # Subtasks never legitimately contain PaliGemma ``<loc>``
-            # tokens — suppress them so a checkpoint whose LM head
-            # has drifted toward the pretrained loc-prior falls back
-            # to its (still-correct) text mass.
-            suppress_loc_tokens=True,
-        )
-        # Diagnostics: surface what the model is *actually* producing
-        # at chunk boundaries, even when the output gets rejected or
-        # repeats. Memorisation collapse looks like "same accepted
-        # subtask N times in a row" or "gibberish_count rising while
-        # current_subtask is stuck". The state panel renders these.
-        state["last_subtask_raw"] = msg or ""
-        # Persistent empty completion is its own failure mode (model
-        # immediately EOS-es from the chat-template generation
-        # prompt) — surface it once every N occurrences so the
-        # operator can distinguish "generation failing silently"
-        # from "generating fine but filter rejecting".
-        if not msg:
-            empties = state.get("subtask_empty_count", 0) + 1
-            state["subtask_empty_count"] = empties
-            if empties == 1 or empties % 5 == 0:
-                debug = getattr(self.policy, "_last_select_message_debug", "") or ""
-                if debug:
-                    push_log(
-                        state,
-                        f"  [info] subtask gen empty (×{empties}); {debug}",
-                    )
-                else:
-                    push_log(
-                        state,
-                        f"  [info] subtask gen returned empty (×{empties}) — "
-                        "no tokens generated (head EOS-ing before any "
-                        "non-special token).",
-                    )
-        if msg and _looks_like_gibberish(msg):
-            # Bump a counter so the operator can see the model is
-            # struggling without spamming the log every tick. A first
-            # rejection still logs once so the failure is visible.
-            count = state.get("subtask_gibberish_count", 0) + 1
-            state["subtask_gibberish_count"] = count
-            if count == 1 or count % 30 == 0:
-                push_log(
-                    state,
-                    f"  [info] subtask gen rejected (gibberish ×{count}): {msg[:60]!r}",
-                )
-            return None
-        if msg:
-            prev_subtask = state.get("current_subtask")
-            changed = set_if_changed(state, "current_subtask", msg, label="subtask")
-            if changed:
-                # Stash the just-completed subtask so ``MemoryUpdateFwd``
-                # can drop it into its prompt as ``Completed subtask:``
-                # — the recipe binds ``completed_subtask`` to
-                # ``nth_prev(style=subtask, offset=1)``, i.e. the subtask
-                # that was active *before* the change.
-                if prev_subtask:
-                    state["prior_subtask"] = prev_subtask
-                # Subtask change is a downstream trigger.
-                state.setdefault("events_this_tick", []).append("subtask_change")
-                state["subtask_repeat_count"] = 0
-            else:
-                # Same accepted string regenerated — memorisation tell.
-                # Once this counter climbs past a few, you're seeing
-                # the model unable to move past the current subtask
-                # despite the chunk having drained (visual scene may
-                # have changed but the LM is replaying training
-                # tokens).
-                state["subtask_repeat_count"] = (
-                    state.get("subtask_repeat_count", 0) + 1
-                )
-        # Silently skip empty completions — common when the model
-        # warms up or generates only EOS; logging it every tick at
-        # ctrl_hz is just noise.
-        return None
-
-
-@dataclass
-class MemoryUpdateFwd(InferenceStep):
-    """On subtask boundary, refresh the compressed memory.
-
-    Mirrors the ``memory_update`` recipe layout exactly:
-
-        user:      "${task}"
-        assistant: "Previous memory: ${prior_memory}"   (if prior memory)
-        user:      "Completed subtask: ${completed_subtask}"  (if subtask)
-        ↓ generate ↓
-        assistant: <new memory>
-    """
-
-    policy: Any = None
-    observation_provider: Any = None
-    trigger: Trigger = field(default_factory=lambda: EventTrigger("subtask_change"))
-
-    def run(self, state: dict[str, Any]) -> dict[str, Any] | None:
-        # Don't consume the event — multiple steps may want to react.
-        if self.policy is None:
-            return None
-        ctx = _msgs_for_memory(state)
-        observation = _maybe_observation(self.observation_provider)
-        new_memory = _generate_with_policy(
-            self.policy,
-            ctx,
-            observation=observation,
-            state=state,
-            label="memory gen",
-            suppress_loc_tokens=True,
-        )
-        state["last_memory_raw"] = new_memory or ""
-        if new_memory and _looks_like_gibberish(new_memory):
-            count = state.get("memory_gibberish_count", 0) + 1
-            state["memory_gibberish_count"] = count
-            push_log(
-                state,
-                f"  [info] memory gen rejected (gibberish ×{count}): {new_memory[:60]!r}",
-            )
-            return None
-        if new_memory:
-            set_if_changed(state, "current_memory", new_memory, label="memory")
-        return None
-
-
-@dataclass
-class UserInterjectionFwd(InferenceStep):
-    """On stdin interjection, refresh the plan + emit a paired ``say``.
-
-    Mirrors the ``user_interjection_response`` recipe layout exactly:
-
-        user:      "${task}"
-        assistant: "Previous plan:\\n${prior_plan}"   (if prior plan)
-        user:      "${interjection}"                  (the new utterance)
-        ↓ generate ↓
-        assistant: <plan + <say>...</say>>
-    """
-
-    policy: Any = None
-    observation_provider: Any = None
-    trigger: Trigger = field(default_factory=lambda: EventTrigger("user_interjection"))
-
-    def run(self, state: dict[str, Any]) -> dict[str, Any] | None:
-        if self.policy is None or not take_event(state, "user_interjection"):
-            return None
-        ctx = _msgs_for_interjection(state)
-        observation = _maybe_observation(self.observation_provider)
-        out = _generate_with_policy(
-            self.policy,
-            ctx,
-            observation=observation,
-            state=state,
-            label="plan/say gen",
-            suppress_loc_tokens=True,
-        )
-        if not out:
-            # Don't log every empty completion — happens repeatedly on
-            # MPS during warm-up and floods the panel. The user can
-            # re-trigger by typing again.
-            return None
-        if _looks_like_gibberish(out):
-            count = state.get("plan_gibberish_count", 0) + 1
-            state["plan_gibberish_count"] = count
-            push_log(
-                state,
-                f"  [info] plan/say gen rejected (gibberish ×{count}): {out[:60]!r}",
-            )
-            return None
-        # Heuristic split: model is trained to emit one assistant turn
-        # carrying both plan text AND a `say` tool call. Look for a
-        # "<say>...</say>" or "say(...)" marker; fall back to whole
-        # text → plan, no speech.
-        plan_text, speech_text = _split_plan_and_say(out)
-        if plan_text and _looks_like_gibberish(plan_text):
-            plan_text = ""
-        if plan_text:
-            set_if_changed(state, "current_plan", plan_text, label="plan")
-        if speech_text:
-            push_log(state, f"  speech: {speech_text}")
-            state.setdefault("tool_calls_pending", []).append(
-                {
-                    "type": "function",
-                    "function": {"name": "say", "arguments": {"text": speech_text}},
-                }
-            )
-            state.setdefault("events_this_tick", []).append("tool_call_pending")
-        # Mark interjection consumed.
-        state["recent_interjection"] = None
-        return None
-
-
-@dataclass
-class AskVQAFwd(InferenceStep):
-    """On stdin question, answer a frame-grounded VQA.
-
-    Mirrors the ``ask_vqa_*`` recipe layout exactly: a single user
-    turn carrying just the VQA question, plus the camera image block
-    in training (we drop the image at inference because the dataset's
-    image preprocessing doesn't match SmolVLM's vision tower input).
-
-        user:   <question>
-        ↓ generate ↓
-        assistant: <vqa answer>
-    """
-
-    policy: Any = None
-    observation_provider: Any = None
-    trigger: Trigger = field(default_factory=lambda: EventTrigger("user_vqa_query"))
-
-    def run(self, state: dict[str, Any]) -> dict[str, Any] | None:
-        if self.policy is None or not take_event(state, "user_vqa_query"):
-            return None
-        question = state.get("recent_vqa_query")
-        if not question:
-            return None
-        ctx = _msgs_for_vqa(question)
-        observation = _maybe_observation(self.observation_provider)
-        answer = _generate_with_policy(
-            self.policy,
-            ctx,
-            observation=observation,
-            state=state,
-            label="vqa gen",
-        )
-        # VQA answers are intentionally JSON-like during training, so
-        # ``_looks_like_gibberish`` would false-positive on them. Keep
-        # the answer as-is — the VQA panel line lets the user judge.
-        if answer:
-            push_log(state, f"  vqa: {answer}")
-        state["recent_vqa_query"] = None
-        return None
-
-
-# ---------------------------------------------------------------------------
-# Tool dispatch
-# ---------------------------------------------------------------------------
-
-
-@dataclass
-class DispatchToolCalls(InferenceStep):
-    """Pop ``tool_calls_pending`` and execute them via :data:`TOOL_REGISTRY`."""
-
-    tools: dict[str, Any] = field(default_factory=dict)
-    trigger: Trigger = field(default_factory=lambda: EventTrigger("tool_call_pending"))
-
-    def run(self, state: dict[str, Any]) -> dict[str, Any] | None:
-        take_event(state, "tool_call_pending")
-        pending = state.get("tool_calls_pending") or []
-        for call in pending:
-            try:
-                fn = (call or {}).get("function") or {}
-                name = fn.get("name")
-                args = fn.get("arguments") or {}
-                tool = self.tools.get(name)
-                if tool is None:
-                    push_log(state, f"  [warn] tool {name!r} not registered — skipping call")
-                    continue
-                tool.call(args)
-            except Exception as exc:  # noqa: BLE001
-                push_log(state, f"  [error] tool dispatch failed: {exc}")
-        state["tool_calls_pending"] = []
-        return None
-
-
-# ---------------------------------------------------------------------------
-# Helpers
-# ---------------------------------------------------------------------------
-
-
-def _looks_like_gibberish(text: str) -> bool:
-    """Heuristically detect generation that's clearly off the rails.
-
-    Memorised models can collapse to dominant-mode outputs when the
-    prompt drifts even slightly from training distribution. Reject:
-
-    * empty / whitespace-only
-    * too few alphabetic characters (mostly punctuation)
-    * a single character repeated past the threshold
-    * starts with ``":"`` and contains no letters
-    * too few unique tokens — e.g. ``"the"``, ``"the the the"``,
-      ``"Ass\\n::\\nthe"`` (the collapse seen on real-robot frames
-      where the model emits one or two memorised tokens repeatedly)
-    * chat-template fragment leakage (``Assistant:``, ``User:``,
-      ``Ass\\n``)
-
-    Real subtasks look like ``"close the gripper to grasp the blue
-    cube"`` — multiple unique alphabetic tokens, no role-marker
-    fragments. Anything materially shorter than that is rejected.
-    """
-    if not text or not text.strip():
-        return True
-    stripped = text.strip()
-    alpha = sum(1 for c in stripped if c.isalpha())
-    if alpha < max(3, len(stripped) // 8):
-        return True
-    if stripped.startswith('":') and stripped.count('"') > stripped.count(" "):
-        return True
-    # Single repeating char: e.g. ``""""""``.
-    if len(set(stripped)) <= 2 and len(stripped) > 4:
-        return True
-    # Chat-template fragment leakage — the model emits ``Ass``,
-    # ``Assistant:``, ``User:``, often with extra newlines/colons.
-    # Reject if the cleaned text is mostly role-marker shards.
-    cleaned = stripped.replace("\n", " ").replace(":", " ")
-    for marker in ("Assistant", "User", "Ass "):
-        if marker in cleaned and len(cleaned.split()) < 4:
-            return True
-    tokens = [t for t in cleaned.split() if any(c.isalpha() for c in t)]
-    unique_alpha = {t.lower() for t in tokens}
-    # Short degenerate output — model stuck on ``the`` or a couple of
-    # memorised single-token continuations.
-    if len(unique_alpha) < 3 and len(stripped) < 80:
-        return True
-    # Long repetition collapse — the LM head loops an n-gram for the
-    # whole generation budget ("the arm the arm … the the the the").
-    # Length-independent: many tokens but a tiny unique ratio. The
-    # earlier ``< 80`` check missed these because the looped string
-    # blows well past 80 chars.
-    if len(tokens) >= 8 and len(unique_alpha) <= max(3, len(tokens) // 10):
-        return True
-    return False
-
-
-def _control_context_messages(
-    state: dict[str, Any],
-    *,
-    include_completed: bool = False,
-    extra_user: str | None = None,
-) -> list[dict[str, Any]]:
-    """Build a chat-template-ready prompt from current runtime state.
-
-    Mirrors what ``subtasks_vqa.yaml`` renders into ``${task}\nPlan:
-    ${plan}\nMemory: ${memory}`` for the high-level branches.
-    """
-    # Always emit ``Plan: `` / ``Memory: `` labels — even with empty
-    # values — to mirror the training-time recipe substitution.
-    task = state.get("task") or ""
-    plan = state.get("current_plan") or ""
-    memory = state.get("current_memory") or ""
-    parts = [task, f"Plan: {plan}", f"Memory: {memory}"]
-    if include_completed and state.get("current_subtask"):
-        parts.append(f"Completed subtask: {state['current_subtask']}")
-    head = "\n".join(parts)
-    msgs: list[dict[str, Any]] = [{"role": "user", "content": head}]
-    if extra_user:
-        msgs.append({"role": "user", "content": extra_user})
-    return msgs
-
-
-# ---------------------------------------------------------------------------
-# Per-recipe prompt builders. Each one mirrors a single sub-recipe's
-# message layout in ``subtasks_vqa.yaml`` so the chat-templated
-# prompt at inference matches what the model saw during training.
-# Generic ``_control_context_messages`` is kept around as a fallback
-# for ad-hoc callers but the four high-level steps now use these.
-# ---------------------------------------------------------------------------
-
-
-def _hirobot_user_head(state: dict[str, Any]) -> str:
-    """Build the ``task\\nPlan: …\\nMemory: …`` user content string.
-
-    Mirrors what the recipe renders at training time, where
-    ``language_render._substitute`` substitutes empty strings for
-    missing ``${plan}`` / ``${memory}`` bindings — i.e. the
-    ``Plan: `` / ``Memory: `` prefix labels are *always* in the
-    user turn, even when their values aren't set yet. Skipping them
-    here (the previous behaviour) produced a different prompt shape
-    on early frames before plan / memory are populated and on
-    samples where the dataset has no plan / memory annotation.
-    """
-    task = state.get("task") or ""
-    plan = state.get("current_plan") or ""
-    memory = state.get("current_memory") or ""
-    return f"{task}\nPlan: {plan}\nMemory: {memory}"
-
-
-def _msgs_for_subtask(state: dict[str, Any]) -> list[dict[str, Any]]:
-    """``high_level_subtask`` recipe layout — predict the subtask from the
-    task. The v-current recipe's user turn is just ``${task}`` (plan and
-    memory are not trained), so the inference prompt is the bare task —
-    no ``Plan: `` / ``Memory: `` lines.
-    """
-    return [{"role": "user", "content": state.get("task") or ""}]
-
-
-def _msgs_for_memory(state: dict[str, Any]) -> list[dict[str, Any]]:
-    """Memory-update prompt — mirrors ``memory_update`` recipe layout.
-
-    Recipe layout (``subtask_mem.yaml``):
-
-        user:      "${task}"
-        assistant: "Previous memory: ${prior_memory}"     (if_present prior)
-        user:      "Completed subtask: ${completed}"       (if_present completed)
-        assistant: → predicts new memory
-
-    Fired by ``MemoryUpdateFwd`` on a ``subtask_change`` event:
-    ``state['current_memory']`` is the memory the policy last emitted
-    (= the ``prior_memory`` binding at training), and
-    ``state['prior_subtask']`` is the subtask that just got replaced
-    (= the ``completed_subtask`` binding at training).
-    """
-    msgs: list[dict[str, Any]] = [
-        {"role": "user", "content": state.get("task") or ""},
-    ]
-    prior_memory = state.get("current_memory")
-    if prior_memory:
-        msgs.append(
-            {"role": "assistant", "content": f"Previous memory: {prior_memory}"}
-        )
-    completed_subtask = state.get("prior_subtask")
-    if completed_subtask:
-        msgs.append(
-            {"role": "user", "content": f"Completed subtask: {completed_subtask}"}
-        )
-    return msgs
-
-
-def _msgs_for_interjection(state: dict[str, Any]) -> list[dict[str, Any]]:
-    """``user_interjection_response`` recipe layout."""
-    msgs: list[dict[str, Any]] = [
-        {"role": "user", "content": state.get("task") or ""}
-    ]
-    if state.get("current_plan"):
-        msgs.append(
-            {"role": "assistant", "content": f"Previous plan:\n{state['current_plan']}"}
-        )
-    interjection = state.get("recent_interjection")
-    if interjection:
-        msgs.append({"role": "user", "content": interjection})
-    return msgs
-
-
-def _msgs_for_plan(state: dict[str, Any]) -> list[dict[str, Any]]:
-    """``plan_generation`` recipe layout — bare task → plan.
-
-    The assistant turn is the generation target, so we only render
-    the user turn at inference; the runtime appends the predicted
-    plan after sampling.
-    """
-    return [{"role": "user", "content": state.get("task") or ""}]
-
-
-def _msgs_for_vqa(question: str) -> list[dict[str, Any]]:
-    """``ask_vqa_*`` recipe layout (text-only at inference)."""
-    return [{"role": "user", "content": question}]
-
-
-def _maybe_observation(provider: Any) -> dict | None:
-    """Pull one observation from ``provider`` if it's set, else ``None``.
-
-    Errors from the provider are logged at debug level and swallowed —
-    text generation still runs (in text-only mode) so a flaky frame
-    source doesn't kill the REPL.
-    """
-    if provider is None:
-        return None
-    try:
-        return provider()
-    except Exception as exc:  # noqa: BLE001
-        logger.debug("observation_provider raised %s — falling back to text-only", exc)
-        return None
-
-
-def _generate_with_policy(
-    policy: Any,
-    messages: list[dict[str, Any]],
-    *,
-    observation: dict | None = None,
-    state: dict[str, Any] | None = None,
-    label: str = "select_message",
-    min_new_tokens: int = 0,
-    temperature: float = 0.0,
-    top_p: float = 1.0,
-    suppress_loc_tokens: bool = False,
-) -> str:
-    """Drive ``policy.select_message`` with a chat batch (and optional obs).
-
-    When ``observation`` carries ``observation.images.*`` and
-    ``observation.state``, those are merged into the batch so
-    ``select_message`` runs the same VLM prefix the policy was trained
-    on. Without an observation the runtime falls back to a text-only
-    prompt — the text head still runs, but generations may drift from
-    the training distribution.
-
-    Failures are surfaced both to the module logger (``warning``) and,
-    when ``state`` is given, to the runtime's user-visible log via
-    :func:`push_log`, so the REPL no longer "looks dead" when
-    something goes wrong inside generation.
-    """
-    if not hasattr(policy, "select_message"):
-        if state is not None:
-            push_log(state, f"  [warn] policy has no select_message — skipping {label}")
-        return ""
-    text_batch = _build_text_batch(policy, messages)
-    try:
-        from lerobot.utils.constants import (  # noqa: PLC0415
-            OBS_LANGUAGE_ATTENTION_MASK,
-            OBS_LANGUAGE_TOKENS,
-        )
-
-        batch: dict[str, Any] = {
-            OBS_LANGUAGE_TOKENS: text_batch["lang_tokens"],
-            OBS_LANGUAGE_ATTENTION_MASK: text_batch["lang_masks"],
-        }
-        if observation:
-            for k, v in observation.items():
-                if isinstance(k, str) and k.startswith("observation.") and k not in batch:
-                    batch[k] = v
-        kwargs: dict[str, Any] = {
-            "tokenizer": text_batch["tokenizer"],
-            "min_new_tokens": min_new_tokens,
-            "temperature": temperature,
-            "top_p": top_p,
-        }
-        kwargs["suppress_loc_tokens"] = suppress_loc_tokens
-        return policy.select_message(batch, **kwargs)
-    except Exception as exc:  # noqa: BLE001
-        logger.warning("%s failed: %s", label, exc, exc_info=logger.isEnabledFor(logging.DEBUG))
-        if state is not None:
-            push_log(state, f"  [warn] {label} failed: {type(exc).__name__}: {exc}")
-        return ""
-
-
-_SAY_RE = re.compile(r"<\s*say\s*>(.*?)<\s*/\s*say\s*>", re.IGNORECASE | re.DOTALL)
-
-
-def _split_plan_and_say(text: str) -> tuple[str, str]:
-    """Pull a ``<say>...</say>`` snippet out of ``text``; remainder is plan.
-
-    The training-time tool-call serializer wraps ``say(text="…")`` in a
-    deterministic textual marker so prefix-LM-style training learns to
-    emit it. The runtime parses it back here. If no marker is present,
-    the entire text is treated as plan with no speech.
-    """
-    if not text:
-        return "", ""
-    match = _SAY_RE.search(text)
-    if not match:
-        return text.strip(), ""
-    speech = match.group(1).strip().strip('"').strip("'")
-    plan = (text[: match.start()] + text[match.end() :]).strip()
-    return plan, speech
--- a/src/lerobot/policies/pi052/inference/triggers.py
+++ b/src/lerobot/policies/pi052/inference/triggers.py
@@ -1,134 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Trigger primitives for PI052's multi-rate inference runtime.
-
-Mirrors the plan's Section "Runtime orchestration": each
-``InferenceStep`` is gated by a :class:`Trigger` that decides per tick
-whether the step fires. Two trigger flavours cover all the cadences
-the canonical recipe needs:
-
-* :class:`HzTrigger` for periodic beats (action chunks at ~3-5 Hz,
-  high-level subtask generation at ~1 Hz, action dispatch at ~50 Hz)
-* :class:`EventTrigger` for one-shot reactions (subtask boundary →
-  memory update; user interjection → plan refresh; user VQA query →
-  vqa answer; pending tool call → dispatcher)
-
-Triggers are stateless except for ``HzTrigger``'s last-fire timestamp.
-The runtime stores the :class:`Tick` clock as ``state["_tick"]`` so
-every step shares a single time source.
-"""
-
-from __future__ import annotations
-
-import time
-from dataclasses import dataclass, field
-from typing import Any, Protocol
-
-
-@dataclass
-class Tick:
-    """Single tick from :class:`TickClock`. Carries time references the
-    runtime steps consume to gate themselves."""
-
-    index: int
-    """Monotonic counter — increments by one per tick."""
-
-    monotonic_seconds: float
-    """``time.monotonic()`` at the start of this tick."""
-
-
-@dataclass
-class TickClock:
-    """Drives the runtime loop at up to ``max_rate_hz``.
-
-    Sleeps just enough between :meth:`advance` calls to enforce the
-    rate. With ``max_rate_hz=50`` the loop wakes ~every 20ms; the
-    higher-level ``HzTrigger`` slices that timeline into sub-cadences.
-    """
-
-    max_rate_hz: float = 50.0
-    _index: int = field(default=0, init=False)
-    _last_seconds: float | None = field(default=None, init=False)
-
-    def advance(self) -> Tick:
-        period = 1.0 / max(self.max_rate_hz, 0.1)
-        now = time.monotonic()
-        if self._last_seconds is not None:
-            sleep_for = (self._last_seconds + period) - now
-            if sleep_for > 0:
-                time.sleep(sleep_for)
-                now = time.monotonic()
-        self._last_seconds = now
-        self._index += 1
-        return Tick(index=self._index, monotonic_seconds=now)
-
-
-class Trigger(Protocol):
-    """Decide whether the next ``InferenceStep`` should fire."""
-
-    def should_fire(self, tick: Tick, state: dict[str, Any]) -> bool: ...
-
-
-@dataclass
-class HzTrigger:
-    """Fire at most ``hz`` times per second.
-
-    A step that gates further (e.g. ``HighLevelSubtaskFwd`` skipping
-    when the action queue is non-empty) and wants the trigger to
-    retry next tick instead of waiting a full period can call
-    :meth:`rearm` from inside ``run``. Without this, a low-hz trigger
-    (e.g. ``hz=0.2`` = once per 5 s) almost never coincides with the
-    brief queue-empty window and the step never fires at all.
-    """
-
-    hz: float
-    _last_seconds: float | None = field(default=None, init=False)
-
-    def should_fire(self, tick: Tick, state: dict[str, Any]) -> bool:
-        period = 1.0 / max(self.hz, 1e-6)
-        if self._last_seconds is None or (tick.monotonic_seconds - self._last_seconds) >= period:
-            self._last_seconds = tick.monotonic_seconds
-            return True
-        return False
-
-    def rearm(self) -> None:
-        """Mark the trigger as not having fired, so the next tick re-evaluates.
-
-        Used by a step that decided to skip after ``should_fire`` already
-        committed the firing — keeps the cadence honest without losing
-        the slot.
-        """
-        self._last_seconds = None
-
-
-@dataclass
-class EventTrigger:
-    """Fire when ``event_name`` is in ``state["events_this_tick"]``.
-
-    The runtime fills ``events_this_tick`` once per tick from:
-
-    * stdin / network input (``user_interjection``, ``user_vqa_query``,
-      ``stop``)
-    * internal state transitions (``subtask_change``,
-      ``tool_call_pending``)
-
-    The list is consumed (cleared at the end of the tick) so events
-    fire at most once.
-    """
-
-    event_name: str
-
-    def should_fire(self, tick: Tick, state: dict[str, Any]) -> bool:
-        events: list[str] = state.get("events_this_tick") or []
-        return self.event_name in events
--- a/src/lerobot/policies/pi052/inference/ui.py
+++ b/src/lerobot/policies/pi052/inference/ui.py
@@ -1,127 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Rich-based REPL layout for the PI052 runtime.
-
-Two-zone terminal layout:
-
-    [chat scrollback — user messages / robot responses, scrolls naturally]
-
-    ┌── State ──────────────────────────────────────────┐
-    │ task     please clean up the kitchen              │
-    │ subtask  grasp the handle of the sponge           │
-    │ plan     1. grasp sponge  2. wipe  3. tidy        │
-    │ memory   sponge picked up; counter still dirty    │
-    └───────────────────────────────────────────────────┘
-    > _
-
-The state panel re-renders on every state change. Chat lines are
-``console.print``'d above the live region so they accumulate naturally
-in scrollback. Implemented with :class:`rich.live.Live` plus
-:func:`rich.console.Console.input` for the prompt — when an input is
-pending, ``rich.Live`` auto-suspends so the input doesn't fight the
-panel for cursor position.
-"""
-
-from __future__ import annotations
-
-from typing import Any
-
-try:  # rich is optional; only required for the interactive REPL.
-    from rich.console import Console
-    from rich.panel import Panel
-    from rich.table import Table
-    from rich.text import Text
-
-    _HAS_RICH = True
-except ImportError:  # pragma: no cover
-    _HAS_RICH = False
-    Console = Any  # type: ignore[assignment]
-    Panel = Any  # type: ignore[assignment]
-    Table = Any  # type: ignore[assignment]
-    Text = Any  # type: ignore[assignment]
-
-
-_STATE_KEYS = (
-    ("task", "task"),
-    ("current_subtask", "subtask"),
-    ("current_plan", "plan"),
-    ("current_memory", "memory"),
-)
-
-
-def make_state_panel(state: dict[str, Any]) -> Any:
-    """Render the persistent state panel for the live region.
-
-    Returns a :class:`rich.panel.Panel`. Caller passes it to
-    ``Live.update(panel)`` whenever the state changes.
-    """
-    if not _HAS_RICH:
-        raise RuntimeError(
-            "rich is required for the interactive REPL. "
-            "`pip install rich` (it's a transitive dep of lerobot)."
-        )
-    table = Table.grid(padding=(0, 2), expand=True)
-    table.add_column(justify="right", style="dim", no_wrap=True, width=10)
-    table.add_column(justify="left")
-    for key, label in _STATE_KEYS:
-        value = state.get(key)
-        if value is None:
-            rendered = Text("(not set)", style="dim italic")
-        else:
-            rendered = Text(str(value), style="bold")
-        table.add_row(label, rendered)
-    queue = state.get("action_queue")
-    queue_len = len(queue) if hasattr(queue, "__len__") else 0
-    pending = state.get("tool_calls_pending") or []
-    footer = Text.assemble(
-        ("queued actions: ", "dim"),
-        (str(queue_len), "bold cyan"),
-        ("    pending tool calls: ", "dim"),
-        (str(len(pending)), "bold magenta"),
-    )
-    table.add_row("", footer)
-    run_mode = state.get("mode", "action")
-    mode_tag = (
-        "[green]action[/]" if run_mode == "action" else "[yellow]paused[/]"
-    )
-    return Panel(
-        table,
-        title=f"[bold]PI052 state[/] · mode: {mode_tag}",
-        border_style="cyan",
-    )
-
-
-def print_user_line(console: Any, line: str) -> None:
-    """Append a user-typed line to the chat scrollback."""
-    if not _HAS_RICH:
-        print(f"you: {line}", flush=True)
-        return
-    console.print(f"[bold cyan]you:[/] {line}")
-
-
-def print_robot_lines(console: Any, lines: list[str]) -> None:
-    """Append robot/runtime log lines to the chat scrollback."""
-    if not _HAS_RICH:
-        for line in lines:
-            print(f"robot: {line.lstrip()}", flush=True)
-        return
-    for line in lines:
-        # The runtime uses leading whitespace + "label: text"; render
-        # the label in green and the value in default for readability.
-        stripped = line.lstrip()
-        if ":" in stripped:
-            label, _, value = stripped.partition(":")
-            console.print(f"[bold green]robot[/] [dim]({label.strip()})[/] {value.strip()}")
-        else:
-            console.print(f"[bold green]robot:[/] {stripped}")
--- a/src/lerobot/policies/pi052/inference/vqa.py
+++ b/src/lerobot/policies/pi052/inference/vqa.py
@@ -1,423 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Interactive VQA for the PI052 runtime.
-
-In ``/vlm`` mode a typed line is treated as a VQA question. This module
-runs the full interactive flow:
-
-  1. pull the current observation and list available cameras,
-  2. ask the operator which camera to ground the question on,
-  3. generate the answer with the VLM conditioned on that one camera,
-  4. parse the JSON answer; if it carries a bounding box (``bbox``) or a
-     point (``keypoint``), draw the overlay on the camera frame, save a
-     PNG to ``./vqa_overlays/`` and auto-open it.
-
-VQA answer schemas mirror the annotation pipeline's ``VQA_ANSWER_SHAPES``
-(see ``lerobot.annotations.steerable_pipeline.validator``):
-
-  * ``bbox``     — ``{"detections": [{"label", "bbox_format": "xyxy",
-                    "bbox": [x1, y1, x2, y2]}, ...]}``
-  * ``keypoint`` — ``{"label", "point_format": "xy", "point": [x, y]}``
-  * ``count`` / ``attribute`` / ``spatial`` — text-only, no overlay.
-"""
-
-from __future__ import annotations
-
-import json
-import logging
-import os
-import re
-import subprocess
-import sys
-import time
-import webbrowser
-from pathlib import Path
-from typing import Any
-
-from .runtime_state import push_log
-
-logger = logging.getLogger(__name__)
-
-_IMAGE_PREFIX = "observation.images."
-
-# PaliGemma detection / pointing vocabulary. PI052 trains spatial VQA
-# answers in this native ``<locNNNN>`` format (index in [0, 1023],
-# normalized to the image axis) instead of pixel-coordinate JSON, so the
-# answer string the runtime parses can be e.g.
-# ``<loc0512><loc0301> blue cube`` (point) or
-# ``<loc0100><loc0080><loc0400><loc0360> blue cube`` (box).
-_LOC_RE = re.compile(r"<loc(\d{1,4})>")
-
-# Iteration order for shape matching — most specific keys first so an
-# answer is classified deterministically.
-_SHAPE_ORDER = ("bbox", "keypoint", "count", "attribute", "spatial")
-
-_BBOX_COLOR = (255, 64, 64)
-_POINT_COLOR = (64, 220, 64)
-
-
-# ---------------------------------------------------------------------------
-# Camera selection
-# ---------------------------------------------------------------------------
-
-
-def available_cameras(observation: dict | None) -> list[str]:
-    """Return the sorted ``observation.images.*`` keys present in ``observation``."""
-    if not observation:
-        return []
-    return sorted(k for k in observation if isinstance(k, str) and k.startswith(_IMAGE_PREFIX))
-
-
-def camera_short_name(camera_key: str) -> str:
-    """Strip the ``observation.images.`` prefix for display."""
-    return camera_key[len(_IMAGE_PREFIX) :] if camera_key.startswith(_IMAGE_PREFIX) else camera_key
-
-
-def prompt_camera_choice(
-    cameras: list[str],
-    *,
-    input_fn: Any = input,
-    print_fn: Any = print,
-) -> str | None:
-    """Ask the operator which camera frame to draw a VQA overlay on.
-
-    Accepts either the menu number or the (short or full) camera name.
-    A single-camera setup auto-selects without prompting. Returns the
-    chosen ``observation.images.*`` key, or ``None`` if the operator
-    cancels / gives an invalid answer.
-    """
-    if not cameras:
-        return None
-    if len(cameras) == 1:
-        return cameras[0]
-    print_fn("Draw the result on which camera?")
-    for i, cam in enumerate(cameras, 1):
-        print_fn(f"  [{i}] {camera_short_name(cam)}")
-    try:
-        raw = str(input_fn("camera> ")).strip()
-    except (EOFError, KeyboardInterrupt):
-        return None
-    if not raw:
-        return cameras[0]
-    if raw.isdigit():
-        idx = int(raw) - 1
-        return cameras[idx] if 0 <= idx < len(cameras) else None
-    for cam in cameras:
-        if raw == cam or raw == camera_short_name(cam):
-            return cam
-    return None
-
-
-# ---------------------------------------------------------------------------
-# Answer parsing
-# ---------------------------------------------------------------------------
-
-
-def _loc_to_norm(idx: int) -> float:
-    """PaliGemma ``<locNNNN>`` index → normalized [0, 1] axis coordinate."""
-    return max(0.0, min(1023.0, float(idx))) / 1023.0
-
-
-def parse_loc_answer(answer: str) -> dict | None:
-    """Parse a PaliGemma ``<loc>``-format spatial VQA answer.
-
-    PI052 trains spatial answers in PaliGemma's native detection
-    vocabulary, label-first: a point is ``<label> <locY><locX>``, a box
-    is ``<label> <locY0><locX0><locY1><locX1>``, and multiple boxes are
-    joined by `` ; `` (e.g. ``cube <loc..><loc..><loc..><loc..> ; box
-    <loc..><loc..><loc..><loc..>``). Loc-first formats are also accepted
-    — this parser strips loc tokens and treats the remainder as the
-    label, so order is irrelevant. Coordinates come back *normalized*
-    ([0, 1]); the overlay denormalizes them against the chosen camera
-    frame's pixel size.
-
-    Returns ``{"kind", "payload", "normalized": True}`` on success
-    (``payload`` mirrors the JSON shapes so the overlay code is shared),
-    or ``None`` when the answer carries no ``<loc>`` tokens.
-    """
-    if not answer or "<loc" not in answer:
-        return None
-    segments = [seg for seg in answer.split(";") if "<loc" in seg]
-    points: list[tuple[float, float, str]] = []
-    boxes: list[tuple[float, float, float, float, str]] = []
-    for seg in segments:
-        locs = [int(m) for m in _LOC_RE.findall(seg)]
-        label = _LOC_RE.sub("", seg).strip()
-        if len(locs) == 2:
-            y, x = (_loc_to_norm(v) for v in locs[:2])
-            points.append((x, y, label))
-        elif len(locs) >= 4:
-            y1, x1, y2, x2 = (_loc_to_norm(v) for v in locs[:4])
-            boxes.append((x1, y1, x2, y2, label))
-    if boxes:
-        detections = [
-            {"label": lbl, "bbox_format": "xyxy", "bbox": [x1, y1, x2, y2]}
-            for (x1, y1, x2, y2, lbl) in boxes
-        ]
-        return {"kind": "bbox", "payload": {"detections": detections}, "normalized": True}
-    if len(points) == 1:
-        x, y, lbl = points[0]
-        return {
-            "kind": "keypoint",
-            "payload": {"label": lbl, "point_format": "xy", "point": [x, y]},
-            "normalized": True,
-        }
-    if points:  # several bare points → treat as detections-as-points
-        detections = [
-            {"label": lbl, "bbox_format": "xyxy", "bbox": [x, y, x, y]} for (x, y, lbl) in points
-        ]
-        return {"kind": "bbox", "payload": {"detections": detections}, "normalized": True}
-    return None
-
-
-def parse_vqa_answer(answer: str) -> dict | None:
-    """Parse a VQA answer string into ``{"kind", "payload"}``.
-
-    ``kind`` is one of the ``VQA_ANSWER_SHAPES`` names (``bbox``,
-    ``keypoint``, ``count``, ``attribute``, ``spatial``) or ``"unknown"``
-    when the JSON doesn't match any known shape. PaliGemma ``<loc>``
-    spatial answers are detected first (PI052 trains them in that native
-    format). Returns ``None`` when the answer is neither ``<loc>`` text
-    nor a parseable JSON object.
-    """
-    if not answer or not answer.strip():
-        return None
-    loc_parsed = parse_loc_answer(answer)
-    if loc_parsed is not None:
-        return loc_parsed
-    try:
-        payload = json.loads(answer)
-    except (ValueError, TypeError):
-        return None
-    if not isinstance(payload, dict):
-        return None
-
-    try:
-        from lerobot.annotations.steerable_pipeline.validator import (  # noqa: PLC0415
-            VQA_ANSWER_SHAPES,
-        )
-
-        shapes = VQA_ANSWER_SHAPES
-    except ImportError:  # pragma: no cover - annotation extra not installed
-        shapes = {
-            "bbox": {"detections"},
-            "keypoint": {"label", "point_format", "point"},
-            "count": {"label", "count"},
-            "attribute": {"label", "attribute", "value"},
-            "spatial": {"subject", "relation", "object"},
-        }
-
-    keys = set(payload)
-    for kind in _SHAPE_ORDER:
-        required = shapes.get(kind)
-        if required and required <= keys:
-            return {"kind": kind, "payload": payload}
-    return {"kind": "unknown", "payload": payload}
-
-
-def answer_has_overlay(parsed: dict | None) -> bool:
-    """True iff ``parsed`` carries drawable spatial coordinates."""
-    return bool(parsed) and parsed.get("kind") in ("bbox", "keypoint")
-
-
-# ---------------------------------------------------------------------------
-# Overlay drawing
-# ---------------------------------------------------------------------------
-
-
-def observation_image_to_pil(image_tensor: Any) -> Any:
-    """Convert an ``observation.images.*`` tensor to a PIL RGB image.
-
-    The runtime observation stores images as ``(1, C, H, W)`` (or
-    ``(C, H, W)``) float tensors in ``[0, 1]``. Reuses
-    ``image_array_to_pil_image`` which handles the CHW→HWC transpose and
-    the float→uint8 scaling.
-    """
-    from lerobot.datasets.image_writer import image_array_to_pil_image  # noqa: PLC0415
-
-    arr = image_tensor
-    if hasattr(arr, "detach"):
-        arr = arr.detach().cpu()
-    if hasattr(arr, "numpy"):
-        arr = arr.numpy()
-    while arr.ndim > 3:  # drop leading batch dim(s)
-        arr = arr[0]
-    return image_array_to_pil_image(arr).convert("RGB")
-
-
-def draw_vqa_overlay(image: Any, parsed: dict) -> Any:
-    """Draw ``bbox`` / ``keypoint`` answers onto a copy of ``image``.
-
-    Non-spatial answers (``count`` / ``attribute`` / ``spatial`` /
-    ``unknown``) are returned as an unmodified copy. When ``parsed`` has
-    ``normalized=True`` (PaliGemma ``<loc>`` answers) the [0, 1]
-    coordinates are scaled to the image's pixel size.
-    """
-    from PIL import ImageDraw  # noqa: PLC0415
-
-    img = image.convert("RGB").copy()
-    kind = parsed.get("kind")
-    payload = parsed.get("payload") or {}
-    draw = ImageDraw.Draw(img)
-    w, h = img.size
-    sx, sy = (w, h) if parsed.get("normalized") else (1, 1)
-
-    if kind == "bbox":
-        for det in payload.get("detections") or []:
-            if not isinstance(det, dict):
-                continue
-            box = det.get("bbox")
-            if not (isinstance(box, list | tuple) and len(box) == 4):
-                continue
-            try:
-                x1, y1, x2, y2 = (float(v) for v in box)
-            except (TypeError, ValueError):
-                continue
-            x1, x2 = x1 * sx, x2 * sx
-            y1, y2 = y1 * sy, y2 * sy
-            draw.rectangle([x1, y1, x2, y2], outline=_BBOX_COLOR, width=3)
-            label = str(det.get("label", "")).strip()
-            if label:
-                draw.text((x1 + 3, max(0.0, y1 - 12)), label, fill=_BBOX_COLOR)
-    elif kind == "keypoint":
-        point = payload.get("point")
-        if isinstance(point, list | tuple) and len(point) == 2:
-            try:
-                x, y = float(point[0]) * sx, float(point[1]) * sy
-            except (TypeError, ValueError):
-                return img
-            r = 6
-            draw.ellipse([x - r, y - r, x + r, y + r], outline=_POINT_COLOR, width=3)
-            draw.line([x - 2 * r, y, x + 2 * r, y], fill=_POINT_COLOR, width=2)
-            draw.line([x, y - 2 * r, x, y + 2 * r], fill=_POINT_COLOR, width=2)
-            label = str(payload.get("label", "")).strip()
-            if label:
-                draw.text((x + r + 3, y - r), label, fill=_POINT_COLOR)
-    return img
-
-
-def _open_file(path: Path) -> None:
-    """Best-effort open ``path`` in the OS default viewer."""
-    try:
-        if sys.platform == "darwin":
-            subprocess.run(["open", str(path)], check=False)
-        elif sys.platform.startswith("linux"):
-            subprocess.run(["xdg-open", str(path)], check=False)
-        elif os.name == "nt":
-            os.startfile(str(path))  # type: ignore[attr-defined]  # noqa: S606
-        else:  # pragma: no cover - exotic platform
-            webbrowser.open(path.resolve().as_uri())
-    except Exception as exc:  # noqa: BLE001
-        logger.debug("could not auto-open %s: %s", path, exc)
-
-
-def save_and_open_overlay(image: Any, out_dir: str | Path = "./vqa_overlays") -> Path:
-    """Save ``image`` as a timestamped PNG under ``out_dir`` and auto-open it."""
-    out = Path(out_dir)
-    out.mkdir(parents=True, exist_ok=True)
-    path = out / f"vqa_{int(time.time() * 1000)}.png"
-    image.save(path)
-    _open_file(path)
-    return path
-
-
-# ---------------------------------------------------------------------------
-# Orchestrator
-# ---------------------------------------------------------------------------
-
-
-def handle_vqa_query(
-    *,
-    policy: Any,
-    observation_provider: Any,
-    question: str,
-    state: dict[str, Any],
-    input_fn: Any = input,
-    print_fn: Any = print,
-) -> None:
-    """Run one interactive VQA question end to end.
-
-    Called synchronously from the input layer while the runtime is in
-    ``/question`` mode (the action loop is gated off, so the policy is
-    not in concurrent use). Progress is reported via both
-    :func:`push_log` (REPL panel scrollback) and ``print_fn`` (direct
-    stdout) — in autonomous question mode the panel redraw is suspended,
-    so the direct print is what the operator actually sees.
-    """
-    from .steps import _generate_with_policy, _msgs_for_vqa  # noqa: PLC0415
-
-    def report(line: str) -> None:
-        """Surface a line both to the panel scrollback and to stdout."""
-        push_log(state, line)
-        try:
-            print_fn(line)
-        except Exception:  # noqa: BLE001
-            pass
-
-    if policy is None or not hasattr(policy, "select_message"):
-        report("  [warn] vqa: policy has no select_message — skipping")
-        return
-
-    observation: dict | None = None
-    if observation_provider is not None:
-        try:
-            observation = observation_provider()
-        except Exception as exc:  # noqa: BLE001
-            logger.debug("observation_provider raised %s", exc)
-
-    # Feed the FULL observation (every camera + state) to the VLM. The
-    # ``ask_vqa_*`` recipes look single-camera, but the image *block* is
-    # stripped before tokenization — the actual frames reach the model
-    # via PI052's ``OBS_IMAGES_*`` channels, and ``embed_prefix``
-    # consumes *all* ``config.image_features`` regardless of which
-    # camera the sub-recipe was tagged for. So the model always sees
-    # every camera; the operator never has to name one to ask.
-    answer = _generate_with_policy(
-        policy,
-        _msgs_for_vqa(question),
-        observation=observation,
-        state=state,
-        label="vqa gen",
-    )
-    if not answer:
-        report("  [info] vqa gen returned empty")
-        return
-    report(f"  vqa: {answer}")
-
-    parsed = parse_vqa_answer(answer)
-    if not answer_has_overlay(parsed):
-        if parsed is None:
-            report("  [info] vqa answer is not JSON — no overlay")
-        return
-
-    # The answer carries a bounding box / point. Its pixel coordinates
-    # are camera-specific and the text answer doesn't say which camera,
-    # so ask the operator *now* — only when there is actually something
-    # to draw — which camera frame to render the overlay on.
-    cameras = available_cameras(observation)
-    if observation is None or not cameras:
-        report("  [info] no camera image — cannot draw overlay")
-        return
-    chosen = prompt_camera_choice(cameras, input_fn=input_fn, print_fn=print_fn)
-    if chosen is None:
-        report("  [info] overlay skipped — no camera selected")
-        return
-    try:
-        pil = observation_image_to_pil(observation[chosen])
-        overlay = draw_vqa_overlay(pil, parsed)
-        path = save_and_open_overlay(overlay)
-        report(f"  vqa overlay ({camera_short_name(chosen)}) saved: {path}")
-    except Exception as exc:  # noqa: BLE001
-        logger.warning("vqa overlay failed: %s", exc, exc_info=logger.isEnabledFor(logging.DEBUG))
-        report(f"  [warn] vqa overlay failed: {type(exc).__name__}: {exc}")
--- a/src/lerobot/policies/pi052/modeling_pi052.py
+++ b/src/lerobot/policies/pi052/modeling_pi052.py
--- a/src/lerobot/policies/pi052/processor_pi052.py
+++ b/src/lerobot/policies/pi052/processor_pi052.py
@@ -1,198 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""π0.5 v2 pre/post-processor factory.
-
-When ``config.recipe_path`` is set, the pre-processor pipeline becomes:
-
-    rename observations
-    add batch dim
-    relative-action prep      (inherited from π0.5)
-    NormalizerProcessorStep
-    RenderMessagesStep        — recipe → messages, target_message_indices,
-                                message_streams (PR 1 of the steerable
-                                stack)
-    PI052TextTokenizerStep    — messages → input_ids + label mask +
-                                predict_actions
-    DeviceProcessorStep
-
-When ``recipe_path`` is ``None`` we delegate to the plain π0.5 pipeline
-so unannotated datasets keep working.
-
-Post-processor is unchanged from π0.5.
-"""
-
-from __future__ import annotations
-
-from pathlib import Path
-from typing import Any
-
-import torch
-
-from lerobot.configs.recipe import TrainingRecipe
-from lerobot.processor import (
-    AbsoluteActionsProcessorStep,
-    ActionTokenizerProcessorStep,
-    AddBatchDimensionProcessorStep,
-    DeviceProcessorStep,
-    NormalizerProcessorStep,
-    PolicyAction,
-    PolicyProcessorPipeline,
-    RelativeActionsProcessorStep,
-    RenameObservationsProcessorStep,
-    UnnormalizerProcessorStep,
-    policy_action_to_transition,
-    transition_to_policy_action,
-)
-# RenderMessagesStep is intentionally not re-exported from
-# ``lerobot.processor`` because it pulls in optional language-stack deps;
-# import it directly.
-from lerobot.processor.render_messages_processor import RenderMessagesStep
-from lerobot.utils.constants import POLICY_POSTPROCESSOR_DEFAULT_NAME, POLICY_PREPROCESSOR_DEFAULT_NAME
-
-from ..pi05.processor_pi05 import make_pi05_pre_post_processors
-from .configuration_pi052 import PI052Config
-from .text_processor_pi052 import PI052TextTokenizerStep
-
-
-def make_pi052_pre_post_processors(
-    config: PI052Config,
-    dataset_stats: dict[str, dict[str, torch.Tensor]] | None = None,
-    dataset_repo_id: str | None = None,
-) -> tuple[
-    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
-    PolicyProcessorPipeline[PolicyAction, PolicyAction],
-]:
-    """Build PI0.5-v2's pre/post-processor pipelines.
-
-    Falls through to π0.5's stock pipeline when ``recipe_path`` is unset.
-    """
-    if not config.recipe_path:
-        return make_pi05_pre_post_processors(config, dataset_stats=dataset_stats)
-
-    recipe = _load_recipe(config.recipe_path)
-
-    relative_step = RelativeActionsProcessorStep(
-        enabled=config.use_relative_actions,
-        exclude_joints=getattr(config, "relative_exclude_joints", []),
-        action_names=getattr(config, "action_feature_names", None),
-    )
-
-    input_steps = [
-        RenameObservationsProcessorStep(rename_map={}),
-        AddBatchDimensionProcessorStep(),
-        relative_step,
-        NormalizerProcessorStep(
-            features={**config.input_features, **config.output_features},
-            norm_map=config.normalization_mapping,
-            stats=dataset_stats,
-        ),
-        RenderMessagesStep(recipe=recipe),
-        PI052TextTokenizerStep(
-            tokenizer_name="google/paligemma-3b-pt-224",
-            max_length=config.tokenizer_max_length,
-            plan_dropout_prob=getattr(config, "plan_dropout_prob", 0.0),
-            memory_dropout_prob=getattr(config, "memory_dropout_prob", 0.0),
-            subtask_dropout_prob=getattr(config, "subtask_dropout_prob", 0.0),
-        ),
-    ]
-
-    # FAST tokenizer for discrete-action CE supervision (paper §III.C).
-    # Only inserted when explicitly enabled — keeps the post-training-
-    # style recipe (flow + text) as the default. When on, the step
-    # writes ACTION_TOKENS / ACTION_TOKEN_MASK into
-    # ``COMPLEMENTARY_DATA`` and the modeling forward picks them up.
-    if getattr(config, "enable_fast_action_loss", False):
-        # Per Pertsch et al. 2025 (FAST [64], π0.5 §III.C): fit the
-        # tokenizer on this dataset's action distribution rather than
-        # using the universal codebook off the shelf. We do this once
-        # and cache to disk, keyed on (dataset, base, n_samples).
-        action_tokenizer_path = config.action_tokenizer_name
-        if (
-            getattr(config, "auto_fit_fast_tokenizer", False)
-            and dataset_repo_id is not None
-        ):
-            from .fit_fast_tokenizer import fit_fast_tokenizer  # noqa: PLC0415
-
-            cache_dir = Path(config.fast_tokenizer_cache_dir).expanduser()
-            try:
-                action_tokenizer_path = fit_fast_tokenizer(
-                    dataset_repo_id=dataset_repo_id,
-                    cache_dir=cache_dir,
-                    base_tokenizer_name=config.action_tokenizer_name,
-                    n_samples=config.fast_tokenizer_fit_samples,
-                    chunk_size=config.chunk_size,
-                )
-            except Exception as exc:  # noqa: BLE001
-                import logging  # noqa: PLC0415
-
-                logging.getLogger(__name__).warning(
-                    "FAST tokenizer fit failed (%s) — falling back to "
-                    "the universal base tokenizer %r. Train will still "
-                    "work but compression will be suboptimal.",
-                    exc, config.action_tokenizer_name,
-                )
-
-        input_steps.append(
-            ActionTokenizerProcessorStep(
-                action_tokenizer_name=action_tokenizer_path,
-                max_action_tokens=config.max_action_tokens,
-                fast_skip_tokens=config.fast_skip_tokens,
-                paligemma_tokenizer_name="google/paligemma-3b-pt-224",
-            )
-        )
-
-    input_steps.append(DeviceProcessorStep(device=config.device))
-
-    output_steps = [
-        UnnormalizerProcessorStep(
-            features=config.output_features,
-            norm_map=config.normalization_mapping,
-            stats=dataset_stats,
-        ),
-        AbsoluteActionsProcessorStep(
-            enabled=config.use_relative_actions,
-            relative_step=relative_step,
-        ),
-        DeviceProcessorStep(device="cpu"),
-    ]
-    return (
-        PolicyProcessorPipeline[dict[str, Any], dict[str, Any]](
-            steps=input_steps,
-            name=POLICY_PREPROCESSOR_DEFAULT_NAME,
-        ),
-        PolicyProcessorPipeline[PolicyAction, PolicyAction](
-            steps=output_steps,
-            name=POLICY_POSTPROCESSOR_DEFAULT_NAME,
-            to_transition=policy_action_to_transition,
-            to_output=transition_to_policy_action,
-        ),
-    )
-
-
-def _load_recipe(path_str: str) -> TrainingRecipe:
-    """Resolve ``path_str`` to a ``TrainingRecipe``.
-
-    Accepts an absolute path or a path relative to
-    ``src/lerobot/configs/``.
-    """
-    p = Path(path_str)
-    if not p.is_absolute() and not p.exists():
-        from lerobot.configs import recipe as _recipe_module  # noqa: PLC0415
-
-        configs_dir = Path(_recipe_module.__file__).resolve().parent
-        candidate = configs_dir / path_str
-        if candidate.exists():
-            p = candidate
-    return TrainingRecipe.from_yaml(p)
--- a/src/lerobot/policies/pi052/text_processor_pi052.py
+++ b/src/lerobot/policies/pi052/text_processor_pi052.py
@@ -1,598 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""π0.5 v2 text-tokenisation step.
-
-PaliGemma is *not* chat-pretrained, so we can't lean on
-``tokenizer.apply_chat_template``. Instead we concatenate the rendered
-messages as plain text with simple ``User: ... Assistant: ...`` role
-delimiters — matching the prompt format π0.5 uses in the paper
-(``Task: ... State: ... Action: ...``).
-
-Outputs:
-
-* ``OBS_LANGUAGE_TOKENS`` / ``OBS_LANGUAGE_ATTENTION_MASK`` — the
-  concatenated prompt tokenised by the PaliGemma tokenizer (the same
-  one ``processor_pi05`` already uses).
-* ``text_labels`` — same shape as token ids, ``-100`` everywhere except
-  positions belonging to messages whose index is in
-  ``target_message_indices``. ``modeling_pi052`` runs cross-entropy on
-  those positions via the PaliGemma ``lm_head``.
-* ``predict_actions`` — bool tensor, ``True`` iff any of the rendered
-  target messages has ``message_streams[i] == "low_level"``.
-"""
-
-from __future__ import annotations
-
-import json
-import logging
-from dataclasses import dataclass
-from typing import Any
-
-import torch
-from torch import Tensor
-
-from lerobot.configs import PipelineFeatureType, PolicyFeature
-from lerobot.processor.pipeline import ProcessorStep, ProcessorStepRegistry
-from lerobot.types import EnvTransition, TransitionKey
-from lerobot.utils.constants import OBS_LANGUAGE_ATTENTION_MASK, OBS_LANGUAGE_TOKENS
-
-logger = logging.getLogger(__name__)
-
-
-def _content_to_text(content: Any) -> str:
-    """Collapse a message's ``content`` (string or multimodal blocks) to text."""
-    if isinstance(content, str):
-        return content
-    if isinstance(content, list):
-        parts = [
-            b["text"]
-            for b in content
-            if isinstance(b, dict) and b.get("type") == "text" and isinstance(b.get("text"), str)
-        ]
-        return "\n".join(parts)
-    return ""
-
-
-def _flatten_say_tool_calls(message: dict[str, Any]) -> dict[str, Any]:
-    """Serialize assistant ``say`` tool calls into a ``<say>...</say>`` marker.
-
-    PaliGemma's flat text prompt has no notion of structured tool calls,
-    and ``_format_messages`` only reads ``role`` / ``content`` — so
-    without this a ``say`` tool call is dropped entirely and never
-    supervised. Rewriting it into the content text as a ``<say>...</say>``
-    marker lets the LM head learn to emit it; the runtime parses it back
-    via ``_split_plan_and_say``. Messages without ``say`` tool calls are
-    returned unchanged (the structured calls, if any, are still dropped).
-    """
-    tool_calls = message.get("tool_calls")
-    if not tool_calls:
-        return message
-    say_texts: list[str] = []
-    for call in tool_calls:
-        if not isinstance(call, dict):
-            continue
-        fn = call.get("function") or {}
-        if fn.get("name") != "say":
-            continue
-        args = fn.get("arguments")
-        if isinstance(args, str):
-            try:
-                import json  # noqa: PLC0415
-
-                args = json.loads(args)
-            except (ValueError, TypeError):
-                args = {}
-        text = args.get("text", "") if isinstance(args, dict) else ""
-        if text:
-            say_texts.append(str(text))
-    new = dict(message)
-    new.pop("tool_calls", None)
-    if not say_texts:
-        return new
-    base = _content_to_text(new.get("content")).strip()
-    marker = "".join(f"<say>{t}</say>" for t in say_texts)
-    new["content"] = f"{base}\n{marker}" if base else marker
-    return new
-
-
-def _strip_blocks(message: dict[str, Any]) -> dict[str, Any]:
-    """Normalise a message's content to a plain string.
-
-    The recipe renderer can emit ``content`` as a string OR as a list
-    of HF-style multimodal blocks (``{type: text, text: ...}``,
-    ``{type: image, feature: ...}``). PaliGemma's text tokenizer can
-    only consume strings, so we flatten: drop image blocks (cameras
-    flow through ``observation.images.*`` separately) and join text
-    block texts.
-    """
-    new = dict(message)
-    new.pop("stream", None)
-    new.pop("target", None)
-    content = new.get("content")
-    if content is None:
-        new["content"] = ""
-    elif isinstance(content, str):
-        pass
-    elif isinstance(content, list):
-        parts: list[str] = []
-        for block in content:
-            if not isinstance(block, dict):
-                continue
-            if block.get("type") == "text":
-                t = block.get("text", "")
-                if isinstance(t, str):
-                    parts.append(t)
-        new["content"] = "\n".join(parts)
-    else:
-        new["content"] = str(content)
-    return new
-
-
-def _is_batched_messages(messages: Any) -> bool:
-    return isinstance(messages, list) and bool(messages) and isinstance(messages[0], list)
-
-
-def _sample_indices(value: Any, batch_size: int) -> list[int | None]:
-    if value is None:
-        return [None] * batch_size
-    if isinstance(value, torch.Tensor):
-        if value.numel() == 1:
-            return [int(value.item())] * batch_size
-        values = value.reshape(-1).tolist()
-        return [int(v) for v in values[:batch_size]]
-    if isinstance(value, (list, tuple)):
-        if len(value) == 1:
-            return _sample_indices(value[0], batch_size)
-        return [int(v.item() if hasattr(v, "item") else v) for v in value[:batch_size]]
-    return [int(value)] * batch_size
-
-
-# ---------------------------------------------------------------------------
-# VQA spatial answers → PaliGemma <loc> format (PI052 only)
-#
-# PaliGemma is pre-trained on detection / pointing with a ``<locNNNN>``
-# vocabulary (normalized [0, 1023]). The recipe's bbox / keypoint VQA
-# answers are stored as JSON in Qwen2.5-VL's grounding convention:
-# **0–1000 normalized coordinates**, NOT pixels. (Verified empirically
-# on the published datasets: x and y both span 0..1000 with ~30% of
-# values exceeding the camera's pixel dimensions — they're not pixels.)
-# Converting to ``<loc>`` is therefore camera-resolution-independent:
-# ``loc_idx = round(coord / 1000 * 1023)``. We do the conversion here —
-# not in the dataset — so the dataset keeps the raw JSON and stays
-# backbone-agnostic.
-# ---------------------------------------------------------------------------
-
-# The 0–1000 scale Qwen2.5-VL emits for grounding coordinates.
-_VQA_COORD_SCALE = 1000.0
-
-
-def register_paligemma_loc_tokens(tokenizer: Any) -> Any:
-    """Make PaliGemma's ``<locDDDD>`` ids match on raw text — single tokens.
-
-    PaliGemma reserves vocab ids [256000, 257023] for ``<locDDDD>``
-    (detection / pointing) tokens, but the *stock* tokenizer does NOT
-    match them when encoding raw text — it BPE-splits ``<loc0162>`` into
-    7 pieces (``<``, ``loc``, ``0``, ``1``, ``6``, ``2``, ``>``). Training
-    the LM head on a ``<loc>`` target then supervises those 7 generic
-    BPE pieces instead of one detection-vocab id, the LM head learns to
-    emit the *character sequence*, and those pieces' logits dominate
-    other turns (the ``<loc>``-salad on subtasks). Registering the loc
-    tokens once makes them tokenize as their single ids (256000+idx),
-    leveraging PaliGemma's detection prior properly. Idempotent.
-    """
-    if "<loc0000>" in getattr(tokenizer, "added_tokens_encoder", {}):
-        return tokenizer
-    tokenizer.add_tokens([f"<loc{i:04d}>" for i in range(1024)])
-    return tokenizer
-
-
-def _loc_token(coord: float, scale: float = _VQA_COORD_SCALE) -> str:
-    """PaliGemma ``<locNNNN>`` for a coord on a ``[0, scale]`` axis."""
-    idx = round(float(coord) / scale * 1023) if scale > 0 else 0
-    return f"<loc{max(0, min(1023, idx)):04d}>"
-
-
-def _vqa_answer_to_loc(answer: dict[str, Any]) -> str | None:
-    """Convert a bbox / keypoint VQA answer dict to PaliGemma ``<loc>`` text.
-
-    Input coordinates are in Qwen2.5-VL's 0–1000 normalized space (see
-    module-level note). y is emitted before x for each coordinate pair
-    (PaliGemma convention), with the integer indices in [0, 1023].
-
-    **Format: label first, locs after.** PaliGemma's pretraining puts
-    locs first (``<loc><loc> label``), but for our small-dataset VQA
-    blend that turns the LM head into a loc-emission attractor at every
-    ``Assistant:`` position — VQA targets share their first supervised
-    token with ~25% of all text samples, and the head collapses to
-    emitting ``<loc>`` regardless of the prompt. Putting the label
-    first (``label <locY><locX>``) means every text sample (subtask,
-    memory, VQA, …) starts the supervised target with a real word,
-    breaking the attractor. The model still learns the loc vocabulary
-    for the *spatial* portion of the answer; it just can't fire it as
-    the first generation step from a clean prompt.
-
-    Returns ``None`` for non-spatial answers (count / attribute /
-    spatial-relation) — those keep their JSON form.
-    """
-    point = answer.get("point")
-    if isinstance(point, list | tuple) and len(point) == 2 and "point_format" in answer:
-        try:
-            x, y = float(point[0]), float(point[1])
-        except (TypeError, ValueError):
-            return None
-        label = str(answer.get("label", "")).strip()
-        if not label:
-            return None
-        return f"{label} {_loc_token(y)}{_loc_token(x)}"
-
-    detections = answer.get("detections")
-    if isinstance(detections, list) and detections:
-        parts: list[str] = []
-        for det in detections:
-            if not isinstance(det, dict):
-                continue
-            box = det.get("bbox")
-            if not (isinstance(box, list | tuple) and len(box) == 4):
-                continue
-            try:
-                x1, y1, x2, y2 = (float(v) for v in box)
-            except (TypeError, ValueError):
-                continue
-            label = str(det.get("label", "")).strip()
-            if not label:
-                continue
-            toks = (
-                f"{_loc_token(y1)}{_loc_token(x1)}"
-                f"{_loc_token(y2)}{_loc_token(x2)}"
-            )
-            parts.append(f"{label} {toks}")
-        return " ; ".join(parts) if parts else None
-    return None
-
-
-def _messages_vqa_to_loc(
-    messages: list[dict[str, Any]],
-    target_indices: list[int],
-) -> list[dict[str, Any]]:
-    """Rewrite bbox / keypoint VQA *target* answers from JSON to ``<loc>`` text.
-
-    Each target turn whose content parses as a spatial VQA answer is
-    converted. Non-spatial answers and subtask / memory targets (plain
-    text → not JSON) are left untouched. Camera-independent: VQA coords
-    are 0–1000 normalized, so no observation lookup is needed.
-    """
-    if not target_indices:
-        return messages
-    out = list(messages)
-    for idx in target_indices:
-        if not (0 <= idx < len(out)):
-            continue
-        content = out[idx].get("content")
-        if not isinstance(content, str) or not content.strip():
-            continue
-        try:
-            answer = json.loads(content)
-        except (ValueError, TypeError):
-            continue  # subtask / memory targets are plain text — skip
-        if not isinstance(answer, dict):
-            continue
-        loc_text = _vqa_answer_to_loc(answer)
-        if loc_text is not None:
-            out[idx] = {**out[idx], "content": loc_text}
-    return out
-
-
-def _format_messages(
-    messages: list[dict[str, Any]],
-    target_indices: list[int] | None = None,
-    eos_token: str | None = None,
-) -> tuple[str, list[tuple[int, int]]]:
-    """Concatenate messages into the π0.5-style flat prompt.
-
-    When both ``target_indices`` and ``eos_token`` are given, the EOS
-    string is appended to each supervised target turn's content and the
-    returned span covers it — so the label builder marks the EOS token
-    as a supervised label. That teaches the LM head where the answer
-    *ends*: without an EOS in the target span the model is never given a
-    stop signal and rambles to ``max_length`` at inference. Inference
-    callers omit both args (no EOS baked into the prompt — the model
-    generates it and ``select_message`` stops on it).
-
-    Returns:
-        prompt:       the full text the tokenizer will consume.
-        msg_spans:    list of ``(char_start, char_end)`` covering each
-                      message's supervised payload (content, plus the
-                      appended EOS for target turns) within ``prompt``.
-    """
-    targets = set(target_indices or [])
-    parts: list[str] = []
-    spans: list[tuple[int, int]] = []
-    cursor = 0
-    for i, m in enumerate(messages):
-        role = m.get("role", "user")
-        content = m.get("content", "") or ""
-        # Role tag + newline. The model has to learn to emit the same
-        # role tokens at generation time, which is fine for greedy
-        # decoding because the chat template is implicit in the
-        # supervised target span.
-        header = f"{role.capitalize()}: "
-        # A supervised target turn ends with EOS so the model learns to
-        # terminate; the span below covers content + EOS. Non-target
-        # turns (and inference) carry no EOS.
-        body = content + eos_token if (eos_token and i in targets) else content
-        # span covers the content (+ EOS) portion only — never the role
-        # tag — so labels are computed over the supervised payload.
-        full = header + body + "\n"
-        start = cursor + len(header)
-        end = start + len(body)
-        parts.append(full)
-        spans.append((start, end))
-        cursor += len(full)
-    return "".join(parts), spans
-
-
-@dataclass
-@ProcessorStepRegistry.register(name="pi052_text_tokenizer")
-class PI052TextTokenizerStep(ProcessorStep):
-    """Render messages → token ids + label mask + predict_actions flag.
-
-    No chat template; concatenates messages as
-    ``User: ... \\nAssistant: ...`` text.
-    """
-
-    tokenizer_name: str = "google/paligemma-3b-pt-224"
-    max_length: int = 200
-    padding: str = "max_length"
-    padding_side: str = "right"
-    plan_dropout_prob: float = 0.0
-    memory_dropout_prob: float = 0.0
-    subtask_dropout_prob: float = 0.0
-    interjection_dropout_prob: float = 0.0
-    dropout_seed: int | None = None
-
-    def __post_init__(self) -> None:
-        self._tokenizer: Any = None
-
-    def _ensure_tokenizer(self) -> Any:
-        if self._tokenizer is not None:
-            return self._tokenizer
-        from transformers import AutoTokenizer  # noqa: PLC0415
-
-        self._tokenizer = register_paligemma_loc_tokens(
-            AutoTokenizer.from_pretrained(self.tokenizer_name)
-        )
-        return self._tokenizer
-
-    # ------------------------------------------------------------------
-    # Pipeline step
-    # ------------------------------------------------------------------
-
-    def __call__(self, transition: EnvTransition) -> EnvTransition | None:
-        transition = transition.copy()
-        complementary = transition.get(TransitionKey.COMPLEMENTARY_DATA, {}) or {}
-        messages = complementary.get("messages") or []
-
-        if not messages:
-            # No recipe was rendered — caller will fall back to the
-            # plain Pi0.5 prompt path. We pass the transition through
-            # unmodified.
-            return transition
-
-        tokenizer = self._ensure_tokenizer()
-        # VQA coords are 0–1000 normalized (Qwen2.5-VL convention) — the
-        # <loc> conversion is camera-resolution-independent and needs no
-        # observation lookup here.
-        if _is_batched_messages(messages):
-            indices_iter = _sample_indices(complementary.get("index"), len(messages))
-            encoded = [
-                self._encode_messages(
-                    tokenizer,
-                    msg,
-                    list(streams),
-                    list(tgt_indices),
-                    complementary,
-                    sample_idx=int(s_idx) if s_idx is not None else None,
-                )
-                for msg, streams, tgt_indices, s_idx in zip(
-                    messages,
-                    complementary.get("message_streams") or [[] for _ in messages],
-                    complementary.get("target_message_indices") or [[] for _ in messages],
-                    indices_iter,
-                    strict=False,
-                )
-            ]
-        else:
-            sample_idx = _sample_indices(complementary.get("index"), 1)[0]
-            encoded = [
-                self._encode_messages(
-                    tokenizer,
-                    messages,
-                    list(complementary.get("message_streams") or []),
-                    list(complementary.get("target_message_indices") or []),
-                    complementary,
-                    sample_idx=sample_idx,
-                )
-            ]
-
-        obs = dict(transition.get(TransitionKey.OBSERVATION) or {})
-        obs[OBS_LANGUAGE_TOKENS] = torch.stack([ids for ids, _, _, _, _ in encoded])
-        obs[OBS_LANGUAGE_ATTENTION_MASK] = torch.stack([attn for _, attn, _, _, _ in encoded])
-        transition[TransitionKey.OBSERVATION] = obs
-
-        transition[TransitionKey.COMPLEMENTARY_DATA] = {
-            **complementary,
-            "text_labels": torch.stack([labels for _, _, labels, _, _ in encoded]),
-            "predict_actions": torch.stack([pred for _, _, _, pred, _ in encoded]),
-        }
-        return transition
-
-    def _encode_messages(
-        self,
-        tokenizer: Any,
-        messages: list[dict[str, Any]],
-        message_streams: list[str | None],
-        target_indices: list[int],
-        complementary: dict[str, Any],
-        sample_idx: int | None = None,
-    ) -> tuple[Tensor, Tensor, Tensor, Tensor, str]:
-        # Optional: drop non-target messages per the dropout config.
-        # Keeps the supervised-target indices stable by re-mapping
-        # after removal.
-        if (
-            self.plan_dropout_prob
-            or self.memory_dropout_prob
-            or self.subtask_dropout_prob
-            or self.interjection_dropout_prob
-        ):
-            messages, target_indices = self._apply_prompt_dropout(
-                messages,
-                target_indices,
-                complementary,
-                sample_idx=sample_idx,
-            )
-
-        # Rewrite bbox / keypoint VQA target answers from JSON to
-        # PaliGemma <loc> text. Coords are 0–1000 normalized so this is
-        # camera-independent.
-        messages = _messages_vqa_to_loc(messages, target_indices)
-
-        # Flatten ``say`` tool calls into ``<say>...</say>`` text before
-        # stripping, so the spoken reply is actually tokenized and
-        # supervised (PaliGemma's flat prompt has no structured calls).
-        messages = [_strip_blocks(_flatten_say_tool_calls(m)) for m in messages]
-        # Append EOS to supervised target turns so the LM head learns to
-        # stop (the span covers it → it becomes a supervised label).
-        prompt, spans = _format_messages(
-            messages, target_indices, getattr(tokenizer, "eos_token", None)
-        )
-
-        encoded = tokenizer(
-            prompt,
-            max_length=self.max_length,
-            padding=self.padding,
-            truncation=True,
-            return_tensors="pt",
-            return_offsets_mapping=True,
-            padding_side=self.padding_side,
-        )
-
-        input_ids = encoded["input_ids"][0]
-        attention_mask = encoded["attention_mask"][0].bool()
-        offsets = encoded["offset_mapping"][0]  # (seq, 2), char (start,end)
-
-        # Build label mask: -100 everywhere except over supervised
-        # target message char ranges.
-        labels = torch.full_like(input_ids, fill_value=-100)
-        for idx in target_indices:
-            if idx >= len(spans):
-                continue
-            char_start, char_end = spans[idx]
-            for token_pos in range(input_ids.shape[0]):
-                if not attention_mask[token_pos]:
-                    continue
-                tok_start, tok_end = int(offsets[token_pos, 0]), int(offsets[token_pos, 1])
-                if tok_end <= char_start or tok_start >= char_end:
-                    continue
-                labels[token_pos] = input_ids[token_pos]
-
-        # Scan ALL message streams (not just targets): the
-        # ``low_level_execution`` recipe drops ``target: true`` on
-        # the assistant to avoid trivial copy-from-user text-CE; the
-        # flow loss still needs to fire, gated by ``stream: low_level``.
-        predict_actions = torch.tensor(
-            bool(any(s == "low_level" for s in message_streams)),
-            dtype=torch.bool,
-        )
-        return input_ids, attention_mask, labels, predict_actions, prompt
-
-    # ------------------------------------------------------------------
-    # Per-component prompt dropout (Pi0.7 §V.E)
-    # ------------------------------------------------------------------
-
-    def _apply_prompt_dropout(
-        self,
-        messages: list[dict[str, Any]],
-        target_indices: list[int],
-        complementary: dict[str, Any],
-        sample_idx: int | None = None,
-    ) -> tuple[list[dict[str, Any]], list[int]]:
-        """Drop messages classified as plan/memory/subtask context.
-
-        Targets are *never* dropped (they're the supervised payload).
-        Re-maps target_indices to the new positions after drops.
-        """
-        import random  # noqa: PLC0415
-
-        seed = self.dropout_seed
-        if seed is None:
-            # Canonical row-index key set by ``BatchProcessor`` /
-            # ``render_messages_processor``. Falling back to other
-            # keys silently gave every sample seed=0 → identical
-            # dropout pattern across the whole epoch.
-            seed_src = sample_idx if sample_idx is not None else complementary.get("index", 0)
-            try:
-                if hasattr(seed_src, "item"):
-                    seed_src = seed_src.item()
-                seed = int(seed_src)
-            except (TypeError, ValueError):
-                seed = 0
-        rng = random.Random(seed)
-
-        keep_indices: list[int] = []
-        for idx, msg in enumerate(messages):
-            if idx in target_indices:
-                keep_indices.append(idx)
-                continue
-            kind = _classify_for_dropout(msg)
-            prob = {
-                "plan": self.plan_dropout_prob,
-                "memory": self.memory_dropout_prob,
-                "subtask": self.subtask_dropout_prob,
-                "interjection": self.interjection_dropout_prob,
-            }.get(kind, 0.0)
-            if prob > 0.0 and rng.random() < prob:
-                continue
-            keep_indices.append(idx)
-
-        # Build remap and apply
-        new_messages = [messages[i] for i in keep_indices]
-        old_to_new = {old: new for new, old in enumerate(keep_indices)}
-        new_targets = [old_to_new[t] for t in target_indices if t in old_to_new]
-        return new_messages, new_targets
-
-    def transform_features(
-        self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
-    ) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
-        return features
-
-
-def _classify_for_dropout(message: dict[str, Any]) -> str | None:
-    """Heuristic content-prefix classifier (plan / memory / subtask)."""
-    content = message.get("content")
-    if isinstance(content, list):
-        text_parts = [b.get("text", "") for b in content if isinstance(b, dict) and b.get("type") == "text"]
-        content = " ".join(text_parts)
-    elif content is None:
-        return None
-    elif not isinstance(content, str):
-        return None
-    s = content.strip()
-    if s.startswith("Plan:") or s.startswith("Previous plan"):
-        return "plan"
-    if s.startswith("Memory:") or s.startswith("Previous memory"):
-        return "memory"
-    if s.startswith("Current subtask") or s.startswith("Completed subtask"):
-        return "subtask"
-    return None
--- a/src/lerobot/policies/pi_gemma.py
+++ b/src/lerobot/policies/pi_gemma.py
@@ -275,8 +275,6 @@ class PiGemmaModel(GemmaModel):  # type: ignore[misc]
        # Convert to bfloat16 if the first layer uses bfloat16
        if len(self.layers) > 0 and self.layers[0].self_attn.q_proj.weight.dtype == torch.bfloat16:
            hidden_states = hidden_states.to(torch.bfloat16)
-        if causal_mask is not None and torch.is_floating_point(causal_mask):
-            causal_mask = causal_mask.to(dtype=hidden_states.dtype)

        # create position embeddings to be shared across the decoder layers
        position_embeddings = self.rotary_emb(hidden_states, position_ids)
--- a/src/lerobot/processor/batch_processor.py
+++ b/src/lerobot/processor/batch_processor.py
@@ -175,6 +175,9 @@ class AddBatchDimensionComplementaryDataStep(ComplementaryDataProcessorStep):
            if isinstance(task_index_value, Tensor) and task_index_value.dim() == 0:
                complementary_data["task_index"] = task_index_value.unsqueeze(0)

+        complementary_data.pop("language_persistent", None)
+        complementary_data.pop("language_events", None)
+
        if "messages" in complementary_data:
            messages = complementary_data["messages"]
            if isinstance(messages, list) and (not messages or isinstance(messages[0], dict)):
--- a/src/lerobot/processor/render_messages_processor.py
+++ b/src/lerobot/processor/render_messages_processor.py
@@ -52,9 +52,6 @@ class RenderMessagesStep(ProcessorStep):
        if not persistent and not events:
            return transition

-        if _is_batched_language(persistent) or _is_batched_language(events):
-            return self._call_batch(transition, complementary_data, persistent, events)
-
        timestamp = complementary_data.get("timestamp")
        if timestamp is None:
            raise KeyError("RenderMessagesStep requires sample timestamp in complementary data.")
@@ -70,131 +67,18 @@ class RenderMessagesStep(ProcessorStep):
            dataset_ctx=self.dataset_ctx,
        )
        if rendered is None:
-            rendered = _fallback_low_level_render(complementary_data.get("task"))
-            if rendered is None:
-                return None
+            return None

        new_transition = transition.copy()
-        new_complementary_data = dict(new_transition.get(TransitionKey.COMPLEMENTARY_DATA) or {})
+        new_complementary_data = dict(complementary_data)
        new_complementary_data.pop(LANGUAGE_PERSISTENT, None)
        new_complementary_data.pop(LANGUAGE_EVENTS, None)
        new_complementary_data.update(rendered)
        new_transition[TransitionKey.COMPLEMENTARY_DATA] = new_complementary_data
        return new_transition

-    def _call_batch(
-        self,
-        transition: EnvTransition,
-        complementary_data: dict[str, Any],
-        persistent_batch: list,
-        events_batch: list,
-    ) -> EnvTransition | None:
-        timestamp = complementary_data.get("timestamp")
-        if timestamp is None:
-            raise KeyError("RenderMessagesStep requires sample timestamp in complementary data.")
-
-        batch_size = max(len(persistent_batch), len(events_batch))
-        messages: list[list[dict[str, Any]]] = []
-        message_streams: list[list[str | None]] = []
-        target_message_indices: list[list[int]] = []
-        keep_indices: list[int] = []
-
-        for i in range(batch_size):
-            rendered = render_sample(
-                recipe=self.recipe,
-                persistent=persistent_batch[i] if i < len(persistent_batch) else [],
-                events=events_batch[i] if i < len(events_batch) else [],
-                t=_batch_value(timestamp, i),
-                sample_idx=int(_batch_value(complementary_data.get("index", 0), i)),
-                task=_batch_value(complementary_data.get("task"), i),
-                dataset_ctx=self.dataset_ctx,
-            )
-            if rendered is None:
-                rendered = _fallback_low_level_render(_batch_value(complementary_data.get("task"), i))
-                if rendered is None:
-                    continue
-            keep_indices.append(i)
-            messages.append(rendered["messages"])
-            message_streams.append(rendered["message_streams"])
-            target_message_indices.append(rendered["target_message_indices"])
-
-        if not messages:
-            return None
-
-        new_transition = (
-            _select_batch_indices(transition, keep_indices)
-            if len(keep_indices) != batch_size
-            else transition.copy()
-        )
-        new_complementary_data = dict(new_transition.get(TransitionKey.COMPLEMENTARY_DATA) or {})
-        new_complementary_data.pop(LANGUAGE_PERSISTENT, None)
-        new_complementary_data.pop(LANGUAGE_EVENTS, None)
-        new_complementary_data["messages"] = messages
-        new_complementary_data["message_streams"] = message_streams
-        new_complementary_data["target_message_indices"] = target_message_indices
-        new_transition[TransitionKey.COMPLEMENTARY_DATA] = new_complementary_data
-        return new_transition
-
    def transform_features(
        self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
    ) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
        """Pass features through unchanged; rendering only touches complementary data."""
        return features
-
-
-def _scalar(value: Any) -> float | int:
-    """Unwrap a tensor/array/single-element list into a Python scalar."""
-    if hasattr(value, "item"):
-        return value.item()
-    if isinstance(value, list):
-        if len(value) != 1:
-            raise ValueError(f"Expected a scalar, got list of length {len(value)}: {value!r}")
-        return _scalar(value[0])
-    return value
-
-
-def _is_batched_language(value: Any) -> bool:
-    return isinstance(value, list) and bool(value) and isinstance(value[0], list)
-
-
-def _batch_value(value: Any, index: int) -> Any:
-    if value is None:
-        return None
-    if isinstance(value, list):
-        return value[index]
-    if hasattr(value, "ndim") and getattr(value, "ndim") > 0:
-        return _scalar(value[index])
-    return _scalar(value)
-
-
-def _select_batch_indices(transition: EnvTransition, indices: list[int]) -> EnvTransition:
-    selected = transition.copy()
-    for key in (TransitionKey.OBSERVATION, TransitionKey.COMPLEMENTARY_DATA):
-        data = selected.get(key)
-        if isinstance(data, dict):
-            selected[key] = {k: _select_value(v, indices) for k, v in data.items()}
-    action = selected.get(TransitionKey.ACTION)
-    if action is not None:
-        selected[TransitionKey.ACTION] = _select_value(action, indices)
-    return selected
-
-
-def _select_value(value: Any, indices: list[int]) -> Any:
-    if isinstance(value, list) and len(value) >= len(indices):
-        return [value[i] for i in indices]
-    if hasattr(value, "index_select") and hasattr(value, "new_tensor") and getattr(value, "ndim", 0) > 0:
-        return value.index_select(0, value.new_tensor(indices).long())
-    return value
-
-
-def _fallback_low_level_render(task: Any) -> dict[str, Any] | None:
-    """Keep action-only samples trainable when no recipe branch matches."""
-    if hasattr(task, "item"):
-        task = task.item()
-    if not isinstance(task, str) or not task:
-        return None
-    return {
-        "messages": [{"role": "user", "content": task}],
-        "message_streams": ["low_level"],
-        "target_message_indices": [],
-    }
--- a/src/lerobot/processor/tokenizer_processor.py
+++ b/src/lerobot/processor/tokenizer_processor.py
@@ -32,7 +32,6 @@ import torch
 from lerobot.configs import FeatureType, PipelineFeatureType, PolicyFeature
 from lerobot.types import EnvTransition, RobotObservation, TransitionKey
 from lerobot.utils.constants import (
-    ACTION_CODE_TOKEN_MASK,
    ACTION_TOKEN_MASK,
    ACTION_TOKENS,
    OBS_LANGUAGE_ATTENTION_MASK,
@@ -413,15 +412,14 @@ class ActionTokenizerProcessorStep(ActionProcessorStep):
            # During inference, no action is available, skip tokenization
            return new_transition

-        # Tokenize and get masks for the full formatted sequence and the discrete action codes.
-        tokens, mask, code_mask = self._tokenize_action(action)
+        # Tokenize and get both tokens and mask
+        tokens, mask = self._tokenize_action(action)

        # Store mask in complementary data
        complementary_data = new_transition.get(TransitionKey.COMPLEMENTARY_DATA, {})
        if complementary_data is None:
            complementary_data = {}
        complementary_data[ACTION_TOKEN_MASK] = mask
-        complementary_data[ACTION_CODE_TOKEN_MASK] = code_mask
        complementary_data[ACTION_TOKENS] = tokens
        new_transition[TransitionKey.COMPLEMENTARY_DATA] = complementary_data
        return new_transition
@@ -432,7 +430,7 @@ class ActionTokenizerProcessorStep(ActionProcessorStep):
        """
        return self._paligemma_tokenizer.vocab_size - 1 - self.fast_skip_tokens - tokens

-    def _tokenize_action(self, action: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+    def _tokenize_action(self, action: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Tokenizes the action tensor and creates a mask.

@@ -461,7 +459,6 @@ class ActionTokenizerProcessorStep(ActionProcessorStep):
        # The fast tokenizer expects action data and returns token IDs
        tokens_list = []
        masks_list = []
-        code_masks_list = []

        for i in range(batch_size):
            # Tokenize single action (move to CPU first as tokenizer uses scipy which requires numpy)
@@ -479,26 +476,19 @@ class ActionTokenizerProcessorStep(ActionProcessorStep):
            if tokens.dim() > 1:
                tokens = tokens.flatten()

-            action_code_tokens = self._act_tokens_to_paligemma_tokens(tokens)
            bos_id = self._paligemma_tokenizer.bos_token_id
-            prompt_tokens = torch.tensor(
-                self._paligemma_tokenizer.encode("Action: ", add_special_tokens=False),
-                device=action.device,
-            )
-            end_tokens = torch.tensor(self._paligemma_tokenizer.encode("|"), device=action.device)
-
-            code_start = 1 + len(prompt_tokens)
-            code_end = code_start + len(action_code_tokens)
+            # add bos
            tokens = torch.cat(
                [
                    torch.tensor([bos_id], device=action.device),
-                    prompt_tokens,
-                    action_code_tokens,
-                    end_tokens,
+                    torch.tensor(
+                        self._paligemma_tokenizer.encode("Action: ", add_special_tokens=False),
+                        device=action.device,
+                    ),
+                    self._act_tokens_to_paligemma_tokens(tokens),
+                    torch.tensor(self._paligemma_tokenizer.encode("|"), device=action.device),
                ]
            )
-            code_mask = torch.zeros(len(tokens), dtype=torch.bool, device=action.device)
-            code_mask[code_start:code_end] = True

            # Truncate or pad to max_action_tokens
            if len(tokens) > self.max_action_tokens:
@@ -507,49 +497,44 @@ class ActionTokenizerProcessorStep(ActionProcessorStep):
                    "Consider increasing the `max_action_tokens` in your model config if this happens frequently."
                )
                tokens = tokens[: self.max_action_tokens]
-                code_mask = code_mask[: self.max_action_tokens]
                mask = torch.ones(self.max_action_tokens, dtype=torch.bool, device=action.device)
            else:
-                pad_len = self.max_action_tokens - len(tokens)
                mask = torch.cat(
                    [
                        torch.ones(len(tokens), dtype=torch.bool, device=action.device),
-                        torch.zeros(pad_len, dtype=torch.bool, device=action.device),
+                        torch.zeros(
+                            self.max_action_tokens - len(tokens), dtype=torch.bool, device=action.device
+                        ),
                    ]
                )
-                code_mask = torch.nn.functional.pad(code_mask, (0, pad_len), value=False)
                # Pad tokens with zeros
-                tokens = torch.nn.functional.pad(tokens, (0, pad_len), value=0)
+                tokens = torch.nn.functional.pad(tokens, (0, self.max_action_tokens - len(tokens)), value=0)

            tokens_list.append(tokens)
            masks_list.append(mask)
-            code_masks_list.append(code_mask)

        # Stack into batched tensors
        tokens_batch = torch.stack(tokens_list, dim=0)  # (B, max_action_tokens)
        masks_batch = torch.stack(masks_list, dim=0)  # (B, max_action_tokens)
-        code_masks_batch = torch.stack(code_masks_list, dim=0)  # (B, max_action_tokens)

        # Remove batch dimension if input was single sample
        if single_sample:
            tokens_batch = tokens_batch.squeeze(0)
            masks_batch = masks_batch.squeeze(0)
-            code_masks_batch = code_masks_batch.squeeze(0)

        # Move to the same device as the input
        if device is not None:
            tokens_batch = tokens_batch.to(device)
            masks_batch = masks_batch.to(device)
-            code_masks_batch = code_masks_batch.to(device)

-        return tokens_batch, masks_batch, code_masks_batch
+        return tokens_batch, masks_batch

    def action(self, action: torch.Tensor) -> torch.Tensor:
        """
        This method is not used since we override __call__.
        Required by ActionProcessorStep ABC.
        """
-        tokens, _, _ = self._tokenize_action(action)
+        tokens, _ = self._tokenize_action(action)
        return tokens

    def get_config(self) -> dict[str, Any]:
--- a/src/lerobot/robots/utils.py
+++ b/src/lerobot/robots/utils.py
@@ -21,8 +21,6 @@ from lerobot.utils.import_utils import make_device_from_device_class
 from .config import RobotConfig
 from .robot import Robot

-logger = logging.getLogger(__name__)
-

 def make_robot_from_config(config: RobotConfig) -> Robot:
    # TODO(Steven): Consider just using the make_device_from_device_class for all types
@@ -120,7 +118,7 @@ def ensure_safe_goal_position(
            }

    if warnings_dict:
-        logger.warning(
+        logging.warning(
            "Relative goal position magnitude had to be clamped to be safe.\n"
            f"{pformat(warnings_dict, indent=4)}"
        )
--- a/src/lerobot/scripts/build_robocasa_composite_seen.py
+++ b/src/lerobot/scripts/build_robocasa_composite_seen.py
@@ -1,345 +0,0 @@
-#!/usr/bin/env python3
-"""Build a single combined LeRobotDataset from RoboCasa's 16 composite_seen tasks.
-
-RoboCasa 1.0 already ships in LeRobot format (parquet + mp4), distributed as
-``lerobot.tar`` archives from Box. This script:
-
-1. Downloads each composite_seen task's ``target/human`` archive via RoboCasa's
-   official ``download_datasets`` helper (idempotent — skipped if already on
-   disk).
-2. Opens each extracted directory as a ``LeRobotDataset``.
-3. Merges all 16 into one unified dataset via ``merge_datasets`` (a thin wrapper
-   over ``aggregate_datasets`` that revalidates fps / robot_type / features,
-   unifies task indices, concatenates videos and parquet, and recomputes stats).
-4. Optionally pushes the merged dataset to the Hub.
-
-The result is one ~8,000-trajectory dataset where each episode carries its
-source task as the ``task`` field — ready for downstream annotation
-(subtasks / memory / VQA / tool calls) without per-task bookkeeping.
-
-Usage::
-
-    uv run python -m lerobot.scripts.build_robocasa_composite_seen \\
-        --output-dir=/data/lerobot/robocasa_composite_seen \\
-        --hub-repo-id=${HF_USER}/robocasa_composite_seen \\
-        --push-to-hub
-
-Prereqs: ``robocasa`` and ``robosuite`` installed (see
-``docs/source/benchmarks/robocasa.mdx`` for the editable-install dance — they
-are not on PyPI and RoboCasa's own ``setup.py`` pins an old LeRobot version).
-
-The 16 composite_seen tasks are the multi-step subset of the official
-RoboCasa365 target benchmark — exactly the slice used to compute the
-``Composite-Seen`` column of the leaderboard.
-"""
-
-from __future__ import annotations
-
-import argparse
-import logging
-import sys
-from pathlib import Path
-
-from lerobot.datasets.dataset_tools import merge_datasets
-from lerobot.datasets.lerobot_dataset import LeRobotDataset
-
-logger = logging.getLogger(__name__)
-
-# Canonical 16 composite_seen tasks (RoboCasa365 target benchmark).
-# Order matches the leaderboard docs.
-COMPOSITE_SEEN_TASKS: list[str] = [
-    "DeliverStraw",
-    "GetToastedBread",
-    "KettleBoiling",
-    "LoadDishwasher",
-    "PackIdenticalLunches",
-    "PreSoakPan",
-    "PrepareCoffee",
-    "RinseSinkBasin",
-    "ScrubCuttingBoard",
-    "SearingMeat",
-    "SetUpCuttingStation",
-    "StackBowlsCabinet",
-    "SteamInMicrowave",
-    "StirVegetables",
-    "StoreLeftoversInBowl",
-    "WashLettuce",
-]
-
-
-def _require_robocasa() -> None:
-    """Fail fast with an actionable message if robocasa is missing.
-
-    RoboCasa is not on PyPI and is not a LeRobot extra — see the installation
-    notes in ``docs/source/benchmarks/robocasa.mdx``.
-    """
-    try:
-        import robocasa  # noqa: F401, PLC0415
-        from robocasa.scripts import download_datasets as _dl  # noqa: F401, PLC0415
-        from robocasa.utils import dataset_registry as _reg  # noqa: F401, PLC0415
-    except ImportError as exc:
-        sys.exit(
-            "[build_robocasa_composite_seen] robocasa is not importable.\n"
-            "Install it (and robosuite) per the LeRobot RoboCasa docs:\n"
-            "    git clone https://github.com/robocasa/robocasa.git ~/robocasa\n"
-            "    git clone https://github.com/ARISE-Initiative/robosuite.git ~/robosuite\n"
-            "    pip install -e ~/robocasa --no-deps\n"
-            "    pip install -e ~/robosuite\n"
-            f"(original error: {exc})"
-        )
-
-
-def _resolve_task_root(task: str) -> Path:
-    """Resolve the local extracted ``LeRobotDataset`` root for a target/human task.
-
-    Uses RoboCasa's own ``dataset_registry`` so we follow whatever directory
-    layout RoboCasa picks (currently ``v1.0/target/composite/<task>/<date>/``
-    under ``robocasa.macros.DATASET_BASE_DIR``). Falls back to discovering the
-    extracted directory if the helper's signature drifted between releases.
-    """
-    from robocasa.utils import dataset_registry  # noqa: PLC0415
-
-    # ``get_ds_path`` is the canonical helper. RoboCasa 1.0 signature is
-    # ``get_ds_path(task, ds_type, return_info=False)`` with ``ds_type`` like
-    # ``"human_im"`` (image-observation human demos). We try the common
-    # ``split=`` kwarg first (newer registry); if it's rejected, fall back.
-    try:
-        ds_path = dataset_registry.get_ds_path(
-            task=task,
-            ds_type="human_im",
-            return_info=False,
-            split="target",
-        )
-    except TypeError:
-        # Older registry — ds_type alone disambiguates target/human.
-        ds_path = dataset_registry.get_ds_path(
-            task=task,
-            ds_type="human_im",
-            return_info=False,
-        )
-
-    root = Path(ds_path)
-    # ``get_ds_path`` may return either the extracted dir or the .tar; normalize.
-    if root.suffix == ".tar":
-        root = root.parent
-    return root
-
-
-def _download_task(task: str, *, overwrite: bool = False) -> Path:
-    """Download (or locate) a single target/human task and return its extracted root."""
-    from robocasa.scripts import download_datasets as dl  # noqa: PLC0415
-
-    # Try the documented programmatic API. The CLI is
-    #   python -m robocasa.scripts.download_datasets --tasks <T> --source human --split target
-    # which is a thin wrapper over a function of the same name.
-    if hasattr(dl, "download_datasets"):
-        try:
-            dl.download_datasets(
-                tasks=[task],
-                source="human",
-                split="target",
-                overwrite=overwrite,
-            )
-        except TypeError:
-            # Older signature — drop the kwargs RoboCasa didn't have yet.
-            dl.download_datasets(tasks=[task])
-    else:
-        # No public function — shell out to the CLI as a last resort. This
-        # guarantees we use whatever entrypoint RoboCasa's authors maintain.
-        import subprocess  # noqa: PLC0415
-
-        cmd = [
-            sys.executable,
-            "-m",
-            "robocasa.scripts.download_datasets",
-            "--tasks",
-            task,
-            "--source",
-            "human",
-            "--split",
-            "target",
-        ]
-        if overwrite:
-            cmd.append("--overwrite")
-        subprocess.run(cmd, check=True)
-
-    root = _resolve_task_root(task)
-    if not root.exists():
-        raise RuntimeError(
-            f"Expected {root} after download, but it doesn't exist. "
-            "RoboCasa may have changed its data layout — verify with "
-            "`robocasa.utils.dataset_registry.get_ds_path()`."
-        )
-    return root
-
-
-def _open_as_lerobot_dataset(task: str, root: Path) -> LeRobotDataset:
-    """Open an extracted RoboCasa target/human task as a ``LeRobotDataset``.
-
-    The placeholder ``repo_id`` (``robocasa/<task>_target_human``) is only used
-    by the aggregator for logging and for the unified task table — the actual
-    data is loaded from ``root``.
-    """
-    repo_id = f"robocasa/{task}_target_human"
-    return LeRobotDataset(repo_id=repo_id, root=root)
-
-
-def parse_args() -> argparse.Namespace:
-    parser = argparse.ArgumentParser(
-        description="Aggregate the 16 RoboCasa composite_seen target tasks into one LeRobotDataset.",
-        formatter_class=argparse.RawDescriptionHelpFormatter,
-        epilog=__doc__,
-    )
-    parser.add_argument(
-        "--output-dir",
-        type=Path,
-        required=True,
-        help="Local directory for the merged dataset (will be created).",
-    )
-    parser.add_argument(
-        "--hub-repo-id",
-        type=str,
-        default=None,
-        help=(
-            "Hub repo_id for the merged dataset (e.g. ``yourname/"
-            "robocasa_composite_seen``). Required for ``--push-to-hub``; also "
-            "becomes the merged dataset's canonical ``repo_id``."
-        ),
-    )
-    parser.add_argument(
-        "--push-to-hub",
-        action="store_true",
-        help="Push the merged dataset to the Hub after building. Requires "
-        "``--hub-repo-id`` and a prior ``huggingface-cli login``.",
-    )
-    parser.add_argument(
-        "--private",
-        action="store_true",
-        help="When pushing, create the Hub repo as private.",
-    )
-    parser.add_argument(
-        "--tasks",
-        type=str,
-        default=None,
-        help="Comma-separated task names to override the default 16 "
-        "composite_seen list (useful for smoke-testing with 1–2 tasks).",
-    )
-    parser.add_argument(
-        "--skip-download",
-        action="store_true",
-        help="Skip the download step entirely; assume each task is already "
-        "extracted on disk at the path ``dataset_registry.get_ds_path`` "
-        "returns.",
-    )
-    parser.add_argument(
-        "--overwrite-download",
-        action="store_true",
-        help="Force re-download even when a complete local extraction exists.",
-    )
-    parser.add_argument(
-        "--log-level",
-        type=str,
-        default="INFO",
-        choices=["DEBUG", "INFO", "WARNING", "ERROR"],
-    )
-    return parser.parse_args()
-
-
-def main() -> int:
-    args = parse_args()
-    logging.basicConfig(
-        level=getattr(logging, args.log_level),
-        format="[%(levelname)s] %(message)s",
-    )
-
-    tasks = (
-        [t.strip() for t in args.tasks.split(",") if t.strip()]
-        if args.tasks
-        else list(COMPOSITE_SEEN_TASKS)
-    )
-    if not tasks:
-        sys.exit("No tasks selected.")
-
-    if args.push_to_hub and not args.hub_repo_id:
-        sys.exit("--push-to-hub requires --hub-repo-id.")
-
-    output_repo_id = args.hub_repo_id or "local/robocasa_composite_seen"
-    logger.info(
-        "Building merged RoboCasa dataset: %d tasks → %s (output dir: %s)",
-        len(tasks),
-        output_repo_id,
-        args.output_dir,
-    )
-
-    _require_robocasa()
-
-    # 1. Download (or locate) each task's extracted directory.
-    task_roots: list[tuple[str, Path]] = []
-    for i, task in enumerate(tasks, 1):
-        logger.info("[%d/%d] %s", i, len(tasks), task)
-        if args.skip_download:
-            root = _resolve_task_root(task)
-            if not root.exists():
-                sys.exit(
-                    f"--skip-download set but extracted directory does not "
-                    f"exist for {task}: {root}"
-                )
-        else:
-            root = _download_task(task, overwrite=args.overwrite_download)
-        logger.info("  extracted at: %s", root)
-        task_roots.append((task, root))
-
-    # 2. Open each as a LeRobotDataset (validation happens inside aggregator).
-    datasets: list[LeRobotDataset] = []
-    for task, root in task_roots:
-        logger.info("Opening %s", task)
-        ds = _open_as_lerobot_dataset(task, root)
-        logger.info(
-            "  %s: %d episodes, %d frames, %d FPS",
-            task,
-            ds.num_episodes,
-            ds.num_frames,
-            ds.fps,
-        )
-        datasets.append(ds)
-
-    # 3. Merge — re-validates features/fps/robot_type, unifies tasks, concats
-    #    videos + parquet, recomputes stats.
-    logger.info("Merging %d datasets into %s", len(datasets), output_repo_id)
-    merged = merge_datasets(
-        datasets=datasets,
-        output_repo_id=output_repo_id,
-        output_dir=args.output_dir,
-    )
-    logger.info(
-        "Merged: %d episodes, %d frames across %d unique task strings",
-        merged.num_episodes,
-        merged.num_frames,
-        len(merged.meta.tasks) if merged.meta.tasks is not None else 0,
-    )
-
-    # 4. Push to Hub.
-    if args.push_to_hub:
-        logger.info("Pushing %s to the Hub (private=%s)", args.hub_repo_id, args.private)
-        # ``upload_large_folder=True`` is the right mode for tens-of-GB
-        # datasets — uses multipart uploads + resumable transfers.
-        merged.push_to_hub(
-            private=args.private,
-            upload_large_folder=True,
-            tags=["lerobot", "robocasa", "composite_seen", "manipulation"],
-        )
-        logger.info(
-            "Push complete: https://huggingface.co/datasets/%s",
-            args.hub_repo_id,
-        )
-    else:
-        logger.info(
-            "Skipping Hub push (no --push-to-hub). Merged dataset is at %s.",
-            args.output_dir,
-        )
-
-    return 0
-
-
-if __name__ == "__main__":
-    raise SystemExit(main())
--- a/src/lerobot/scripts/lerobot_annotate.py
+++ b/src/lerobot/scripts/lerobot_annotate.py
@@ -1,205 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""``lerobot-annotate`` — populate ``language_persistent`` and
-``language_events`` columns on a LeRobot dataset.
-
-Annotations live directly in ``data/chunk-*/file-*.parquet``.
-
-Example:
-
-  uv run lerobot-annotate \\
-      --root=/path/to/dataset \\
-      --vlm.model_id=Qwen/Qwen2.5-VL-7B-Instruct
-
-For distributed runs, see ``examples/annotations/run_hf_job.py``.
-"""
-
-import logging
-from pathlib import Path
-
-from lerobot.annotations.steerable_pipeline.config import AnnotationPipelineConfig
-from lerobot.annotations.steerable_pipeline.executor import Executor
-from lerobot.annotations.steerable_pipeline.frames import make_frame_provider
-from lerobot.annotations.steerable_pipeline.modules import (
-    GeneralVqaModule,
-    InterjectionsAndSpeechModule,
-    PlanSubtasksMemoryModule,
-)
-from lerobot.annotations.steerable_pipeline.validator import StagingValidator
-from lerobot.annotations.steerable_pipeline.vlm_client import make_vlm_client
-from lerobot.annotations.steerable_pipeline.vocabulary import VocabularyDiscoveryModule
-from lerobot.annotations.steerable_pipeline.writer import LanguageColumnsWriter
-from lerobot.configs import parser
-
-logger = logging.getLogger(__name__)
-
-
-def _resolve_root(cfg: AnnotationPipelineConfig) -> Path:
-    if cfg.root is not None:
-        return Path(cfg.root)
-    if cfg.repo_id is not None:
-        from huggingface_hub import snapshot_download
-
-        return Path(snapshot_download(repo_id=cfg.repo_id, repo_type="dataset"))
-    raise ValueError("Either --root or --repo_id must be provided.")
-
-
-@parser.wrap()
-def annotate(cfg: AnnotationPipelineConfig) -> None:
-    """Run the steerable annotation pipeline against a dataset."""
-    logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
-    root = _resolve_root(cfg)
-    logger.info("annotate: root=%s", root)
-
-    vlm = make_vlm_client(cfg.vlm)
-    frame_provider = make_frame_provider(
-        root, camera_key=cfg.vlm.camera_key, video_backend=cfg.video_backend
-    )
-    # Surface the resolved cameras up front so a silent vqa-module no-op
-    # is obvious in job output rather than discovered post-hoc by counting
-    # parquet rows.
-    cam_keys = list(getattr(frame_provider, "camera_keys", []) or [])
-    logger.info(
-        "annotate: frame_provider default camera=%r, all cameras=%s",
-        getattr(frame_provider, "camera_key", None),
-        cam_keys,
-    )
-    if cfg.vqa.enabled and not cam_keys:
-        logger.warning(
-            "annotate: the vqa module is enabled but no cameras were "
-            "resolved — it will produce zero VQA rows. Check "
-            "meta/info.json for observation.images.* features, or pass "
-            "--vlm.camera_key=<key> to seed the cameras list."
-        )
-    plan = PlanSubtasksMemoryModule(vlm=vlm, config=cfg.plan, frame_provider=frame_provider)
-    interjections = InterjectionsAndSpeechModule(
-        vlm=vlm, config=cfg.interjections, seed=cfg.seed, frame_provider=frame_provider
-    )
-    vqa = GeneralVqaModule(vlm=vlm, config=cfg.vqa, seed=cfg.seed, frame_provider=frame_provider)
-    vocabulary = VocabularyDiscoveryModule(
-        vlm=vlm, config=cfg.vocabulary, frame_provider=frame_provider
-    )
-    writer = LanguageColumnsWriter()
-    validator = StagingValidator(
-        dataset_camera_keys=tuple(getattr(frame_provider, "camera_keys", []) or []) or None,
-    )
-
-    executor = Executor(
-        config=cfg,
-        plan=plan,
-        interjections=interjections,
-        vqa=vqa,
-        vocabulary=vocabulary,
-        writer=writer,
-        validator=validator,
-    )
-    summary = executor.run(root)
-    logger.info("annotate: wrote %d shard(s)", len(summary.written_paths))
-    for phase in summary.phases:
-        logger.info(
-            "annotate: phase=%s processed=%d skipped=%d",
-            phase.name,
-            phase.episodes_processed,
-            phase.episodes_skipped,
-        )
-    if summary.validation_report.warnings:
-        for w in summary.validation_report.warnings:
-            logger.warning(w)
-
-    if cfg.push_to_hub:
-        if cfg.repo_id is None and cfg.dest_repo_id is None:
-            raise ValueError(
-                "--push_to_hub requires --repo_id or --dest_repo_id (the dataset repo to push to)."
-            )
-        _push_to_hub(root, cfg)
-
-
-def _push_to_hub(root: Path, cfg: AnnotationPipelineConfig) -> None:
-    """Upload the annotated dataset directory to the Hub.
-
-    Pushes to ``cfg.dest_repo_id`` when set, otherwise back to ``cfg.repo_id``.
-    """
-    from huggingface_hub import HfApi  # noqa: PLC0415
-
-    repo_id = cfg.dest_repo_id or cfg.repo_id
-    commit_message = cfg.push_commit_message or "Add steerable annotations (lerobot-annotate)"
-    api = HfApi()
-    print(f"[lerobot-annotate] creating/locating dataset repo {repo_id}...", flush=True)
-    api.create_repo(
-        repo_id=repo_id,
-        repo_type="dataset",
-        private=cfg.push_private,
-        exist_ok=True,
-    )
-    print(f"[lerobot-annotate] uploading {root} -> {repo_id}...", flush=True)
-    commit_info = api.upload_folder(
-        folder_path=str(root),
-        repo_id=repo_id,
-        repo_type="dataset",
-        commit_message=commit_message,
-        ignore_patterns=[".annotate_staging/**", "**/.DS_Store"],
-    )
-    print(f"[lerobot-annotate] uploaded to https://huggingface.co/datasets/{repo_id}", flush=True)
-
-    # Tag the upload with the codebase version. ``LeRobotDatasetMetadata``
-    # resolves the dataset revision via ``get_safe_version`` which scans
-    # for tags like ``v3.0``; without a tag it raises
-    # ``RevisionNotFoundError``. Read the version straight from the
-    # dataset's own ``meta/info.json`` so we tag whatever the writer
-    # actually wrote (no accidental drift if the codebase floor moves).
-    from lerobot.datasets.dataset_metadata import CODEBASE_VERSION  # noqa: PLC0415
-
-    info_path = root / "meta" / "info.json"
-    version_tag = CODEBASE_VERSION
-    if info_path.exists():
-        try:
-            from lerobot.utils.io_utils import load_json  # noqa: PLC0415
-
-            info = load_json(info_path)
-            ds_version = info.get("codebase_version")
-            if isinstance(ds_version, str) and ds_version.startswith("v"):
-                version_tag = ds_version
-        except Exception as exc:  # noqa: BLE001
-            print(f"[lerobot-annotate] could not read codebase_version from info.json ({exc}); falling back to {version_tag}", flush=True)
-    revision = getattr(commit_info, "oid", None)
-    tag_kwargs = {
-        "repo_id": repo_id,
-        "tag": version_tag,
-        "repo_type": "dataset",
-        "exist_ok": True,
-    }
-    if revision is not None:
-        tag_kwargs["revision"] = revision
-
-    try:
-        api.create_tag(**tag_kwargs)
-        print(f"[lerobot-annotate] tagged {repo_id} as {version_tag}", flush=True)
-    except Exception as exc:  # noqa: BLE001
-        print(
-            f"[lerobot-annotate] WARNING: could not create tag {version_tag!r} on {repo_id}: {exc}. "
-            "Dataset is uploaded but ``LeRobotDataset`` won't be able to load it until it's tagged. "
-            "Run: from huggingface_hub import HfApi; "
-            f"HfApi().create_tag({repo_id!r}, tag={version_tag!r}, repo_type='dataset', exist_ok=True)",
-            flush=True,
-        )
-
-
-def main() -> None:
-    annotate()
-
-
-if __name__ == "__main__":
-    main()
--- a/src/lerobot/scripts/lerobot_pi052_runtime.py
+++ b/src/lerobot/scripts/lerobot_pi052_runtime.py
--- a/src/lerobot/scripts/lerobot_train.py
+++ b/src/lerobot/scripts/lerobot_train.py
@@ -20,7 +20,6 @@ Requires: pip install 'lerobot[training]'  (includes dataset + accelerate + wand

 import dataclasses
 import logging
-import os
 import time
 from contextlib import nullcontext
 from pprint import pformat
@@ -44,7 +43,7 @@ from lerobot.common.train_utils import (
 from lerobot.common.wandb_utils import WandBLogger
 from lerobot.configs import parser
 from lerobot.configs.train import TrainPipelineConfig
-from lerobot.datasets import EpisodeAwareSampler, WeightedEpisodeAwareSampler, make_dataset
+from lerobot.datasets import EpisodeAwareSampler, make_dataset
 from lerobot.envs import close_envs, make_env, make_env_pre_post_processors
 from lerobot.optim.factory import make_optimizer_and_scheduler
 from lerobot.policies import PreTrainedPolicy, make_policy, make_pre_post_processors
@@ -162,196 +161,6 @@ def update_policy(
    return train_metrics, output_dict


-def _print_debug_text_predictions(
-    policy: Any, batch: dict[str, Any], step: int, n_samples: int = 5
-) -> None:
-    """Forward the current batch and print head-argmax vs label per supervised position.
-
-    Opt-in via ``LEROBOT_DEBUG_PREDS_EVERY=<step_interval>``. Only the
-    policy types that expose ``debug_text_predictions`` participate
-    (currently PI052); others are silently skipped. Pretty-prints up to
-    ``n_samples`` samples from the current batch, showing the prompt,
-    every supervised position's (label, prediction, ✓/✗), and a
-    per-sample token-accuracy summary — the cheapest "is text training
-    actually learning anything" signal.
-    """
-    # Accelerator/DDP wraps the policy in a ``module`` attribute and
-    # doesn't proxy custom methods through, so a naive
-    # ``hasattr(policy, "debug_text_predictions")`` returns False on the
-    # wrapper — and the helper would silently no-op. Walk through any
-    # ``.module`` indirection (DDP, FSDP, ``accelerator.prepare`` wrappers)
-    # to reach the raw policy that actually defines the method.
-    inner = policy
-    while hasattr(inner, "module") and not hasattr(inner, "debug_text_predictions"):
-        inner = inner.module
-    if not hasattr(inner, "debug_text_predictions"):
-        logging.warning(
-            "LEROBOT_DEBUG_PREDS_EVERY set but policy %s has no "
-            "debug_text_predictions method — skipping dump.",
-            type(inner).__name__,
-        )
-        return
-    try:
-        debug = inner.debug_text_predictions(batch, max_samples=n_samples)
-    except Exception as exc:  # noqa: BLE001
-        logging.warning("debug_text_predictions failed: %s", exc, exc_info=True)
-        return
-    if not debug:
-        logging.warning(
-            "debug_text_predictions returned no supervised samples — "
-            "current batch has no text labels."
-        )
-        return
-    policy = inner  # used below for select_message-style decoding parity
-
-    # Build a tokenizer for decoding — match training side exactly.
-    try:
-        from transformers import AutoTokenizer  # noqa: PLC0415
-
-        from lerobot.policies.pi052.text_processor_pi052 import (  # noqa: PLC0415
-            register_paligemma_loc_tokens,
-        )
-
-        tok_name = (
-            getattr(policy.config, "tokenizer_name", None) or "google/paligemma-3b-pt-224"
-        )
-        tokenizer = register_paligemma_loc_tokens(AutoTokenizer.from_pretrained(tok_name))
-    except Exception as exc:  # noqa: BLE001
-        logging.warning("debug preds: tokenizer load failed: %s", exc)
-        return
-
-    ids = debug["input_ids"]
-    labels = debug["labels"]
-    preds = debug["predictions"]
-    attn = debug["attention_mask"]
-    inference = debug.get("inference") or []
-
-    n = ids.shape[0]
-    print(
-        f"\n========== STEP {step} DEBUG PREDICTIONS ({n} samples) ==========",
-        flush=True,
-    )
-    for s in range(n):
-        a = attn[s].tolist()
-        real = sum(a)
-        sid = ids[s].tolist()
-        sl = labels[s].tolist()
-        sp = preds[s].tolist()
-        prompt = tokenizer.decode(sid[:real], skip_special_tokens=False)
-        print(f"\n  --- sample {s + 1}/{n} ---", flush=True)
-        print(f"  prompt: {prompt!r}", flush=True)
-
-        # Ground-truth target (the contiguous supervised label span).
-        sup_ids = [int(sid[i]) for i in range(real) if sl[i] != -100]
-        if sup_ids:
-            print(
-                f"  target  (ground truth)        : {tokenizer.decode(sup_ids, skip_special_tokens=False)!r}",
-                flush=True,
-            )
-
-        # Training-side teacher-forced argmax on the same prompt+target.
-        n_sup = n_ok = 0
-        first_sup_pred: int | None = None
-        teacher_chars: list[int] = []
-        for i in range(1, real):
-            label = sl[i]
-            if label == -100:
-                continue
-            n_sup += 1
-            pred = int(sp[i - 1])
-            if first_sup_pred is None:
-                first_sup_pred = pred
-            teacher_chars.append(pred)
-            if label == pred:
-                n_ok += 1
-        teacher_text = (
-            tokenizer.decode(teacher_chars, skip_special_tokens=False) if teacher_chars else ""
-        )
-        acc = n_ok / max(n_sup, 1)
-        print(
-            f"  training argmax (teacher-fed) : {teacher_text!r}   acc={n_ok}/{n_sup}={acc:.1%}",
-            flush=True,
-        )
-
-        # Inference-side autoregressive output from the same prompt prefix.
-        inf_entry = inference[s] if s < len(inference) else None
-        if inf_entry:
-            inf_decoded = inf_entry.get("decoded", "")
-            print(f"  inference (autoregressive)    : {inf_decoded!r}", flush=True)
-            # First-token parity: training-side argmax at the prompt-end
-            # position MUST equal inference's first generated token —
-            # both compute argmax(lm_head(h_last_prompt)) on identical
-            # context. Any divergence signals a training↔inference bug.
-            if first_sup_pred is not None and inf_decoded and not inf_decoded.startswith("<inference"):
-                inf_ids = tokenizer(inf_decoded, add_special_tokens=False)["input_ids"]
-                if inf_ids:
-                    inf_first = int(inf_ids[0])
-                    match = inf_first == first_sup_pred
-                    print(
-                        f"  first-token parity            : "
-                        f"train={first_sup_pred} ({tokenizer.decode([first_sup_pred])!r}) "
-                        f"vs infer={inf_first} ({tokenizer.decode([inf_first])!r})  "
-                        f"{'✓ MATCH' if match else '✗ DIVERGED — training/inference mismatch'}",
-                        flush=True,
-                    )
-    print("=" * 60 + "\n", flush=True)
-
-
-def _build_vqa_oversample_weights(dataset: Any, target_fraction: float) -> "torch.Tensor | None":
-    """Build per-frame sampling weights that oversample VQA-annotated frames.
-
-    Scans the dataset's ``language_events`` column for frames carrying a
-    ``vqa``-style annotation and returns a weight tensor (length == total
-    dataset frames) such that, under multinomial sampling, VQA frames make up
-    roughly ``target_fraction`` of the training stream.
-
-    Returns ``None`` (⇒ fall back to uniform episode-aware sampling) when VQA
-    frames cannot be detected or there are none.
-    """
-    if not 0.0 < target_fraction < 1.0:
-        logging.warning(
-            "vqa_target_fraction must be in (0, 1); got %s — VQA oversampling disabled.",
-            target_fraction,
-        )
-        return None
-    hf = getattr(dataset, "hf_dataset", None)
-    if hf is None or "language_events" not in getattr(hf, "column_names", []):
-        logging.warning(
-            "Dataset has no `language_events` column — VQA oversampling disabled."
-        )
-        return None
-
-    events_col = hf["language_events"]
-    n_frames = len(events_col)
-    is_vqa = torch.zeros(n_frames, dtype=torch.bool)
-    for i, rows in enumerate(events_col):
-        if rows and any((row or {}).get("style") == "vqa" for row in rows):
-            is_vqa[i] = True
-
-    n_vqa = int(is_vqa.sum())
-    if n_vqa == 0:
-        logging.warning("No `vqa` annotations found in the dataset — VQA oversampling disabled.")
-        return None
-    n_other = n_frames - n_vqa
-
-    # Solve target = (n_vqa·w) / (n_vqa·w + n_other) for the VQA weight w.
-    # Clamp to ≥ 1 so VQA frames are never *down*-weighted below uniform.
-    weight = (target_fraction * n_other) / ((1.0 - target_fraction) * max(n_vqa, 1))
-    weight = max(weight, 1.0)
-    weights = torch.ones(n_frames, dtype=torch.double)
-    weights[is_vqa] = weight
-    logging.info(
-        "VQA oversampling: %d/%d frames carry a `vqa` annotation (%.2f%%); "
-        "weighting them x%.2f to target ~%.0f%% of the training stream.",
-        n_vqa,
-        n_frames,
-        100.0 * n_vqa / n_frames,
-        weight,
-        100.0 * target_fraction,
-    )
-    return weights
-
-
@parser.wrap()
 def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
    """
@@ -483,17 +292,6 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):

    active_cfg = cfg.trainable_config
    processor_pretrained_path = active_cfg.pretrained_path
-    # pi052: even when loading pretrained weights, build the processors
-    # from the current pi052 config so the recipe text-label and FAST
-    # action-label steps are generated and not silently swapped for the
-    # checkpoint's older processor stack.
-    if cfg.policy.type == "pi052" and processor_pretrained_path is not None and not cfg.resume:
-        logging.warning(
-            "pi052 is loading pretrained weights from %s, but building processors from the current "
-            "pi052 config so recipe text labels and FAST action labels are generated.",
-            processor_pretrained_path,
-        )
-        processor_pretrained_path = None
    if (
        getattr(active_cfg, "use_relative_actions", False)
        and processor_pretrained_path is not None
@@ -513,14 +311,6 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
    if cfg.is_reward_model_training:
        processor_kwargs["dataset_meta"] = dataset.meta

-    # For pi052 (and any future policy that auto-fits part of its
-    # preprocessing per-dataset), pass the dataset repo id so the
-    # processor factory can locate/refresh dataset-specific artifacts
-    # (e.g. fitted FAST tokenizers per Pertsch et al. 2025 [64],
-    # π0.5 §III.C).
-    if cfg.policy.type == "pi052":
-        processor_kwargs["dataset_repo_id"] = cfg.dataset.repo_id
-
    if not cfg.is_reward_model_training and processor_pretrained_path is not None:
        processor_kwargs["preprocessor_overrides"] = {
            "device_processor": {"device": device.type},
@@ -601,29 +391,13 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
    # create dataloader for offline training
    if hasattr(active_cfg, "drop_n_last_frames"):
        shuffle = False
-        from_indices = dataset.meta.episodes["dataset_from_index"]
-        to_indices = dataset.meta.episodes["dataset_to_index"]
-        # When `vqa_target_fraction` is set, oversample VQA-annotated
-        # frames via a weighted sampler; otherwise plain episode-aware.
-        vqa_weights = None
-        if cfg.vqa_target_fraction is not None and not cfg.dataset.streaming:
-            vqa_weights = _build_vqa_oversample_weights(dataset, cfg.vqa_target_fraction)
-        if vqa_weights is not None:
-            sampler = WeightedEpisodeAwareSampler(
-                from_indices,
-                to_indices,
-                vqa_weights,
-                episode_indices_to_use=dataset.episodes,
-                drop_n_last_frames=active_cfg.drop_n_last_frames,
-            )
-        else:
-            sampler = EpisodeAwareSampler(
-                from_indices,
-                to_indices,
-                episode_indices_to_use=dataset.episodes,
-                drop_n_last_frames=active_cfg.drop_n_last_frames,
-                shuffle=True,
-            )
+        sampler = EpisodeAwareSampler(
+            dataset.meta.episodes["dataset_from_index"],
+            dataset.meta.episodes["dataset_to_index"],
+            episode_indices_to_use=dataset.episodes,
+            drop_n_last_frames=active_cfg.drop_n_last_frames,
+            shuffle=True,
+        )
    else:
        shuffle = True
        sampler = None
@@ -654,54 +428,6 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):

    policy.train()

-    # ------------------------------------------------------------------
-    # EMA setup
-    # ------------------------------------------------------------------
-    # Shadow copy of the trainable params for late-training averaging
-    # (Chi et al. 2023 Diffusion Policy §V.D; openpi JAX trainer ships
-    # this with decay=0.999 for pi05_libero; openpi PyTorch port and
-    # LeRobot main both skip it). Off by default; opt in with
-    # ``--ema.enable=true``. Implemented via ema-pytorch
-    # (https://github.com/lucidrains/ema-pytorch) — the standard PyTorch
-    # EMA library, also used by lucidrains' diffusion repos.
-    ema = None
-    if cfg.ema.enable and is_main_process:
-        from ema_pytorch import EMA  # noqa: PLC0415
-
-        ema = EMA(
-            accelerator.unwrap_model(policy),
-            beta=cfg.ema.decay,
-            update_after_step=cfg.ema.warmup_steps,
-            update_every=1,  # update on every ema.update() call
-            # Don't register the live model as an ema submodule — accelerator
-            # already owns its lifecycle, and double-registration would
-            # double-count its params in ``ema.state_dict()``.
-            include_online_model=False,
-        )
-        ema.to(accelerator.device)
-        logging.info(
-            "EMA enabled (ema-pytorch): beta=%g, update_after_step=%d, "
-            "use_for_eval=%s, use_for_wandb_examples=%s",
-            cfg.ema.decay,
-            cfg.ema.warmup_steps,
-            cfg.ema.use_for_eval,
-            cfg.ema.use_for_wandb_examples,
-        )
-
-        # Resume the EMA shadow if a previous run wrote one.
-        if cfg.checkpoint_path is not None:
-            ema_path = cfg.checkpoint_path / "training_state" / "ema_state.pt"
-            if ema_path.exists():
-                logging.info("Resuming EMA shadow from %s", ema_path)
-                try:
-                    ema.load_state_dict(torch.load(ema_path, map_location=accelerator.device))
-                except Exception as exc:  # noqa: BLE001
-                    logging.warning(
-                        "Failed to load EMA shadow (%s) — restarting EMA from "
-                        "current live weights",
-                        exc,
-                    )
-
    train_metrics = {
        "loss": AverageMeter("loss", ":.3f"),
        "grad_norm": AverageMeter("grdn", ":.3f"),
@@ -754,14 +480,6 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
            sample_weighter=sample_weighter,
        )

-        # EMA update: pull one step of the live weights into the shadow.
-        # Runs only on the main process (the shadow lives there); other
-        # ranks rely on the live model staying in sync via accelerator.
-        # ``ema-pytorch`` holds an internal reference to the online model
-        # (set at construction), so ``ema.update()`` takes no args.
-        if ema is not None:
-            ema.update()
-
        # Note: eval and checkpoint happens *after* the `step`th training update has completed, so we
        # increment `step` here.
        step += 1
@@ -772,27 +490,6 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
        is_saving_step = step % cfg.save_freq == 0 or step == cfg.steps
        is_eval_step = cfg.eval_freq > 0 and step % cfg.eval_freq == 0

-        # Optional periodic head-prediction dump for the LM head:
-        # ``LEROBOT_DEBUG_PREDS_EVERY=1000`` prints 5 samples + per-token
-        # (label, argmax, ✓/✗) every 1000 steps. Cheap diagnostic to see
-        # whether the text head is actually learning what we expect, vs
-        # collapsing to a fixed token. Refilling the recipe-sample dump
-        # budget at the same cadence also redumps the raw input shapes.
-        _debug_preds_every = int(os.environ.get("LEROBOT_DEBUG_PREDS_EVERY", "0"))
-        if (
-            _debug_preds_every > 0
-            and step % _debug_preds_every == 0
-            and is_main_process
-        ):
-            try:
-                from lerobot.policies.pi052 import text_processor_pi052 as _tp  # noqa: PLC0415
-
-                _tp._DUMPED_SO_FAR = 0
-                _tp._DUMP_BUDGET = max(_tp._DUMP_BUDGET, 5)
-            except Exception:  # noqa: BLE001
-                pass
-            _print_debug_text_predictions(policy, batch, step, n_samples=5)
-
        if is_log_step:
            logging.info(train_tracker)
            if wandb_logger:
@@ -803,49 +500,9 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
                if sample_weighter is not None:
                    weighter_stats = sample_weighter.get_stats()
                    wandb_log_dict.update({f"sample_weighting/{k}": v for k, v in weighter_stats.items()})
-                # EMA observability: ``ema.step`` is the count of
-                # ``ema.update()`` calls (= optimizer steps once EMA is
-                # enabled); ``ema.initted`` flips to True once we've
-                # crossed ``update_after_step``.
-                if ema is not None:
-                    wandb_log_dict["ema/step"] = int(ema.step.item())
-                    wandb_log_dict["ema/initted"] = float(ema.initted.item())
-                    wandb_log_dict["ema/beta"] = float(cfg.ema.decay)
                wandb_logger.log_dict(wandb_log_dict, step)
            train_tracker.reset_averages()

-        # Periodic training-example dump to wandb (camera images + text
-        # fields + action endpoints). Opt-in via ``--wandb.log_examples_freq``;
-        # independent of ``--log_freq`` so you can keep scalar logs frequent
-        # and the heavier visual dump rare (e.g. every 5000 steps).
-        if (
-            wandb_logger is not None
-            and cfg.wandb.log_examples_freq > 0
-            and step % cfg.wandb.log_examples_freq == 0
-            and is_main_process
-        ):
-            try:
-                # Optionally use the EMA shadow model directly for the
-                # predicted-action columns (matches what eval / deployment
-                # would see). ``ema-pytorch`` exposes the shadow as a
-                # full ``nn.Module`` at ``ema.ema_model``, so we just
-                # pass that instead of swap-and-restore.
-                target_policy = (
-                    ema.ema_model
-                    if (ema is not None and cfg.ema.use_for_wandb_examples)
-                    else accelerator.unwrap_model(policy)
-                )
-                wandb_logger.log_training_examples(
-                    batch=batch,
-                    step=step,
-                    camera_keys=list(dataset.meta.camera_keys),
-                    n_samples=cfg.wandb.log_examples_n,
-                    policy=target_policy,
-                    predict_actions=cfg.wandb.log_examples_predict_actions,
-                )
-            except Exception as exc:  # noqa: BLE001
-                logging.warning("wandb log_training_examples failed: %s", exc)
-
        if cfg.save_checkpoint and is_saving_step:
            if is_main_process:
                logging.info(f"Checkpoint policy after step {step}")
@@ -861,18 +518,6 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
                    postprocessor=postprocessor,
                )
                update_last_checkpoint(checkpoint_dir)
-                # Save the EMA shadow alongside the training state so a
-                # resumed run picks up exactly where the live EMA left off.
-                # ``ema-pytorch.state_dict()`` returns the full shadow
-                # nn.Module's state dict + step/initted buffers; saved as
-                # .pt (the rest of training_state mixes formats already).
-                if ema is not None:
-                    try:
-                        ema_path = checkpoint_dir / "training_state" / "ema_state.pt"
-                        ema_path.parent.mkdir(parents=True, exist_ok=True)
-                        torch.save(ema.state_dict(), ema_path)
-                    except Exception as exc:  # noqa: BLE001
-                        logging.warning("Failed to save EMA shadow: %s", exc)
                if wandb_logger:
                    wandb_logger.log_policy(checkpoint_dir)

@@ -882,20 +527,10 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
            if is_main_process:
                step_id = get_step_identifier(step, cfg.steps)
                logging.info(f"Eval policy at step {step}")
-                # Use the EMA shadow model for eval when enabled —
-                # standard practice for diffusion-style policies (~1–3%
-                # lift on closed-loop success). ``ema.ema_model`` is a
-                # full nn.Module clone, so we just pass it through; no
-                # swap/restore on the live policy needed.
-                eval_target_policy = (
-                    ema.ema_model
-                    if (ema is not None and cfg.ema.use_for_eval)
-                    else accelerator.unwrap_model(policy)
-                )
                with torch.no_grad(), accelerator.autocast():
                    eval_info = eval_policy_all(
                        envs=eval_env,  # dict[suite][task_id] -> vec_env
-                        policy=eval_target_policy,
+                        policy=accelerator.unwrap_model(policy),
                        env_preprocessor=env_preprocessor,
                        env_postprocessor=env_postprocessor,
                        preprocessor=preprocessor,
--- a/src/lerobot/tools/init.py
+++ b/src/lerobot/tools/init.py
@@ -1,29 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""LeRobot tool implementations.
-
-Storage of the tool catalog (``meta/info.json["tools"]``) and the
-``SAY_TOOL_SCHEMA`` constant live in PR 1
-(``lerobot.datasets.language``). This package holds the *runnable*
-implementations one file per tool, plus the registry that maps tool
-names to classes.
-
-See ``docs/source/tools.mdx`` for the authoring guide.
-"""
-
-from .base import Tool
-from .registry import TOOL_REGISTRY, get_tools
-from .say import SayTool
-
-__all__ = ["Tool", "TOOL_REGISTRY", "get_tools", "SayTool"]
--- a/src/lerobot/tools/base.py
+++ b/src/lerobot/tools/base.py
@@ -1,58 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Tool protocol — the contract every runnable tool implementation honors.
-
-Tools are the executable side of the OpenAI-style function-calling
-abstraction the v3.1 language schema (PR 1) carries on assistant
-messages: the schema describes *what can be called*, the tool
-implementation describes *how to call it*.
-
-Implementations live one-per-file under :mod:`lerobot.tools` (e.g.
-``say.py`` for ``SayTool``) and are registered in
-:mod:`lerobot.tools.registry`. The runtime instantiates them lazily so
-heavy dependencies (torch models, audio backends, network clients,
-hardware drivers) only load when the dataset actually declares the tool.
-"""
-
-from __future__ import annotations
-
-from typing import Any, Protocol, runtime_checkable
-
-
-@runtime_checkable
-class Tool(Protocol):
-    """Minimum surface every tool must expose."""
-
-    #: Name matching ``schema["function"]["name"]``. The runtime dispatcher
-    #: routes incoming ``tool_calls`` to the implementation by this key.
-    name: str
-
-    #: OpenAI-style function-call schema. Same dict the dataset stores in
-    #: ``meta/info.json["tools"]`` and the chat template renders into the
-    #: prompt.
-    schema: dict[str, Any]
-
-    def call(self, arguments: dict[str, Any]) -> Any:
-        """Execute the tool with the model-provided arguments.
-
-        ``arguments`` is the parsed dict from
-        ``tool_calls[i]["function"]["arguments"]`` (already JSON-decoded
-        when the model emits a JSON-string by the chat-template
-        convention). Implementations validate the dict against their own
-        schema; the runtime only routes by name.
-
-        Return value is implementation-defined — typically a tensor
-        (TTS audio), a Path (saved file), a dict (structured result), or
-        ``None`` (side-effect-only call).
-        """
--- a/src/lerobot/tools/registry.py
+++ b/src/lerobot/tools/registry.py
@@ -1,70 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Tool registry — name → implementation class.
-
-Adding a new tool:
-
-1. Drop a file under ``src/lerobot/tools/`` that defines a class
-   conforming to :class:`lerobot.tools.base.Tool` (must expose ``name``,
-   ``schema``, ``call(arguments)``).
-2. Register the class here under :data:`TOOL_REGISTRY`.
-3. (Optional) Pre-populate ``meta/info.json["tools"]`` on your dataset
-   to advertise the schema to the chat-template + policy. The PR 2
-   annotation pipeline preserves anything you put there.
-
-See ``docs/source/tools.mdx`` for the full authoring guide.
-"""
-
-from __future__ import annotations
-
-from typing import Any
-
-from .base import Tool
-from .say import SayTool
-
-#: Map from ``function.name`` to a class implementing :class:`Tool`.
-#: The runtime instantiates entries lazily — registering a tool here is
-#: essentially free (no model load happens until ``call`` runs).
-TOOL_REGISTRY: dict[str, type] = {
-    "say": SayTool,
-}
-
-
-def get_tools(meta: Any, **kwargs: Any) -> dict[str, Tool]:
-    """Build name → tool-instance dict from a dataset's declared catalog.
-
-    ``meta`` is anything with a ``.tools`` attribute returning the
-    OpenAI-style schema list — typically a
-    :class:`lerobot.datasets.dataset_metadata.LeRobotDatasetMetadata`.
-    Each entry whose ``function.name`` is registered here is
-    instantiated with the schema dict; tools whose name is unknown to
-    the registry are skipped (the schema still rides through the chat
-    template, the model just can't actually invoke that tool at
-    inference).
-
-    Extra keyword arguments are forwarded to every constructor — useful
-    for runtime defaults like ``output_dir=Path("./tts_log")``.
-    """
-    declared = list(meta.tools)
-    instances: dict[str, Tool] = {}
-    for schema in declared:
-        try:
-            name = schema["function"]["name"]
-        except (KeyError, TypeError):
-            continue
-        cls = TOOL_REGISTRY.get(name)
-        if cls is None:
-            continue
-        instances[name] = cls(schema=schema, **kwargs)
-    return instances
--- a/src/lerobot/tools/say.py
+++ b/src/lerobot/tools/say.py
@@ -1,169 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""``SayTool`` — text-to-speech tool wrapping Kyutai's pocket-tts.
-
-The first concrete tool implementation. PI052 and downstream runtime
-dispatchers consume this when the model emits an assistant message
-with ``tool_calls=[{function: {name: "say", arguments: {text: ...}}}]``.
-
-Why pocket-tts:
-
- runs on CPU (no GPU dependency); ~6× real-time on a MacBook Air M4
- ~100M parameters, ~200ms first-chunk latency
- streamable, voice-cloneable
- pip-installable, MIT-style permissive license
-
-The pocket-tts model is loaded **lazily** the first time ``call(...)``
-runs (or eagerly via ``preload()``). Loading takes a few seconds and
-several hundred MB of RAM, so we don't pay the cost when the tool is
-merely *registered* — only when it's *invoked*.
-
-Optional dependency. Install with::
-
-    pip install lerobot[tools]
-    # or directly:
-    pip install pocket-tts
-"""
-
-from __future__ import annotations
-
-import logging
-from dataclasses import dataclass, field
-from pathlib import Path
-from typing import Any
-
-from lerobot.datasets.language import SAY_TOOL_SCHEMA
-
-logger = logging.getLogger(__name__)
-
-
-@dataclass
-class SayTool:
-    """Speak a short utterance via Kyutai's pocket-tts.
-
-    Parameters
-    ----------
-    schema:
-        Optional schema override; defaults to the canonical
-        ``SAY_TOOL_SCHEMA`` from PR 1. Custom voices or extended
-        argument shapes can pass in a modified schema, but the
-        implementation only reads ``arguments["text"]``.
-    voice:
-        One of the pocket-tts catalog voices (``alba``, ``marius``,
-        ``javert``, ``jean``, ``fantine``, ``cosette``, ``eponine``,
-        ``azelma``) or a path to a ``.wav`` / ``.safetensors`` voice
-        file for cloning. See the pocket-tts model card for licensing.
-    output_dir:
-        If set, every ``call(...)`` writes a ``<timestamp>.wav`` audio
-        file there in addition to returning the PCM tensor.
-        ``None`` (default) skips disk writes — useful for live
-        playback paths that hand the tensor directly to a sounddevice
-        / WebAudio sink.
-    """
-
-    schema: dict[str, Any] = field(default_factory=lambda: dict(SAY_TOOL_SCHEMA))
-    voice: str = "alba"
-    output_dir: Path | None = None
-
-    name: str = field(init=False, default="say")
-    _model: Any = field(init=False, default=None, repr=False)
-    _voice_state: Any = field(init=False, default=None, repr=False)
-    _sample_rate: int = field(init=False, default=24000, repr=False)
-
-    # ------------------------------------------------------------------
-    # Lazy model load
-    # ------------------------------------------------------------------
-
-    def preload(self) -> None:
-        """Load the pocket-tts model + voice state into memory.
-
-        Optional — ``call(...)`` triggers this automatically on first
-        invocation. Useful when you want the multi-second load to
-        happen at startup rather than on the first ``say`` the policy
-        emits.
-        """
-        if self._model is not None and self._voice_state is not None:
-            return
-        try:
-            from pocket_tts import TTSModel  # noqa: PLC0415  (optional dep)
-        except ImportError as exc:  # pragma: no cover (env-dependent)
-            raise ImportError(
-                "SayTool requires pocket-tts. Install with `pip install "
-                "lerobot[tools]` or `pip install pocket-tts`."
-            ) from exc
-        logger.info("SayTool: loading pocket-tts model + voice=%r", self.voice)
-        self._model = TTSModel.load_model()
-        self._voice_state = self._model.get_state_for_audio_prompt(self.voice)
-        self._sample_rate = int(getattr(self._model, "sample_rate", 24000))
-
-    # ------------------------------------------------------------------
-    # Tool protocol
-    # ------------------------------------------------------------------
-
-    def call(self, arguments: dict[str, Any]) -> Any:
-        """Speak ``arguments["text"]`` and return the PCM tensor.
-
-        Optionally also writes ``<output_dir>/<timestamp>.wav`` when
-        ``self.output_dir`` is set. The returned tensor is a 1-D
-        ``torch.Tensor`` of float32 PCM samples at
-        ``self.sample_rate`` Hz — directly playable by
-        ``sounddevice.play(audio.numpy(), self.sample_rate)`` or
-        encodable by ``scipy.io.wavfile.write``.
-        """
-        text = arguments.get("text")
-        if not isinstance(text, str) or not text.strip():
-            raise ValueError(
-                f"SayTool.call expects arguments={{'text': str}}, got {arguments!r}"
-            )
-        self.preload()
-
-        audio = self._model.generate_audio(self._voice_state, text)
-
-        if self.output_dir is not None:
-            self._write_wav(audio, text)
-
-        return audio
-
-    @property
-    def sample_rate(self) -> int:
-        """PCM sample rate of the returned tensor (Hz)."""
-        return self._sample_rate
-
-    # ------------------------------------------------------------------
-    # Helpers
-    # ------------------------------------------------------------------
-
-    def _write_wav(self, audio: Any, text: str) -> Path:
-        """Write a ``.wav`` next to ``output_dir`` for offline inspection."""
-        import time as _time  # noqa: PLC0415
-
-        try:
-            import scipy.io.wavfile  # noqa: PLC0415
-        except ImportError as exc:  # pragma: no cover
-            raise ImportError(
-                "SayTool.output_dir requires scipy. `pip install scipy`."
-            ) from exc
-
-        out_dir = Path(self.output_dir)
-        out_dir.mkdir(parents=True, exist_ok=True)
-        # One file per call; suffix with a millisecond timestamp + a
-        # short text snippet so a directory listing is informative.
-        snippet = "".join(c if c.isalnum() else "_" for c in text[:32]).strip("_")
-        ts_ms = int(_time.time() * 1000)
-        path = out_dir / f"say_{ts_ms}_{snippet}.wav"
-
-        # ``audio`` is a torch tensor; pocket-tts uses CPU, so a plain
-        # ``.numpy()`` is safe.
-        scipy.io.wavfile.write(path, self.sample_rate, audio.numpy())
-        return path
--- a/src/lerobot/utils/collate.py
+++ b/src/lerobot/utils/collate.py
@@ -22,7 +22,7 @@ from torch.utils.data._utils.collate import default_collate

 from lerobot.datasets.language import LANGUAGE_COLUMNS

-_PYTHON_LIST_KEYS = {"messages", "message_streams", "target_message_indices", *LANGUAGE_COLUMNS}
+_PYTHON_LIST_KEYS = {"messages", "message_streams", "target_message_indices"}


 def lerobot_collate_fn(batch: list[dict[str, Any] | None]) -> dict[str, Any] | None:
--- a/src/lerobot/utils/constants.py
+++ b/src/lerobot/utils/constants.py
@@ -34,7 +34,6 @@ ACTION = "action"
 ACTION_PREFIX = ACTION + "."
 ACTION_TOKENS = ACTION + ".tokens"
 ACTION_TOKEN_MASK = ACTION + ".token_mask"
-ACTION_CODE_TOKEN_MASK = ACTION + ".code_token_mask"
 REWARD = "next.reward"
 TRUNCATED = "next.truncated"
 DONE = "next.done"
--- a/tests/annotations/init.py
+++ b/tests/annotations/init.py
--- a/tests/annotations/_helpers.py
+++ b/tests/annotations/_helpers.py
@@ -1,58 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Helpers shared across annotation-pipeline tests."""
-
-from __future__ import annotations
-
-import json
-from typing import Any
-
-from lerobot.annotations.steerable_pipeline.vlm_client import StubVlmClient
-
-
-def make_canned_responder(
-    responses_by_marker: dict[str, Any],
-    default: Any = None,
-) -> StubVlmClient:
-    """Return a stub that picks a response by inspecting the user prompt.
-
-    For each call the responder examines the last user-message text and
-    returns the response keyed by the first marker substring it contains.
-    Falls back to ``default`` if no marker matches.
-    """
-
-    def responder(messages: list[dict[str, Any]]) -> Any:
-        last_user_text = ""
-        for message in messages:
-            if message.get("role") != "user":
-                continue
-            content = message.get("content")
-            if isinstance(content, str):
-                last_user_text = content
-            elif isinstance(content, list):
-                for block in content:
-                    if isinstance(block, dict) and block.get("type") == "text":
-                        last_user_text = block.get("text", "")
-        for marker, response in responses_by_marker.items():
-            if marker in last_user_text:
-                return response
-        return default
-
-    return StubVlmClient(responder=responder)
-
-
-def encode_vqa_answer(payload: dict[str, Any]) -> str:
-    return json.dumps(payload, sort_keys=True)
--- a/tests/annotations/conftest.py
+++ b/tests/annotations/conftest.py
@@ -1,51 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Shared fixtures for annotation-pipeline tests.
-
-The on-disk dataset builder lives with the other dataset factories in
-``tests/fixtures/dataset_factories.py`` (:func:`build_annotation_dataset`);
-these fixtures only wire it into pytest.
-"""
-
-from __future__ import annotations
-
-from pathlib import Path
-
-import pytest
-
-from tests.fixtures.dataset_factories import build_annotation_dataset
-
-
-@pytest.fixture
-def fixture_dataset_root(tmp_path: Path) -> Path:
-    """A tiny dataset with two episodes, 12 frames each at 10 fps."""
-    return build_annotation_dataset(
-        tmp_path / "ds",
-        episode_specs=[
-            (0, 12, "Could you tidy the kitchen please?"),
-            (1, 12, "Please clean up the kitchen"),
-        ],
-        fps=10,
-    )
-
-
-@pytest.fixture
-def single_episode_root(tmp_path: Path) -> Path:
-    return build_annotation_dataset(
-        tmp_path / "ds_one",
-        episode_specs=[(0, 30, "Pour water from the bottle into the cup.")],
-        fps=10,
-    )
--- a/tests/annotations/run_e2e_smoke.py
+++ b/tests/annotations/run_e2e_smoke.py
@@ -1,101 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Opt-in E2E smoke run for ``make annotation-e2e``.
-
-Builds the shared annotation fixture (:func:`build_annotation_dataset`),
-runs the full annotation pipeline against it with a stub VLM, and prints a
-short report. This is intentionally not a pytest test — it exercises the
-CLI plumbing — but it reuses the same on-disk dataset builder as the pytest
-fixtures so there is no duplicated fixture code.
-"""
-
-from __future__ import annotations
-
-import sys
-import tempfile
-from pathlib import Path
-
-from lerobot.annotations.steerable_pipeline.config import AnnotationPipelineConfig
-from lerobot.annotations.steerable_pipeline.executor import Executor
-from lerobot.annotations.steerable_pipeline.modules import (
-    GeneralVqaModule,
-    InterjectionsAndSpeechModule,
-    PlanSubtasksMemoryModule,
-)
-from lerobot.annotations.steerable_pipeline.validator import StagingValidator
-from lerobot.annotations.steerable_pipeline.vlm_client import StubVlmClient
-from lerobot.annotations.steerable_pipeline.writer import LanguageColumnsWriter
-from tests.fixtures.dataset_factories import build_annotation_dataset
-
-
-def _stub_responder(messages):
-    text = ""
-    for m in messages:
-        if m.get("role") == "user":
-            content = m.get("content")
-            if isinstance(content, list):
-                for block in content:
-                    if isinstance(block, dict) and block.get("type") == "text":
-                        text = block.get("text", "")
-            elif isinstance(content, str):
-                text = content
-    if "atomic subtasks" in text:
-        return {
-            "subtasks": [
-                {"text": "grasp the bottle", "start": 0.0, "end": 1.0},
-                {"text": "pour into the cup", "start": 1.0, "end": 2.0},
-                {"text": "place the bottle down", "start": 2.0, "end": 3.0},
-            ]
-        }
-    if "concise hierarchical PLAN" in text:
-        return {"plan": "1. grasp\n2. pour\n3. place"}
-    if "Update the memory" in text:
-        return {"memory": "poured once"}
-    if "acknowledgement the robot" in text:
-        return {"text": "Sure."}
-    if "ONE realistic interruption" in text:
-        return {"interjection": "use less water", "speech": "Using less water."}
-    if "frame-grounded visual question" in text:
-        return {"question": "How many cups?", "answer": {"label": "cup", "count": 1}}
-    return None
-
-
-def main() -> int:
-    with tempfile.TemporaryDirectory() as tmp:
-        root = build_annotation_dataset(
-            Path(tmp) / "ds",
-            episode_specs=[(0, 30, "Pour water into the cup.")],
-            fps=10,
-        )
-        vlm = StubVlmClient(responder=_stub_responder)
-        cfg = AnnotationPipelineConfig()
-        executor = Executor(
-            config=cfg,
-            plan=PlanSubtasksMemoryModule(vlm=vlm, config=cfg.plan),
-            interjections=InterjectionsAndSpeechModule(vlm=vlm, config=cfg.interjections, seed=cfg.seed),
-            vqa=GeneralVqaModule(vlm=vlm, config=cfg.vqa, seed=cfg.seed),
-            writer=LanguageColumnsWriter(),
-            validator=StagingValidator(),
-        )
-        summary = executor.run(root)
-        print(f"phases={[(p.name, p.episodes_processed) for p in summary.phases]}")
-        print(f"validation: {summary.validation_report.summary()}")
-        print(f"shards rewritten: {len(summary.written_paths)}")
-    return 0
-
-
-if __name__ == "__main__":
-    sys.exit(main())
--- a/tests/annotations/test_frames.py
+++ b/tests/annotations/test_frames.py
@@ -1,146 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Unit tests for :class:`VideoFrameProvider` method bindings.
-
-These were prompted by a real regression: ``video_for_episode`` was once
-indented one level too deep so it ended up nested *inside* a module-level
-helper (after that function's ``return`` statement) — silently dead code
-that meant production runs with ``use_video_url=False`` would
-``AttributeError`` on ``self.frame_provider.video_for_episode(...)``. The
-existing module tests didn't catch it because they exercise stub providers.
-
-The tests below assert on the class itself (not on an instance), so a
-future reindent regression flips them to red without needing a real
-LeRobot dataset on disk.
-"""
-
-from __future__ import annotations
-
-import shutil
-import subprocess
-from pathlib import Path
-
-import pytest
-import torch
-
-pytest.importorskip("datasets", reason="datasets is required (install lerobot[dataset])")
-
-from lerobot.annotations.steerable_pipeline.frames import (  # noqa: E402
-    VideoFrameProvider,
-    _decode_frames_av,
-    _decode_frames_ffmpeg,
-)
-
-
-def test_video_for_episode_is_a_method_of_videoframeprovider():
-    """``video_for_episode`` must be a bound method, not nested dead code."""
-    assert callable(getattr(VideoFrameProvider, "video_for_episode", None))
-
-
-def test_episode_clip_path_is_a_method_of_videoframeprovider():
-    """``episode_clip_path`` is now a method (was a free function reaching
-    into ``provider._meta`` from outside the class)."""
-    assert callable(getattr(VideoFrameProvider, "episode_clip_path", None))
-
-
-def test_videoframeprovider_has_a_lock_for_concurrent_use():
-    """A ``ThreadPoolExecutor`` runs the plan / interjections / vqa phases
-    concurrently; the cache + warn-flag accesses must be guarded.
-    """
-    import threading
-
-    # Fresh-instance check via a minimal fake to avoid touching the hub.
-    # The lock is declared with ``init=False`` and has a default factory,
-    # so a constructed instance must own a real ``threading.Lock``.
-    lock_field = next(
-        (f for f in VideoFrameProvider.__dataclass_fields__.values() if f.name == "_lock"),
-        None,
-    )
-    assert lock_field is not None
-    assert lock_field.default_factory is threading.Lock
-
-
-@pytest.fixture
-def sample_video(tmp_path: Path) -> Path:
-    """A 3 s 10 fps test-pattern mp4, written with ffmpeg."""
-    if shutil.which("ffmpeg") is None:
-        pytest.skip("ffmpeg not available")
-    out = tmp_path / "sample.mp4"
-    subprocess.run(
-        [
-            "ffmpeg", "-y", "-f", "lavfi",
-            "-i", "testsrc=duration=3:size=160x120:rate=10",
-            "-pix_fmt", "yuv420p", str(out),
-        ],
-        check=True,
-        capture_output=True,
-    )
-    return out
-
-
-def test_decode_frames_av_returns_one_uint8_frame_per_timestamp(sample_video: Path) -> None:
-    """``_decode_frames_av`` decodes via PyAV directly — no torchcodec/torchvision.
-
-    This is the always-available fallback: torchcodec is unusable in some
-    containers and lerobot's ``pyav`` backend routes through the removed
-    ``torchvision.io.VideoReader``.
-    """
-    timestamps = [0.0, 1.0, 2.5]
-    frames = _decode_frames_av(sample_video, timestamps)
-
-    assert len(frames) == len(timestamps)
-    for frame in frames:
-        assert isinstance(frame, torch.Tensor)
-        assert frame.dtype == torch.uint8
-        assert frame.shape == (3, 120, 160)
-
-
-def test_decode_frames_av_picks_nearest_frame(sample_video: Path) -> None:
-    """Repeated and out-of-order timestamps each resolve to the nearest frame."""
-    frames = _decode_frames_av(sample_video, [2.0, 0.0, 2.0])
-
-    assert len(frames) == 3
-    assert torch.equal(frames[0], frames[2])
-    assert not torch.equal(frames[0], frames[1])
-
-
-def test_decode_frames_av_raises_on_missing_file(tmp_path: Path) -> None:
-    """A missing video surfaces as an exception the caller can fall back on."""
-    with pytest.raises(Exception):  # noqa: B017, PT011
-        _decode_frames_av(tmp_path / "does_not_exist.mp4", [0.0])
-
-
-def test_decode_frames_ffmpeg_returns_one_uint8_frame_per_timestamp(sample_video: Path) -> None:
-    """``_decode_frames_ffmpeg`` shells out to the ffmpeg CLI — the always-
-    available fallback that decodes AV1 and isolates crashes to a child
-    process.
-    """
-    timestamps = [0.0, 1.0, 2.5]
-    frames = _decode_frames_ffmpeg(sample_video, timestamps)
-
-    assert len(frames) == len(timestamps)
-    for frame in frames:
-        assert isinstance(frame, torch.Tensor)
-        assert frame.dtype == torch.uint8
-        assert frame.shape == (3, 120, 160)
-
-
-def test_decode_frames_ffmpeg_raises_on_missing_file(tmp_path: Path) -> None:
-    """A missing video raises (non-zero ffmpeg exit), never crashes the job."""
-    if shutil.which("ffmpeg") is None:
-        pytest.skip("ffmpeg not available")
-    with pytest.raises(Exception):  # noqa: B017, PT011
-        _decode_frames_ffmpeg(tmp_path / "does_not_exist.mp4", [0.0])
--- a/tests/annotations/test_modules.py
+++ b/tests/annotations/test_modules.py
@@ -1,355 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Module 1/2/3 unit tests with stubbed VLMs."""
-
-from __future__ import annotations
-
-import json
-from dataclasses import dataclass, field
-from pathlib import Path
-from typing import Any
-
-from lerobot.annotations.steerable_pipeline.config import (
-    InterjectionsConfig,
-    PlanConfig,
-    VqaConfig,
-)
-from lerobot.annotations.steerable_pipeline.modules import (
-    GeneralVqaModule,
-    InterjectionsAndSpeechModule,
-    PlanSubtasksMemoryModule,
-)
-from lerobot.annotations.steerable_pipeline.reader import iter_episodes
-from lerobot.annotations.steerable_pipeline.staging import EpisodeStaging
-from lerobot.annotations.steerable_pipeline.vlm_client import StubVlmClient
-
-from ._helpers import make_canned_responder
-
-
-@dataclass
-class _StubFrameProvider:
-    """Returns one sentinel object per requested timestamp."""
-
-    sentinel: Any = field(default_factory=lambda: object())
-    cameras: tuple[str, ...] = ("observation.images.top",)
-    calls: list[tuple[int, tuple[float, ...], str | None]] = field(default_factory=list)
-    video_calls: list[tuple[int, int, str | None]] = field(default_factory=list)
-
-    @property
-    def camera_keys(self) -> list[str]:
-        return list(self.cameras)
-
-    def frames_at(self, record, timestamps, camera_key=None):
-        self.calls.append((record.episode_index, tuple(timestamps), camera_key))
-        return [self.sentinel] * len(timestamps)
-
-    def video_for_episode(self, record, max_frames, camera_key=None):
-        self.video_calls.append((record.episode_index, max_frames, camera_key))
-        n = min(max_frames, len(record.frame_timestamps))
-        return [self.sentinel] * n
-
-
-def _spy_responder(captured: list[list[dict[str, Any]]], reply: Any):
-    def responder(messages):
-        captured.append(list(messages))
-        return reply
-
-    return StubVlmClient(responder=responder)
-
-
-def test_module1_plan_memory_subtask_smoke(fixture_dataset_root: Path, tmp_path: Path) -> None:
-    vlm = make_canned_responder(
-        {
-            "atomic subtasks": {
-                "subtasks": [
-                    {"text": "grasp the handle of the sponge", "start": 0.0, "end": 0.4},
-                    {"text": "wipe the counter from left to right", "start": 0.4, "end": 0.8},
-                    {"text": "place the sponge into the sink", "start": 0.8, "end": 1.1},
-                ]
-            },
-            "Update the memory": {"memory": "wiped the counter once"},
-        },
-    )
-    module = PlanSubtasksMemoryModule(vlm=vlm, config=PlanConfig())
-    record = next(iter_episodes(fixture_dataset_root))
-    staging = EpisodeStaging(tmp_path / "stage", record.episode_index)
-    module.run_episode(record, staging)
-    rows = staging.read("plan")
-
-    styles = {r["style"] for r in rows}
-    assert {"subtask", "plan", "memory"}.issubset(styles)
-    # subtask timestamps must be exact frame timestamps
-    frame_set = set(record.frame_timestamps)
-    for row in rows:
-        assert row["timestamp"] in frame_set
-    # one plan row per subtask boundary; the first lands at t0 and each
-    # plan is the deterministic numbered list of still-todo subtasks
-    plan_rows = sorted((r for r in rows if r["style"] == "plan"), key=lambda r: r["timestamp"])
-    subtask_rows = [r for r in rows if r["style"] == "subtask"]
-    assert len(plan_rows) == len(subtask_rows)
-    assert plan_rows[0]["timestamp"] == record.frame_timestamps[0]
-    # the t0 plan enumerates all subtasks; later plans shrink
-    assert plan_rows[0]["content"].startswith("1. ")
-    assert len(plan_rows[0]["content"].splitlines()) == len(subtask_rows)
-    assert len(plan_rows[-1]["content"].splitlines()) == 1
-
-
-def test_module2_at_t0_emits_speech_only_no_interjection(fixture_dataset_root: Path, tmp_path: Path) -> None:
-    vlm = make_canned_responder(
-        {"acknowledgement the robot": {"text": "Sure, on it."}},
-    )
-    module = InterjectionsAndSpeechModule(
-        vlm=vlm,
-        config=InterjectionsConfig(max_interjections_per_episode=0),
-    )
-    record = next(iter_episodes(fixture_dataset_root))
-    staging = EpisodeStaging(tmp_path / "stage", record.episode_index)
-    module.run_episode(record, staging)
-    rows = staging.read("interjections")
-    assert len(rows) == 1
-    only = rows[0]
-    assert only["role"] == "assistant"
-    assert only["style"] is None
-    assert only["content"] is None
-    assert only["timestamp"] == record.frame_timestamps[0]
-    assert only["tool_calls"][0]["function"]["name"] == "say"
-
-
-def test_module2_mid_episode_emits_paired_interjection_and_speech(
-    fixture_dataset_root: Path, tmp_path: Path
-) -> None:
-    """Module 2 anchors interjections on Module 1's subtask boundaries.
-
-    The executor runs Module 1 first, then Module 2 reads the subtask
-    rows back from the same staging tree (see
-    ``_mid_episode_interjections``). Reproduce that contract here by
-    seeding the staging with two subtask rows so a single ``0 → 1``
-    boundary exists for Module 2 to anchor on.
-    """
-    vlm = make_canned_responder(
-        {
-            "acknowledgement the robot": {"text": "OK."},
-            # Marker matches the distinctive line of
-            # ``module_2_interjection.txt``. The old marker
-            # ("ONE realistic interruption") came from a previous prompt
-            # version that asked for counterfactual interjections; the
-            # current design anchors on subtask boundaries instead, so
-            # the prompt and its marker changed.
-            "Write ONE interjection": {
-                "interjection": "now wipe the counter please",
-                "speech": "On it.",
-            },
-        },
-    )
-    module = InterjectionsAndSpeechModule(
-        vlm=vlm,
-        config=InterjectionsConfig(max_interjections_per_episode=1, interjection_min_t=0.2),
-        seed=7,
-    )
-    record = next(iter_episodes(fixture_dataset_root))
-    staging = EpisodeStaging(tmp_path / "stage", record.episode_index)
-    # Seed Module 1's subtask staging so Module 2 has a boundary to
-    # anchor on (it bails with zero rows when no spans exist — the
-    # production executor guarantees Module 1 ran first).
-    boundary_ts = float(record.frame_timestamps[len(record.frame_timestamps) // 2])
-    staging.write(
-        "plan",
-        [
-            {
-                "role": "assistant",
-                "content": "grasp the sponge",
-                "style": "subtask",
-                "timestamp": float(record.frame_timestamps[0]),
-                "tool_calls": None,
-            },
-            {
-                "role": "assistant",
-                "content": "wipe the counter",
-                "style": "subtask",
-                "timestamp": boundary_ts,
-                "tool_calls": None,
-            },
-        ],
-    )
-    module.run_episode(record, staging)
-    rows = staging.read("interjections")
-
-    interjections = [r for r in rows if r["style"] == "interjection"]
-    speeches = [r for r in rows if r["style"] is None and r["role"] == "assistant"]
-    assert len(interjections) == 1
-    assert len(speeches) >= 2  # initial t=0 + one paired with the interjection
-    inter_t = interjections[0]["timestamp"]
-    assert any(abs(s["timestamp"] - inter_t) < 1e-9 for s in speeches)
-
-
-def test_module3_vqa_unique_per_frame_and_camera(single_episode_root: Path, tmp_path: Path) -> None:
-    payload = {
-        "question": "How many cups?",
-        "answer": {"label": "cup", "count": 2, "note": "white & blue"},
-    }
-    vlm = make_canned_responder({"frame-grounded visual question": payload})
-    module = GeneralVqaModule(
-        vlm=vlm,
-        config=VqaConfig(vqa_emission_hz=1.0, K=3),
-        seed=1,
-        frame_provider=_StubFrameProvider(cameras=("observation.images.top", "observation.images.wrist")),
-    )
-    record = next(iter_episodes(single_episode_root))
-    staging = EpisodeStaging(tmp_path / "stage", record.episode_index)
-    module.run_episode(record, staging)
-    rows = staging.read("vqa")
-    # every vqa row must carry a camera tag and one of the configured cameras
-    for r in rows:
-        assert r["style"] == "vqa"
-        assert r.get("camera") in {"observation.images.top", "observation.images.wrist"}
-    # at most one (vqa, user) and one (vqa, assistant) per (timestamp, camera)
-    user_keys = [(r["timestamp"], r["camera"]) for r in rows if r["role"] == "user" and r["style"] == "vqa"]
-    assistant_keys = [
-        (r["timestamp"], r["camera"]) for r in rows if r["role"] == "assistant" and r["style"] == "vqa"
-    ]
-    assert len(user_keys) == len(set(user_keys))
-    assert len(assistant_keys) == len(set(assistant_keys))
-    # both cameras must be represented
-    assert {c for _, c in user_keys} == {"observation.images.top", "observation.images.wrist"}
-    # every emitted timestamp must be an exact source frame timestamp
-    frame_set = set(record.frame_timestamps)
-    for ts, _ in user_keys + assistant_keys:
-        assert ts in frame_set
-
-
-def test_module1_attaches_video_block_to_subtask_prompt(fixture_dataset_root: Path, tmp_path: Path) -> None:
-    """Module 1 sends one ``type=video`` block covering the whole episode."""
-    captured: list[list[dict[str, Any]]] = []
-    payload = {
-        "subtasks": [
-            {"text": "grasp the handle of the sponge", "start": 0.0, "end": 0.5},
-            {"text": "wipe the counter", "start": 0.5, "end": 1.1},
-        ]
-    }
-    plan_payload = {"plan": "1. grasp\n2. wipe"}
-    memory_payload = {"memory": "wiped once"}
-
-    def responder(messages):
-        captured.append(list(messages))
-        text = ""
-        for m in messages:
-            for block in m.get("content", []):
-                if isinstance(block, dict) and block.get("type") == "text":
-                    text = block.get("text", "")
-        if "concise hierarchical PLAN" in text:
-            return plan_payload
-        if "Update the memory" in text:
-            return memory_payload
-        return payload
-
-    provider = _StubFrameProvider()
-    module = PlanSubtasksMemoryModule(
-        vlm=StubVlmClient(responder=responder),
-        # Disable the rephrasings sub-prompt so the test's only video-bearing
-        # call is the subtask one — keeps the assertions below focused on
-        # ``_generate_subtasks`` rather than fighting the order of unrelated
-        # text-only Module-1 sub-prompts.
-        config=PlanConfig(max_video_frames=5, frames_per_second=10.0, n_task_rephrasings=0),
-        frame_provider=provider,
-    )
-    record = next(iter_episodes(fixture_dataset_root))
-    staging = EpisodeStaging(tmp_path / "stage", record.episode_index)
-    module.run_episode(record, staging)
-
-    # Find the call carrying the subtask prompt rather than blindly taking
-    # captured[0] — Module 1 issues several sub-prompts and their order is
-    # not part of the contract.
-    assert captured, "no VLM calls made"
-
-    def _prompt_text(messages):
-        for m in messages:
-            for block in m.get("content", []):
-                if isinstance(block, dict) and block.get("type") == "text":
-                    return block.get("text", "")
-        return ""
-
-    subtask_calls = [m for m in captured if "atomic subtasks" in _prompt_text(m)]
-    assert len(subtask_calls) == 1, "expected exactly one subtask-prompt VLM call"
-    content = subtask_calls[0][0]["content"]
-    video_blocks = [b for b in content if isinstance(b, dict) and b.get("type") == "video"]
-    image_blocks = [b for b in content if isinstance(b, dict) and b.get("type") == "image"]
-    text_blocks = [b for b in content if isinstance(b, dict) and b.get("type") == "text"]
-    assert len(video_blocks) == 1, f"expected exactly 1 video block, got {content}"
-    assert image_blocks == [], "subtask prompt must not mix image blocks with the video block"
-    assert len(text_blocks) == 1
-    # video block must wrap a list of frames covering the episode
-    assert isinstance(video_blocks[0]["video"], list)
-    assert len(video_blocks[0]["video"]) <= 5
-    # provider is called with target_count = min(duration * fps, max). With
-    # fps=10 on a ~1s episode that requests >max, so max=5 wins.
-    assert provider.video_calls and provider.video_calls[0][0] == record.episode_index
-    assert provider.video_calls[0][1] <= 5
-
-
-def test_module3_attaches_frame_image_block_to_prompt(single_episode_root: Path, tmp_path: Path) -> None:
-    """Each VQA prompt must carry a single image block at the emission frame."""
-    captured: list[list[dict[str, Any]]] = []
-    payload = {
-        "question": "How many cups?",
-        "answer": {"label": "cup", "count": 1},
-    }
-    provider = _StubFrameProvider()
-    module = GeneralVqaModule(
-        vlm=_spy_responder(captured, payload),
-        config=VqaConfig(vqa_emission_hz=1.0, K=1),
-        seed=0,
-        frame_provider=provider,
-    )
-    record = next(iter_episodes(single_episode_root))
-    staging = EpisodeStaging(tmp_path / "stage", record.episode_index)
-    module.run_episode(record, staging)
-
-    assert captured, "no VLM calls made"
-    for messages in captured:
-        content = messages[0]["content"]
-        image_blocks = [b for b in content if isinstance(b, dict) and b.get("type") == "image"]
-        text_blocks = [b for b in content if isinstance(b, dict) and b.get("type") == "text"]
-        assert len(image_blocks) == 1, f"expected 1 image block per VQA prompt, got {content}"
-        assert image_blocks[0]["image"] is provider.sentinel
-        assert len(text_blocks) == 1
-    # provider was called once per emission per camera with the exact emission timestamp
-    for ep_idx, ts_tuple, camera in provider.calls:
-        assert ep_idx == record.episode_index
-        assert len(ts_tuple) == 1
-        assert ts_tuple[0] in record.frame_timestamps
-        assert camera in provider.cameras
-
-
-def test_module3_assistant_content_is_valid_json(single_episode_root: Path, tmp_path: Path) -> None:
-    payload = {
-        "question": "Where is the cup?",
-        "answer": {"detections": [{"label": "cup", "bbox_format": "xyxy", "bbox": [10, 20, 50, 80]}]},
-    }
-    vlm = make_canned_responder({"frame-grounded visual question": payload})
-    module = GeneralVqaModule(
-        vlm=vlm,
-        config=VqaConfig(vqa_emission_hz=1.0, K=2),
-        seed=2,
-        frame_provider=_StubFrameProvider(),
-    )
-    record = next(iter_episodes(single_episode_root))
-    staging = EpisodeStaging(tmp_path / "stage", record.episode_index)
-    module.run_episode(record, staging)
-    rows = staging.read("vqa")
-    for row in rows:
-        if row["role"] == "assistant" and row["style"] == "vqa":
-            decoded = json.loads(row["content"])
-            assert "detections" in decoded
--- a/tests/annotations/test_pipeline_recipe_render.py
+++ b/tests/annotations/test_pipeline_recipe_render.py
@@ -1,181 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""End-to-end smoke: pipeline output → PR 1 canonical recipe rendering."""
-
-from __future__ import annotations
-
-from pathlib import Path
-
-import pyarrow.parquet as pq
-
-from lerobot.annotations.steerable_pipeline.config import (
-    AnnotationPipelineConfig,
-    InterjectionsConfig,
-    PlanConfig,
-    VqaConfig,
-)
-from lerobot.annotations.steerable_pipeline.executor import Executor
-from lerobot.annotations.steerable_pipeline.modules import (
-    GeneralVqaModule,
-    InterjectionsAndSpeechModule,
-    PlanSubtasksMemoryModule,
-)
-from lerobot.annotations.steerable_pipeline.validator import StagingValidator
-from lerobot.annotations.steerable_pipeline.writer import LanguageColumnsWriter
-from lerobot.configs.recipe import MessageTurn, TrainingRecipe
-from lerobot.datasets.language_render import render_sample
-
-from ._helpers import make_canned_responder
-
-def _build_pr1_style_blend_recipe() -> TrainingRecipe:
-    """Inline blend recipe that consumes every style this pipeline produces.
-
-    PR 1 used to ship ``src/lerobot/configs/recipes/pi05_hirobot.yaml`` as
-    a canonical example, but that file was dropped during PR 1 review. The
-    cross-PR contract this test guards is "the recipe DSL can render
-    non-empty messages from pipeline output", which doesn't require a
-    specific YAML — so we build the equivalent blend in code.
-    """
-    return TrainingRecipe(
-        blend={
-            "low_level_execution": TrainingRecipe(
-                weight=0.35,
-                messages=[
-                    MessageTurn(
-                        role="user",
-                        content="${task}\nPlan: ${plan}\nMemory: ${memory}",
-                        stream="high_level",
-                    ),
-                    MessageTurn(role="assistant", content="${subtask}", stream="low_level", target=True),
-                ],
-            ),
-            "user_interjection_response": TrainingRecipe(
-                weight=0.16,
-                bindings={
-                    "speech": "emitted_at(t, role=assistant, tool_name=say)",
-                    "interjection": "emitted_at(t, style=interjection)",
-                },
-                messages=[
-                    MessageTurn(role="user", content="${task}", stream="high_level"),
-                    MessageTurn(
-                        role="user",
-                        content="${interjection}",
-                        stream="high_level",
-                        if_present="interjection",
-                    ),
-                    MessageTurn(
-                        role="assistant",
-                        content="${plan}",
-                        stream="high_level",
-                        target=True,
-                        if_present="plan",
-                        tool_calls_from="speech",
-                    ),
-                ],
-            ),
-        }
-    )
-
-
-def _build_executor() -> Executor:
-    vlm = make_canned_responder(
-        {
-            "atomic subtasks": {
-                "subtasks": [
-                    {"text": "grasp the bottle", "start": 0.0, "end": 0.5},
-                    {"text": "pour into the cup", "start": 0.5, "end": 1.0},
-                    {"text": "place the bottle down", "start": 1.0, "end": 1.5},
-                ]
-            },
-            "concise hierarchical PLAN": {"plan": "1. grasp\n2. pour\n3. place"},
-            "Update the memory": {"memory": "poured once"},
-            "acknowledgement the robot": {"text": "Sure."},
-            "ONE realistic interruption": {
-                "interjection": "use less water",
-                "speech": "Using less water.",
-            },
-            "frame-grounded visual question": {
-                "question": "How many cups?",
-                "answer": {"label": "cup", "count": 1},
-            },
-        },
-    )
-    config = AnnotationPipelineConfig(
-        plan=PlanConfig(),
-        interjections=InterjectionsConfig(max_interjections_per_episode=1, interjection_min_t=0.5),
-        vqa=VqaConfig(vqa_emission_hz=1.0, K=2),
-    )
-    return Executor(
-        config=config,
-        plan=PlanSubtasksMemoryModule(vlm=vlm, config=config.plan),
-        interjections=InterjectionsAndSpeechModule(vlm=vlm, config=config.interjections, seed=config.seed),
-        vqa=GeneralVqaModule(vlm=vlm, config=config.vqa, seed=config.seed),
-        writer=LanguageColumnsWriter(),
-        validator=StagingValidator(),
-    )
-
-
-def test_pr1_canonical_recipe_renders_nonempty_from_pipeline_output(
-    single_episode_root: Path,
-) -> None:
-    executor = _build_executor()
-    summary = executor.run(single_episode_root)
-    # validator may emit warnings but no errors for the synthetic fixture
-    assert summary.validation_report.ok, summary.validation_report.summary()
-
-    table = pq.read_table(single_episode_root / "data" / "chunk-000" / "file-000.parquet")
-    persistent_lists = table.column("language_persistent").to_pylist()
-    events_lists = table.column("language_events").to_pylist()
-    timestamps = table.column("timestamp").to_pylist()
-
-    recipe = _build_pr1_style_blend_recipe()
-
-    rendered_any = False
-    for sample_idx, (ts, persistent, events) in enumerate(
-        zip(timestamps, persistent_lists, events_lists, strict=True)
-    ):
-        result = render_sample(
-            recipe=recipe,
-            persistent=persistent,
-            events=events,
-            t=float(ts),
-            sample_idx=sample_idx,
-            dataset_ctx={"task": "Pour water from the bottle into the cup."},
-        )
-        if result is None:
-            continue
-        if result["messages"]:
-            rendered_any = True
-            # A valid render supervises something: a text-CE target turn
-            # OR a flow-only ``low_level``-stream turn (action loss).
-            assert (
-                result["target_message_indices"]
-                or "low_level" in result["message_streams"]
-            )
-            break
-    assert rendered_any, "recipe rendered no messages from pipeline output"
-
-    # Sanity: speech atom appears in events column intact
-    flat_events = [r for ev in events_lists for r in ev]
-    speech_rows = [r for r in flat_events if r.get("style") is None and r.get("role") == "assistant"]
-    assert speech_rows
-    say = speech_rows[0]["tool_calls"][0]
-    assert say["function"]["name"] == "say"
-    assert isinstance(say["function"]["arguments"]["text"], str)
-    # PR 2 no longer writes a ``tools`` column — the say schema lives as a
-    # constant (``SAY_TOOL_SCHEMA``) so PR 1's row struct is the single
-    # source of truth for the v3.1 schema.
-    assert "tools" not in table.column_names
--- a/Show More
+++ b/Show More