docs/source/annotation_pipeline.mdx

# Annotation Pipeline

`lerobot-annotate` populates the two language columns introduced by the
[Language Columns and Recipes](./language_and_recipes) page —
`language_persistent` and `language_events` — directly into
`data/chunk-*/file-*.parquet`. There is no flavor namespace and no sidecar
file tree: multiple revisions of a dataset mean multiple dataset copies.

## What the pipeline produces

Three modules write into a per-episode staging tree, then a single writer
rewrites the data shards in place:

| Style / atom                                | Column                | Module   |
| ------------------------------------------- | --------------------- | -------- |
| `subtask` (Pi0.7-style "how, not what")     | `language_persistent` | Module 1 |
| `plan` (initial + refresh on interjection)  | `language_persistent` | Module 1 |
| `memory` (MEM-style compression)            | `language_persistent` | Module 1 |
| `interjection`                              | `language_events`     | Module 2 |
| speech tool-call atom (`style=null`, `say`) | `language_events`     | Module 2 |
| `vqa` (user / assistant pair)               | `language_events`     | Module 3 |

The writer drops the legacy `subtask_index` column. It does **not** add a
`tools` column to the parquet — the tool catalog lives at
`meta/info.json["tools"]` instead (see [Tools](./tools)). After every
annotation run the pipeline ensures the canonical `say` schema is
present in that list, preserving any tools the user pre-declared. Chat-
template consumers read the catalog through
`LeRobotDatasetMetadata.tools` and pass it to
`apply_chat_template(messages, tools=meta.tools, ...)`.

If you want to declare additional tools for a dataset before annotation
runs, edit `meta/info.json["tools"]` directly — the pipeline preserves
anything already there. Implementations of those tools live under
`src/lerobot/tools/`; one file per tool, registered via
`TOOL_REGISTRY`. See the [Tools](./tools) doc for the authoring guide.

## How to run it locally or on SLURM

Install the extra and invoke the console script:

```bash
uv sync --extra annotations
uv run lerobot-annotate \
  --repo_id=imstevenpmwork/super_poulain_draft \
  --vlm.backend=vllm \
  --vlm.model_id=Qwen/Qwen3.6-27B-FP8 \
  --vlm.tensor_parallel_size=2
```

The pipeline attaches actual camera footage to every Module 1/2/3 prompt
by default, decoded from the dataset's first `observation.images.*`
stream. Override with `--vlm.camera_key=observation.images.<name>` to
pin a specific viewpoint. Datasets with no video tracks fall back to
text-only prompts automatically.

**Module 1 sees the whole episode as one video block.** Subtask
decomposition gets a `{"type":"video", "video":[<frames>]}` block
covering the entire demonstration; Qwen-VL pools temporally on its own
and decides where to cut. There is no keyframe stride or count knob —
`--module_1.max_video_frames` (default 32) only caps the frames packed
into the video block as a model-capacity bound. Module 2 attaches a
single still frame at the interjection timestamp; Module 3 attaches the
exact emission frame to each VQA pair.

The executor picks `LocalPipelineExecutor` for small datasets and
`SlurmPipelineExecutor` for large ones based on
`--executor.auto_threshold` (default 32 episodes). Force local with
`--executor.force_local=true`. SLURM jobs honour `--executor.slurm_partition`,
`--executor.slurm_gpus`, and `--executor.slurm_time`.

## Style-to-recipe consumer mapping

The pipeline produces exactly the styles consumed by
`src/lerobot/configs/recipes/subtask_mem_vqa_speech.yaml`:

- `low_level_execution`, `high_level_subtask`, `memory_update` consume
  `subtask`/`plan`/`memory` from `language_persistent`.
- `user_interjection_response` consumes `interjection` events plus the
  paired speech atom (merged into one assistant target turn via
  `tool_calls_from`) and the same-timestamp `plan` refresh.
- `ask_vqa` consumes the `(vqa, user)` and `(vqa, assistant)` pairs from
  `language_events`.

## Why the design is scoped to the canonical recipe

Two things drive the scope:

1. **Persistent state vs exact-event split.** Persistent rows (`subtask`,
   `plan`, `memory`) broadcast per episode and answer "what state is in
   force at this frame?". Event rows (`interjection`, `vqa`, speech) only
   appear on the exact frame whose timestamp matches the emission. The
   pipeline writes timestamps taken straight from the source parquet — no
   floating-point recomputation.
2. **One Qwen-VL pass.** All three modules share a single VLM client
   (vLLM if available, transformers fallback) so the cost is one model
   load per dataset, not three.

## Module independence and staged reruns

Each module writes its raw output to
`<root>/.annotate_staging/episode_{N:06d}/<module>.jsonl`. That makes
prompt iteration cheap — re-running one module overwrites only its own
JSONL file before the writer composes the final parquet. Modules can be
disabled via `--module_1.enabled=false` (and similarly for 2 and 3) to
test them in isolation.

## Validation/report checks before final write

Before the writer runs, `StagingValidator` checks:

- exact frame-timestamp alignment for every event row;
- no orphan speech / interjection pairs;
- `plan` is refreshed at every interjection timestamp;
- `memory` rows fall on subtask boundaries (warning, not error);
- VQA assistant `content` parses as JSON in one of the
  bbox / keypoint / count / attribute / spatial shapes;
- every row routes to the column dictated by `column_for_style(style)`.

Errors abort the writer (`--skip_validation=true` overrides for debugging).

## Paper inspirations per module

- **Module 1 — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417))
  atom granularity ("pick up one piece of lettuce", "place bowl to box");
  Pi0.7 ([Physical Intelligence 2025](https://pi.website/pi07)) "how, not
  what" detail.
- **Module 1 — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596))
  compression directive: keep only minimal relevant information; functional
  outcomes preserved, specific attributes dropped.
- **Module 2 — interjections.** Hi Robot scenario taxonomy: negative task,
  situated correction, specific constraint, preference. Speech is a
  tool-call-only atom (`tool_calls=[{type:function, function:{name:"say",
arguments:{text:...}}}]`).
- **Module 3 — VQA.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693))
  grounded features (bounding boxes in pixel `[x_min, y_min, x_max, y_max]`,
  keypoints) and Steerable Policies' multi-abstraction grounding.

Future maintainers should adjust the prompt templates in
`src/lerobot/annotations/steerable_pipeline/prompts/` against these
references rather than rewriting from scratch.

## Compute and list-size estimates

Per episode, the pipeline issues O(`max_steps`) Module 1 calls,
O(`max_interjections_per_episode`) Module 2 calls, and
O(`vqa_emission_hz × episode_seconds`) Module 3 calls. With defaults
(8 subtasks, 1 interjection, 1 Hz × 3 pairs) and 30-second episodes, that
is ~50 VLM calls per episode. `language_persistent` per episode is ~10s of
KB at most (parquet dictionary-encodes one entry per episode);
`language_events` is empty on most frames and is bounded by the number of
emissions, not `num_frames × num_emissions`.

## Reproducibility via seed and prompt hashes

`--seed` (default 1729) feeds the per-episode RNGs that select interjection
timestamps and VQA question types. Combined with the deterministic prompt
templates checked into `prompts/`, two runs at the same seed against the
same dataset and the same model checkpoint produce byte-identical staging
artifacts. Prompt edits are recorded by file hash; future tooling can pin
expected `(seed, prompt_hash)` pairs into the dataset card.
-												feat: language annotation pipeline (PR 2/3)

Adds the steerable annotation pipeline (`lerobot-annotate`) that populates
the `language_persistent` and `language_events` columns introduced in
PR 1 directly into `data/chunk-*/file-*.parquet`. No flavor namespace,
no sidecar tree.

Modules produced:
- Module 1 (plan_subtasks_memory): Pi0.7-style subtasks, plan (init +
  refresh on interjection), MEM-style memory at subtask boundaries.
- Module 2 (interjections_and_speech): t=0 speech-only acknowledgement,
  mid-episode paired interjection + speech tool-call atom.
- Module 3 (general_vqa): bbox/keypoint/count/attribute/spatial pairs at
  configurable cadence with one-retry JSON validation.

Writer enforces: per-episode persistent identity, exact-frame event
timestamps, column routing per `column_for_style`, dataset-level `tools`
column with the `say` schema, drops legacy `subtask_index`. Validator
runs against staged JSONL artifacts before the writer rewrites parquet.

Adds `lerobot-annotate` console script, `annotations` extra (datatrove +
optional vllm), `make annotation-e2e` opt-in smoke target, and
`docs/source/annotation_pipeline.mdx`.

Branched from PR 1 (`feat/language-columns`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-27 16:22:51 +02:00
+								# Annotation Pipeline
 								`lerobot-annotate` populates the two language columns introduced by the
 								[Language Columns and Recipes](./language_and_recipes) page —
 								`language_persistent` and `language_events` — directly into
 								`data/chunk-*/file-*.parquet`. There is no flavor namespace and no sidecar
 								file tree: multiple revisions of a dataset mean multiple dataset copies.
 								## What the pipeline produces
 								Three modules write into a per-episode staging tree, then a single writer
 								rewrites the data shards in place:
 								| Style / atom                                | Column                | Module   |
 								| ------------------------------------------- | --------------------- | -------- |
 								| `subtask` (Pi0.7-style "how, not what")     | `language_persistent` | Module 1 |
 								| `plan` (initial + refresh on interjection)  | `language_persistent` | Module 1 |
 								| `memory` (MEM-style compression)            | `language_persistent` | Module 1 |
 								| `interjection`                              | `language_events`     | Module 2 |
 								| speech tool-call atom (`style=null`, `say`) | `language_events`     | Module 2 |
 								| `vqa` (user / assistant pair)               | `language_events`     | Module 3 |
-												refactor(annotate): drop dataset-level ``tools`` parquet column

PR 2 used to write a top-level ``tools`` column on every parquet shard
holding the JSON schema for the ``say`` tool, broadcast identically
across every row. That extends PR 1's schema for no real information
gain — the schema is a fixed code constant, parquet's RLE/dict encoding
collapses it on disk anyway, and HF/TRL chat-template consumers can
just import the constant directly.

PR 2 should fill in PR 1's existing schema, not add to it. So:

- ``writer.py``: stop emitting the ``tools`` column. Strip any legacy
  ``tools`` column from older shards on rerun so the schema converges to
  v3.1. ``SAY_TOOL_SCHEMA`` stays as a public constant (now joined by
  ``DEFAULT_TOOLS = [SAY_TOOL_SCHEMA]``); chat-template policies and the
  visualizer import them directly.
- ``test_writer.py``: replace the "tools column present" assertion with
  one that explicitly checks the column is absent, plus a new test
  asserting the constant's shape.
- ``test_pipeline_recipe_render.py``: drop the tools-column read; assert
  it's not present in the rewritten parquet.
- ``annotation_pipeline.mdx``: update the writer description to note the
  parquet stays small and the schema lives as a code constant.

If multi-tool-set support ever becomes real (datasets with different
tool inventories), the right home is ``meta/info.json["tools"]`` —
adding it later is non-breaking; ripping out a parquet column already
shipped is not.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-30 15:54:37 +02:00
+								The writer drops the legacy `subtask_index` column. It does **not** add a
-												feat(annotate): write tool catalog to meta/info.json after annotation

After every ``lerobot-annotate`` run, the executor ensures
``meta/info.json["tools"]`` contains at minimum the canonical ``say``
schema, while preserving any tools the user pre-declared on the
dataset. Chat-template consumers (PR 3 SmolVLA2 / Pi0.5 / dataset
visualizer) read the catalog through
``LeRobotDatasetMetadata.tools`` and pass it to
``apply_chat_template(messages, tools=meta.tools, ...)``.

- ``executor.py``: new ``_ensure_tools_in_info`` helper called
  after the parquet rewrite. Idempotent and additive — merges by
  ``function.name``, only writes back if the list changed.
- ``writer.py``: drops the duplicated ``SAY_TOOL_SCHEMA`` /
  ``DEFAULT_TOOLS`` constants in favour of importing from
  ``lerobot.datasets.language`` (PR 1's single source of truth).
  Re-exported so existing imports keep working.
- ``annotation_pipeline.mdx``: replace the "code constant only" note
  with a pointer to the new Tools doc and a description of the
  meta/info.json behaviour, including how to pre-declare custom
  tools before annotation runs.

This is the storage half of the tools work; PR 3 ships the runnable
implementations under ``src/lerobot/tools/`` (one file per tool,
first up: ``say.py`` wired to Kyutai's pocket-tts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-30 18:51:38 +02:00
+								`tools` column to the parquet — the tool catalog lives at
 								`meta/info.json["tools"]` instead (see [Tools](./tools)). After every
 								annotation run the pipeline ensures the canonical `say` schema is
 								present in that list, preserving any tools the user pre-declared. Chat-
 								template consumers read the catalog through
 								`LeRobotDatasetMetadata.tools` and pass it to
 								`apply_chat_template(messages, tools=meta.tools, ...)`.
 								If you want to declare additional tools for a dataset before annotation
 								runs, edit `meta/info.json["tools"]` directly — the pipeline preserves
 								anything already there. Implementations of those tools live under
 								`src/lerobot/tools/`; one file per tool, registered via
 								`TOOL_REGISTRY`. See the [Tools](./tools) doc for the authoring guide.
-												feat: language annotation pipeline (PR 2/3)

Adds the steerable annotation pipeline (`lerobot-annotate`) that populates
the `language_persistent` and `language_events` columns introduced in
PR 1 directly into `data/chunk-*/file-*.parquet`. No flavor namespace,
no sidecar tree.

Modules produced:
- Module 1 (plan_subtasks_memory): Pi0.7-style subtasks, plan (init +
  refresh on interjection), MEM-style memory at subtask boundaries.
- Module 2 (interjections_and_speech): t=0 speech-only acknowledgement,
  mid-episode paired interjection + speech tool-call atom.
- Module 3 (general_vqa): bbox/keypoint/count/attribute/spatial pairs at
  configurable cadence with one-retry JSON validation.

Writer enforces: per-episode persistent identity, exact-frame event
timestamps, column routing per `column_for_style`, dataset-level `tools`
column with the `say` schema, drops legacy `subtask_index`. Validator
runs against staged JSONL artifacts before the writer rewrites parquet.

Adds `lerobot-annotate` console script, `annotations` extra (datatrove +
optional vllm), `make annotation-e2e` opt-in smoke target, and
`docs/source/annotation_pipeline.mdx`.

Branched from PR 1 (`feat/language-columns`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-27 16:22:51 +02:00
 								## How to run it locally or on SLURM
 								Install the extra and invoke the console script:
 								```bash
 								uv sync --extra annotations
 								uv run lerobot-annotate \
-												feat(annotate): attach camera keyframes to module prompts; default to Qwen3.6-27B-FP8

Closes the visual-grounding gap flagged after the initial PR review:
modules now decode actual camera frames at the relevant timestamps and
attach them as `{"type":"image", "image":<PIL>}` content blocks to the
VLM prompts.

- New `frames.py`:
  - `FrameProvider` Protocol; `VideoFrameProvider` decodes from the
    dataset's first `observation.images.*` stream via
    `LeRobotDatasetMetadata.get_video_file_path` and
    `decode_video_frames`, with the same `from_timestamp` shift the main
    dataset uses.
  - Per-process LRU cache so co-timestamped Module 1 plan-update + Module
    2 calls share decode work.
  - `make_frame_provider` falls back to a null provider when the dataset
    has no video tracks → text-only prompts (graceful absence).
- Modules 1/2/3 take an optional `frame_provider` (default null) and
  prepend image blocks before the text block.
  - Module 1 attaches `keyframes_per_episode` keyframes to the subtask
    decomposition prompt.
  - Module 2 attaches the frame at the interjection timestamp.
  - Module 3 attaches the exact emission frame to each VQA pair.
- VlmConfig: backend now defaults to `vllm`; default model is
  `Qwen/Qwen3.6-27B-FP8`. New knobs: `--vlm.tensor_parallel_size`,
  `--vlm.camera_key` (override the keyframe stream).
- `_make_vllm_client` honours `tensor_parallel_size` so 27B-FP8 sharded
  on 2× GPUs works out of the box.
- `test_module3_attaches_frame_image_block_to_prompt` asserts modules
  emit one image block per VQA prompt at the exact emission timestamp.
- Docs: example switched to `imstevenpmwork/super_poulain_draft` +
  Qwen3.6-27B-FP8 + tensor_parallel_size=2; documents the keyframe
  attachment behaviour and the no-video fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-27 16:58:45 +02:00
+								  --repo_id=imstevenpmwork/super_poulain_draft \
 								  --vlm.backend=vllm \
 								  --vlm.model_id=Qwen/Qwen3.6-27B-FP8 \
 								  --vlm.tensor_parallel_size=2
-												feat: language annotation pipeline (PR 2/3)

Adds the steerable annotation pipeline (`lerobot-annotate`) that populates
the `language_persistent` and `language_events` columns introduced in
PR 1 directly into `data/chunk-*/file-*.parquet`. No flavor namespace,
no sidecar tree.

Modules produced:
- Module 1 (plan_subtasks_memory): Pi0.7-style subtasks, plan (init +
  refresh on interjection), MEM-style memory at subtask boundaries.
- Module 2 (interjections_and_speech): t=0 speech-only acknowledgement,
  mid-episode paired interjection + speech tool-call atom.
- Module 3 (general_vqa): bbox/keypoint/count/attribute/spatial pairs at
  configurable cadence with one-retry JSON validation.

Writer enforces: per-episode persistent identity, exact-frame event
timestamps, column routing per `column_for_style`, dataset-level `tools`
column with the `say` schema, drops legacy `subtask_index`. Validator
runs against staged JSONL artifacts before the writer rewrites parquet.

Adds `lerobot-annotate` console script, `annotations` extra (datatrove +
optional vllm), `make annotation-e2e` opt-in smoke target, and
`docs/source/annotation_pipeline.mdx`.

Branched from PR 1 (`feat/language-columns`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-27 16:22:51 +02:00
+								```
-												feat(annotate): Module 1 sees the whole episode as one video block

Replaces keyframe sampling with a single Qwen-VL video block covering
the whole demonstration. The model pools temporally itself and chooses
where to cut subtasks — no stride, no count, no keyframe count knob to
tune.

- frames.py: ``FrameProvider`` gains ``video_for_episode(record,
  max_frames)``; ``VideoFrameProvider`` samples up to ``max_frames``
  uniformly across the episode duration; ``_NullProvider`` returns []
  for the no-video fallback. New ``to_video_block`` helper.
- Module 1: drops keyframe sampling. The subtask prompt now goes out as
  ``[{"type":"video", "video":[<frames>]}, {"type":"text", ...}]`` and
  the prompt template asks the model to "watch the whole clip, then
  segment it" with cut points decided from gripper/contact/regrasp
  events the model sees.
- Module1Config: ``keyframes_per_episode`` removed; replaced with
  ``max_video_frames: int = 32`` (model-capacity bound, not annotation
  logic).
- Test: ``test_module1_attaches_video_block_to_subtask_prompt`` locks in
  the single-video-block invariant.
- Stub-VLM markers updated: tests now key on "atomic subtasks" instead
  of the old "Decompose the demonstration" phrase that no longer
  appears in the prompt.
- Docs: updated to describe the whole-episode video-block behavior and
  the no-video fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-27 17:08:36 +02:00
+								The pipeline attaches actual camera footage to every Module 1/2/3 prompt
 								by default, decoded from the dataset's first `observation.images.*`
 								stream. Override with `--vlm.camera_key=observation.images.<name>` to
 								pin a specific viewpoint. Datasets with no video tracks fall back to
 								text-only prompts automatically.
 								**Module 1 sees the whole episode as one video block.** Subtask
 								decomposition gets a `{"type":"video", "video":[<frames>]}` block
 								covering the entire demonstration; Qwen-VL pools temporally on its own
 								and decides where to cut. There is no keyframe stride or count knob —
 								`--module_1.max_video_frames` (default 32) only caps the frames packed
 								into the video block as a model-capacity bound. Module 2 attaches a
 								single still frame at the interjection timestamp; Module 3 attaches the
 								exact emission frame to each VQA pair.
-												feat(annotate): attach camera keyframes to module prompts; default to Qwen3.6-27B-FP8

Closes the visual-grounding gap flagged after the initial PR review:
modules now decode actual camera frames at the relevant timestamps and
attach them as `{"type":"image", "image":<PIL>}` content blocks to the
VLM prompts.

- New `frames.py`:
  - `FrameProvider` Protocol; `VideoFrameProvider` decodes from the
    dataset's first `observation.images.*` stream via
    `LeRobotDatasetMetadata.get_video_file_path` and
    `decode_video_frames`, with the same `from_timestamp` shift the main
    dataset uses.
  - Per-process LRU cache so co-timestamped Module 1 plan-update + Module
    2 calls share decode work.
  - `make_frame_provider` falls back to a null provider when the dataset
    has no video tracks → text-only prompts (graceful absence).
- Modules 1/2/3 take an optional `frame_provider` (default null) and
  prepend image blocks before the text block.
  - Module 1 attaches `keyframes_per_episode` keyframes to the subtask
    decomposition prompt.
  - Module 2 attaches the frame at the interjection timestamp.
  - Module 3 attaches the exact emission frame to each VQA pair.
- VlmConfig: backend now defaults to `vllm`; default model is
  `Qwen/Qwen3.6-27B-FP8`. New knobs: `--vlm.tensor_parallel_size`,
  `--vlm.camera_key` (override the keyframe stream).
- `_make_vllm_client` honours `tensor_parallel_size` so 27B-FP8 sharded
  on 2× GPUs works out of the box.
- `test_module3_attaches_frame_image_block_to_prompt` asserts modules
  emit one image block per VQA prompt at the exact emission timestamp.
- Docs: example switched to `imstevenpmwork/super_poulain_draft` +
  Qwen3.6-27B-FP8 + tensor_parallel_size=2; documents the keyframe
  attachment behaviour and the no-video fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-27 16:58:45 +02:00
-												feat: language annotation pipeline (PR 2/3)

Adds the steerable annotation pipeline (`lerobot-annotate`) that populates
the `language_persistent` and `language_events` columns introduced in
PR 1 directly into `data/chunk-*/file-*.parquet`. No flavor namespace,
no sidecar tree.

Modules produced:
- Module 1 (plan_subtasks_memory): Pi0.7-style subtasks, plan (init +
  refresh on interjection), MEM-style memory at subtask boundaries.
- Module 2 (interjections_and_speech): t=0 speech-only acknowledgement,
  mid-episode paired interjection + speech tool-call atom.
- Module 3 (general_vqa): bbox/keypoint/count/attribute/spatial pairs at
  configurable cadence with one-retry JSON validation.

Writer enforces: per-episode persistent identity, exact-frame event
timestamps, column routing per `column_for_style`, dataset-level `tools`
column with the `say` schema, drops legacy `subtask_index`. Validator
runs against staged JSONL artifacts before the writer rewrites parquet.

Adds `lerobot-annotate` console script, `annotations` extra (datatrove +
optional vllm), `make annotation-e2e` opt-in smoke target, and
`docs/source/annotation_pipeline.mdx`.

Branched from PR 1 (`feat/language-columns`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-27 16:22:51 +02:00
+								The executor picks `LocalPipelineExecutor` for small datasets and
 								`SlurmPipelineExecutor` for large ones based on
 								`--executor.auto_threshold` (default 32 episodes). Force local with
 								`--executor.force_local=true`. SLURM jobs honour `--executor.slurm_partition`,
 								`--executor.slurm_gpus`, and `--executor.slurm_time`.
 								## Style-to-recipe consumer mapping
 								The pipeline produces exactly the styles consumed by
-												refactor(recipes): rename recipes, drop pi05_hirobot

- hirobot.yaml            -> subtasks_vqa.yaml
- hirobot_memory.yaml     -> subtask_mem_vqa_speech.yaml
- pi05_hirobot.yaml       -> deleted (stale: uses plan, top-camera names;
  superseded by the two recipes above)
- smolvla2_hirobot.yaml   -> deleted (was untracked stale junk)

Updated the smolvla2 / pi052 `recipe_path` config defaults, all
docstring / comment references, the annotation-pipeline + recipe docs,
and the three tests that loaded pi05_hirobot.yaml (repointed to the
renamed recipes; the low-level-branch and pipeline-render assertions
now accept a flow-only `low_level` stream as valid supervision, since
the new recipes' low_level_execution has no text-CE target).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-05-18 16:02:15 +02:00
+								`src/lerobot/configs/recipes/subtask_mem_vqa_speech.yaml`:
-												feat: language annotation pipeline (PR 2/3)

Adds the steerable annotation pipeline (`lerobot-annotate`) that populates
the `language_persistent` and `language_events` columns introduced in
PR 1 directly into `data/chunk-*/file-*.parquet`. No flavor namespace,
no sidecar tree.

Modules produced:
- Module 1 (plan_subtasks_memory): Pi0.7-style subtasks, plan (init +
  refresh on interjection), MEM-style memory at subtask boundaries.
- Module 2 (interjections_and_speech): t=0 speech-only acknowledgement,
  mid-episode paired interjection + speech tool-call atom.
- Module 3 (general_vqa): bbox/keypoint/count/attribute/spatial pairs at
  configurable cadence with one-retry JSON validation.

Writer enforces: per-episode persistent identity, exact-frame event
timestamps, column routing per `column_for_style`, dataset-level `tools`
column with the `say` schema, drops legacy `subtask_index`. Validator
runs against staged JSONL artifacts before the writer rewrites parquet.

Adds `lerobot-annotate` console script, `annotations` extra (datatrove +
optional vllm), `make annotation-e2e` opt-in smoke target, and
`docs/source/annotation_pipeline.mdx`.

Branched from PR 1 (`feat/language-columns`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-27 16:22:51 +02:00
 								- `low_level_execution`, `high_level_subtask`, `memory_update` consume
 								  `subtask`/`plan`/`memory` from `language_persistent`.
 								- `user_interjection_response` consumes `interjection` events plus the
 								  paired speech atom (merged into one assistant target turn via
 								  `tool_calls_from`) and the same-timestamp `plan` refresh.
 								- `ask_vqa` consumes the `(vqa, user)` and `(vqa, assistant)` pairs from
 								  `language_events`.
 								## Why the design is scoped to the canonical recipe
 								Two things drive the scope:
 . **Persistent state vs exact-event split.** Persistent rows (`subtask`,
 								   `plan`, `memory`) broadcast per episode and answer "what state is in
 								   force at this frame?". Event rows (`interjection`, `vqa`, speech) only
 								   appear on the exact frame whose timestamp matches the emission. The
 								   pipeline writes timestamps taken straight from the source parquet — no
 								   floating-point recomputation.
 . **One Qwen-VL pass.** All three modules share a single VLM client
 								   (vLLM if available, transformers fallback) so the cost is one model
 								   load per dataset, not three.
 								## Module independence and staged reruns
 								Each module writes its raw output to
 								`<root>/.annotate_staging/episode_{N:06d}/<module>.jsonl`. That makes
 								prompt iteration cheap — re-running one module overwrites only its own
 								JSONL file before the writer composes the final parquet. Modules can be
 								disabled via `--module_1.enabled=false` (and similarly for 2 and 3) to
 								test them in isolation.
 								## Validation/report checks before final write
 								Before the writer runs, `StagingValidator` checks:
 								- exact frame-timestamp alignment for every event row;
 								- no orphan speech / interjection pairs;
 								- `plan` is refreshed at every interjection timestamp;
 								- `memory` rows fall on subtask boundaries (warning, not error);
 								- VQA assistant `content` parses as JSON in one of the
 								  bbox / keypoint / count / attribute / spatial shapes;
 								- every row routes to the column dictated by `column_for_style(style)`.
 								Errors abort the writer (`--skip_validation=true` overrides for debugging).
 								## Paper inspirations per module
 								- **Module 1 — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417))
 								  atom granularity ("pick up one piece of lettuce", "place bowl to box");
 								  Pi0.7 ([Physical Intelligence 2025](https://pi.website/pi07)) "how, not
 								  what" detail.
 								- **Module 1 — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596))
 								  compression directive: keep only minimal relevant information; functional
 								  outcomes preserved, specific attributes dropped.
 								- **Module 2 — interjections.** Hi Robot scenario taxonomy: negative task,
 								  situated correction, specific constraint, preference. Speech is a
 								  tool-call-only atom (`tool_calls=[{type:function, function:{name:"say",
 								arguments:{text:...}}}]`).
 								- **Module 3 — VQA.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693))
 								  grounded features (bounding boxes in pixel `[x_min, y_min, x_max, y_max]`,
 								  keypoints) and Steerable Policies' multi-abstraction grounding.
 								Future maintainers should adjust the prompt templates in
 								`src/lerobot/annotations/steerable_pipeline/prompts/` against these
 								references rather than rewriting from scratch.
 								## Compute and list-size estimates
 								Per episode, the pipeline issues O(`max_steps`) Module 1 calls,
 								O(`max_interjections_per_episode`) Module 2 calls, and
 								O(`vqa_emission_hz × episode_seconds`) Module 3 calls. With defaults
 								(8 subtasks, 1 interjection, 1 Hz × 3 pairs) and 30-second episodes, that
 								is ~50 VLM calls per episode. `language_persistent` per episode is ~10s of
 								KB at most (parquet dictionary-encodes one entry per episode);
 								`language_events` is empty on most frames and is bounded by the number of
 								emissions, not `num_frames × num_emissions`.
 								## Reproducibility via seed and prompt hashes
 								`--seed` (default 1729) feeds the per-episode RNGs that select interjection
 								timestamps and VQA question types. Combined with the deterministic prompt
 								templates checked into `prompts/`, two runs at the same seed against the
 								same dataset and the same model checkpoint produce byte-identical staging
 								artifacts. Prompt edits are recorded by file hash; future tooling can pin
 								expected `(seed, prompt_hash)` pairs into the dataset card.