From 7bec991cdf0ff19e00cf9c370eab126c43cfa651 Mon Sep 17 00:00:00 2001
From: Pepijn <pepijn@huggingface.co>
Date: Thu, 4 Jun 2026 11:48:59 +0200
Subject: [PATCH] docs(annotate): friendlier rewrite + architecture diagram;
 drop reproducibility section

Rewrite annotation_pipeline.mdx in plainer, easier-to-read language
(shorter sentences, active voice, a plain-text intro), add an ASCII
'How it fits together' architecture diagram, and remove the
'Reproducibility via seed and prompt hashes' section. Content/links are
preserved; only wording and structure change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/source/annotation_pipeline.mdx | 267 +++++++++++++++-------------
 1 file changed, 148 insertions(+), 119 deletions(-)

diff --git a/docs/source/annotation_pipeline.mdx b/docs/source/annotation_pipeline.mdx
index a2d38e417..3fae61627 100644
--- a/docs/source/annotation_pipeline.mdx
+++ b/docs/source/annotation_pipeline.mdx
@@ -1,177 +1,206 @@
 # Annotation Pipeline
 
-`lerobot-annotate` populates the two language columns introduced by the
+`lerobot-annotate` watches each episode's video with a vision-language
+model (VLM) and writes natural-language annotations back into your
+dataset. It fills the two language columns from the
 [Language Columns and Recipes](./language_and_recipes) page —
-`language_persistent` and `language_events` — directly into
+`language_persistent` and `language_events` — straight into
 `data/chunk-*/file-*.parquet`.
 
+In short: point it at a LeRobot dataset, and it adds subtasks, plans,
+memory, interjections, speech, and visual Q&A that a policy can be
+trained on.
+
+## How it fits together
+
+```text
+  your dataset                lerobot-annotate
+  (LeRobot v3.1)        ┌──────────────────────────────────┐
+        │              │   read episodes                    │
+        └─────────────▶│        │                           │
+                       │        ▼                           │
+   one shared          │   ┌──────┐ ┌─────────────┐ ┌─────┐ │  each module writes
+   Qwen-VL server ────▶│   │ plan │ │interjections│ │ vqa │ │  raw JSONL into
+   (vLLM, OpenAI API)  │   └──┬───┘ └──────┬──────┘ └──┬──┘ │  .annotate_staging/
+                       │      └────────────┼───────────┘    │
+                       │                   ▼                 │
+                       │               validator             │  checks everything
+                       │                   │                 │
+                       │                   ▼                 │
+                       │                writer ──────────────┼─▶ data/chunk-*/file-*.parquet
+                       └──────────────────────────────────┘     (+ meta/info.json tools)
+```
+
+Three modules (`plan`, `interjections`, `vqa`) all talk to **one** shared
+VLM. Each module stages its output to disk, a validator checks it, and a
+single writer rewrites the dataset shards in place.
+
 ## What the pipeline produces
 
-Three modules write into a per-episode staging tree, then a single writer
-rewrites the data shards in place:
+Each module emits a few kinds of annotation ("styles"), routed to one of
+the two language columns:
 
 | Style / atom                                | Column                | Module          |
 | ------------------------------------------- | --------------------- | --------------- |
 | `subtask` (Pi0.7-style "how, not what")     | `language_persistent` | `plan`          |
 | `plan` (initial + refresh on interjection)  | `language_persistent` | `plan`          |
 | `memory` (MEM-style compression)            | `language_persistent` | `plan`          |
-| `task_aug` (rephrasings of canonical task)  | `language_persistent` | `plan`          |
+| `task_aug` (rephrasings of the task)        | `language_persistent` | `plan`          |
 | `interjection`                              | `language_events`     | `interjections` |
 | speech tool-call atom (`style=null`, `say`) | `language_events`     | `interjections` |
 | `vqa` (user / assistant pair)               | `language_events`     | `vqa`           |
 
-The `plan` module generates subtasks per episode with a **describe → segment**
-grounding flow: a first pass narrates only what is visible in the chosen
-camera, and its description is fed into a second pass that segments the
-episode into consecutive atomic subtasks. The resulting spans are then
-deterministically stitched into a contiguous full-episode cover so every
-frame has exactly one active subtask. See
+### How subtasks are generated
+
+The `plan` module doesn't ask the VLM for subtasks in one shot. Instead
+it uses a two-step **describe → segment** flow:
+
+1. **Describe** — the VLM narrates only what it actually sees in the
+   chosen camera (no guessing about the task).
+2. **Segment** — that description is fed back in, and the VLM splits the
+   episode into consecutive atomic subtasks.
+
+The resulting spans are then stitched into a gap-free, full-episode
+cover, so **every frame has exactly one active subtask**. See
 [`run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
-for the production flag set (single camera, embedded frames, windowed
+for the production settings (single camera, embedded frames, windowed
 subtask generation).
 
-The writer does **not** add a `tools` column to the parquet — the tool
-catalog lives at `meta/info.json["tools"]` instead (see
-[Tools](./tools)). After every annotation run the pipeline ensures the
-canonical `say` schema is present in that list, preserving any tools the
-user pre-declared.
+### Tools
 
-If you want to declare additional tools for a dataset before annotation
-runs, edit `meta/info.json["tools"]` directly — the pipeline preserves
-anything already there. That makes the tool visible to the chat template
-so the model can learn to _generate_ the call. The runtime layer that
-_executes_ a generated call (the `Tool` protocol / `TOOL_REGISTRY` under
-`src/lerobot/tools/`) is not part of this PR — see the
-[Tools](./tools) doc, which marks those pieces as not-yet-implemented.
+The writer does **not** add a `tools` column to the parquet. The tool
+catalog lives in `meta/info.json["tools"]` instead (see [Tools](./tools)).
+After every run, the pipeline makes sure the canonical `say` schema is in
+that list, keeping any tools you declared beforehand.
+
+Want to add your own tool? Edit `meta/info.json["tools"]` directly — the
+pipeline preserves whatever is already there. That makes the tool visible
+to the chat template, so the model can learn to _generate_ the call. The
+runtime layer that actually _executes_ a generated call (the `Tool`
+protocol / `TOOL_REGISTRY` under `src/lerobot/tools/`) is not part of
+this PR — the [Tools](./tools) doc marks those pieces as
+not-yet-implemented.
 
 ## Running on Hugging Face Jobs
 
-Distributed annotation is delegated to
-[Hugging Face Jobs](https://huggingface.co/docs/hub/en/jobs). The repo
-ships a launcher script you copy and edit for your dataset:
+Annotation runs on [Hugging Face Jobs](https://huggingface.co/docs/hub/en/jobs).
+The repo ships a launcher script you copy and tweak for your dataset:
 
 ```bash
 HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py
 ```
 
-[`examples/annotations/run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
-spawns a single-GPU `h200` job (scale up to `h200x4` for larger datasets) that:
+[`run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
+starts a single-GPU `h200` job (bump it to `h200x4` for big datasets)
+that:
 
 1. installs `lerobot` (from `main`) plus the annotation extras,
-2. boots one vLLM server per GPU (in the `vllm/vllm-openai` image) for the
-   chosen model, which the pipeline drives over the OpenAI-compatible API,
+2. boots one vLLM server per GPU (using the `vllm/vllm-openai` image) and
+   drives it over the OpenAI-compatible API,
 3. runs the `plan` / `interjections` / `vqa` modules across the dataset
-   via `lerobot-annotate`,
-4. with `--push_to_hub=true`, uploads the annotated dataset to
-   `--new_repo_id` (or back to `--repo_id` in place when that is unset).
+   with `lerobot-annotate`,
+4. with `--push_to_hub=true`, uploads the result to `--new_repo_id` (or
+   back to `--repo_id` in place if you leave that unset).
 
-To target a different dataset, model, or hub repo, edit the `CMD` block
-inside the script — every flag in there maps directly onto a CLI flag of
-`lerobot-annotate` (see `lerobot-annotate --help` for the full list).
+To use a different dataset, model, or hub repo, edit the `CMD` block in
+the script. Every flag there maps directly to a `lerobot-annotate` flag
+(run `lerobot-annotate --help` for the full list).
 
 ## Contributing new modules
 
-The pipeline is built to be extended, and **contributions are very
-welcome** — whether that's a brand-new annotation module (e.g. a
-trajectory-trace or affordance module), a new prompt template, a better
-grounding flow, or quality improvements to the existing `plan` /
-`interjections` / `vqa` modules. Each module lives under
+The pipeline is built to grow, and **contributions are very welcome** —
+a brand-new module (say, trajectory traces or affordances), a new prompt
+template, a smarter grounding flow, or quality fixes to the existing
+`plan` / `interjections` / `vqa` modules.
+
+Every module lives under
 `src/lerobot/annotations/steerable_pipeline/modules/`, shares the VLM
-client and keyframe cache, writes its raw output to the per-episode
-staging tree, and is wired into the executor as an independent phase.
-If you have an idea for a module or an improvement, open an issue or PR
-on [the repo](https://github.com/huggingface/lerobot).
+client and the keyframe cache, writes its raw output to the staging
+tree, and plugs into the executor as its own phase. Got an idea? Open an
+issue or PR on [the repo](https://github.com/huggingface/lerobot).
 
-## Style-to-recipe consumer mapping
+## How recipes consume the output
 
-The pipeline's outputs are designed to be consumed by recipes (see
-[Language Columns and Recipes](./language_and_recipes)) — typically:
+The annotations are meant to be read by recipes (see
+[Language Columns and Recipes](./language_and_recipes)). Typically:
 
-- low-level / high-level / memory-update branches consume
-  `subtask`/`plan`/`memory` from `language_persistent`.
-- An interjection-response branch consumes `interjection` events plus
-  the paired speech atom (merged into one assistant target turn via
-  `tool_calls_from`) and the same-timestamp `plan` refresh.
-- A VQA branch consumes the `(vqa, user)` and `(vqa, assistant)` pairs
-  from `language_events`.
+- low-level / high-level / memory-update branches read
+  `subtask` / `plan` / `memory` from `language_persistent`.
+- an interjection-response branch reads `interjection` events plus the
+  paired speech atom (merged into one assistant turn via `tool_calls_from`)
+  and the matching `plan` refresh at the same timestamp.
+- a VQA branch reads the `(vqa, user)` and `(vqa, assistant)` pairs from
+  `language_events`.
 
-## Why the design splits state from events
+## Why state and events are split
 
-Two things drive the scope:
+Two ideas shape the design:
 
-1. **Persistent state vs exact-event split.** Persistent rows
-   (`subtask`, `plan`, `memory`) broadcast per episode and answer "what
-   state is in force at this frame?". Event rows (`interjection`, `vqa`,
-   speech) only appear on the exact frame whose timestamp matches the
-   emission. The pipeline writes timestamps taken straight from the
-   source parquet — no floating-point recomputation.
-2. **One Qwen-VL pass.** All three modules share a single VLM client (the
-   OpenAI-compatible client talking to the job's vLLM server) so the cost
-   is one model load per dataset, not three.
+1. **Persistent state vs. exact events.** Persistent rows (`subtask`,
+   `plan`, `memory`) apply to the whole episode and answer "what's true
+   right now?". Event rows (`interjection`, `vqa`, speech) appear only on
+   the one frame whose timestamp matches. Timestamps are copied straight
+   from the source parquet — never recomputed in floating point.
+2. **One VLM pass.** All three modules share a single VLM client (the
+   OpenAI-compatible client talking to the job's vLLM server), so you pay
+   for one model load per dataset, not three.
 
-## Module independence and staged reruns
+## Re-running a single module
 
-Each module writes its raw output to
-`<root>/.annotate_staging/episode_{N:06d}/<module>.jsonl`. That makes
-prompt iteration cheap — re-running one module overwrites only its own
-JSONL file before the writer composes the final parquet. Modules can be
-disabled via `--plan.enabled=false` (and likewise `--interjections.enabled`
-/ `--vqa.enabled`) to
-test them in isolation.
+Each module stages its raw output to
+`<root>/.annotate_staging/episode_{N:06d}/<module>.jsonl`. This makes
+prompt iteration cheap: re-running one module overwrites only its own
+JSONL, then the writer recomposes the final parquet. Disable modules you
+don't want with `--plan.enabled=false` (and likewise
+`--interjections.enabled` / `--vqa.enabled`) to test one at a time.
 
-## Validation/report checks before final write
+## What the validator checks
 
-Before the writer runs, `StagingValidator` checks:
+Before the writer runs, `StagingValidator` confirms:
 
-- exact frame-timestamp alignment for every event row;
-- no orphan speech / interjection pairs;
+- every event row lands exactly on a real frame timestamp;
+- no speech / interjection pairs are left orphaned;
 - `plan` is refreshed at every interjection timestamp;
-- `memory` rows fall on subtask boundaries (warning, not error);
-- VQA assistant `content` parses as JSON in one of the
+- `memory` rows fall on subtask boundaries (a warning, not an error);
+- each VQA assistant `content` is valid JSON in one of the
   bbox / keypoint / count / attribute / spatial shapes;
-- every row routes to the column dictated by `column_for_style(style)`.
+- every row goes to the column chosen by `column_for_style(style)`.
 
-Errors abort the writer (`--skip_validation=true` overrides for debugging).
+Any error aborts the writer. Pass `--skip_validation=true` to override
+while debugging.
 
-## Paper inspirations per module
+## Where each module's ideas come from
 
-- **`plan` module — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417))
-  atom granularity ("pick up one piece of lettuce", "place bowl to box");
-  Pi0.7 ([Physical Intelligence 2025](https://pi.website/pi07)) "how, not
-  what" detail.
-- **`plan` module — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596))
-  compression directive: keep only minimal relevant information; functional
-  outcomes preserved, specific attributes dropped.
-- **`interjections` module.** Hi Robot scenario taxonomy: negative task,
+- **`plan` — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417))
+  for atom granularity ("pick up one piece of lettuce", "place bowl to
+  box"); Pi0.7 ([Physical Intelligence 2025](https://pi.website/pi07))
+  for "how, not what" detail.
+- **`plan` — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596)):
+  keep only the minimal relevant information — preserve outcomes, drop
+  specific attributes.
+- **`interjections`.** Hi Robot's scenario taxonomy: negative task,
   situated correction, specific constraint, preference. Speech is a
-  tool-call-only atom (`tool_calls=[{type:function, function:{name:"say",
-arguments:{text:...}}}]`).
-- **`vqa` module.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693))
-  grounded features (bounding boxes in pixel `[x_min, y_min, x_max, y_max]`,
-  keypoints) and Steerable VLA Policies ([Zhao 2025](https://arxiv.org/abs/2509.07626))
-  multi-abstraction grounding. Pi0.7 also grounds answers across
-  multiple abstraction levels.
+  tool-call-only atom
+  (`tool_calls=[{type:function, function:{name:"say", arguments:{text:...}}}]`).
+- **`vqa`.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693)) for
+  grounded features (pixel bounding boxes `[x_min, y_min, x_max, y_max]`,
+  keypoints) and Steerable VLA Policies
+  ([Zhao 2025](https://arxiv.org/abs/2509.07626)) for multi-abstraction
+  grounding. Pi0.7 also grounds answers across abstraction levels.
 
-Future maintainers should adjust the prompt templates in
-`src/lerobot/annotations/steerable_pipeline/prompts/` against these
-references rather than rewriting from scratch.
+When improving a module, tweak its prompt template in
+`src/lerobot/annotations/steerable_pipeline/prompts/` rather than
+rewriting from scratch.
 
-## Compute and list-size estimates
+## Roughly how much it costs
 
-Per episode, the pipeline issues O(`max_steps`) `plan`-module calls,
-O(`max_interjections_per_episode`) `interjections`-module calls, and
-O(`vqa_emission_hz × episode_seconds`) `vqa`-module calls. With defaults
-(8 subtasks, 1 interjection, 1 Hz × 3 pairs) and 30-second episodes, that
-is ~50 VLM calls per episode. `language_persistent` per episode is ~10s of
-KB at most (parquet dictionary-encodes one entry per episode);
-`language_events` is empty on most frames and is bounded by the number of
-emissions, not `num_frames × num_emissions`.
+Per episode, the pipeline makes about `max_steps` plan calls,
+`max_interjections_per_episode` interjection calls, and
+`vqa_emission_hz × episode_seconds` VQA calls. With the defaults (8
+subtasks, 1 interjection, 1 Hz × 3 pairs) on a 30-second episode, that's
+~50 VLM calls.
 
-## Reproducibility via seed and prompt hashes
-
-`--seed` (default 1729) feeds the per-episode RNGs that select interjection
-timestamps and VQA question types. Combined with the deterministic prompt
-templates checked into `prompts/`, two runs at the same seed against the
-same dataset and the same model checkpoint produce byte-identical staging
-artifacts. Prompt edits are recorded by file hash; future tooling can pin
-expected `(seed, prompt_hash)` pairs into the dataset card.
+Storage stays small: `language_persistent` is at most tens of KB per
+episode (parquet dictionary-encodes the one entry that repeats across
+frames), and `language_events` is empty on most frames — its size scales
+with the number of emissions, not `num_frames × num_emissions`.