lerobot-clone/docs/source/annotation_pipeline.mdx

# Annotation Pipeline

`lerobot-annotate` populates the two language columns introduced by the
[Language Columns and Recipes](./language_and_recipes) page —
`language_persistent` and `language_events` — directly into
`data/chunk-*/file-*.parquet`.

## What the pipeline produces

Three modules write into a per-episode staging tree, then a single writer
rewrites the data shards in place:

| Style / atom                                | Column                | Module          |
| ------------------------------------------- | --------------------- | --------------- |
| `subtask` (Pi0.7-style "how, not what")     | `language_persistent` | `plan`          |
| `plan` (initial + refresh on interjection)  | `language_persistent` | `plan`          |
| `memory` (MEM-style compression)            | `language_persistent` | `plan`          |
| `task_aug` (rephrasings of canonical task)  | `language_persistent` | `plan`          |
| `interjection`                              | `language_events`     | `interjections` |
| speech tool-call atom (`style=null`, `say`) | `language_events`     | `interjections` |
| `vqa` (user / assistant pair)               | `language_events`     | `vqa`           |

The `plan` module generates subtasks per episode with a **describe → segment**
grounding flow: a first pass narrates only what is visible in the chosen
camera, and its description is fed into a second pass that segments the
episode into consecutive atomic subtasks. The resulting spans are then
deterministically stitched into a contiguous full-episode cover so every
frame has exactly one active subtask. See
[`run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
for the production flag set (single camera, embedded frames, windowed
subtask generation).

The writer does **not** add a `tools` column to the parquet — the tool
catalog lives at `meta/info.json["tools"]` instead (see
[Tools](./tools)). After every annotation run the pipeline ensures the
canonical `say` schema is present in that list, preserving any tools the
user pre-declared.

If you want to declare additional tools for a dataset before annotation
runs, edit `meta/info.json["tools"]` directly — the pipeline preserves
anything already there. That makes the tool visible to the chat template
so the model can learn to _generate_ the call. The runtime layer that
_executes_ a generated call (the `Tool` protocol / `TOOL_REGISTRY` under
`src/lerobot/tools/`) is not part of this PR — see the
[Tools](./tools) doc, which marks those pieces as not-yet-implemented.

## Running on Hugging Face Jobs

Distributed annotation is delegated to
[Hugging Face Jobs](https://huggingface.co/docs/hub/en/jobs). The repo
ships a launcher script you copy and edit for your dataset:

```bash
HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py
```

[`examples/annotations/run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
spawns a single-GPU `h200` job (scale up to `h200x4` for larger datasets) that:

1. installs the branch under test plus the annotation extras,
2. boots one vLLM server per GPU (in the `vllm/vllm-openai` image) for the
   chosen model, which the pipeline drives over the OpenAI-compatible API,
3. runs the `plan` / `interjections` / `vqa` modules across the dataset
   via `lerobot-annotate`,
4. with `--push_to_hub=true`, uploads the annotated dataset to
   `--new_repo_id` (or back to `--repo_id` in place when that is unset).

To target a different dataset, model, or hub repo, edit the `CMD` block
inside the script — every flag in there maps directly onto a CLI flag of
`lerobot-annotate` (see `lerobot-annotate --help` for the full list).

## Contributing new modules

The pipeline is built to be extended, and **contributions are very
welcome** — whether that's a brand-new annotation module (e.g. a
trajectory-trace or affordance module), a new prompt template, a better
grounding flow, or quality improvements to the existing `plan` /
`interjections` / `vqa` modules. Each module lives under
`src/lerobot/annotations/steerable_pipeline/modules/`, shares the VLM
client and keyframe cache, writes its raw output to the per-episode
staging tree, and is wired into the executor as an independent phase.
If you have an idea for a module or an improvement, open an issue or PR
on [the repo](https://github.com/huggingface/lerobot).

## Style-to-recipe consumer mapping

The pipeline's outputs are designed to be consumed by recipes (see
[Language Columns and Recipes](./language_and_recipes)) — typically:

- low-level / high-level / memory-update branches consume
  `subtask`/`plan`/`memory` from `language_persistent`.
- An interjection-response branch consumes `interjection` events plus
  the paired speech atom (merged into one assistant target turn via
  `tool_calls_from`) and the same-timestamp `plan` refresh.
- A VQA branch consumes the `(vqa, user)` and `(vqa, assistant)` pairs
  from `language_events`.

## Why the design splits state from events

Two things drive the scope:

1. **Persistent state vs exact-event split.** Persistent rows
   (`subtask`, `plan`, `memory`) broadcast per episode and answer "what
   state is in force at this frame?". Event rows (`interjection`, `vqa`,
   speech) only appear on the exact frame whose timestamp matches the
   emission. The pipeline writes timestamps taken straight from the
   source parquet — no floating-point recomputation.
2. **One Qwen-VL pass.** All three modules share a single VLM client (the
   OpenAI-compatible client talking to the job's vLLM server) so the cost
   is one model load per dataset, not three.

## Module independence and staged reruns

Each module writes its raw output to
`<root>/.annotate_staging/episode_{N:06d}/<module>.jsonl`. That makes
prompt iteration cheap — re-running one module overwrites only its own
JSONL file before the writer composes the final parquet. Modules can be
disabled via `--plan.enabled=false` (and likewise `--interjections.enabled`
/ `--vqa.enabled`) to
test them in isolation.

## Validation/report checks before final write

Before the writer runs, `StagingValidator` checks:

- exact frame-timestamp alignment for every event row;
- no orphan speech / interjection pairs;
- `plan` is refreshed at every interjection timestamp;
- `memory` rows fall on subtask boundaries (warning, not error);
- VQA assistant `content` parses as JSON in one of the
  bbox / keypoint / count / attribute / spatial shapes;
- every row routes to the column dictated by `column_for_style(style)`.

Errors abort the writer (`--skip_validation=true` overrides for debugging).

## Paper inspirations per module

- **`plan` module — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417))
  atom granularity ("pick up one piece of lettuce", "place bowl to box");
  Pi0.7 ([Physical Intelligence 2025](https://pi.website/pi07)) "how, not
  what" detail.
- **`plan` module — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596))
  compression directive: keep only minimal relevant information; functional
  outcomes preserved, specific attributes dropped.
- **`interjections` module.** Hi Robot scenario taxonomy: negative task,
  situated correction, specific constraint, preference. Speech is a
  tool-call-only atom (`tool_calls=[{type:function, function:{name:"say",
arguments:{text:...}}}]`).
- **`vqa` module.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693))
  grounded features (bounding boxes in pixel `[x_min, y_min, x_max, y_max]`,
  keypoints) and Steerable VLA Policies ([Zhao 2025](https://arxiv.org/abs/2509.07626))
  multi-abstraction grounding. Pi0.7 also grounds answers across
  multiple abstraction levels.

Future maintainers should adjust the prompt templates in
`src/lerobot/annotations/steerable_pipeline/prompts/` against these
references rather than rewriting from scratch.

## Compute and list-size estimates

Per episode, the pipeline issues O(`max_steps`) `plan`-module calls,
O(`max_interjections_per_episode`) `interjections`-module calls, and
O(`vqa_emission_hz × episode_seconds`) `vqa`-module calls. With defaults
(8 subtasks, 1 interjection, 1 Hz × 3 pairs) and 30-second episodes, that
is ~50 VLM calls per episode. `language_persistent` per episode is ~10s of
KB at most (parquet dictionary-encodes one entry per episode);
`language_events` is empty on most frames and is bounded by the number of
emissions, not `num_frames × num_emissions`.

## Reproducibility via seed and prompt hashes

`--seed` (default 1729) feeds the per-episode RNGs that select interjection
timestamps and VQA question types. Combined with the deterministic prompt
templates checked into `prompts/`, two runs at the same seed against the
same dataset and the same model checkpoint produce byte-identical staging
artifacts. Prompt edits are recorded by file hash; future tooling can pin
expected `(seed, prompt_hash)` pairs into the dataset card.