Files
lerobot-clone/docs/source/annotation_pipeline.mdx

218 lines
10 KiB
Plaintext
Raw Normal View History

# Annotation Pipeline
`lerobot-annotate` watches each episode's video with a vision-language
model (VLM) and writes natural-language annotations back into your
dataset. It fills the two language columns from the
[Language Columns and Recipes](./language_and_recipes) page —
`language_persistent` and `language_events` — straight into
refactor(annotate): delegate distribution to HF Jobs; drop SLURM/local switch The executor previously claimed it would "optionally hand off" to datatrove's LocalPipelineExecutor or SlurmPipelineExecutor — but it already runs phases inline in every code path, and HF Jobs (see ``examples/annotation/run_hf_job.py``) is the actual distribution strategy. Stop pretending we have an executor selector. * `executor.py`: drop `select_executor_class`, the "kind" log line, and the references to LocalPipelineExecutor / SlurmPipelineExecutor. Module docstring now says distribution is delegated to HF Jobs. * `config.py`: drop `auto_threshold`, `force_local`, `slurm_partition`, `slurm_gpus`, `slurm_time`, `workers`. `ExecutorConfig` keeps only `episode_parallelism`. While here, prune the longer "why" docstrings on every field down to the load-bearing bits — full story moves to `docs/source/annotation_pipeline.mdx`. * `pyproject.toml`: drop `datatrove>=0.4.0,<2.0.0` from the `[annotations]` extra; the dep was only there for the (never used) cluster executors. Comment block notes the new HF-Jobs delegation. * `reader.py`, `lerobot_annotate.py`: drop their own datatrove / flavor-namespace mentions. * `docs/source/annotation_pipeline.mdx`: - remove the flavor-namespace / sidecar paragraph (out of scope — "multiple revisions = multiple copies" is dataset-level policy); - remove the "writer drops the legacy `subtask_index` column" note (already covered by PR 1's intentional-break call-out); - remove the chat-template + `apply_chat_template(messages, tools=...)` line (covered by Tools doc); - replace the "executor picks Local vs Slurm" paragraph with `--executor.episode_parallelism` and a pointer to HF Jobs; - rewrite the style→recipe section to talk about "recipes" generically instead of pinning a specific YAML; - add a "Running on Hugging Face Jobs" section pointing at `examples/annotation/run_hf_job.py`; - add a "Running locally" example matching the CLI's docstring (`uv run lerobot-annotate --root=... --vlm.model_id=...`); - extend the paper-inspirations list with Pi0.7 and Steerable VLA Policies (Zhao 2025) for Module 3. Tests: same 3 pre-existing failures as before this commit (2 module assertions still in flight; 1 carryover from PR 1). 41/44 pass. Pre-commit clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:09:22 +02:00
`data/chunk-*/file-*.parquet`.
In short: point it at a LeRobot dataset, and it adds subtasks, plans,
memory, interjections, speech, and visual Q&A that a policy can be
trained on.
## How it fits together
```text
your dataset lerobot-annotate
(LeRobot v3.1)
┌─────────────────────────────────────────────────────┐
│ read episodes │
└──────────────────────────┬──────────────────────────┘
┌────────────────────┼────────────────────┐
▼ ▼ ▼
┌──────────┐ ┌───────────────┐ ┌──────────┐ one shared Qwen-VL
│ plan │ │ interjections │ │ vqa │ ◀── server (vLLM, OpenAI
└────┬─────┘ └───────┬───────┘ └────┬─────┘ API) drives all three
└────────────────────┼─────────────────────┘
│ each module stages raw JSONL
▼ into .annotate_staging/
┌─────────────────┐
│ validator │ ◀── checks everything
└────────┬────────┘
┌─────────────────┐
│ writer │
└────────┬────────┘
data/chunk-*/file-*.parquet
(+ meta/info.json tools)
```
Three modules (`plan`, `interjections`, `vqa`) all talk to **one** shared
VLM. Each module stages its output to disk, a validator checks it, and a
single writer rewrites the dataset shards in place.
## What the pipeline produces
Each module emits a few kinds of annotation ("styles"), routed to one of
the two language columns:
| Style / atom | Column | Module |
| ------------------------------------------- | --------------------- | --------------- |
| `subtask` (Pi0.7-style "how, not what") | `language_persistent` | `plan` |
| `plan` (initial + refresh on interjection) | `language_persistent` | `plan` |
| `memory` (MEM-style compression) | `language_persistent` | `plan` |
| `task_aug` (rephrasings of the task) | `language_persistent` | `plan` |
| `interjection` | `language_events` | `interjections` |
| speech tool-call atom (`style=null`, `say`) | `language_events` | `interjections` |
| `vqa` (user / assistant pair) | `language_events` | `vqa` |
### How subtasks are generated
The `plan` module doesn't ask the VLM for subtasks in one shot. Instead
it uses a two-step **describe → segment** flow:
1. **Describe** — the VLM narrates only what it actually sees in the
chosen camera (no guessing about the task).
2. **Segment** — that description is fed back in, and the VLM splits the
episode into consecutive atomic subtasks.
The resulting spans are then stitched into a gap-free, full-episode
cover, so **every frame has exactly one active subtask**. See
annotate: address review feedback — bug fixes, docs/code drift, naming, cleanup Bugs * validator: don't re-raise on unknown style. The second column_for_style lookup (used to route persistent vs event) now sits in try/except so an unknown style is recorded by _check_column_routing and skipped instead of crashing the whole validation pass. * general_vqa._target_cameras: when restrict_to_default_camera is set but the configured camera_key isn't one the provider exposes, warn and fall back to all cameras instead of returning a phantom key that KeyErrors deep in frame decode. * interjections: clamp interjection timestamps to frame_timestamps[0] rather than a hardcoded 0.0 (datasets can start at non-zero t). Docs / code drift * annotation_pipeline.mdx: drop the phantom 'vocabulary discovery / phase 0 / --vocabulary.* / canonical_vocabulary.json' section (none of it exists); describe the real describe->segment + coverage-stitch flow. Soften the src/lerobot/tools/ + TOOL_REGISTRY reference to 'not part of this PR' (matches tools.mdx, which already marks the runtime layer as not-yet-implemented). Fix the --push_to_hub/--new_repo_id wording. Note the default is now a single h200. Add a 'Contributing new modules' section inviting module / prompt / quality contributions. * executor docstring: six phases, no phantom phase 0. run_hf_job.py * add the Apache 2.0 license header (was flagged repeatedly). * default to a single GPU: flavor=h200, parallel_servers=1, num_gpus=1 (scale to h200x4 noted in the docstring). * pin the install to @main instead of the feature branch (won't break after merge). Naming / cleanup * rename dest_repo_id -> new_repo_id across config / script / example / test to match the LeRobot dataset edit tools. * rename prompt templates module_N_*.txt -> descriptive (plan_*, interjections_*, vqa.txt) and update every load_prompt() call. * remove dead _messages_to_prompt (used only by the removed in-process backends). * declare _warned_decode_fail (frames) and _warned_no_camera (vqa) as real init=False dataclass fields instead of getattr monkey-patches. * scope bandit B607 to the two ffmpeg subprocess.run sites via '# nosec B607' and drop it from the global skip list. Tests * fix stale canned-VLM markers ('ONE realistic interruption' -> 'compact interjection', 'Update the memory' -> 'compressed semantic memory') and drop the dead 'concise hierarchical PLAN' plan responders (plan generation is deterministic now) in run_e2e_smoke, test_pipeline_recipe_render, test_modules. * run_e2e_smoke now asserts interjection + speech rows are produced so a stale marker can't silently pass again. * drop remaining 'PR 1' / 'PR 2' references from test comments / names. Verified: tests/annotations + tests/datasets/test_language + tests/scripts/test_lerobot_annotate (31 passed); make-style E2E smoke (interjections=1 speech_atoms=2); pre-commit (ruff, mypy, bandit, prettier) clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-03 18:30:46 +02:00
[`run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
for the production settings (single camera, embedded frames, windowed
annotate: address review feedback — bug fixes, docs/code drift, naming, cleanup Bugs * validator: don't re-raise on unknown style. The second column_for_style lookup (used to route persistent vs event) now sits in try/except so an unknown style is recorded by _check_column_routing and skipped instead of crashing the whole validation pass. * general_vqa._target_cameras: when restrict_to_default_camera is set but the configured camera_key isn't one the provider exposes, warn and fall back to all cameras instead of returning a phantom key that KeyErrors deep in frame decode. * interjections: clamp interjection timestamps to frame_timestamps[0] rather than a hardcoded 0.0 (datasets can start at non-zero t). Docs / code drift * annotation_pipeline.mdx: drop the phantom 'vocabulary discovery / phase 0 / --vocabulary.* / canonical_vocabulary.json' section (none of it exists); describe the real describe->segment + coverage-stitch flow. Soften the src/lerobot/tools/ + TOOL_REGISTRY reference to 'not part of this PR' (matches tools.mdx, which already marks the runtime layer as not-yet-implemented). Fix the --push_to_hub/--new_repo_id wording. Note the default is now a single h200. Add a 'Contributing new modules' section inviting module / prompt / quality contributions. * executor docstring: six phases, no phantom phase 0. run_hf_job.py * add the Apache 2.0 license header (was flagged repeatedly). * default to a single GPU: flavor=h200, parallel_servers=1, num_gpus=1 (scale to h200x4 noted in the docstring). * pin the install to @main instead of the feature branch (won't break after merge). Naming / cleanup * rename dest_repo_id -> new_repo_id across config / script / example / test to match the LeRobot dataset edit tools. * rename prompt templates module_N_*.txt -> descriptive (plan_*, interjections_*, vqa.txt) and update every load_prompt() call. * remove dead _messages_to_prompt (used only by the removed in-process backends). * declare _warned_decode_fail (frames) and _warned_no_camera (vqa) as real init=False dataclass fields instead of getattr monkey-patches. * scope bandit B607 to the two ffmpeg subprocess.run sites via '# nosec B607' and drop it from the global skip list. Tests * fix stale canned-VLM markers ('ONE realistic interruption' -> 'compact interjection', 'Update the memory' -> 'compressed semantic memory') and drop the dead 'concise hierarchical PLAN' plan responders (plan generation is deterministic now) in run_e2e_smoke, test_pipeline_recipe_render, test_modules. * run_e2e_smoke now asserts interjection + speech rows are produced so a stale marker can't silently pass again. * drop remaining 'PR 1' / 'PR 2' references from test comments / names. Verified: tests/annotations + tests/datasets/test_language + tests/scripts/test_lerobot_annotate (31 passed); make-style E2E smoke (interjections=1 speech_atoms=2); pre-commit (ruff, mypy, bandit, prettier) clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-03 18:30:46 +02:00
subtask generation).
feat(annotate): phase 0 — derive canonical vocabulary from sample episodes The pipeline previously emitted near-unique subtask + memory phrasings per episode (free-form LLM rephrasing). On the downstream low-level policy that collapses the action expert's conditioning to noise: every episode pairs a different paraphrase with similar motions, so the expert learns a flat scene-prior that ignores the subtask string — then at inference the high-level head invents *yet another* paraphrase and the expert produces tiny "uncertain hover" chunks. Add a vocabulary-discovery phase (phase 0) that runs once per dataset: - watches the first ``vocabulary.sample_episodes`` (default 3) episode videos as one Qwen-VL prompt, - asks the VLM to derive ~``n_subtask_target`` canonical imperative subtask labels and ~``n_memory_target`` first-person past-tense memory milestones that recur across the demos, - persists them to ``meta/canonical_vocabulary.json`` (human- inspectable, hand-editable), and - wires the resulting ``Vocabulary`` into the ``plan`` module so every per-episode subtask + memory call is constrained to those exact strings (both as prompt-side instructions *and* post-VLM validation: paraphrases snap to the closest canonical entry via token-set overlap; below a 0.5 Jaccard floor the subtask is dropped rather than warped into something semantically wrong). Operator workflow: - first run discovers the vocabulary, writes the JSON, and runs the ``plan`` module against it, - subsequent runs reuse the on-disk file (``reuse_existing=True`` default) so hand-edits stick, - set ``--vocabulary.enabled=False`` to fall back to free-form generation (the original behaviour). The discovery prompt forbids gerunds / third-person / adverbs and caps the lists to the requested counts, matching the Hi-Robot / π0.6-MEM convention of small per-environment vocabularies. The ``plan`` module's subtask + memory prompts grow a conditional ``{vocabulary_block}`` slot rendered only when a vocabulary is present; without one the templates collapse to their previous free-form form. Tests: 11 new unit tests under tests/annotations/test_vocabulary.py cover the on-disk round-trip, discovery against the fixture dataset, ``reuse_existing`` short-circuit, paraphrase canonicalisation, off- vocab subtask dropping, and the no-vocabulary pass-through path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-22 11:40:05 +00:00
### Tools
The writer does **not** add a `tools` column to the parquet. The tool
catalog lives in `meta/info.json["tools"]` instead (see [Tools](./tools)).
After every run, the pipeline makes sure the canonical `say` schema is in
that list, keeping any tools you declared beforehand.
Want to add your own tool? Edit `meta/info.json["tools"]` directly — the
pipeline preserves whatever is already there. That makes the tool visible
to the chat template, so the model can learn to _generate_ the call. The
runtime layer that actually _executes_ a generated call (the `Tool`
protocol / `TOOL_REGISTRY` under `src/lerobot/tools/`) is not part of
this PR — the [Tools](./tools) doc marks those pieces as
not-yet-implemented.
refactor(annotate): delegate distribution to HF Jobs; drop SLURM/local switch The executor previously claimed it would "optionally hand off" to datatrove's LocalPipelineExecutor or SlurmPipelineExecutor — but it already runs phases inline in every code path, and HF Jobs (see ``examples/annotation/run_hf_job.py``) is the actual distribution strategy. Stop pretending we have an executor selector. * `executor.py`: drop `select_executor_class`, the "kind" log line, and the references to LocalPipelineExecutor / SlurmPipelineExecutor. Module docstring now says distribution is delegated to HF Jobs. * `config.py`: drop `auto_threshold`, `force_local`, `slurm_partition`, `slurm_gpus`, `slurm_time`, `workers`. `ExecutorConfig` keeps only `episode_parallelism`. While here, prune the longer "why" docstrings on every field down to the load-bearing bits — full story moves to `docs/source/annotation_pipeline.mdx`. * `pyproject.toml`: drop `datatrove>=0.4.0,<2.0.0` from the `[annotations]` extra; the dep was only there for the (never used) cluster executors. Comment block notes the new HF-Jobs delegation. * `reader.py`, `lerobot_annotate.py`: drop their own datatrove / flavor-namespace mentions. * `docs/source/annotation_pipeline.mdx`: - remove the flavor-namespace / sidecar paragraph (out of scope — "multiple revisions = multiple copies" is dataset-level policy); - remove the "writer drops the legacy `subtask_index` column" note (already covered by PR 1's intentional-break call-out); - remove the chat-template + `apply_chat_template(messages, tools=...)` line (covered by Tools doc); - replace the "executor picks Local vs Slurm" paragraph with `--executor.episode_parallelism` and a pointer to HF Jobs; - rewrite the style→recipe section to talk about "recipes" generically instead of pinning a specific YAML; - add a "Running on Hugging Face Jobs" section pointing at `examples/annotation/run_hf_job.py`; - add a "Running locally" example matching the CLI's docstring (`uv run lerobot-annotate --root=... --vlm.model_id=...`); - extend the paper-inspirations list with Pi0.7 and Steerable VLA Policies (Zhao 2025) for Module 3. Tests: same 3 pre-existing failures as before this commit (2 module assertions still in flight; 1 carryover from PR 1). 41/44 pass. Pre-commit clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:09:22 +02:00
## Running on Hugging Face Jobs
Annotation runs on [Hugging Face Jobs](https://huggingface.co/docs/hub/en/jobs).
The repo ships a launcher script you copy and tweak for your dataset:
refactor(annotate): delegate distribution to HF Jobs; drop SLURM/local switch The executor previously claimed it would "optionally hand off" to datatrove's LocalPipelineExecutor or SlurmPipelineExecutor — but it already runs phases inline in every code path, and HF Jobs (see ``examples/annotation/run_hf_job.py``) is the actual distribution strategy. Stop pretending we have an executor selector. * `executor.py`: drop `select_executor_class`, the "kind" log line, and the references to LocalPipelineExecutor / SlurmPipelineExecutor. Module docstring now says distribution is delegated to HF Jobs. * `config.py`: drop `auto_threshold`, `force_local`, `slurm_partition`, `slurm_gpus`, `slurm_time`, `workers`. `ExecutorConfig` keeps only `episode_parallelism`. While here, prune the longer "why" docstrings on every field down to the load-bearing bits — full story moves to `docs/source/annotation_pipeline.mdx`. * `pyproject.toml`: drop `datatrove>=0.4.0,<2.0.0` from the `[annotations]` extra; the dep was only there for the (never used) cluster executors. Comment block notes the new HF-Jobs delegation. * `reader.py`, `lerobot_annotate.py`: drop their own datatrove / flavor-namespace mentions. * `docs/source/annotation_pipeline.mdx`: - remove the flavor-namespace / sidecar paragraph (out of scope — "multiple revisions = multiple copies" is dataset-level policy); - remove the "writer drops the legacy `subtask_index` column" note (already covered by PR 1's intentional-break call-out); - remove the chat-template + `apply_chat_template(messages, tools=...)` line (covered by Tools doc); - replace the "executor picks Local vs Slurm" paragraph with `--executor.episode_parallelism` and a pointer to HF Jobs; - rewrite the style→recipe section to talk about "recipes" generically instead of pinning a specific YAML; - add a "Running on Hugging Face Jobs" section pointing at `examples/annotation/run_hf_job.py`; - add a "Running locally" example matching the CLI's docstring (`uv run lerobot-annotate --root=... --vlm.model_id=...`); - extend the paper-inspirations list with Pi0.7 and Steerable VLA Policies (Zhao 2025) for Module 3. Tests: same 3 pre-existing failures as before this commit (2 module assertions still in flight; 1 carryover from PR 1). 41/44 pass. Pre-commit clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:09:22 +02:00
```bash
HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py
refactor(annotate): delegate distribution to HF Jobs; drop SLURM/local switch The executor previously claimed it would "optionally hand off" to datatrove's LocalPipelineExecutor or SlurmPipelineExecutor — but it already runs phases inline in every code path, and HF Jobs (see ``examples/annotation/run_hf_job.py``) is the actual distribution strategy. Stop pretending we have an executor selector. * `executor.py`: drop `select_executor_class`, the "kind" log line, and the references to LocalPipelineExecutor / SlurmPipelineExecutor. Module docstring now says distribution is delegated to HF Jobs. * `config.py`: drop `auto_threshold`, `force_local`, `slurm_partition`, `slurm_gpus`, `slurm_time`, `workers`. `ExecutorConfig` keeps only `episode_parallelism`. While here, prune the longer "why" docstrings on every field down to the load-bearing bits — full story moves to `docs/source/annotation_pipeline.mdx`. * `pyproject.toml`: drop `datatrove>=0.4.0,<2.0.0` from the `[annotations]` extra; the dep was only there for the (never used) cluster executors. Comment block notes the new HF-Jobs delegation. * `reader.py`, `lerobot_annotate.py`: drop their own datatrove / flavor-namespace mentions. * `docs/source/annotation_pipeline.mdx`: - remove the flavor-namespace / sidecar paragraph (out of scope — "multiple revisions = multiple copies" is dataset-level policy); - remove the "writer drops the legacy `subtask_index` column" note (already covered by PR 1's intentional-break call-out); - remove the chat-template + `apply_chat_template(messages, tools=...)` line (covered by Tools doc); - replace the "executor picks Local vs Slurm" paragraph with `--executor.episode_parallelism` and a pointer to HF Jobs; - rewrite the style→recipe section to talk about "recipes" generically instead of pinning a specific YAML; - add a "Running on Hugging Face Jobs" section pointing at `examples/annotation/run_hf_job.py`; - add a "Running locally" example matching the CLI's docstring (`uv run lerobot-annotate --root=... --vlm.model_id=...`); - extend the paper-inspirations list with Pi0.7 and Steerable VLA Policies (Zhao 2025) for Module 3. Tests: same 3 pre-existing failures as before this commit (2 module assertions still in flight; 1 carryover from PR 1). 41/44 pass. Pre-commit clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:09:22 +02:00
```
[`run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
starts a single-GPU `h200` job (bump it to `h200x4` for big datasets)
that:
refactor(annotate): delegate distribution to HF Jobs; drop SLURM/local switch The executor previously claimed it would "optionally hand off" to datatrove's LocalPipelineExecutor or SlurmPipelineExecutor — but it already runs phases inline in every code path, and HF Jobs (see ``examples/annotation/run_hf_job.py``) is the actual distribution strategy. Stop pretending we have an executor selector. * `executor.py`: drop `select_executor_class`, the "kind" log line, and the references to LocalPipelineExecutor / SlurmPipelineExecutor. Module docstring now says distribution is delegated to HF Jobs. * `config.py`: drop `auto_threshold`, `force_local`, `slurm_partition`, `slurm_gpus`, `slurm_time`, `workers`. `ExecutorConfig` keeps only `episode_parallelism`. While here, prune the longer "why" docstrings on every field down to the load-bearing bits — full story moves to `docs/source/annotation_pipeline.mdx`. * `pyproject.toml`: drop `datatrove>=0.4.0,<2.0.0` from the `[annotations]` extra; the dep was only there for the (never used) cluster executors. Comment block notes the new HF-Jobs delegation. * `reader.py`, `lerobot_annotate.py`: drop their own datatrove / flavor-namespace mentions. * `docs/source/annotation_pipeline.mdx`: - remove the flavor-namespace / sidecar paragraph (out of scope — "multiple revisions = multiple copies" is dataset-level policy); - remove the "writer drops the legacy `subtask_index` column" note (already covered by PR 1's intentional-break call-out); - remove the chat-template + `apply_chat_template(messages, tools=...)` line (covered by Tools doc); - replace the "executor picks Local vs Slurm" paragraph with `--executor.episode_parallelism` and a pointer to HF Jobs; - rewrite the style→recipe section to talk about "recipes" generically instead of pinning a specific YAML; - add a "Running on Hugging Face Jobs" section pointing at `examples/annotation/run_hf_job.py`; - add a "Running locally" example matching the CLI's docstring (`uv run lerobot-annotate --root=... --vlm.model_id=...`); - extend the paper-inspirations list with Pi0.7 and Steerable VLA Policies (Zhao 2025) for Module 3. Tests: same 3 pre-existing failures as before this commit (2 module assertions still in flight; 1 carryover from PR 1). 41/44 pass. Pre-commit clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:09:22 +02:00
1. installs `lerobot` (from `main`) plus the annotation extras,
2. boots one vLLM server per GPU (using the `vllm/vllm-openai` image) and
drives it over the OpenAI-compatible API,
3. runs the `plan` / `interjections` / `vqa` modules across the dataset
with `lerobot-annotate`,
4. with `--push_to_hub=true`, uploads the result to `--new_repo_id` (or
back to `--repo_id` in place if you leave that unset).
refactor(annotate): delegate distribution to HF Jobs; drop SLURM/local switch The executor previously claimed it would "optionally hand off" to datatrove's LocalPipelineExecutor or SlurmPipelineExecutor — but it already runs phases inline in every code path, and HF Jobs (see ``examples/annotation/run_hf_job.py``) is the actual distribution strategy. Stop pretending we have an executor selector. * `executor.py`: drop `select_executor_class`, the "kind" log line, and the references to LocalPipelineExecutor / SlurmPipelineExecutor. Module docstring now says distribution is delegated to HF Jobs. * `config.py`: drop `auto_threshold`, `force_local`, `slurm_partition`, `slurm_gpus`, `slurm_time`, `workers`. `ExecutorConfig` keeps only `episode_parallelism`. While here, prune the longer "why" docstrings on every field down to the load-bearing bits — full story moves to `docs/source/annotation_pipeline.mdx`. * `pyproject.toml`: drop `datatrove>=0.4.0,<2.0.0` from the `[annotations]` extra; the dep was only there for the (never used) cluster executors. Comment block notes the new HF-Jobs delegation. * `reader.py`, `lerobot_annotate.py`: drop their own datatrove / flavor-namespace mentions. * `docs/source/annotation_pipeline.mdx`: - remove the flavor-namespace / sidecar paragraph (out of scope — "multiple revisions = multiple copies" is dataset-level policy); - remove the "writer drops the legacy `subtask_index` column" note (already covered by PR 1's intentional-break call-out); - remove the chat-template + `apply_chat_template(messages, tools=...)` line (covered by Tools doc); - replace the "executor picks Local vs Slurm" paragraph with `--executor.episode_parallelism` and a pointer to HF Jobs; - rewrite the style→recipe section to talk about "recipes" generically instead of pinning a specific YAML; - add a "Running on Hugging Face Jobs" section pointing at `examples/annotation/run_hf_job.py`; - add a "Running locally" example matching the CLI's docstring (`uv run lerobot-annotate --root=... --vlm.model_id=...`); - extend the paper-inspirations list with Pi0.7 and Steerable VLA Policies (Zhao 2025) for Module 3. Tests: same 3 pre-existing failures as before this commit (2 module assertions still in flight; 1 carryover from PR 1). 41/44 pass. Pre-commit clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:09:22 +02:00
To use a different dataset, model, or hub repo, edit the `CMD` block in
the script. Every flag there maps directly to a `lerobot-annotate` flag
(run `lerobot-annotate --help` for the full list).
annotate: address review feedback — bug fixes, docs/code drift, naming, cleanup Bugs * validator: don't re-raise on unknown style. The second column_for_style lookup (used to route persistent vs event) now sits in try/except so an unknown style is recorded by _check_column_routing and skipped instead of crashing the whole validation pass. * general_vqa._target_cameras: when restrict_to_default_camera is set but the configured camera_key isn't one the provider exposes, warn and fall back to all cameras instead of returning a phantom key that KeyErrors deep in frame decode. * interjections: clamp interjection timestamps to frame_timestamps[0] rather than a hardcoded 0.0 (datasets can start at non-zero t). Docs / code drift * annotation_pipeline.mdx: drop the phantom 'vocabulary discovery / phase 0 / --vocabulary.* / canonical_vocabulary.json' section (none of it exists); describe the real describe->segment + coverage-stitch flow. Soften the src/lerobot/tools/ + TOOL_REGISTRY reference to 'not part of this PR' (matches tools.mdx, which already marks the runtime layer as not-yet-implemented). Fix the --push_to_hub/--new_repo_id wording. Note the default is now a single h200. Add a 'Contributing new modules' section inviting module / prompt / quality contributions. * executor docstring: six phases, no phantom phase 0. run_hf_job.py * add the Apache 2.0 license header (was flagged repeatedly). * default to a single GPU: flavor=h200, parallel_servers=1, num_gpus=1 (scale to h200x4 noted in the docstring). * pin the install to @main instead of the feature branch (won't break after merge). Naming / cleanup * rename dest_repo_id -> new_repo_id across config / script / example / test to match the LeRobot dataset edit tools. * rename prompt templates module_N_*.txt -> descriptive (plan_*, interjections_*, vqa.txt) and update every load_prompt() call. * remove dead _messages_to_prompt (used only by the removed in-process backends). * declare _warned_decode_fail (frames) and _warned_no_camera (vqa) as real init=False dataclass fields instead of getattr monkey-patches. * scope bandit B607 to the two ffmpeg subprocess.run sites via '# nosec B607' and drop it from the global skip list. Tests * fix stale canned-VLM markers ('ONE realistic interruption' -> 'compact interjection', 'Update the memory' -> 'compressed semantic memory') and drop the dead 'concise hierarchical PLAN' plan responders (plan generation is deterministic now) in run_e2e_smoke, test_pipeline_recipe_render, test_modules. * run_e2e_smoke now asserts interjection + speech rows are produced so a stale marker can't silently pass again. * drop remaining 'PR 1' / 'PR 2' references from test comments / names. Verified: tests/annotations + tests/datasets/test_language + tests/scripts/test_lerobot_annotate (31 passed); make-style E2E smoke (interjections=1 speech_atoms=2); pre-commit (ruff, mypy, bandit, prettier) clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-03 18:30:46 +02:00
## Contributing new modules
The pipeline is built to grow, and **contributions are very welcome** —
a brand-new module (say, trajectory traces or affordances), a new prompt
template, a smarter grounding flow, or quality fixes to the existing
`plan` / `interjections` / `vqa` modules.
Every module lives under
annotate: address review feedback — bug fixes, docs/code drift, naming, cleanup Bugs * validator: don't re-raise on unknown style. The second column_for_style lookup (used to route persistent vs event) now sits in try/except so an unknown style is recorded by _check_column_routing and skipped instead of crashing the whole validation pass. * general_vqa._target_cameras: when restrict_to_default_camera is set but the configured camera_key isn't one the provider exposes, warn and fall back to all cameras instead of returning a phantom key that KeyErrors deep in frame decode. * interjections: clamp interjection timestamps to frame_timestamps[0] rather than a hardcoded 0.0 (datasets can start at non-zero t). Docs / code drift * annotation_pipeline.mdx: drop the phantom 'vocabulary discovery / phase 0 / --vocabulary.* / canonical_vocabulary.json' section (none of it exists); describe the real describe->segment + coverage-stitch flow. Soften the src/lerobot/tools/ + TOOL_REGISTRY reference to 'not part of this PR' (matches tools.mdx, which already marks the runtime layer as not-yet-implemented). Fix the --push_to_hub/--new_repo_id wording. Note the default is now a single h200. Add a 'Contributing new modules' section inviting module / prompt / quality contributions. * executor docstring: six phases, no phantom phase 0. run_hf_job.py * add the Apache 2.0 license header (was flagged repeatedly). * default to a single GPU: flavor=h200, parallel_servers=1, num_gpus=1 (scale to h200x4 noted in the docstring). * pin the install to @main instead of the feature branch (won't break after merge). Naming / cleanup * rename dest_repo_id -> new_repo_id across config / script / example / test to match the LeRobot dataset edit tools. * rename prompt templates module_N_*.txt -> descriptive (plan_*, interjections_*, vqa.txt) and update every load_prompt() call. * remove dead _messages_to_prompt (used only by the removed in-process backends). * declare _warned_decode_fail (frames) and _warned_no_camera (vqa) as real init=False dataclass fields instead of getattr monkey-patches. * scope bandit B607 to the two ffmpeg subprocess.run sites via '# nosec B607' and drop it from the global skip list. Tests * fix stale canned-VLM markers ('ONE realistic interruption' -> 'compact interjection', 'Update the memory' -> 'compressed semantic memory') and drop the dead 'concise hierarchical PLAN' plan responders (plan generation is deterministic now) in run_e2e_smoke, test_pipeline_recipe_render, test_modules. * run_e2e_smoke now asserts interjection + speech rows are produced so a stale marker can't silently pass again. * drop remaining 'PR 1' / 'PR 2' references from test comments / names. Verified: tests/annotations + tests/datasets/test_language + tests/scripts/test_lerobot_annotate (31 passed); make-style E2E smoke (interjections=1 speech_atoms=2); pre-commit (ruff, mypy, bandit, prettier) clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-03 18:30:46 +02:00
`src/lerobot/annotations/steerable_pipeline/modules/`, shares the VLM
client and the keyframe cache, writes its raw output to the staging
tree, and plugs into the executor as its own phase. Got an idea? Open an
issue or PR on [the repo](https://github.com/huggingface/lerobot).
## How recipes consume the output
The annotations are meant to be read by recipes (see
[Language Columns and Recipes](./language_and_recipes)). Typically:
- low-level / high-level / memory-update branches read
`subtask` / `plan` / `memory` from `language_persistent`.
- an interjection-response branch reads `interjection` events plus the
paired speech atom (merged into one assistant turn via `tool_calls_from`)
and the matching `plan` refresh at the same timestamp.
- a VQA branch reads the `(vqa, user)` and `(vqa, assistant)` pairs from
`language_events`.
## Why state and events are split
Two ideas shape the design:
1. **Persistent state vs. exact events.** Persistent rows (`subtask`,
`plan`, `memory`) apply to the whole episode and answer "what's true
right now?". Event rows (`interjection`, `vqa`, speech) appear only on
the one frame whose timestamp matches. Timestamps are copied straight
from the source parquet — never recomputed in floating point.
2. **One VLM pass.** All three modules share a single VLM client (the
OpenAI-compatible client talking to the job's vLLM server), so you pay
for one model load per dataset, not three.
## Re-running a single module
Each module stages its raw output to
`<root>/.annotate_staging/episode_{N:06d}/<module>.jsonl`. This makes
prompt iteration cheap: re-running one module overwrites only its own
JSONL, then the writer recomposes the final parquet. Disable modules you
don't want with `--plan.enabled=false` (and likewise
`--interjections.enabled` / `--vqa.enabled`) to test one at a time.
## What the validator checks
Before the writer runs, `StagingValidator` confirms:
- every event row lands exactly on a real frame timestamp;
- no speech / interjection pairs are left orphaned;
- `plan` is refreshed at every interjection timestamp;
- `memory` rows fall on subtask boundaries (a warning, not an error);
- each VQA assistant `content` is valid JSON in one of the
bbox / keypoint / count / attribute / spatial shapes;
- every row goes to the column chosen by `column_for_style(style)`.
Any error aborts the writer. Pass `--skip_validation=true` to override
while debugging.
## Where each module's ideas come from
- **`plan` — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417))
for atom granularity ("pick up one piece of lettuce", "place bowl to
box"); Pi0.7 ([Physical Intelligence 2025](https://pi.website/pi07))
for "how, not what" detail.
- **`plan` — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596)):
keep only the minimal relevant information — preserve outcomes, drop
specific attributes.
- **`interjections`.** Hi Robot's scenario taxonomy: negative task,
situated correction, specific constraint, preference. Speech is a
tool-call-only atom
(`tool_calls=[{type:function, function:{name:"say", arguments:{text:...}}}]`).
- **`vqa`.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693)) for
grounded features (pixel bounding boxes `[x_min, y_min, x_max, y_max]`,
keypoints) and Steerable VLA Policies
([Zhao 2025](https://arxiv.org/abs/2509.07626)) for multi-abstraction
grounding. Pi0.7 also grounds answers across abstraction levels.
When improving a module, tweak its prompt template in
`src/lerobot/annotations/steerable_pipeline/prompts/` rather than
rewriting from scratch.
## Roughly how much it costs
Per episode, the pipeline makes about `max_steps` plan calls,
`max_interjections_per_episode` interjection calls, and
`vqa_emission_hz × episode_seconds` VQA calls. With the defaults (8
subtasks, 1 interjection, 1 Hz × 3 pairs) on a 30-second episode, that's
~50 VLM calls.
Storage stays small: `language_persistent` is at most tens of KB per
episode (parquet dictionary-encodes the one entry that repeats across
frames), and `language_events` is empty on most frames — its size scales
with the number of emissions, not `num_frames × num_emissions`.