mirror of
https://github.com/huggingface/lerobot.git
synced 2026-06-03 20:31:25 +00:00
Bugs
* validator: don't re-raise on unknown style. The second column_for_style
lookup (used to route persistent vs event) now sits in try/except so an
unknown style is recorded by _check_column_routing and skipped instead
of crashing the whole validation pass.
* general_vqa._target_cameras: when restrict_to_default_camera is set but
the configured camera_key isn't one the provider exposes, warn and fall
back to all cameras instead of returning a phantom key that KeyErrors
deep in frame decode.
* interjections: clamp interjection timestamps to frame_timestamps[0]
rather than a hardcoded 0.0 (datasets can start at non-zero t).
Docs / code drift
* annotation_pipeline.mdx: drop the phantom 'vocabulary discovery / phase
0 / --vocabulary.* / canonical_vocabulary.json' section (none of it
exists); describe the real describe->segment + coverage-stitch flow.
Soften the src/lerobot/tools/ + TOOL_REGISTRY reference to 'not part of
this PR' (matches tools.mdx, which already marks the runtime layer as
not-yet-implemented). Fix the --push_to_hub/--new_repo_id wording. Note
the default is now a single h200. Add a 'Contributing new modules'
section inviting module / prompt / quality contributions.
* executor docstring: six phases, no phantom phase 0.
run_hf_job.py
* add the Apache 2.0 license header (was flagged repeatedly).
* default to a single GPU: flavor=h200, parallel_servers=1, num_gpus=1
(scale to h200x4 noted in the docstring).
* pin the install to @main instead of the feature branch (won't break
after merge).
Naming / cleanup
* rename dest_repo_id -> new_repo_id across config / script / example /
test to match the LeRobot dataset edit tools.
* rename prompt templates module_N_*.txt -> descriptive (plan_*,
interjections_*, vqa.txt) and update every load_prompt() call.
* remove dead _messages_to_prompt (used only by the removed in-process
backends).
* declare _warned_decode_fail (frames) and _warned_no_camera (vqa) as
real init=False dataclass fields instead of getattr monkey-patches.
* scope bandit B607 to the two ffmpeg subprocess.run sites via
'# nosec B607' and drop it from the global skip list.
Tests
* fix stale canned-VLM markers ('ONE realistic interruption' ->
'compact interjection', 'Update the memory' -> 'compressed semantic
memory') and drop the dead 'concise hierarchical PLAN' plan responders
(plan generation is deterministic now) in run_e2e_smoke,
test_pipeline_recipe_render, test_modules.
* run_e2e_smoke now asserts interjection + speech rows are produced so a
stale marker can't silently pass again.
* drop remaining 'PR 1' / 'PR 2' references from test comments / names.
Verified: tests/annotations + tests/datasets/test_language +
tests/scripts/test_lerobot_annotate (31 passed); make-style E2E smoke
(interjections=1 speech_atoms=2); pre-commit (ruff, mypy, bandit,
prettier) clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
178 lines
8.8 KiB
Plaintext
178 lines
8.8 KiB
Plaintext
# Annotation Pipeline
|
||
|
||
`lerobot-annotate` populates the two language columns introduced by the
|
||
[Language Columns and Recipes](./language_and_recipes) page —
|
||
`language_persistent` and `language_events` — directly into
|
||
`data/chunk-*/file-*.parquet`.
|
||
|
||
## What the pipeline produces
|
||
|
||
Three modules write into a per-episode staging tree, then a single writer
|
||
rewrites the data shards in place:
|
||
|
||
| Style / atom | Column | Module |
|
||
| ------------------------------------------- | --------------------- | --------------- |
|
||
| `subtask` (Pi0.7-style "how, not what") | `language_persistent` | `plan` |
|
||
| `plan` (initial + refresh on interjection) | `language_persistent` | `plan` |
|
||
| `memory` (MEM-style compression) | `language_persistent` | `plan` |
|
||
| `task_aug` (rephrasings of canonical task) | `language_persistent` | `plan` |
|
||
| `interjection` | `language_events` | `interjections` |
|
||
| speech tool-call atom (`style=null`, `say`) | `language_events` | `interjections` |
|
||
| `vqa` (user / assistant pair) | `language_events` | `vqa` |
|
||
|
||
The `plan` module generates subtasks per episode with a **describe → segment**
|
||
grounding flow: a first pass narrates only what is visible in the chosen
|
||
camera, and its description is fed into a second pass that segments the
|
||
episode into consecutive atomic subtasks. The resulting spans are then
|
||
deterministically stitched into a contiguous full-episode cover so every
|
||
frame has exactly one active subtask. See
|
||
[`run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
|
||
for the production flag set (single camera, embedded frames, windowed
|
||
subtask generation).
|
||
|
||
The writer does **not** add a `tools` column to the parquet — the tool
|
||
catalog lives at `meta/info.json["tools"]` instead (see
|
||
[Tools](./tools)). After every annotation run the pipeline ensures the
|
||
canonical `say` schema is present in that list, preserving any tools the
|
||
user pre-declared.
|
||
|
||
If you want to declare additional tools for a dataset before annotation
|
||
runs, edit `meta/info.json["tools"]` directly — the pipeline preserves
|
||
anything already there. That makes the tool visible to the chat template
|
||
so the model can learn to _generate_ the call. The runtime layer that
|
||
_executes_ a generated call (the `Tool` protocol / `TOOL_REGISTRY` under
|
||
`src/lerobot/tools/`) is not part of this PR — see the
|
||
[Tools](./tools) doc, which marks those pieces as not-yet-implemented.
|
||
|
||
## Running on Hugging Face Jobs
|
||
|
||
Distributed annotation is delegated to
|
||
[Hugging Face Jobs](https://huggingface.co/docs/hub/en/jobs). The repo
|
||
ships a launcher script you copy and edit for your dataset:
|
||
|
||
```bash
|
||
HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py
|
||
```
|
||
|
||
[`examples/annotations/run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
|
||
spawns a single-GPU `h200` job (scale up to `h200x4` for larger datasets) that:
|
||
|
||
1. installs the branch under test plus the annotation extras,
|
||
2. boots one vLLM server per GPU (in the `vllm/vllm-openai` image) for the
|
||
chosen model, which the pipeline drives over the OpenAI-compatible API,
|
||
3. runs the `plan` / `interjections` / `vqa` modules across the dataset
|
||
via `lerobot-annotate`,
|
||
4. with `--push_to_hub=true`, uploads the annotated dataset to
|
||
`--new_repo_id` (or back to `--repo_id` in place when that is unset).
|
||
|
||
To target a different dataset, model, or hub repo, edit the `CMD` block
|
||
inside the script — every flag in there maps directly onto a CLI flag of
|
||
`lerobot-annotate` (see `lerobot-annotate --help` for the full list).
|
||
|
||
## Contributing new modules
|
||
|
||
The pipeline is built to be extended, and **contributions are very
|
||
welcome** — whether that's a brand-new annotation module (e.g. a
|
||
trajectory-trace or affordance module), a new prompt template, a better
|
||
grounding flow, or quality improvements to the existing `plan` /
|
||
`interjections` / `vqa` modules. Each module lives under
|
||
`src/lerobot/annotations/steerable_pipeline/modules/`, shares the VLM
|
||
client and keyframe cache, writes its raw output to the per-episode
|
||
staging tree, and is wired into the executor as an independent phase.
|
||
If you have an idea for a module or an improvement, open an issue or PR
|
||
on [the repo](https://github.com/huggingface/lerobot).
|
||
|
||
## Style-to-recipe consumer mapping
|
||
|
||
The pipeline's outputs are designed to be consumed by recipes (see
|
||
[Language Columns and Recipes](./language_and_recipes)) — typically:
|
||
|
||
- low-level / high-level / memory-update branches consume
|
||
`subtask`/`plan`/`memory` from `language_persistent`.
|
||
- An interjection-response branch consumes `interjection` events plus
|
||
the paired speech atom (merged into one assistant target turn via
|
||
`tool_calls_from`) and the same-timestamp `plan` refresh.
|
||
- A VQA branch consumes the `(vqa, user)` and `(vqa, assistant)` pairs
|
||
from `language_events`.
|
||
|
||
## Why the design splits state from events
|
||
|
||
Two things drive the scope:
|
||
|
||
1. **Persistent state vs exact-event split.** Persistent rows
|
||
(`subtask`, `plan`, `memory`) broadcast per episode and answer "what
|
||
state is in force at this frame?". Event rows (`interjection`, `vqa`,
|
||
speech) only appear on the exact frame whose timestamp matches the
|
||
emission. The pipeline writes timestamps taken straight from the
|
||
source parquet — no floating-point recomputation.
|
||
2. **One Qwen-VL pass.** All three modules share a single VLM client (the
|
||
OpenAI-compatible client talking to the job's vLLM server) so the cost
|
||
is one model load per dataset, not three.
|
||
|
||
## Module independence and staged reruns
|
||
|
||
Each module writes its raw output to
|
||
`<root>/.annotate_staging/episode_{N:06d}/<module>.jsonl`. That makes
|
||
prompt iteration cheap — re-running one module overwrites only its own
|
||
JSONL file before the writer composes the final parquet. Modules can be
|
||
disabled via `--plan.enabled=false` (and likewise `--interjections.enabled`
|
||
/ `--vqa.enabled`) to
|
||
test them in isolation.
|
||
|
||
## Validation/report checks before final write
|
||
|
||
Before the writer runs, `StagingValidator` checks:
|
||
|
||
- exact frame-timestamp alignment for every event row;
|
||
- no orphan speech / interjection pairs;
|
||
- `plan` is refreshed at every interjection timestamp;
|
||
- `memory` rows fall on subtask boundaries (warning, not error);
|
||
- VQA assistant `content` parses as JSON in one of the
|
||
bbox / keypoint / count / attribute / spatial shapes;
|
||
- every row routes to the column dictated by `column_for_style(style)`.
|
||
|
||
Errors abort the writer (`--skip_validation=true` overrides for debugging).
|
||
|
||
## Paper inspirations per module
|
||
|
||
- **`plan` module — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417))
|
||
atom granularity ("pick up one piece of lettuce", "place bowl to box");
|
||
Pi0.7 ([Physical Intelligence 2025](https://pi.website/pi07)) "how, not
|
||
what" detail.
|
||
- **`plan` module — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596))
|
||
compression directive: keep only minimal relevant information; functional
|
||
outcomes preserved, specific attributes dropped.
|
||
- **`interjections` module.** Hi Robot scenario taxonomy: negative task,
|
||
situated correction, specific constraint, preference. Speech is a
|
||
tool-call-only atom (`tool_calls=[{type:function, function:{name:"say",
|
||
arguments:{text:...}}}]`).
|
||
- **`vqa` module.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693))
|
||
grounded features (bounding boxes in pixel `[x_min, y_min, x_max, y_max]`,
|
||
keypoints) and Steerable VLA Policies ([Zhao 2025](https://arxiv.org/abs/2509.07626))
|
||
multi-abstraction grounding. Pi0.7 also grounds answers across
|
||
multiple abstraction levels.
|
||
|
||
Future maintainers should adjust the prompt templates in
|
||
`src/lerobot/annotations/steerable_pipeline/prompts/` against these
|
||
references rather than rewriting from scratch.
|
||
|
||
## Compute and list-size estimates
|
||
|
||
Per episode, the pipeline issues O(`max_steps`) `plan`-module calls,
|
||
O(`max_interjections_per_episode`) `interjections`-module calls, and
|
||
O(`vqa_emission_hz × episode_seconds`) `vqa`-module calls. With defaults
|
||
(8 subtasks, 1 interjection, 1 Hz × 3 pairs) and 30-second episodes, that
|
||
is ~50 VLM calls per episode. `language_persistent` per episode is ~10s of
|
||
KB at most (parquet dictionary-encodes one entry per episode);
|
||
`language_events` is empty on most frames and is bounded by the number of
|
||
emissions, not `num_frames × num_emissions`.
|
||
|
||
## Reproducibility via seed and prompt hashes
|
||
|
||
`--seed` (default 1729) feeds the per-episode RNGs that select interjection
|
||
timestamps and VQA question types. Combined with the deterministic prompt
|
||
templates checked into `prompts/`, two runs at the same seed against the
|
||
same dataset and the same model checkpoint produce byte-identical staging
|
||
artifacts. Prompt edits are recorded by file hash; future tooling can pin
|
||
expected `(seed, prompt_hash)` pairs into the dataset card.
|