Compare commits

...

2 Commits

Author SHA1 Message Date
pepijn
1e7c0d6aa1 annotate(plan): force composite-action subtasks; ban ultra-fine splits
Tighten ``module_1_subtasks.txt`` so the VLM emits one composite
atomic action per subtask instead of decomposing every pick into
``move to X`` / ``grasp X`` / ``lift X``:

- Lock the verb vocabulary to the composite set the low-level
  policy actually learns end-to-end: ``pick up`` (approach + grasp +
  lift), ``put``/``place`` (transport + release), ``push``, ``pull``,
  ``turn``, ``press``, ``open``, ``close``, ``pour``, ``insert``.
  ``go to`` is allowed only as a pure relocation between phases.
- Add an explicit ``Forbidden ultra-fine splits`` block enumerating
  the patterns the VLM was tempted to emit (``move to X``,
  ``reach for X``, ``grasp X``, ``lift X``, ``release X``) and
  instructing it to fold each into its parent composite.
- Rewrite the Good/Bad examples to match the composite contract;
  the previous ``"move to blue cube" / "grasp blue cube" / "lift
  blue cube"`` Good list was actively encouraging the over-
  segmentation pattern this prompt is supposed to prevent.
- Tighten the duration rule: candidates shorter than
  ``min_subtask_seconds`` must be merged into a neighbour rather
  than emitted. Pairs with bumping the runtime floor to 3 s so
  composites have room to land.

Pure prompt change — no code or schema change. Existing canonical-
vocabulary retry path is unaffected (the new verb whitelist lives
in prose, not in the validator).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-26 05:14:30 +00:00
pepijn
920c6ef5a2 docs(annotate): disable phase-0 vocabulary discovery by default in run_hf_job
Heterogeneous datasets (different tasks/scenes across episodes) don't
share a single small subtask + memory vocabulary, so the canonical
vocabulary phase narrowed every episode to the wrong target distribution.
Flip the example to free-form generation by default and document the
``--vocabulary.enabled=true`` switch for homogeneous datasets where the
canonical vocabulary still helps the downstream policy.

No pipeline-code changes: ``VocabularyConfig.enabled`` already gates
phase 0 (see ``executor.py:_run_vocabulary_phase`` and
``VocabularyConfig`` docstring) and falls back to free-form generation.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-26 04:42:10 +00:00
2 changed files with 62 additions and 23 deletions

View File

@@ -5,13 +5,16 @@ Spawns one ``h200x2`` job that:
1. installs this branch of ``lerobot`` plus the annotation extras,
2. boots two vllm servers (one per GPU) with Qwen3.6-35B-A3B-FP8,
3. discovers the dataset's canonical subtask + memory vocabulary
from the first 3 sample episodes (phase 0),
4. runs the plan / interjections / vqa modules across the dataset
(subtasks + memory are constrained to the canonical vocabulary),
5. uploads the annotated dataset to ``--dest_repo_id`` (when set)
3. runs the plan / interjections / vqa modules across the dataset
in free-form mode (phase 0 canonical-vocabulary discovery is
disabled — each episode generates its own subtasks + memory),
4. uploads the annotated dataset to ``--dest_repo_id`` (when set)
or back to ``--repo_id``.
Re-enable phase 0 with ``--vocabulary.enabled=true`` (optionally
``--vocabulary.sample_episodes=N``) when the dataset is homogeneous
enough to share one subtask + memory vocabulary across all episodes.
Usage:
HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py
@@ -54,12 +57,14 @@ CMD = (
"--executor.episode_parallelism=16 "
"--vlm.chat_template_kwargs='{\"enable_thinking\": false}' "
"--vlm.camera_key=observation.images.wrist "
# Phase 0 — canonical vocabulary discovery from the first N sample
# episodes. The VLM picks the right number of subtask + memory
# entries itself from what it sees; the resulting
# meta/canonical_vocabulary.json constrains every subtask + memory
# string to a small repeatable target distribution.
"--vocabulary.sample_episodes=3 "
# Phase 0 — canonical vocabulary discovery DISABLED by default.
# Heterogeneous datasets (different tasks/scenes across episodes)
# don't share a single small subtask + memory vocabulary, so each
# episode generates its subtasks + memory free-form. Flip to
# ``--vocabulary.enabled=true`` (optionally ``--vocabulary.sample_episodes=N``)
# for homogeneous datasets where a shared canonical vocabulary
# helps the downstream policy.
"--vocabulary.enabled=false "
# Phase 1 — plan module (subtasks + plan + memory + task_aug).
"--plan.frames_per_second=1.0 "
"--plan.use_video_url=true "

View File

@@ -8,14 +8,42 @@ the robot performs.
{vocabulary_block}Authoring rules — Hi Robot atom granularity, pi0.7-style short prompts:
- Each subtask = one atomic skill the low-level policy can execute.
- Write each subtask as an IMPERATIVE COMMAND, starting with a verb:
move, reach, pick up, grasp, place, put, push, pull, open, close,
turn, press, lift, insert, pour...
- Each subtask = one COMPOSITE atomic skill the low-level policy can
execute end-to-end. A "skill" bundles its own approach motion with
its terminal action — do NOT split the approach off as its own
subtask. The whole-arm policy already learns to reach as part of
every manipulation primitive.
- Write each subtask as an IMPERATIVE COMMAND, starting with one of
these verbs (extend only when none fits):
pick up <obj> — approach + grasp + lift in one subtask
put <obj> on/in <loc> — transport + release in one subtask
place <obj> on/in <loc> — synonym of "put"; pick one and stay consistent
push <obj> — contact + linear shove
pull <obj> — contact + linear retract
turn <knob/dial/handle> — rotary actuation
press <button> — single-press contact
open <drawer/door/lid> — full open motion
close <drawer/door/lid> — full close motion
pour <src> into <dst> — tilt + flow
insert <obj> into <slot>— alignment + push-fit
go to <loc> — ONLY when no grasp / actuation follows
(e.g. a pure relocation between phases).
If the next subtask grasps something at
that location, drop "go to ..." and just
write "pick up ..." instead.
- Forbidden ultra-fine splits — the VLM is NOT allowed to emit these
as standalone subtasks; fold them into the parent composite:
"move to X" → fold into "pick up X" (or whatever follows)
"reach for X" → fold into "pick up X"
"grasp X" → fold into "pick up X"
"lift X" → fold into "pick up X" (or "put X on Y" if it's
the transport phase of a place)
"release X" → fold into "put X on Y" (or "place X in Y")
- Keep it SHORT — a verb phrase, not a sentence. Drop articles
("the", "a") and adverbs ("carefully", "slowly"). Add a "how"
detail (which hand, which grasp point) ONLY when it is needed to
disambiguate.
disambiguate. Every subtask must begin with one of the verbs
above (no leading nouns, no "then", no "first").
- NEVER use third person. Never write "the robot", "the arm", "the
gripper moves", "it picks up" — the robot is implied. Command it,
do not describe it.
@@ -23,16 +51,22 @@ the robot performs.
"cube", every subtask says "cube" — never switch to "block". If it
says "box", never switch to "bin"/"container". Keep vocabulary
consistent across the whole episode.
- Good: "move to blue cube", "grasp blue cube", "lift blue cube",
"place blue cube in box", "open drawer", "release yellow cube".
- Bad: "the robot arm moves towards the blue cube" (third person,
too long), "carefully pick up the cube" (adverb, article),
"release the yellow block" ("block" when the task said "cube").
- Good: "pick up blue cube", "put blue cube in box", "open drawer",
"turn red knob", "press start button", "go to sink".
- Bad: "move to blue cube" (approach as its own subtask — forbidden,
must be folded into "pick up blue cube"); "the robot arm moves
towards the blue cube" (third person, too long); "carefully pick
up the cube" (adverb, article); "release the yellow block"
("block" when the task said "cube", and "release" must be folded
into a "put"/"place" subtask).
- Subtasks are non-overlapping and cover the full episode in order.
Choose the cut points yourself based on what you see in the video
(gripper open/close events, contact, regrasps, transitions).
- Each subtask spans at least {min_subtask_seconds} seconds.
- Do not exceed {max_steps} subtasks total.
- Each subtask spans at least {min_subtask_seconds} seconds. If a
candidate span would be shorter, merge it into its neighbour
rather than emitting it.
- Do not exceed {max_steps} subtasks total. Fewer, larger composites
are preferred over many micro-steps.
- Every subtask's [start_time, end_time] must lie within
[0.0, {episode_duration}] seconds.