annotate(plan): force composite-action subtasks; ban ultra-fine splits

Tighten ``module_1_subtasks.txt`` so the VLM emits one composite
atomic action per subtask instead of decomposing every pick into
``move to X`` / ``grasp X`` / ``lift X``:

- Lock the verb vocabulary to the composite set the low-level
  policy actually learns end-to-end: ``pick up`` (approach + grasp +
  lift), ``put``/``place`` (transport + release), ``push``, ``pull``,
  ``turn``, ``press``, ``open``, ``close``, ``pour``, ``insert``.
  ``go to`` is allowed only as a pure relocation between phases.
- Add an explicit ``Forbidden ultra-fine splits`` block enumerating
  the patterns the VLM was tempted to emit (``move to X``,
  ``reach for X``, ``grasp X``, ``lift X``, ``release X``) and
  instructing it to fold each into its parent composite.
- Rewrite the Good/Bad examples to match the composite contract;
  the previous ``"move to blue cube" / "grasp blue cube" / "lift
  blue cube"`` Good list was actively encouraging the over-
  segmentation pattern this prompt is supposed to prevent.
- Tighten the duration rule: candidates shorter than
  ``min_subtask_seconds`` must be merged into a neighbour rather
  than emitted. Pairs with bumping the runtime floor to 3 s so
  composites have room to land.

Pure prompt change — no code or schema change. Existing canonical-
vocabulary retry path is unaffected (the new verb whitelist lives
in prose, not in the validator).

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
pepijn
2026-05-26 05:14:30 +00:00
parent 920c6ef5a2
commit 1e7c0d6aa1

View File

@@ -8,14 +8,42 @@ the robot performs.
{vocabulary_block}Authoring rules — Hi Robot atom granularity, pi0.7-style short prompts:
- Each subtask = one atomic skill the low-level policy can execute.
- Write each subtask as an IMPERATIVE COMMAND, starting with a verb:
move, reach, pick up, grasp, place, put, push, pull, open, close,
turn, press, lift, insert, pour...
- Each subtask = one COMPOSITE atomic skill the low-level policy can
execute end-to-end. A "skill" bundles its own approach motion with
its terminal action — do NOT split the approach off as its own
subtask. The whole-arm policy already learns to reach as part of
every manipulation primitive.
- Write each subtask as an IMPERATIVE COMMAND, starting with one of
these verbs (extend only when none fits):
pick up <obj> — approach + grasp + lift in one subtask
put <obj> on/in <loc> — transport + release in one subtask
place <obj> on/in <loc> — synonym of "put"; pick one and stay consistent
push <obj> — contact + linear shove
pull <obj> — contact + linear retract
turn <knob/dial/handle> — rotary actuation
press <button> — single-press contact
open <drawer/door/lid> — full open motion
close <drawer/door/lid> — full close motion
pour <src> into <dst> — tilt + flow
insert <obj> into <slot>— alignment + push-fit
go to <loc> — ONLY when no grasp / actuation follows
(e.g. a pure relocation between phases).
If the next subtask grasps something at
that location, drop "go to ..." and just
write "pick up ..." instead.
- Forbidden ultra-fine splits — the VLM is NOT allowed to emit these
as standalone subtasks; fold them into the parent composite:
"move to X" → fold into "pick up X" (or whatever follows)
"reach for X" → fold into "pick up X"
"grasp X" → fold into "pick up X"
"lift X" → fold into "pick up X" (or "put X on Y" if it's
the transport phase of a place)
"release X" → fold into "put X on Y" (or "place X in Y")
- Keep it SHORT — a verb phrase, not a sentence. Drop articles
("the", "a") and adverbs ("carefully", "slowly"). Add a "how"
detail (which hand, which grasp point) ONLY when it is needed to
disambiguate.
disambiguate. Every subtask must begin with one of the verbs
above (no leading nouns, no "then", no "first").
- NEVER use third person. Never write "the robot", "the arm", "the
gripper moves", "it picks up" — the robot is implied. Command it,
do not describe it.
@@ -23,16 +51,22 @@ the robot performs.
"cube", every subtask says "cube" — never switch to "block". If it
says "box", never switch to "bin"/"container". Keep vocabulary
consistent across the whole episode.
- Good: "move to blue cube", "grasp blue cube", "lift blue cube",
"place blue cube in box", "open drawer", "release yellow cube".
- Bad: "the robot arm moves towards the blue cube" (third person,
too long), "carefully pick up the cube" (adverb, article),
"release the yellow block" ("block" when the task said "cube").
- Good: "pick up blue cube", "put blue cube in box", "open drawer",
"turn red knob", "press start button", "go to sink".
- Bad: "move to blue cube" (approach as its own subtask — forbidden,
must be folded into "pick up blue cube"); "the robot arm moves
towards the blue cube" (third person, too long); "carefully pick
up the cube" (adverb, article); "release the yellow block"
("block" when the task said "cube", and "release" must be folded
into a "put"/"place" subtask).
- Subtasks are non-overlapping and cover the full episode in order.
Choose the cut points yourself based on what you see in the video
(gripper open/close events, contact, regrasps, transitions).
- Each subtask spans at least {min_subtask_seconds} seconds.
- Do not exceed {max_steps} subtasks total.
- Each subtask spans at least {min_subtask_seconds} seconds. If a
candidate span would be shorter, merge it into its neighbour
rather than emitting it.
- Do not exceed {max_steps} subtasks total. Fewer, larger composites
are preferred over many micro-steps.
- Every subtask's [start_time, end_time] must lie within
[0.0, {episode_duration}] seconds.