From cd128cbbd5d811ba625f1c86c9dbe8f21ed5b734 Mon Sep 17 00:00:00 2001 From: Pepijn Date: Tue, 2 Jun 2026 16:10:49 +0200 Subject: [PATCH] annotate: add verb-scoped disambiguation rules to subtask prompt MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adopt the one prompt technique Scale's dense-captioning study found reliably positive: targeted, verb-scoped, visually-grounded disambiguation rules. Their lesson was that such a rule must fire ONLY on the spatial situation it names (their narrow 'Stack vs Put' rule helped; an over-broad directional 'Scoop' rule bled into other verbs and hurt), so each rule here is phrased visually and scoped to one confusable pair: * stack-vs-put (on top of an object vs on a surface) * insert-vs-put (fitted slot vs surface) * pick-up/retrieve-vs-put (decide by which way the OBJECT moves: gripper closes + object moves with hand = pick up; gripper opens + object stays = put — directly targets Scale's dominant direction-flip failure) * pour-vs-put (tilt + flow vs untilted move) This is the highest-confidence, lowest-risk change from the Scale findings; our pipeline already aligns with their 'avoid' list (no temporal tokens, no overlays, no fancy sampling, no sequential context injection, uniform sampling, describe-don't-predict framing). Co-Authored-By: Claude Opus 4.7 (1M context) --- .../prompts/module_1_subtasks.txt | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/src/lerobot/annotations/steerable_pipeline/prompts/module_1_subtasks.txt b/src/lerobot/annotations/steerable_pipeline/prompts/module_1_subtasks.txt index e1c8f822e..e6a5260a7 100644 --- a/src/lerobot/annotations/steerable_pipeline/prompts/module_1_subtasks.txt +++ b/src/lerobot/annotations/steerable_pipeline/prompts/module_1_subtasks.txt @@ -85,6 +85,23 @@ Authoring rules — Hi Robot atom granularity, pi0.7-style short prompts: - Every subtask's [start_time, end_time] must lie within [0.0, {episode_duration}] seconds. +SPECIAL CASES — verb disambiguation (each rule is narrowly visual and +fires ONLY on the spatial situation it names; it must not change how you +label any other situation): +- STACK vs PUT: if an object is placed ON TOP OF another specific object + (not on a flat table / shelf / counter), use "stack ... on ...", not + "put". "stack blue book on green book", NOT "put blue book on table". +- INSERT vs PUT: if an object goes INTO a fitted slot / hole / socket / + receptacle (push-fit), use "insert ... into ...", not "put". +- RETRIEVE/PICK-UP vs PUT (direction): watch the gripper. If it CLOSES + on the object and the object moves WITH the hand, it is "pick up" / + "retrieve" (object leaves its location). If the gripper OPENS and the + object stays where the hand left it, it is "put" / "place" (object + arrives at a location). Decide by which way the object moves, not by + where the hand ends up. +- POUR vs PUT: only use "pour" when the source is tilted and contents + flow out; moving a full container without tilting is "put"/"place". + Output strictly valid JSON of shape: {{