From cd128cbbd5d811ba625f1c86c9dbe8f21ed5b734 Mon Sep 17 00:00:00 2001
From: Pepijn <pepijn@huggingface.co>
Date: Tue, 2 Jun 2026 16:10:49 +0200
Subject: [PATCH] annotate: add verb-scoped disambiguation rules to subtask
 prompt
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adopt the one prompt technique Scale's dense-captioning study found
reliably positive: targeted, verb-scoped, visually-grounded
disambiguation rules. Their lesson was that such a rule must fire ONLY
on the spatial situation it names (their narrow 'Stack vs Put' rule
helped; an over-broad directional 'Scoop' rule bled into other verbs
and hurt), so each rule here is phrased visually and scoped to one
confusable pair:
  * stack-vs-put (on top of an object vs on a surface)
  * insert-vs-put (fitted slot vs surface)
  * pick-up/retrieve-vs-put (decide by which way the OBJECT moves:
    gripper closes + object moves with hand = pick up; gripper opens +
    object stays = put — directly targets Scale's dominant
    direction-flip failure)
  * pour-vs-put (tilt + flow vs untilted move)

This is the highest-confidence, lowest-risk change from the Scale
findings; our pipeline already aligns with their 'avoid' list (no
temporal tokens, no overlays, no fancy sampling, no sequential context
injection, uniform sampling, describe-don't-predict framing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../prompts/module_1_subtasks.txt               | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/src/lerobot/annotations/steerable_pipeline/prompts/module_1_subtasks.txt b/src/lerobot/annotations/steerable_pipeline/prompts/module_1_subtasks.txt
index e1c8f822e..e6a5260a7 100644
--- a/src/lerobot/annotations/steerable_pipeline/prompts/module_1_subtasks.txt
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/module_1_subtasks.txt
@@ -85,6 +85,23 @@ Authoring rules — Hi Robot atom granularity, pi0.7-style short prompts:
 - Every subtask's [start_time, end_time] must lie within
   [0.0, {episode_duration}] seconds.
 
+SPECIAL CASES — verb disambiguation (each rule is narrowly visual and
+fires ONLY on the spatial situation it names; it must not change how you
+label any other situation):
+- STACK vs PUT: if an object is placed ON TOP OF another specific object
+  (not on a flat table / shelf / counter), use "stack ... on ...", not
+  "put". "stack blue book on green book", NOT "put blue book on table".
+- INSERT vs PUT: if an object goes INTO a fitted slot / hole / socket /
+  receptacle (push-fit), use "insert ... into ...", not "put".
+- RETRIEVE/PICK-UP vs PUT (direction): watch the gripper. If it CLOSES
+  on the object and the object moves WITH the hand, it is "pick up" /
+  "retrieve" (object leaves its location). If the gripper OPENS and the
+  object stays where the hand left it, it is "put" / "place" (object
+  arrives at a location). Decide by which way the object moves, not by
+  where the hand ends up.
+- POUR vs PUT: only use "pour" when the source is tilted and contents
+  flow out; moving a full container without tilting is "put"/"place".
+
 Output strictly valid JSON of shape:
 
   {{