annotate: kill subtask hallucination + single-camera grounding

Two fixes for 'subtasks describe actions not in the video' plus a way to focus the whole pipeline on one camera. ANTI-HALLUCINATION 1. _episode_video_block: when use_video_url is set but clip extraction fails, FALL BACK to embedded frames instead of returning an empty block. An empty block left the VLM with zero visual grounding, so it invented subtasks from the task text alone — the likely root cause of hallucinated steps. Now logs a warning and embeds frames. 2. module_1_subtasks.txt gains a GROUNDING preamble (overrides all other rules): label only motion visible in specific frames; never invent/anticipate/pad; max_steps is a CEILING not a target; atomic demos may be exactly ONE subtask; the VIDEO is ground truth, not the instruction text. SINGLE-CAMERA GROUNDING * New VqaConfig.restrict_to_default_camera (default False). When True, the VQA module grounds on only the --vlm.camera_key stream instead of iterating every camera — matching the plan / interjection modules, which already use that single camera. Now the whole pipeline can focus on one view (e.g. observation.images.base). run_hf_job.py updated: * use_video_url=false + frames_per_second=2.0 — embed frames directly (most reliable; no silent text-only failure mode) with dense grounding. * vqa.restrict_to_default_camera=true — VQA on the single camera too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-03 04:11:24 +00:00 · 2026-06-02 15:08:25 +02:00
parent 7454b4c993
commit ba5d4c5cd8
5 changed files with 65 additions and 9 deletions
--- a/examples/annotations/run_hf_job.py
+++ b/examples/annotations/run_hf_job.py
@@ -54,9 +54,14 @@ CMD = (
    "--vlm.chat_template_kwargs='{\"enable_thinking\": false}' "
    "--vlm.camera_key=observation.images.robot0_agentview_right "
    # Phase 1 — plan module (subtasks + plan + memory).
-    "--plan.frames_per_second=1.0 "
-    "--plan.use_video_url=true "
-    "--plan.use_video_url_fps=1.0 "
+    # Embed decoded frames directly (use_video_url=false) rather than
+    # handing the server a file:// clip. The embedded path is more
+    # reliable: if clip extraction ever fails, the video_url path would
+    # silently send NO video and the VLM would hallucinate subtasks from
+    # the task text alone. 2 fps gives dense visual grounding so the VLM
+    # labels what actually happens.
+    "--plan.frames_per_second=2.0 "
+    "--plan.use_video_url=false "
    # IMPORTANT for RoboCasa: the dataset's task string ("Navigate to the
    # stove", "Pick the mug...") is authoritative and is what eval uses.
    # ``derive_task_from_video=off`` keeps that canonical task driving
@@ -80,6 +85,10 @@ CMD = (
    # Phase 2 — interjections + speech.
    "--interjections.max_interjections_per_episode=6 "
    # Phase 4 — general VQA.
+    # Ground VQA on the SAME single camera as plan/interjections
+    # (--vlm.camera_key) instead of iterating every camera. The whole
+    # pipeline then focuses on one view, e.g. observation.images.base.
+    "--vqa.restrict_to_default_camera=true "
    "--vqa.K=1 "
    "--vqa.vqa_emission_hz=1.0"
 )