mirror of
https://github.com/huggingface/lerobot.git
synced 2026-06-04 12:51:27 +00:00
annotate: kill subtask hallucination + single-camera grounding
Two fixes for 'subtasks describe actions not in the video' plus a way
to focus the whole pipeline on one camera.
ANTI-HALLUCINATION
1. _episode_video_block: when use_video_url is set but clip extraction
fails, FALL BACK to embedded frames instead of returning an empty
block. An empty block left the VLM with zero visual grounding, so
it invented subtasks from the task text alone — the likely root
cause of hallucinated steps. Now logs a warning and embeds frames.
2. module_1_subtasks.txt gains a GROUNDING preamble (overrides all
other rules): label only motion visible in specific frames; never
invent/anticipate/pad; max_steps is a CEILING not a target; atomic
demos may be exactly ONE subtask; the VIDEO is ground truth, not
the instruction text.
SINGLE-CAMERA GROUNDING
* New VqaConfig.restrict_to_default_camera (default False). When True,
the VQA module grounds on only the --vlm.camera_key stream instead
of iterating every camera — matching the plan / interjection
modules, which already use that single camera. Now the whole
pipeline can focus on one view (e.g. observation.images.base).
run_hf_job.py updated:
* use_video_url=false + frames_per_second=2.0 — embed frames directly
(most reliable; no silent text-only failure mode) with dense
grounding.
* vqa.restrict_to_default_camera=true — VQA on the single camera too.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -180,8 +180,20 @@ class GeneralVqaModule:
|
||||
Defaults to every camera the provider exposes. Datasets with no
|
||||
cameras (or test/null providers) yield an empty list, which makes
|
||||
``run_episode`` a no-op.
|
||||
|
||||
When ``config.restrict_to_default_camera`` is set, VQA grounds on
|
||||
only the provider's default camera (the single ``--vlm.camera_key``
|
||||
stream), matching the plan / interjection modules so the whole
|
||||
pipeline focuses on one view.
|
||||
"""
|
||||
return list(getattr(self.frame_provider, "camera_keys", []) or [])
|
||||
all_cameras = list(getattr(self.frame_provider, "camera_keys", []) or [])
|
||||
if getattr(self.config, "restrict_to_default_camera", False):
|
||||
default = getattr(self.frame_provider, "camera_key", None)
|
||||
if default and default in all_cameras:
|
||||
return [default]
|
||||
if default:
|
||||
return [default]
|
||||
return all_cameras
|
||||
|
||||
def _build_messages(
|
||||
self,
|
||||
|
||||
Reference in New Issue
Block a user