Quality-gate fixes after the main merge:
* UP037: drop redundant quotes from PlanConfig forward-ref annotations
(action_records / task_aug_axes) — safe under 'from __future__ import
annotations'.
* ruff format applied to config.py, executor.py, general_vqa.py,
plan_subtasks_memory.py, validator.py, lerobot_annotate.py.
No behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes for 'subtasks describe actions not in the video' plus a way
to focus the whole pipeline on one camera.
ANTI-HALLUCINATION
1. _episode_video_block: when use_video_url is set but clip extraction
fails, FALL BACK to embedded frames instead of returning an empty
block. An empty block left the VLM with zero visual grounding, so
it invented subtasks from the task text alone — the likely root
cause of hallucinated steps. Now logs a warning and embeds frames.
2. module_1_subtasks.txt gains a GROUNDING preamble (overrides all
other rules): label only motion visible in specific frames; never
invent/anticipate/pad; max_steps is a CEILING not a target; atomic
demos may be exactly ONE subtask; the VIDEO is ground truth, not
the instruction text.
SINGLE-CAMERA GROUNDING
* New VqaConfig.restrict_to_default_camera (default False). When True,
the VQA module grounds on only the --vlm.camera_key stream instead
of iterating every camera — matching the plan / interjection
modules, which already use that single camera. Now the whole
pipeline can focus on one view (e.g. observation.images.base).
run_hf_job.py updated:
* use_video_url=false + frames_per_second=2.0 — embed frames directly
(most reliable; no silent text-only failure mode) with dense
grounding.
* vqa.restrict_to_default_camera=true — VQA on the single camera too.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- name the three modules everywhere (plan / interjections / vqa) instead
of module_1/2/3 — config classes, config fields, executor params,
staging keys and phase names now carry the module name
- rename examples/annotation -> examples/annotations; add the Apache
header to run_hf_job.py
- drop the unused GeneralVqaModule._generate_one
- remove "PR 1" references from comments/docstrings
- frames.py: rely on the always-defined LeRobotDatasetMetadata.camera_keys
- executor.py: read/write meta/info.json via load_info / write_info
- reader.py: load meta/tasks.parquet via io_utils.load_tasks
- make --push_to_hub a bool; push the annotated dataset back to --repo_id
- move the on-disk test dataset builder into tests/fixtures
(build_annotation_dataset); run_e2e_smoke reuses it
- clarify in the docs that the vqa module grounds each pair on a single
frame (K = per-tick anchor count)
- hoist stdlib dynamic imports to module scope
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Module 3 fast-pathed out (50 episodes in 0.6s) when
``frame_provider.camera_keys`` came back empty even though Module 1/2
worked, because they use ``frame_provider.camera_key`` (singular) and
were happy with the explicit ``--vlm.camera_key=...`` override.
Two fixes:
- ``frames.py``: read ``meta.camera_keys`` (covers both video- and
image-stored cameras) instead of ``meta.video_keys`` (video-only),
matching :class:`LeRobotDatasetMetadata`'s canonical accessor. If
metadata still surfaces nothing but the caller explicitly passed
``--vlm.camera_key=<key>``, fall back to ``[<key>]`` — the key is by
definition known to exist on the dataset.
- ``general_vqa.py``: emit a one-time WARNING log when Module 3 sees
zero cameras so this never silently produces zero VQA again.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Module 3 now produces one (vqa, user) + (vqa, assistant) pair per
emission tick *per camera* rather than only against the dataset's first
camera. Each emitted row carries the `camera` field added in PR 1
(language-columns), so the resolver can disambiguate per-camera VQA via
`emitted_at(t, style=vqa, role=assistant, camera=...)` without ambiguity.
- `frames.py`: `FrameProvider` Protocol gains a `camera_keys` property
and a `camera_key=` argument on `frames_at` / `video_for_episode`.
`VideoFrameProvider` exposes every `observation.images.*` key the
dataset declares (not just the first) and keys its decode cache on
`(episode, camera, timestamp)` so per-camera reads don't collide.
Module 1 / 2 keep their old single-camera behaviour by leaving
`camera_key=None` (falls back to the default camera).
- `modules/general_vqa.py`: `run_episode` iterates `frame_provider
.camera_keys` for each emission tick, builds one prompt per camera,
batches all of them through the VLM, and stamps the resulting rows
with `camera=<that key>`. Empty `camera_keys` (null provider) makes
the module a no-op rather than silently emitting untagged rows.
- `writer.py`: `_normalize_persistent_row` / `_normalize_event_row`
carry `camera` through and call `validate_camera_field` so the
invariant is enforced at the writer boundary. Event sort key now
includes `camera` for deterministic ordering when several cameras
share `(timestamp, style, role)`. `speech_atom` sets `camera=None`.
- `validator.py`: `StagingValidator` gains a `dataset_camera_keys`
field; `_check_camera_field` enforces the invariant and cross-checks
every view-dependent row's `camera` against the dataset's known video
keys. New `_check_vqa_uniqueness_per_frame_camera` flags duplicate
`(vqa, role)` pairs at the same `(t, camera)`.
- `lerobot_annotate.py`: passes the live frame provider's
`camera_keys` into the validator so the cross-check uses the actual
dataset camera set.
- Tests: `_StubFrameProvider` exposes `camera_keys` and accepts the new
`camera_key=` kwarg. `test_module3_vqa_unique_per_frame_and_camera`
configures two cameras and asserts both are represented, that every
emitted row has a `camera` tag, and that uniqueness holds per
`(timestamp, camera, role)`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds vlm.client_concurrency (default 16) which uses a ThreadPoolExecutor
to fan out batched chat.completions calls. vllm batches them internally
on the server side, giving big throughput wins on a single TP=1 server
without needing DP/TP and the NCCL setup it requires.
Module 3 now batches all per-episode VQA calls into a single
generate_json invocation so they fire in parallel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the visual-grounding gap flagged after the initial PR review:
modules now decode actual camera frames at the relevant timestamps and
attach them as `{"type":"image", "image":<PIL>}` content blocks to the
VLM prompts.
- New `frames.py`:
- `FrameProvider` Protocol; `VideoFrameProvider` decodes from the
dataset's first `observation.images.*` stream via
`LeRobotDatasetMetadata.get_video_file_path` and
`decode_video_frames`, with the same `from_timestamp` shift the main
dataset uses.
- Per-process LRU cache so co-timestamped Module 1 plan-update + Module
2 calls share decode work.
- `make_frame_provider` falls back to a null provider when the dataset
has no video tracks → text-only prompts (graceful absence).
- Modules 1/2/3 take an optional `frame_provider` (default null) and
prepend image blocks before the text block.
- Module 1 attaches `keyframes_per_episode` keyframes to the subtask
decomposition prompt.
- Module 2 attaches the frame at the interjection timestamp.
- Module 3 attaches the exact emission frame to each VQA pair.
- VlmConfig: backend now defaults to `vllm`; default model is
`Qwen/Qwen3.6-27B-FP8`. New knobs: `--vlm.tensor_parallel_size`,
`--vlm.camera_key` (override the keyframe stream).
- `_make_vllm_client` honours `tensor_parallel_size` so 27B-FP8 sharded
on 2× GPUs works out of the box.
- `test_module3_attaches_frame_image_block_to_prompt` asserts modules
emit one image block per VQA prompt at the exact emission timestamp.
- Docs: example switched to `imstevenpmwork/super_poulain_draft` +
Qwen3.6-27B-FP8 + tensor_parallel_size=2; documents the keyframe
attachment behaviour and the no-video fallback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>