lerobot-clone

mirror of https://github.com/huggingface/lerobot.git synced 2026-06-02 11:51:25 +00:00

Author	SHA1	Message	Date
Pepijn	f72b28738a	fix(annotate): default keyframe decode to ffmpeg CLI (thread-safe) The decoder chain tried torchcodec first, then ffmpeg. torchcodec is not thread-safe: under the executor's 16-wide concurrent decode in the interjections phase it SIGSEGVs (exit 139) before the ffmpeg fallback is ever reached — uncatchable, so it kills the whole job. Default the auto chain to ffmpeg only. Per-frame ffmpeg decode runs in an isolated child process: crash-safe and concurrency-safe (the plan phase already proved 16 parallel ffmpeg subprocesses are fine). torchcodec / pyav remain available via an explicit video_backend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:40:29 +02:00
Pepijn	1bd53cc7da	fix(annotate): decode keyframes via ffmpeg CLI fallback PyAV segfaulted (exit 139) decoding the AV1 streams modern LeRobot datasets use — a SIGSEGV that the per-episode try/except cannot catch, killing the whole job when the interjections phase started. Replace the PyAV fallback with _decode_frames_ffmpeg, which shells out to the ffmpeg CLI: a full ffmpeg build decodes AV1, and a child-process crash is a catchable non-zero exit rather than a segfault. Decoder chain is now torchcodec -> ffmpeg. _decode_frames_av stays available behind video_backend="pyav" for callers that want it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:08:31 +02:00
Pepijn	7128bb1769	fix(annotate): decode keyframes via PyAV directly The pyav fallback routed through lerobot's decode_video_frames(backend= "pyav"), which uses torchvision.io.VideoReader — removed in torchvision 0.23+. On modern torch stacks (e.g. vllm-openai with torchvision 0.26) both torchcodec and that path fail, leaving interjection/vqa prompts without visual context. Add _decode_frames_av: a self-contained PyAV decoder that picks the nearest frame per timestamp. It is the always-available tail of the decoder chain (torchcodec -> pyav) and the target of --video_backend=pyav. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:45:04 +02:00
Pepijn	31e0c15e55	fix(annotate): pyav fallback when torchcodec keyframe decode fails VideoFrameProvider decoded keyframes via torchcodec only. Some containers (e.g. vllm-openai) ship a torchcodec that cannot push packets to the decoder ("Operation not permitted"), silently degrading interjection/vqa prompts to no visual context. _decode now retries with pyav when the default backend raises, and a new `video_backend` config field lets callers pin the backend explicitly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:23:53 +02:00
Pepijn Kooijmans	9dfc9084e1	review: decode keyframes via video_utils.decode_video_frames Addresses three of CarolinePascal's frames.py comments (the fourth, the subprocess re-encode, waits on #3611): - replace the bespoke _decode_pyav_direct PyAV decoder with lerobot.datasets.video_utils.decode_video_frames (torchcodec backend, PyAV fallback) — torchvision's VideoReader removal no longer applies - frames flow through the provider as torch.Tensor (C, H, W uint8); PIL is materialised only at the VLM-message boundary in to_image_blocks / to_video_block, where the chat backends need it - _decode now returns exactly one frame per timestamp (or [] on failure), so frames_at pairs them with strict=True Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:00:38 +02:00
Pepijn Kooijmans	fd18beb3a1	review: address CarolinePascal feedback - name the three modules everywhere (plan / interjections / vqa) instead of module_1/2/3 — config classes, config fields, executor params, staging keys and phase names now carry the module name - rename examples/annotation -> examples/annotations; add the Apache header to run_hf_job.py - drop the unused GeneralVqaModule._generate_one - remove "PR 1" references from comments/docstrings - frames.py: rely on the always-defined LeRobotDatasetMetadata.camera_keys - executor.py: read/write meta/info.json via load_info / write_info - reader.py: load meta/tasks.parquet via io_utils.load_tasks - make --push_to_hub a bool; push the annotated dataset back to --repo_id - move the on-disk test dataset builder into tests/fixtures (build_annotation_dataset); run_e2e_smoke reuses it - clarify in the docs that the vqa module grounds each pair on a single frame (K = per-tick anchor count) - hoist stdlib dynamic imports to module scope Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:03:25 +02:00
Pepijn	53c7641885	review: fix dead-code bug, add thread safety, atomic writes, smaller cleanups Critical: video_for_episode was unreachable dead code. ``video_for_episode`` was indented inside ``_decode_pyav_direct``, after its ``return`` statement — Python parsed it as a nested function that never executed. Module 1's ``_episode_video_block`` calls ``self.frame_provider.video_for_episode(record, target_count)`` on the ``use_video_url=False`` path, which would have AttributeError'd on any real dataset. Tests passed only because they used ``_StubFrameProvider`` / ``_NullProvider`` which have the method. Moved it to be a proper method of ``VideoFrameProvider`` (right after ``frames_at``). Thread safety on VideoFrameProvider. The executor runs Module 1/2/3 phases under a ``ThreadPoolExecutor``, so the per-instance ``_cache`` dict and the one-shot ``_warned_decode_fail`` flag were exposed to concurrent reads/writes. Added a ``threading.Lock`` field, wrapped cache reads/writes and the warn-flag check-and-set in ``with self._lock:``. Stub fixtures unaffected. episode_clip_path is now a method of VideoFrameProvider. Used to be a free function reaching into ``provider._meta.episodes`` and ``provider._meta.get_video_file_path`` from outside the class. As a method it just uses ``self._meta``. The only caller (Module 1) updated; no external callers. Atomic write in LanguageColumnsWriter. ``pq.write_table(new_table, path)`` was overwriting the parquet shard in place — a crash mid-write would corrupt the file. Now writes to a sibling ``.tmp`` and ``Path.replace`` atomically. Smaller items: * ``executor.py`` docstring opened with "four phases" but listed six. Now says "six phases" to match. * ``[annotations]`` extra in ``pyproject.toml`` now includes ``openai>=1.40,<2.0``. Default ``VlmConfig.backend`` is ``"openai"``, so without it ``_make_openai_client`` would ImportError on a fresh ``uv sync --extra annotations``. * ``_snap_to_frame`` was duplicated identically in ``plan_subtasks_memory.py`` and ``interjections_and_speech.py``. Promoted to ``snap_to_frame`` in ``reader.py`` (next to ``EpisodeRecord``); both modules now import it. Backwards-compat alias not needed — no external callers. * ``EpisodeRecord.frames_df()`` was re-reading the full parquet on every call. Now memoizes via a private dataclass field so repeat calls from different modules pay the cost once. Method signature unchanged. * ``_extract_first_json_object`` had a redundant ``and not escape`` guard that was dead because the prior block already handled and reset ``escape``. Replaced with a comment explaining the invariant. Pre-existing lint cleanups surfaced once these files entered pre-commit's scope: * dead local ``client = clients[0]`` in ``_make_openai_client`` (the real round-robin uses ``clients[rr_counter[...]]``). * ``cmd = ... if "{port}" in cmd else f"...{port}"`` ternary collapse in ``_spawn_parallel_inference_servers``. * ``seek_pts = 0 if stream.time_base is None else int(...)`` ternary collapse in ``_decode_pyav_direct``. * ``# nosec B310`` on the localhost ``urllib.request.urlopen`` probe in ``_server_is_up`` — the URL is the user-configured local-server endpoint the CLI itself spawned, not arbitrary user input. Test added. ``tests/annotations/test_frames.py`` pins the regression on ``VideoFrameProvider``: asserts ``video_for_episode`` and ``episode_clip_path`` are callable methods (not nested dead code or free functions), and that the ``_lock`` field is a real ``threading.Lock``. Sweep: 64 passed, 2 failed (same pre-existing module-impl bugs as before this commit). Pre-commit clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 11:53:43 +02:00
Pepijn	0f6e3230df	fix(annotate): decode video frames with PyAV directly ``lerobot.datasets.video_utils.decode_video_frames`` routes ``backend="pyav"`` through ``decode_video_frames_torchvision`` → ``torchvision.io.VideoReader``, but ``VideoReader`` was removed in torchvision >= 0.22 (the vllm/vllm-openai:latest container ships with torchvision 0.25). That made every Module 3 frame decode raise ``AttributeError: module 'torchvision.io' has no attribute 'VideoReader'``, which the previous catch-all silently turned into an empty image list, which then made every Module 3 prompt skip via the ``not _has_image_block(messages)`` branch and produce zero VQA rows. Bypass ``video_utils`` entirely. The annotation pipeline only needs a handful of PIL frames per (episode, ts), so a direct PyAV decode is both simpler and insulated from torchvision API churn. ``av`` is already in the install set, no new dependency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:36 +02:00
Pepijn	2f2e42c4aa	log(annotate): warn loudly on first video decode failure VideoFrameProvider._decode used to swallow every exception silently and return []. That made Module 3 (VQA) produce zero rows whenever local video decoding broke (codec, backend, missing file, ...) because every prompt got skipped via the ``not _has_image_block(messages)`` branch in general_vqa.py — without any signal in the job log. Log the first failure with full exception info (subsequent failures stay quiet to avoid log spam) so this fast-path is debuggable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:36 +02:00
Pepijn	e064cfcb04	fix(annotate): seed Module 3 cameras from camera_keys + camera_key fallback Module 3 fast-pathed out (50 episodes in 0.6s) when ``frame_provider.camera_keys`` came back empty even though Module 1/2 worked, because they use ``frame_provider.camera_key`` (singular) and were happy with the explicit ``--vlm.camera_key=...`` override. Two fixes: - ``frames.py``: read ``meta.camera_keys`` (covers both video- and image-stored cameras) instead of ``meta.video_keys`` (video-only), matching :class:`LeRobotDatasetMetadata`'s canonical accessor. If metadata still surfaces nothing but the caller explicitly passed ``--vlm.camera_key=<key>``, fall back to ``[<key>]`` — the key is by definition known to exist on the dataset. - ``general_vqa.py``: emit a one-time WARNING log when Module 3 sees zero cameras so this never silently produces zero VQA again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:36 +02:00
Pepijn	1217fdb6f0	feat(annotate): emit VQA per-camera and propagate camera field Module 3 now produces one (vqa, user) + (vqa, assistant) pair per emission tick per camera rather than only against the dataset's first camera. Each emitted row carries the `camera` field added in PR 1 (language-columns), so the resolver can disambiguate per-camera VQA via `emitted_at(t, style=vqa, role=assistant, camera=...)` without ambiguity. - `frames.py`: `FrameProvider` Protocol gains a `camera_keys` property and a `camera_key=` argument on `frames_at` / `video_for_episode`. `VideoFrameProvider` exposes every `observation.images.*` key the dataset declares (not just the first) and keys its decode cache on `(episode, camera, timestamp)` so per-camera reads don't collide. Module 1 / 2 keep their old single-camera behaviour by leaving `camera_key=None` (falls back to the default camera). - `modules/general_vqa.py`: `run_episode` iterates `frame_provider .camera_keys` for each emission tick, builds one prompt per camera, batches all of them through the VLM, and stamps the resulting rows with `camera=<that key>`. Empty `camera_keys` (null provider) makes the module a no-op rather than silently emitting untagged rows. - `writer.py`: `_normalize_persistent_row` / `_normalize_event_row` carry `camera` through and call `validate_camera_field` so the invariant is enforced at the writer boundary. Event sort key now includes `camera` for deterministic ordering when several cameras share `(timestamp, style, role)`. `speech_atom` sets `camera=None`. - `validator.py`: `StagingValidator` gains a `dataset_camera_keys` field; `_check_camera_field` enforces the invariant and cross-checks every view-dependent row's `camera` against the dataset's known video keys. New `_check_vqa_uniqueness_per_frame_camera` flags duplicate `(vqa, role)` pairs at the same `(t, camera)`. - `lerobot_annotate.py`: passes the live frame provider's `camera_keys` into the validator so the cross-check uses the actual dataset camera set. - Tests: `_StubFrameProvider` exposes `camera_keys` and accepts the new `camera_key=` kwarg. `test_module3_vqa_unique_per_frame_and_camera` configures two cameras and asserts both are represented, that every emitted row has a `camera` tag, and that uniqueness holds per `(timestamp, camera, role)`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:36 +02:00
Pepijn	d0388e1142	fix(annotate): transcode subclips to H.264 instead of stream-copy Modern LeRobot datasets store videos in AV1, which vllm's libav build cannot decode (the video processor returns 0 frames and downstream chokes with ZeroDivisionError). Re-encode each per-episode subclip with libx264 (preset ultrafast, crf 23) so the resulting mp4 is universally decodable. Strip audio with -an for a smaller payload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:36 +02:00
Pepijn	b325475b38	feat(annotate): video_url block for openai backend Module 1 can now send the episode's actual mp4 file as a video_url content block instead of pre-decoded frames. The server (transformers serve / vllm serve / ktransformers serve) handles frame sampling at the configured fps. Default fps=1 (one frame per second is enough for subtask-boundary detection on manipulation episodes). A per-episode subclip is extracted to <root>/.annotate_staging/.video_clips/ via ffmpeg stream-copy (no re-encode) so the model sees only this episode's frames, not the whole shard. Enable with --module_1.use_video_url=true (and --vlm.backend=openai). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:34 +02:00
Pepijn	712d63abbd	fix(annotate): tolerate decoder returning fewer frames than requested pyav (and sometimes torchcodec) decode can return fewer frames than requested timestamps when some timestamps fall outside the video file's content range. Drop the strict=True on the zip and rely on the None-filter to discard missing frames. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:34 +02:00
Pepijn	6653999983	fix(annotate): default video decode backend to pyav torchcodec's __init__ bad-allocs on the cu128/torch-2.8 stack in some environments (Lustre/conda combos). The annotation pipeline calls decode_video_frames many times per episode, so this is a hard blocker. Default to pyav (always available via the av package) and let users opt back into torchcodec via LEROBOT_VIDEO_BACKEND=torchcodec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:34 +02:00
Pepijn	663fff0ae2	feat(annotate): Module 1 sees the whole episode as one video block Replaces keyframe sampling with a single Qwen-VL video block covering the whole demonstration. The model pools temporally itself and chooses where to cut subtasks — no stride, no count, no keyframe count knob to tune. - frames.py: ``FrameProvider`` gains ``video_for_episode(record, max_frames)``; ``VideoFrameProvider`` samples up to ``max_frames`` uniformly across the episode duration; ``_NullProvider`` returns [] for the no-video fallback. New ``to_video_block`` helper. - Module 1: drops keyframe sampling. The subtask prompt now goes out as ``[{"type":"video", "video":[<frames>]}, {"type":"text", ...}]`` and the prompt template asks the model to "watch the whole clip, then segment it" with cut points decided from gripper/contact/regrasp events the model sees. - Module1Config: ``keyframes_per_episode`` removed; replaced with ``max_video_frames: int = 32`` (model-capacity bound, not annotation logic). - Test: ``test_module1_attaches_video_block_to_subtask_prompt`` locks in the single-video-block invariant. - Stub-VLM markers updated: tests now key on "atomic subtasks" instead of the old "Decompose the demonstration" phrase that no longer appears in the prompt. - Docs: updated to describe the whole-episode video-block behavior and the no-video fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:33 +02:00
Pepijn	9d6af804bf	feat(annotate): attach camera keyframes to module prompts; default to Qwen3.6-27B-FP8 Closes the visual-grounding gap flagged after the initial PR review: modules now decode actual camera frames at the relevant timestamps and attach them as `{"type":"image", "image":<PIL>}` content blocks to the VLM prompts. - New `frames.py`: - `FrameProvider` Protocol; `VideoFrameProvider` decodes from the dataset's first `observation.images.*` stream via `LeRobotDatasetMetadata.get_video_file_path` and `decode_video_frames`, with the same `from_timestamp` shift the main dataset uses. - Per-process LRU cache so co-timestamped Module 1 plan-update + Module 2 calls share decode work. - `make_frame_provider` falls back to a null provider when the dataset has no video tracks → text-only prompts (graceful absence). - Modules 1/2/3 take an optional `frame_provider` (default null) and prepend image blocks before the text block. - Module 1 attaches `keyframes_per_episode` keyframes to the subtask decomposition prompt. - Module 2 attaches the frame at the interjection timestamp. - Module 3 attaches the exact emission frame to each VQA pair. - VlmConfig: backend now defaults to `vllm`; default model is `Qwen/Qwen3.6-27B-FP8`. New knobs: `--vlm.tensor_parallel_size`, `--vlm.camera_key` (override the keyframe stream). - `_make_vllm_client` honours `tensor_parallel_size` so 27B-FP8 sharded on 2× GPUs works out of the box. - `test_module3_attaches_frame_image_block_to_prompt` asserts modules emit one image block per VQA prompt at the exact emission timestamp. - Docs: example switched to `imstevenpmwork/super_poulain_draft` + Qwen3.6-27B-FP8 + tensor_parallel_size=2; documents the keyframe attachment behaviour and the no-video fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:33 +02:00

17 Commits