review: fix dead-code bug, add thread safety, atomic writes, smaller cleanups

**Critical: video_for_episode was unreachable dead code.** ``video_for_episode`` was indented inside ``_decode_pyav_direct``, after its ``return`` statement — Python parsed it as a nested function that never executed. Module 1's ``_episode_video_block`` calls ``self.frame_provider.video_for_episode(record, target_count)`` on the ``use_video_url=False`` path, which would have AttributeError'd on any real dataset. Tests passed only because they used ``_StubFrameProvider`` / ``_NullProvider`` which have the method. Moved it to be a proper method of ``VideoFrameProvider`` (right after ``frames_at``). **Thread safety on VideoFrameProvider.** The executor runs Module 1/2/3 phases under a ``ThreadPoolExecutor``, so the per-instance ``_cache`` dict and the one-shot ``_warned_decode_fail`` flag were exposed to concurrent reads/writes. Added a ``threading.Lock`` field, wrapped cache reads/writes and the warn-flag check-and-set in ``with self._lock:``. Stub fixtures unaffected. **episode_clip_path is now a method of VideoFrameProvider.** Used to be a free function reaching into ``provider._meta.episodes`` and ``provider._meta.get_video_file_path`` from outside the class. As a method it just uses ``self._meta``. The only caller (Module 1) updated; no external callers. **Atomic write in LanguageColumnsWriter.** ``pq.write_table(new_table, path)`` was overwriting the parquet shard in place — a crash mid-write would corrupt the file. Now writes to a sibling ``.tmp`` and ``Path.replace`` atomically. **Smaller items:** * ``executor.py`` docstring opened with "four phases" but listed six. Now says "six phases" to match. * ``[annotations]`` extra in ``pyproject.toml`` now includes ``openai>=1.40,<2.0``. Default ``VlmConfig.backend`` is ``"openai"``, so without it ``_make_openai_client`` would ImportError on a fresh ``uv sync --extra annotations``. * ``_snap_to_frame`` was duplicated identically in ``plan_subtasks_memory.py`` and ``interjections_and_speech.py``. Promoted to ``snap_to_frame`` in ``reader.py`` (next to ``EpisodeRecord``); both modules now import it. Backwards-compat alias not needed — no external callers. * ``EpisodeRecord.frames_df()`` was re-reading the full parquet on every call. Now memoizes via a private dataclass field so repeat calls from different modules pay the cost once. Method signature unchanged. * ``_extract_first_json_object`` had a redundant ``and not escape`` guard that was dead because the prior block already handled and reset ``escape``. Replaced with a comment explaining the invariant. **Pre-existing lint cleanups surfaced once these files entered pre-commit's scope:** * dead local ``client = clients[0]`` in ``_make_openai_client`` (the real round-robin uses ``clients[rr_counter[...]]``). * ``cmd = ... if "{port}" in cmd else f"...{port}"`` ternary collapse in ``_spawn_parallel_inference_servers``. * ``seek_pts = 0 if stream.time_base is None else int(...)`` ternary collapse in ``_decode_pyav_direct``. * ``# nosec B310`` on the localhost ``urllib.request.urlopen`` probe in ``_server_is_up`` — the URL is the user-configured local-server endpoint the CLI itself spawned, not arbitrary user input. **Test added.** ``tests/annotations/test_frames.py`` pins the regression on ``VideoFrameProvider``: asserts ``video_for_episode`` and ``episode_clip_path`` are callable methods (not nested dead code or free functions), and that the ``_lock`` field is a real ``threading.Lock``. Sweep: 64 passed, 2 failed (same pre-existing module-impl bugs as before this commit). Pre-commit clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-03 04:11:24 +00:00 · 2026-05-08 11:53:43 +02:00
parent 088c8371df
commit 53c7641885
10 changed files with 284 additions and 204 deletions
--- a/src/lerobot/annotations/steerable_pipeline/reader.py
+++ b/src/lerobot/annotations/steerable_pipeline/reader.py
@@ -31,8 +31,8 @@ rows into memory at once.

 from __future__ import annotations

-from collections.abc import Iterator
-from dataclasses import dataclass
+from collections.abc import Iterator, Sequence
+from dataclasses import dataclass, field
 from pathlib import Path
 from typing import Any

@@ -53,14 +53,34 @@ class EpisodeRecord:
    row_offset: int  # row offset within the parquet file where this episode starts
    row_count: int  # number of rows for this episode

-    def frames_df(self):  # type: ignore[no-untyped-def]
-        """Lazy-load the pandas slice for this episode."""
-        import pandas as pd  # noqa: PLC0415  - deferred for optional dataset extra
+    # Memoized parquet slice — populated on first ``frames_df()`` call so
+    # repeat queries from different modules don't re-read the whole shard.
+    _frames_df_cache: Any = field(default=None, init=False, repr=False, compare=False)

-        table = pq.read_table(self.data_path)
-        df: pd.DataFrame = table.to_pandas()
-        slice_ = df.iloc[self.row_offset : self.row_offset + self.row_count].reset_index(drop=True)
-        return slice_
+    def frames_df(self):  # type: ignore[no-untyped-def]
+        """Lazy-load the pandas slice for this episode (memoized)."""
+        if self._frames_df_cache is None:
+            import pandas as pd  # noqa: PLC0415  - deferred for optional dataset extra
+
+            table = pq.read_table(self.data_path)
+            df: pd.DataFrame = table.to_pandas()
+            self._frames_df_cache = df.iloc[self.row_offset : self.row_offset + self.row_count].reset_index(
+                drop=True
+            )
+        return self._frames_df_cache
+
+
+def snap_to_frame(t: float, frame_timestamps: Sequence[float]) -> float:
+    """Snap an arbitrary float to the nearest exact source frame timestamp.
+
+    Modules use this when emitting event-style rows so the row's
+    timestamp matches a real parquet frame (event rows must land on an
+    exact frame, see PR 1's "exact event matching" rule).
+    """
+    if not frame_timestamps:
+        return float(t)
+    nearest = min(frame_timestamps, key=lambda f: abs(f - t))
+    return float(nearest)


 def _load_tasks_lookup(root: Path) -> dict[int, str]: