feat: language annotation pipeline (PR 2/3)
Adds the steerable annotation pipeline (`lerobot-annotate`) that populates
the `language_persistent` and `language_events` columns introduced in
PR 1 directly into `data/chunk-*/file-*.parquet`. No flavor namespace,
no sidecar tree.
Modules produced:
- Module 1 (plan_subtasks_memory): Pi0.7-style subtasks, plan (init +
refresh on interjection), MEM-style memory at subtask boundaries.
- Module 2 (interjections_and_speech): t=0 speech-only acknowledgement,
mid-episode paired interjection + speech tool-call atom.
- Module 3 (general_vqa): bbox/keypoint/count/attribute/spatial pairs at
configurable cadence with one-retry JSON validation.
Writer enforces: per-episode persistent identity, exact-frame event
timestamps, column routing per `column_for_style`, dataset-level `tools`
column with the `say` schema, drops legacy `subtask_index`. Validator
runs against staged JSONL artifacts before the writer rewrites parquet.
Adds `lerobot-annotate` console script, `annotations` extra (datatrove +
optional vllm), `make annotation-e2e` opt-in smoke target, and
`docs/source/annotation_pipeline.mdx`.
Branched from PR 1 (`feat/language-columns`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:22:51 +02:00
|
|
|
#!/usr/bin/env python
|
|
|
|
|
|
|
|
|
|
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
|
|
|
|
|
#
|
|
|
|
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
|
|
|
# you may not use this file except in compliance with the License.
|
|
|
|
|
# You may obtain a copy of the License at
|
|
|
|
|
#
|
|
|
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
#
|
|
|
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
|
|
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
|
# See the License for the specific language governing permissions and
|
|
|
|
|
# limitations under the License.
|
|
|
|
|
"""Datatrove-shaped reader.
|
|
|
|
|
|
|
|
|
|
The reader walks ``data/chunk-*/file-*.parquet`` and yields one record per
|
|
|
|
|
episode containing:
|
|
|
|
|
|
|
|
|
|
- ``episode_index``: int
|
|
|
|
|
- ``frame_timestamps``: tuple[float, ...]
|
|
|
|
|
- ``frame_indices``: tuple[int, ...]
|
|
|
|
|
- ``episode_task``: str (canonical task from ``meta/tasks.parquet``)
|
|
|
|
|
- ``data_path``: pathlib.Path of the source parquet shard
|
|
|
|
|
- ``frames_df``: pandas.DataFrame slice for the episode (only loaded on demand)
|
|
|
|
|
|
|
|
|
|
This shape lets each module operate per-episode without loading all parquet
|
refactor(annotate): delegate distribution to HF Jobs; drop SLURM/local switch
The executor previously claimed it would "optionally hand off" to
datatrove's LocalPipelineExecutor or SlurmPipelineExecutor — but it
already runs phases inline in every code path, and HF Jobs (see
``examples/annotation/run_hf_job.py``) is the actual distribution
strategy. Stop pretending we have an executor selector.
* `executor.py`: drop `select_executor_class`, the "kind" log line, and
the references to LocalPipelineExecutor / SlurmPipelineExecutor.
Module docstring now says distribution is delegated to HF Jobs.
* `config.py`: drop `auto_threshold`, `force_local`, `slurm_partition`,
`slurm_gpus`, `slurm_time`, `workers`. `ExecutorConfig` keeps only
`episode_parallelism`. While here, prune the longer "why" docstrings
on every field down to the load-bearing bits — full story moves to
`docs/source/annotation_pipeline.mdx`.
* `pyproject.toml`: drop `datatrove>=0.4.0,<2.0.0` from the
`[annotations]` extra; the dep was only there for the (never used)
cluster executors. Comment block notes the new HF-Jobs delegation.
* `reader.py`, `lerobot_annotate.py`: drop their own datatrove /
flavor-namespace mentions.
* `docs/source/annotation_pipeline.mdx`:
- remove the flavor-namespace / sidecar paragraph (out of scope —
"multiple revisions = multiple copies" is dataset-level policy);
- remove the "writer drops the legacy `subtask_index` column" note
(already covered by PR 1's intentional-break call-out);
- remove the chat-template + `apply_chat_template(messages, tools=...)`
line (covered by Tools doc);
- replace the "executor picks Local vs Slurm" paragraph with
`--executor.episode_parallelism` and a pointer to HF Jobs;
- rewrite the style→recipe section to talk about "recipes" generically
instead of pinning a specific YAML;
- add a "Running on Hugging Face Jobs" section pointing at
`examples/annotation/run_hf_job.py`;
- add a "Running locally" example matching the CLI's docstring
(`uv run lerobot-annotate --root=... --vlm.model_id=...`);
- extend the paper-inspirations list with Pi0.7 and Steerable VLA
Policies (Zhao 2025) for Module 3.
Tests: same 3 pre-existing failures as before this commit (2 module
assertions still in flight; 1 carryover from PR 1). 41/44 pass.
Pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:09:22 +02:00
|
|
|
rows into memory at once.
|
feat: language annotation pipeline (PR 2/3)
Adds the steerable annotation pipeline (`lerobot-annotate`) that populates
the `language_persistent` and `language_events` columns introduced in
PR 1 directly into `data/chunk-*/file-*.parquet`. No flavor namespace,
no sidecar tree.
Modules produced:
- Module 1 (plan_subtasks_memory): Pi0.7-style subtasks, plan (init +
refresh on interjection), MEM-style memory at subtask boundaries.
- Module 2 (interjections_and_speech): t=0 speech-only acknowledgement,
mid-episode paired interjection + speech tool-call atom.
- Module 3 (general_vqa): bbox/keypoint/count/attribute/spatial pairs at
configurable cadence with one-retry JSON validation.
Writer enforces: per-episode persistent identity, exact-frame event
timestamps, column routing per `column_for_style`, dataset-level `tools`
column with the `say` schema, drops legacy `subtask_index`. Validator
runs against staged JSONL artifacts before the writer rewrites parquet.
Adds `lerobot-annotate` console script, `annotations` extra (datatrove +
optional vllm), `make annotation-e2e` opt-in smoke target, and
`docs/source/annotation_pipeline.mdx`.
Branched from PR 1 (`feat/language-columns`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:22:51 +02:00
|
|
|
"""
|
|
|
|
|
|
|
|
|
|
from __future__ import annotations
|
|
|
|
|
|
review: fix dead-code bug, add thread safety, atomic writes, smaller cleanups
**Critical: video_for_episode was unreachable dead code.**
``video_for_episode`` was indented inside ``_decode_pyav_direct``, after
its ``return`` statement — Python parsed it as a nested function that
never executed. Module 1's ``_episode_video_block`` calls
``self.frame_provider.video_for_episode(record, target_count)`` on the
``use_video_url=False`` path, which would have AttributeError'd on any
real dataset. Tests passed only because they used ``_StubFrameProvider``
/ ``_NullProvider`` which have the method. Moved it to be a proper
method of ``VideoFrameProvider`` (right after ``frames_at``).
**Thread safety on VideoFrameProvider.**
The executor runs Module 1/2/3 phases under a ``ThreadPoolExecutor``, so
the per-instance ``_cache`` dict and the one-shot ``_warned_decode_fail``
flag were exposed to concurrent reads/writes. Added a ``threading.Lock``
field, wrapped cache reads/writes and the warn-flag check-and-set in
``with self._lock:``. Stub fixtures unaffected.
**episode_clip_path is now a method of VideoFrameProvider.**
Used to be a free function reaching into ``provider._meta.episodes`` and
``provider._meta.get_video_file_path`` from outside the class. As a
method it just uses ``self._meta``. The only caller (Module 1) updated;
no external callers.
**Atomic write in LanguageColumnsWriter.**
``pq.write_table(new_table, path)`` was overwriting the parquet shard
in place — a crash mid-write would corrupt the file. Now writes to a
sibling ``.tmp`` and ``Path.replace`` atomically.
**Smaller items:**
* ``executor.py`` docstring opened with "four phases" but listed six.
Now says "six phases" to match.
* ``[annotations]`` extra in ``pyproject.toml`` now includes
``openai>=1.40,<2.0``. Default ``VlmConfig.backend`` is ``"openai"``,
so without it ``_make_openai_client`` would ImportError on a fresh
``uv sync --extra annotations``.
* ``_snap_to_frame`` was duplicated identically in
``plan_subtasks_memory.py`` and ``interjections_and_speech.py``.
Promoted to ``snap_to_frame`` in ``reader.py`` (next to
``EpisodeRecord``); both modules now import it. Backwards-compat alias
not needed — no external callers.
* ``EpisodeRecord.frames_df()`` was re-reading the full parquet on every
call. Now memoizes via a private dataclass field so repeat calls from
different modules pay the cost once. Method signature unchanged.
* ``_extract_first_json_object`` had a redundant ``and not escape`` guard
that was dead because the prior block already handled and reset
``escape``. Replaced with a comment explaining the invariant.
**Pre-existing lint cleanups surfaced once these files entered
pre-commit's scope:**
* dead local ``client = clients[0]`` in ``_make_openai_client`` (the
real round-robin uses ``clients[rr_counter[...]]``).
* ``cmd = ... if "{port}" in cmd else f"...{port}"`` ternary collapse in
``_spawn_parallel_inference_servers``.
* ``seek_pts = 0 if stream.time_base is None else int(...)`` ternary
collapse in ``_decode_pyav_direct``.
* ``# nosec B310`` on the localhost ``urllib.request.urlopen`` probe in
``_server_is_up`` — the URL is the user-configured local-server endpoint
the CLI itself spawned, not arbitrary user input.
**Test added.**
``tests/annotations/test_frames.py`` pins the regression on
``VideoFrameProvider``: asserts ``video_for_episode`` and
``episode_clip_path`` are callable methods (not nested dead code or
free functions), and that the ``_lock`` field is a real
``threading.Lock``.
Sweep: 64 passed, 2 failed (same pre-existing module-impl bugs as
before this commit). Pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:53:43 +02:00
|
|
|
from collections.abc import Iterator, Sequence
|
|
|
|
|
from dataclasses import dataclass, field
|
feat: language annotation pipeline (PR 2/3)
Adds the steerable annotation pipeline (`lerobot-annotate`) that populates
the `language_persistent` and `language_events` columns introduced in
PR 1 directly into `data/chunk-*/file-*.parquet`. No flavor namespace,
no sidecar tree.
Modules produced:
- Module 1 (plan_subtasks_memory): Pi0.7-style subtasks, plan (init +
refresh on interjection), MEM-style memory at subtask boundaries.
- Module 2 (interjections_and_speech): t=0 speech-only acknowledgement,
mid-episode paired interjection + speech tool-call atom.
- Module 3 (general_vqa): bbox/keypoint/count/attribute/spatial pairs at
configurable cadence with one-retry JSON validation.
Writer enforces: per-episode persistent identity, exact-frame event
timestamps, column routing per `column_for_style`, dataset-level `tools`
column with the `say` schema, drops legacy `subtask_index`. Validator
runs against staged JSONL artifacts before the writer rewrites parquet.
Adds `lerobot-annotate` console script, `annotations` extra (datatrove +
optional vllm), `make annotation-e2e` opt-in smoke target, and
`docs/source/annotation_pipeline.mdx`.
Branched from PR 1 (`feat/language-columns`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:22:51 +02:00
|
|
|
from pathlib import Path
|
|
|
|
|
from typing import Any
|
|
|
|
|
|
|
|
|
|
import pyarrow.parquet as pq
|
|
|
|
|
|
2026-05-18 12:03:25 +02:00
|
|
|
from lerobot.datasets.io_utils import load_tasks
|
feat: language annotation pipeline (PR 2/3)
Adds the steerable annotation pipeline (`lerobot-annotate`) that populates
the `language_persistent` and `language_events` columns introduced in
PR 1 directly into `data/chunk-*/file-*.parquet`. No flavor namespace,
no sidecar tree.
Modules produced:
- Module 1 (plan_subtasks_memory): Pi0.7-style subtasks, plan (init +
refresh on interjection), MEM-style memory at subtask boundaries.
- Module 2 (interjections_and_speech): t=0 speech-only acknowledgement,
mid-episode paired interjection + speech tool-call atom.
- Module 3 (general_vqa): bbox/keypoint/count/attribute/spatial pairs at
configurable cadence with one-retry JSON validation.
Writer enforces: per-episode persistent identity, exact-frame event
timestamps, column routing per `column_for_style`, dataset-level `tools`
column with the `say` schema, drops legacy `subtask_index`. Validator
runs against staged JSONL artifacts before the writer rewrites parquet.
Adds `lerobot-annotate` console script, `annotations` extra (datatrove +
optional vllm), `make annotation-e2e` opt-in smoke target, and
`docs/source/annotation_pipeline.mdx`.
Branched from PR 1 (`feat/language-columns`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:22:51 +02:00
|
|
|
from lerobot.datasets.utils import DEFAULT_TASKS_PATH
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@dataclass
|
|
|
|
|
class EpisodeRecord:
|
|
|
|
|
"""Per-episode record yielded by the reader."""
|
|
|
|
|
|
|
|
|
|
episode_index: int
|
|
|
|
|
episode_task: str
|
|
|
|
|
frame_timestamps: tuple[float, ...]
|
|
|
|
|
frame_indices: tuple[int, ...]
|
|
|
|
|
data_path: Path
|
|
|
|
|
row_offset: int # row offset within the parquet file where this episode starts
|
|
|
|
|
row_count: int # number of rows for this episode
|
|
|
|
|
|
review: fix dead-code bug, add thread safety, atomic writes, smaller cleanups
**Critical: video_for_episode was unreachable dead code.**
``video_for_episode`` was indented inside ``_decode_pyav_direct``, after
its ``return`` statement — Python parsed it as a nested function that
never executed. Module 1's ``_episode_video_block`` calls
``self.frame_provider.video_for_episode(record, target_count)`` on the
``use_video_url=False`` path, which would have AttributeError'd on any
real dataset. Tests passed only because they used ``_StubFrameProvider``
/ ``_NullProvider`` which have the method. Moved it to be a proper
method of ``VideoFrameProvider`` (right after ``frames_at``).
**Thread safety on VideoFrameProvider.**
The executor runs Module 1/2/3 phases under a ``ThreadPoolExecutor``, so
the per-instance ``_cache`` dict and the one-shot ``_warned_decode_fail``
flag were exposed to concurrent reads/writes. Added a ``threading.Lock``
field, wrapped cache reads/writes and the warn-flag check-and-set in
``with self._lock:``. Stub fixtures unaffected.
**episode_clip_path is now a method of VideoFrameProvider.**
Used to be a free function reaching into ``provider._meta.episodes`` and
``provider._meta.get_video_file_path`` from outside the class. As a
method it just uses ``self._meta``. The only caller (Module 1) updated;
no external callers.
**Atomic write in LanguageColumnsWriter.**
``pq.write_table(new_table, path)`` was overwriting the parquet shard
in place — a crash mid-write would corrupt the file. Now writes to a
sibling ``.tmp`` and ``Path.replace`` atomically.
**Smaller items:**
* ``executor.py`` docstring opened with "four phases" but listed six.
Now says "six phases" to match.
* ``[annotations]`` extra in ``pyproject.toml`` now includes
``openai>=1.40,<2.0``. Default ``VlmConfig.backend`` is ``"openai"``,
so without it ``_make_openai_client`` would ImportError on a fresh
``uv sync --extra annotations``.
* ``_snap_to_frame`` was duplicated identically in
``plan_subtasks_memory.py`` and ``interjections_and_speech.py``.
Promoted to ``snap_to_frame`` in ``reader.py`` (next to
``EpisodeRecord``); both modules now import it. Backwards-compat alias
not needed — no external callers.
* ``EpisodeRecord.frames_df()`` was re-reading the full parquet on every
call. Now memoizes via a private dataclass field so repeat calls from
different modules pay the cost once. Method signature unchanged.
* ``_extract_first_json_object`` had a redundant ``and not escape`` guard
that was dead because the prior block already handled and reset
``escape``. Replaced with a comment explaining the invariant.
**Pre-existing lint cleanups surfaced once these files entered
pre-commit's scope:**
* dead local ``client = clients[0]`` in ``_make_openai_client`` (the
real round-robin uses ``clients[rr_counter[...]]``).
* ``cmd = ... if "{port}" in cmd else f"...{port}"`` ternary collapse in
``_spawn_parallel_inference_servers``.
* ``seek_pts = 0 if stream.time_base is None else int(...)`` ternary
collapse in ``_decode_pyav_direct``.
* ``# nosec B310`` on the localhost ``urllib.request.urlopen`` probe in
``_server_is_up`` — the URL is the user-configured local-server endpoint
the CLI itself spawned, not arbitrary user input.
**Test added.**
``tests/annotations/test_frames.py`` pins the regression on
``VideoFrameProvider``: asserts ``video_for_episode`` and
``episode_clip_path`` are callable methods (not nested dead code or
free functions), and that the ``_lock`` field is a real
``threading.Lock``.
Sweep: 64 passed, 2 failed (same pre-existing module-impl bugs as
before this commit). Pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:53:43 +02:00
|
|
|
# Memoized parquet slice — populated on first ``frames_df()`` call so
|
|
|
|
|
# repeat queries from different modules don't re-read the whole shard.
|
|
|
|
|
_frames_df_cache: Any = field(default=None, init=False, repr=False, compare=False)
|
feat: language annotation pipeline (PR 2/3)
Adds the steerable annotation pipeline (`lerobot-annotate`) that populates
the `language_persistent` and `language_events` columns introduced in
PR 1 directly into `data/chunk-*/file-*.parquet`. No flavor namespace,
no sidecar tree.
Modules produced:
- Module 1 (plan_subtasks_memory): Pi0.7-style subtasks, plan (init +
refresh on interjection), MEM-style memory at subtask boundaries.
- Module 2 (interjections_and_speech): t=0 speech-only acknowledgement,
mid-episode paired interjection + speech tool-call atom.
- Module 3 (general_vqa): bbox/keypoint/count/attribute/spatial pairs at
configurable cadence with one-retry JSON validation.
Writer enforces: per-episode persistent identity, exact-frame event
timestamps, column routing per `column_for_style`, dataset-level `tools`
column with the `say` schema, drops legacy `subtask_index`. Validator
runs against staged JSONL artifacts before the writer rewrites parquet.
Adds `lerobot-annotate` console script, `annotations` extra (datatrove +
optional vllm), `make annotation-e2e` opt-in smoke target, and
`docs/source/annotation_pipeline.mdx`.
Branched from PR 1 (`feat/language-columns`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:22:51 +02:00
|
|
|
|
review: fix dead-code bug, add thread safety, atomic writes, smaller cleanups
**Critical: video_for_episode was unreachable dead code.**
``video_for_episode`` was indented inside ``_decode_pyav_direct``, after
its ``return`` statement — Python parsed it as a nested function that
never executed. Module 1's ``_episode_video_block`` calls
``self.frame_provider.video_for_episode(record, target_count)`` on the
``use_video_url=False`` path, which would have AttributeError'd on any
real dataset. Tests passed only because they used ``_StubFrameProvider``
/ ``_NullProvider`` which have the method. Moved it to be a proper
method of ``VideoFrameProvider`` (right after ``frames_at``).
**Thread safety on VideoFrameProvider.**
The executor runs Module 1/2/3 phases under a ``ThreadPoolExecutor``, so
the per-instance ``_cache`` dict and the one-shot ``_warned_decode_fail``
flag were exposed to concurrent reads/writes. Added a ``threading.Lock``
field, wrapped cache reads/writes and the warn-flag check-and-set in
``with self._lock:``. Stub fixtures unaffected.
**episode_clip_path is now a method of VideoFrameProvider.**
Used to be a free function reaching into ``provider._meta.episodes`` and
``provider._meta.get_video_file_path`` from outside the class. As a
method it just uses ``self._meta``. The only caller (Module 1) updated;
no external callers.
**Atomic write in LanguageColumnsWriter.**
``pq.write_table(new_table, path)`` was overwriting the parquet shard
in place — a crash mid-write would corrupt the file. Now writes to a
sibling ``.tmp`` and ``Path.replace`` atomically.
**Smaller items:**
* ``executor.py`` docstring opened with "four phases" but listed six.
Now says "six phases" to match.
* ``[annotations]`` extra in ``pyproject.toml`` now includes
``openai>=1.40,<2.0``. Default ``VlmConfig.backend`` is ``"openai"``,
so without it ``_make_openai_client`` would ImportError on a fresh
``uv sync --extra annotations``.
* ``_snap_to_frame`` was duplicated identically in
``plan_subtasks_memory.py`` and ``interjections_and_speech.py``.
Promoted to ``snap_to_frame`` in ``reader.py`` (next to
``EpisodeRecord``); both modules now import it. Backwards-compat alias
not needed — no external callers.
* ``EpisodeRecord.frames_df()`` was re-reading the full parquet on every
call. Now memoizes via a private dataclass field so repeat calls from
different modules pay the cost once. Method signature unchanged.
* ``_extract_first_json_object`` had a redundant ``and not escape`` guard
that was dead because the prior block already handled and reset
``escape``. Replaced with a comment explaining the invariant.
**Pre-existing lint cleanups surfaced once these files entered
pre-commit's scope:**
* dead local ``client = clients[0]`` in ``_make_openai_client`` (the
real round-robin uses ``clients[rr_counter[...]]``).
* ``cmd = ... if "{port}" in cmd else f"...{port}"`` ternary collapse in
``_spawn_parallel_inference_servers``.
* ``seek_pts = 0 if stream.time_base is None else int(...)`` ternary
collapse in ``_decode_pyav_direct``.
* ``# nosec B310`` on the localhost ``urllib.request.urlopen`` probe in
``_server_is_up`` — the URL is the user-configured local-server endpoint
the CLI itself spawned, not arbitrary user input.
**Test added.**
``tests/annotations/test_frames.py`` pins the regression on
``VideoFrameProvider``: asserts ``video_for_episode`` and
``episode_clip_path`` are callable methods (not nested dead code or
free functions), and that the ``_lock`` field is a real
``threading.Lock``.
Sweep: 64 passed, 2 failed (same pre-existing module-impl bugs as
before this commit). Pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:53:43 +02:00
|
|
|
def frames_df(self): # type: ignore[no-untyped-def]
|
|
|
|
|
"""Lazy-load the pandas slice for this episode (memoized)."""
|
|
|
|
|
if self._frames_df_cache is None:
|
|
|
|
|
import pandas as pd # noqa: PLC0415 - deferred for optional dataset extra
|
|
|
|
|
|
|
|
|
|
table = pq.read_table(self.data_path)
|
|
|
|
|
df: pd.DataFrame = table.to_pandas()
|
|
|
|
|
self._frames_df_cache = df.iloc[self.row_offset : self.row_offset + self.row_count].reset_index(
|
|
|
|
|
drop=True
|
|
|
|
|
)
|
|
|
|
|
return self._frames_df_cache
|
|
|
|
|
|
|
|
|
|
|
2026-05-08 12:18:09 +02:00
|
|
|
def reconstruct_subtask_spans(
|
|
|
|
|
rows: Sequence[dict[str, Any]],
|
|
|
|
|
*,
|
|
|
|
|
episode_end_t: float | None = None,
|
|
|
|
|
) -> list[dict[str, Any]]:
|
|
|
|
|
"""Turn ``style="subtask"`` rows into ``{text, start, end}`` spans.
|
|
|
|
|
|
|
|
|
|
Each span's ``end`` is the next span's ``start``. The final span's
|
|
|
|
|
``end`` defaults to its own ``start`` (zero-duration) — pass
|
|
|
|
|
``episode_end_t`` to extend it to the episode's last frame instead,
|
|
|
|
|
which is what downstream consumers (memory, interjection boundary
|
|
|
|
|
selection) expect.
|
|
|
|
|
|
2026-05-18 12:03:25 +02:00
|
|
|
Used by the ``plan`` module (plan-update pass) and the
|
|
|
|
|
``interjections`` module (interjection anchoring), which both need the
|
|
|
|
|
same span shape.
|
2026-05-08 12:18:09 +02:00
|
|
|
"""
|
|
|
|
|
sorted_rows = sorted(
|
|
|
|
|
(r for r in rows if r.get("style") == "subtask"),
|
|
|
|
|
key=lambda r: float(r["timestamp"]),
|
|
|
|
|
)
|
|
|
|
|
spans: list[dict[str, Any]] = []
|
|
|
|
|
for r in sorted_rows:
|
|
|
|
|
t = float(r["timestamp"])
|
|
|
|
|
if spans:
|
|
|
|
|
spans[-1]["end"] = t
|
|
|
|
|
spans.append({"text": r.get("content") or "", "start": t, "end": t})
|
|
|
|
|
if spans and episode_end_t is not None and float(episode_end_t) > spans[-1]["start"]:
|
|
|
|
|
spans[-1]["end"] = float(episode_end_t)
|
|
|
|
|
return spans
|
|
|
|
|
|
|
|
|
|
|
review: fix dead-code bug, add thread safety, atomic writes, smaller cleanups
**Critical: video_for_episode was unreachable dead code.**
``video_for_episode`` was indented inside ``_decode_pyav_direct``, after
its ``return`` statement — Python parsed it as a nested function that
never executed. Module 1's ``_episode_video_block`` calls
``self.frame_provider.video_for_episode(record, target_count)`` on the
``use_video_url=False`` path, which would have AttributeError'd on any
real dataset. Tests passed only because they used ``_StubFrameProvider``
/ ``_NullProvider`` which have the method. Moved it to be a proper
method of ``VideoFrameProvider`` (right after ``frames_at``).
**Thread safety on VideoFrameProvider.**
The executor runs Module 1/2/3 phases under a ``ThreadPoolExecutor``, so
the per-instance ``_cache`` dict and the one-shot ``_warned_decode_fail``
flag were exposed to concurrent reads/writes. Added a ``threading.Lock``
field, wrapped cache reads/writes and the warn-flag check-and-set in
``with self._lock:``. Stub fixtures unaffected.
**episode_clip_path is now a method of VideoFrameProvider.**
Used to be a free function reaching into ``provider._meta.episodes`` and
``provider._meta.get_video_file_path`` from outside the class. As a
method it just uses ``self._meta``. The only caller (Module 1) updated;
no external callers.
**Atomic write in LanguageColumnsWriter.**
``pq.write_table(new_table, path)`` was overwriting the parquet shard
in place — a crash mid-write would corrupt the file. Now writes to a
sibling ``.tmp`` and ``Path.replace`` atomically.
**Smaller items:**
* ``executor.py`` docstring opened with "four phases" but listed six.
Now says "six phases" to match.
* ``[annotations]`` extra in ``pyproject.toml`` now includes
``openai>=1.40,<2.0``. Default ``VlmConfig.backend`` is ``"openai"``,
so without it ``_make_openai_client`` would ImportError on a fresh
``uv sync --extra annotations``.
* ``_snap_to_frame`` was duplicated identically in
``plan_subtasks_memory.py`` and ``interjections_and_speech.py``.
Promoted to ``snap_to_frame`` in ``reader.py`` (next to
``EpisodeRecord``); both modules now import it. Backwards-compat alias
not needed — no external callers.
* ``EpisodeRecord.frames_df()`` was re-reading the full parquet on every
call. Now memoizes via a private dataclass field so repeat calls from
different modules pay the cost once. Method signature unchanged.
* ``_extract_first_json_object`` had a redundant ``and not escape`` guard
that was dead because the prior block already handled and reset
``escape``. Replaced with a comment explaining the invariant.
**Pre-existing lint cleanups surfaced once these files entered
pre-commit's scope:**
* dead local ``client = clients[0]`` in ``_make_openai_client`` (the
real round-robin uses ``clients[rr_counter[...]]``).
* ``cmd = ... if "{port}" in cmd else f"...{port}"`` ternary collapse in
``_spawn_parallel_inference_servers``.
* ``seek_pts = 0 if stream.time_base is None else int(...)`` ternary
collapse in ``_decode_pyav_direct``.
* ``# nosec B310`` on the localhost ``urllib.request.urlopen`` probe in
``_server_is_up`` — the URL is the user-configured local-server endpoint
the CLI itself spawned, not arbitrary user input.
**Test added.**
``tests/annotations/test_frames.py`` pins the regression on
``VideoFrameProvider``: asserts ``video_for_episode`` and
``episode_clip_path`` are callable methods (not nested dead code or
free functions), and that the ``_lock`` field is a real
``threading.Lock``.
Sweep: 64 passed, 2 failed (same pre-existing module-impl bugs as
before this commit). Pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:53:43 +02:00
|
|
|
def snap_to_frame(t: float, frame_timestamps: Sequence[float]) -> float:
|
|
|
|
|
"""Snap an arbitrary float to the nearest exact source frame timestamp.
|
|
|
|
|
|
|
|
|
|
Modules use this when emitting event-style rows so the row's
|
2026-05-18 12:03:25 +02:00
|
|
|
timestamp matches a real parquet frame: event rows must land on an
|
|
|
|
|
exact frame, otherwise the per-frame event lookup the writer does
|
|
|
|
|
would never match them.
|
review: fix dead-code bug, add thread safety, atomic writes, smaller cleanups
**Critical: video_for_episode was unreachable dead code.**
``video_for_episode`` was indented inside ``_decode_pyav_direct``, after
its ``return`` statement — Python parsed it as a nested function that
never executed. Module 1's ``_episode_video_block`` calls
``self.frame_provider.video_for_episode(record, target_count)`` on the
``use_video_url=False`` path, which would have AttributeError'd on any
real dataset. Tests passed only because they used ``_StubFrameProvider``
/ ``_NullProvider`` which have the method. Moved it to be a proper
method of ``VideoFrameProvider`` (right after ``frames_at``).
**Thread safety on VideoFrameProvider.**
The executor runs Module 1/2/3 phases under a ``ThreadPoolExecutor``, so
the per-instance ``_cache`` dict and the one-shot ``_warned_decode_fail``
flag were exposed to concurrent reads/writes. Added a ``threading.Lock``
field, wrapped cache reads/writes and the warn-flag check-and-set in
``with self._lock:``. Stub fixtures unaffected.
**episode_clip_path is now a method of VideoFrameProvider.**
Used to be a free function reaching into ``provider._meta.episodes`` and
``provider._meta.get_video_file_path`` from outside the class. As a
method it just uses ``self._meta``. The only caller (Module 1) updated;
no external callers.
**Atomic write in LanguageColumnsWriter.**
``pq.write_table(new_table, path)`` was overwriting the parquet shard
in place — a crash mid-write would corrupt the file. Now writes to a
sibling ``.tmp`` and ``Path.replace`` atomically.
**Smaller items:**
* ``executor.py`` docstring opened with "four phases" but listed six.
Now says "six phases" to match.
* ``[annotations]`` extra in ``pyproject.toml`` now includes
``openai>=1.40,<2.0``. Default ``VlmConfig.backend`` is ``"openai"``,
so without it ``_make_openai_client`` would ImportError on a fresh
``uv sync --extra annotations``.
* ``_snap_to_frame`` was duplicated identically in
``plan_subtasks_memory.py`` and ``interjections_and_speech.py``.
Promoted to ``snap_to_frame`` in ``reader.py`` (next to
``EpisodeRecord``); both modules now import it. Backwards-compat alias
not needed — no external callers.
* ``EpisodeRecord.frames_df()`` was re-reading the full parquet on every
call. Now memoizes via a private dataclass field so repeat calls from
different modules pay the cost once. Method signature unchanged.
* ``_extract_first_json_object`` had a redundant ``and not escape`` guard
that was dead because the prior block already handled and reset
``escape``. Replaced with a comment explaining the invariant.
**Pre-existing lint cleanups surfaced once these files entered
pre-commit's scope:**
* dead local ``client = clients[0]`` in ``_make_openai_client`` (the
real round-robin uses ``clients[rr_counter[...]]``).
* ``cmd = ... if "{port}" in cmd else f"...{port}"`` ternary collapse in
``_spawn_parallel_inference_servers``.
* ``seek_pts = 0 if stream.time_base is None else int(...)`` ternary
collapse in ``_decode_pyav_direct``.
* ``# nosec B310`` on the localhost ``urllib.request.urlopen`` probe in
``_server_is_up`` — the URL is the user-configured local-server endpoint
the CLI itself spawned, not arbitrary user input.
**Test added.**
``tests/annotations/test_frames.py`` pins the regression on
``VideoFrameProvider``: asserts ``video_for_episode`` and
``episode_clip_path`` are callable methods (not nested dead code or
free functions), and that the ``_lock`` field is a real
``threading.Lock``.
Sweep: 64 passed, 2 failed (same pre-existing module-impl bugs as
before this commit). Pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:53:43 +02:00
|
|
|
"""
|
|
|
|
|
if not frame_timestamps:
|
|
|
|
|
return float(t)
|
|
|
|
|
nearest = min(frame_timestamps, key=lambda f: abs(f - t))
|
|
|
|
|
return float(nearest)
|
feat: language annotation pipeline (PR 2/3)
Adds the steerable annotation pipeline (`lerobot-annotate`) that populates
the `language_persistent` and `language_events` columns introduced in
PR 1 directly into `data/chunk-*/file-*.parquet`. No flavor namespace,
no sidecar tree.
Modules produced:
- Module 1 (plan_subtasks_memory): Pi0.7-style subtasks, plan (init +
refresh on interjection), MEM-style memory at subtask boundaries.
- Module 2 (interjections_and_speech): t=0 speech-only acknowledgement,
mid-episode paired interjection + speech tool-call atom.
- Module 3 (general_vqa): bbox/keypoint/count/attribute/spatial pairs at
configurable cadence with one-retry JSON validation.
Writer enforces: per-episode persistent identity, exact-frame event
timestamps, column routing per `column_for_style`, dataset-level `tools`
column with the `say` schema, drops legacy `subtask_index`. Validator
runs against staged JSONL artifacts before the writer rewrites parquet.
Adds `lerobot-annotate` console script, `annotations` extra (datatrove +
optional vllm), `make annotation-e2e` opt-in smoke target, and
`docs/source/annotation_pipeline.mdx`.
Branched from PR 1 (`feat/language-columns`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:22:51 +02:00
|
|
|
|
|
|
|
|
|
|
|
|
|
def _load_tasks_lookup(root: Path) -> dict[int, str]:
|
2026-05-18 12:03:25 +02:00
|
|
|
"""Map ``task_index -> task`` from ``meta/tasks.parquet``.
|
|
|
|
|
|
|
|
|
|
Returns an empty dict when the file is absent — the task description is
|
|
|
|
|
derived later from the video if needed. Reuses the library-level
|
|
|
|
|
:func:`lerobot.datasets.io_utils.load_tasks`, which returns the tasks
|
|
|
|
|
frame indexed by task string with a ``task_index`` column.
|
|
|
|
|
"""
|
|
|
|
|
if not (root / DEFAULT_TASKS_PATH).exists():
|
feat: language annotation pipeline (PR 2/3)
Adds the steerable annotation pipeline (`lerobot-annotate`) that populates
the `language_persistent` and `language_events` columns introduced in
PR 1 directly into `data/chunk-*/file-*.parquet`. No flavor namespace,
no sidecar tree.
Modules produced:
- Module 1 (plan_subtasks_memory): Pi0.7-style subtasks, plan (init +
refresh on interjection), MEM-style memory at subtask boundaries.
- Module 2 (interjections_and_speech): t=0 speech-only acknowledgement,
mid-episode paired interjection + speech tool-call atom.
- Module 3 (general_vqa): bbox/keypoint/count/attribute/spatial pairs at
configurable cadence with one-retry JSON validation.
Writer enforces: per-episode persistent identity, exact-frame event
timestamps, column routing per `column_for_style`, dataset-level `tools`
column with the `say` schema, drops legacy `subtask_index`. Validator
runs against staged JSONL artifacts before the writer rewrites parquet.
Adds `lerobot-annotate` console script, `annotations` extra (datatrove +
optional vllm), `make annotation-e2e` opt-in smoke target, and
`docs/source/annotation_pipeline.mdx`.
Branched from PR 1 (`feat/language-columns`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:22:51 +02:00
|
|
|
return {}
|
2026-05-18 12:03:25 +02:00
|
|
|
tasks = load_tasks(root)
|
|
|
|
|
return {int(idx): str(task) for task, idx in zip(tasks.index, tasks["task_index"], strict=True)}
|
feat: language annotation pipeline (PR 2/3)
Adds the steerable annotation pipeline (`lerobot-annotate`) that populates
the `language_persistent` and `language_events` columns introduced in
PR 1 directly into `data/chunk-*/file-*.parquet`. No flavor namespace,
no sidecar tree.
Modules produced:
- Module 1 (plan_subtasks_memory): Pi0.7-style subtasks, plan (init +
refresh on interjection), MEM-style memory at subtask boundaries.
- Module 2 (interjections_and_speech): t=0 speech-only acknowledgement,
mid-episode paired interjection + speech tool-call atom.
- Module 3 (general_vqa): bbox/keypoint/count/attribute/spatial pairs at
configurable cadence with one-retry JSON validation.
Writer enforces: per-episode persistent identity, exact-frame event
timestamps, column routing per `column_for_style`, dataset-level `tools`
column with the `say` schema, drops legacy `subtask_index`. Validator
runs against staged JSONL artifacts before the writer rewrites parquet.
Adds `lerobot-annotate` console script, `annotations` extra (datatrove +
optional vllm), `make annotation-e2e` opt-in smoke target, and
`docs/source/annotation_pipeline.mdx`.
Branched from PR 1 (`feat/language-columns`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:22:51 +02:00
|
|
|
|
|
|
|
|
|
|
|
|
|
def iter_episodes(root: Path, *, only_episodes: tuple[int, ...] | None = None) -> Iterator[EpisodeRecord]:
|
|
|
|
|
"""Yield :class:`EpisodeRecord` for every episode under ``root/data/``.
|
|
|
|
|
|
|
|
|
|
Episodes are yielded in ascending ``episode_index`` order. The reader does
|
|
|
|
|
not assume a specific chunk/file layout: it scans every ``*.parquet``
|
|
|
|
|
under ``data/`` and groups by ``episode_index``.
|
|
|
|
|
"""
|
|
|
|
|
tasks = _load_tasks_lookup(root)
|
|
|
|
|
data_dir = root / "data"
|
|
|
|
|
parquet_files = sorted(data_dir.rglob("*.parquet"))
|
|
|
|
|
|
|
|
|
|
only_set = set(only_episodes) if only_episodes is not None else None
|
|
|
|
|
|
|
|
|
|
for path in parquet_files:
|
|
|
|
|
yield from _iter_one_path(path, tasks, only_set)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
def _iter_one_path(path: Path, tasks: dict[int, str], only_set: set[int] | None) -> Iterator[EpisodeRecord]:
|
|
|
|
|
table = pq.read_table(path)
|
|
|
|
|
names = table.column_names
|
|
|
|
|
if "episode_index" not in names:
|
|
|
|
|
return
|
|
|
|
|
episode_col = table.column("episode_index").to_pylist()
|
|
|
|
|
timestamp_col = (
|
|
|
|
|
table.column("timestamp").to_pylist() if "timestamp" in names else [0.0] * len(episode_col)
|
|
|
|
|
)
|
|
|
|
|
frame_col = (
|
|
|
|
|
table.column("frame_index").to_pylist() if "frame_index" in names else list(range(len(episode_col)))
|
|
|
|
|
)
|
|
|
|
|
task_col = table.column("task_index").to_pylist() if "task_index" in names else None
|
|
|
|
|
|
|
|
|
|
def _build(
|
|
|
|
|
ep: int,
|
|
|
|
|
start: int,
|
|
|
|
|
end: int,
|
|
|
|
|
task_idx: int | None,
|
|
|
|
|
ts_buf: list[float],
|
|
|
|
|
fi_buf: list[int],
|
|
|
|
|
) -> EpisodeRecord | None:
|
|
|
|
|
if only_set is not None and ep not in only_set:
|
|
|
|
|
return None
|
|
|
|
|
task = tasks.get(task_idx, "") if task_idx is not None else ""
|
|
|
|
|
return EpisodeRecord(
|
|
|
|
|
episode_index=ep,
|
|
|
|
|
episode_task=task,
|
|
|
|
|
frame_timestamps=tuple(ts_buf),
|
|
|
|
|
frame_indices=tuple(fi_buf),
|
|
|
|
|
data_path=path,
|
|
|
|
|
row_offset=start,
|
|
|
|
|
row_count=end - start,
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
cur_ep: int | None = None
|
|
|
|
|
start_offset = 0
|
|
|
|
|
ts_buf: list[float] = []
|
|
|
|
|
fi_buf: list[int] = []
|
|
|
|
|
cur_task_idx: int | None = None
|
|
|
|
|
|
|
|
|
|
for i, ep in enumerate(episode_col):
|
|
|
|
|
if cur_ep is None:
|
|
|
|
|
cur_ep = ep
|
|
|
|
|
start_offset = i
|
|
|
|
|
ts_buf = [timestamp_col[i]]
|
|
|
|
|
fi_buf = [frame_col[i]]
|
|
|
|
|
cur_task_idx = task_col[i] if task_col is not None else None
|
|
|
|
|
continue
|
|
|
|
|
if ep != cur_ep:
|
|
|
|
|
rec = _build(cur_ep, start_offset, i, cur_task_idx, ts_buf, fi_buf)
|
|
|
|
|
if rec is not None:
|
|
|
|
|
yield rec
|
|
|
|
|
cur_ep = ep
|
|
|
|
|
start_offset = i
|
|
|
|
|
ts_buf = [timestamp_col[i]]
|
|
|
|
|
fi_buf = [frame_col[i]]
|
|
|
|
|
cur_task_idx = task_col[i] if task_col is not None else None
|
|
|
|
|
else:
|
|
|
|
|
ts_buf.append(timestamp_col[i])
|
|
|
|
|
fi_buf.append(frame_col[i])
|
|
|
|
|
|
|
|
|
|
if cur_ep is not None:
|
|
|
|
|
rec = _build(cur_ep, start_offset, len(episode_col), cur_task_idx, ts_buf, fi_buf)
|
|
|
|
|
if rec is not None:
|
|
|
|
|
yield rec
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
def gather_data_paths(root: Path) -> list[Path]:
|
|
|
|
|
"""Return every ``data/chunk-*/file-*.parquet`` path under ``root``."""
|
|
|
|
|
return sorted((root / "data").rglob("*.parquet"))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
def episode_offsets_per_path(path: Path) -> dict[int, tuple[int, int]]:
|
|
|
|
|
"""Return ``{episode_index: (row_offset, row_count)}`` for one parquet."""
|
|
|
|
|
table = pq.read_table(path, columns=["episode_index"])
|
|
|
|
|
episode_col = table.column("episode_index").to_pylist()
|
|
|
|
|
out: dict[int, tuple[int, int]] = {}
|
|
|
|
|
cur_ep: int | None = None
|
|
|
|
|
start = 0
|
|
|
|
|
for i, ep in enumerate(episode_col):
|
|
|
|
|
if cur_ep is None:
|
|
|
|
|
cur_ep = ep
|
|
|
|
|
start = i
|
|
|
|
|
continue
|
|
|
|
|
if ep != cur_ep:
|
|
|
|
|
out[cur_ep] = (start, i - start)
|
|
|
|
|
cur_ep = ep
|
|
|
|
|
start = i
|
|
|
|
|
if cur_ep is not None:
|
|
|
|
|
out[cur_ep] = (start, len(episode_col) - start)
|
|
|
|
|
return out
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
def keyframe_indices(record: EpisodeRecord, k: int) -> list[int]:
|
|
|
|
|
"""Return ``k`` evenly spaced row indices into the episode (relative)."""
|
|
|
|
|
n = record.row_count
|
|
|
|
|
if k <= 0 or n == 0:
|
|
|
|
|
return []
|
|
|
|
|
if k >= n:
|
|
|
|
|
return list(range(n))
|
|
|
|
|
step = (n - 1) / (k - 1) if k > 1 else 0.0
|
|
|
|
|
return [int(round(i * step)) for i in range(k)] if k > 1 else [n // 2]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
def lookup_data_path(root: Path, episode_index: int) -> tuple[Path, int, int] | None:
|
|
|
|
|
"""Find the parquet file containing ``episode_index`` and its slice bounds."""
|
|
|
|
|
for path in gather_data_paths(root):
|
|
|
|
|
offsets = episode_offsets_per_path(path)
|
|
|
|
|
if episode_index in offsets:
|
|
|
|
|
start, count = offsets[episode_index]
|
|
|
|
|
return path, start, count
|
|
|
|
|
return None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
def episode_frame_timestamps(root: Path, episode_index: int) -> tuple[Any, list[float]]:
|
|
|
|
|
"""Return the parquet path and per-frame timestamps for ``episode_index``."""
|
|
|
|
|
found = lookup_data_path(root, episode_index)
|
|
|
|
|
if found is None:
|
|
|
|
|
raise ValueError(f"Episode {episode_index} not found under {root}/data/")
|
|
|
|
|
path, start, count = found
|
|
|
|
|
table = pq.read_table(path, columns=["timestamp"])
|
|
|
|
|
timestamps = table.column("timestamp").to_pylist()[start : start + count]
|
|
|
|
|
return path, [float(t) for t in timestamps]
|