Tighten the remaining multi-line comment blocks in config.py (derive_task,
frames/window, describe_first, action-record/vqa/vlm fields, video_backend,
repo ids, executor) to 1-3 lines each. Also fix a stale path typo
('examples/annotation' -> the docstring now just says HF Jobs). Comments
only — no field or behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two behavior-preserving simplifications:
* plan_subtasks_memory.run_episode: the task_aug 'axes' and free-form
branches built identical deduped rows via copy-pasted seen/append
loops. Collapse to one branch that picks the variant source, then a
shared _task_aug_rows() helper does the dedup + row build (-~25 LOC).
* writer: _normalize_persistent_row / _normalize_event_row shared the
same camera-validate + struct construction. Extract _normalize_row(),
keeping the exact key order (the parquet struct schema is inferred
from insertion order, so timestamp must stay between style and camera).
docs: 'Which modules run' is now a table giving each module's on/off flag
(--plan.enabled / --interjections.enabled / --vqa.enabled) and what it
turns off.
Verified: 40 tests pass (incl. test_writer struct round-trip); pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pyproject annotations-extra comment still described the removed
vllm/transformers in-process backends ('vllm preferred ... transformers
fallback', '_make_vllm_client'); rewrite it for the openai-only reality
and trim it. Also condense the conftest lazy-import NOTE. Comments only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Dead code (defined but never referenced anywhere in src/tests/examples):
* reader.py: keyframe_indices, episode_frame_timestamps, lookup_data_path,
and the now-orphaned gather_data_paths + episode_offsets_per_path
(lookup_data_path was their only caller).
* staging.py: iter_staged_episodes.
* writer.py: normalize_rows_for_writer.
* config.py VlmConfig: json_mode, batch_size, tensor_parallel_size,
gpu_memory_utilization, trust_remote_code — consumed only by the
in-process vllm/transformers backends that were removed; the openai
auto-serve path carries those vLLM flags via serve_command instead.
Kept max_model_len (still used as the serve-command default).
* config.py TaskAugAxesConfig.total property.
Docs: new 'Key options' section in annotation_pipeline.mdx — grouped
tables (dataset in/out, module toggles, --vlm.*, --plan.*, interjections
+ vqa) describing the flags users actually reach for, with defaults.
config.py: compact the verbose field comments + ActionRecordsConfig /
TaskAugAxesConfig docstrings; fix two stale 'verify' references (the
verify pass was removed — it's describe -> segment now) and the stale
'renders record back to subtask text' note (that path was removed).
vlm_client docstring no longer mentions the removed json_mode field.
Verified: tests/annotations + tests/datasets/test_language +
tests/scripts/test_lerobot_annotate (40 passed); pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trim the long inline comment blocks (effective_task / task_aug, action
records, plan-boundary rows, plan-update span closing, windowed +
coverage-stitch sections) and the _generate_plan / run_plan_updates
docstrings to a few lines each. No behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same flags and rationale, condensed — each plan-module flag now has a
short one/two-line comment instead of a paragraph.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Top-down flow (read episodes → 3 modules fan out → validator → writer →
parquet) with aligned boxes, instead of the cramped bordered version.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrite annotation_pipeline.mdx in plainer, easier-to-read language
(shorter sentences, active voice, a plain-text intro), add an ASCII
'How it fits together' architecture diagram, and remove the
'Reproducibility via seed and prompt hashes' section. Content/links are
preserved; only wording and structure change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The example already pins '@main'; update the doc step and the script
docstring from 'the branch under test' to 'lerobot (from main)' now that
the pipeline is merging to main.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bugs
* validator: don't re-raise on unknown style. The second column_for_style
lookup (used to route persistent vs event) now sits in try/except so an
unknown style is recorded by _check_column_routing and skipped instead
of crashing the whole validation pass.
* general_vqa._target_cameras: when restrict_to_default_camera is set but
the configured camera_key isn't one the provider exposes, warn and fall
back to all cameras instead of returning a phantom key that KeyErrors
deep in frame decode.
* interjections: clamp interjection timestamps to frame_timestamps[0]
rather than a hardcoded 0.0 (datasets can start at non-zero t).
Docs / code drift
* annotation_pipeline.mdx: drop the phantom 'vocabulary discovery / phase
0 / --vocabulary.* / canonical_vocabulary.json' section (none of it
exists); describe the real describe->segment + coverage-stitch flow.
Soften the src/lerobot/tools/ + TOOL_REGISTRY reference to 'not part of
this PR' (matches tools.mdx, which already marks the runtime layer as
not-yet-implemented). Fix the --push_to_hub/--new_repo_id wording. Note
the default is now a single h200. Add a 'Contributing new modules'
section inviting module / prompt / quality contributions.
* executor docstring: six phases, no phantom phase 0.
run_hf_job.py
* add the Apache 2.0 license header (was flagged repeatedly).
* default to a single GPU: flavor=h200, parallel_servers=1, num_gpus=1
(scale to h200x4 noted in the docstring).
* pin the install to @main instead of the feature branch (won't break
after merge).
Naming / cleanup
* rename dest_repo_id -> new_repo_id across config / script / example /
test to match the LeRobot dataset edit tools.
* rename prompt templates module_N_*.txt -> descriptive (plan_*,
interjections_*, vqa.txt) and update every load_prompt() call.
* remove dead _messages_to_prompt (used only by the removed in-process
backends).
* declare _warned_decode_fail (frames) and _warned_no_camera (vqa) as
real init=False dataclass fields instead of getattr monkey-patches.
* scope bandit B607 to the two ffmpeg subprocess.run sites via
'# nosec B607' and drop it from the global skip list.
Tests
* fix stale canned-VLM markers ('ONE realistic interruption' ->
'compact interjection', 'Update the memory' -> 'compressed semantic
memory') and drop the dead 'concise hierarchical PLAN' plan responders
(plan generation is deterministic now) in run_e2e_smoke,
test_pipeline_recipe_render, test_modules.
* run_e2e_smoke now asserts interjection + speech rows are produced so a
stale marker can't silently pass again.
* drop remaining 'PR 1' / 'PR 2' references from test comments / names.
Verified: tests/annotations + tests/datasets/test_language +
tests/scripts/test_lerobot_annotate (31 passed); make-style E2E smoke
(interjections=1 speech_atoms=2); pre-commit (ruff, mypy, bandit,
prettier) clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
action_record is in PERSISTENT_STYLES but was missing from CORE_STYLES,
so STYLE_REGISTRY (= CORE_STYLES | EXTENDED_STYLES) didn't contain it and
the PERSISTENT_STYLES | EVENT_ONLY_STYLES <= STYLE_REGISTRY invariant in
test_style_registry_routes_columns failed. Add it to CORE_STYLES so the
registry, the persistent-set, and column_for_style() stay consistent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The shipped workflow is Hugging Face Jobs (examples/annotations/run_hf_
job.py): it serves the model with vLLM in the vllm/vllm-openai image and
the pipeline talks to it over the OpenAI-compatible API. The in-process
vllm / transformers local backends added surface (and the vllm
one pinned an old torch) without being part of that path, so they're
removed for now.
* vlm_client.make_vlm_client: keep only backend='openai' (+ 'stub'
rejected with the usual guidance). Requesting 'vllm'/'transformers'
now raises a clear 'not supported for now — use the HF Jobs flow'
error. Removed _make_vllm_client and _make_transformers_client.
* config: backend docstring updated (openai-only); default model_id
bumped to Qwen/Qwen3.6-27B to match run_hf_job.
* docs/annotation_pipeline.mdx: remove the '## Running locally'
section; the launcher description now says one vLLM server per GPU
over the OpenAI API, and the 'One Qwen-VL pass' note drops the
'vLLM/transformers fallback' wording.
Tests are unaffected (they construct StubVlmClient directly; nothing
referenced the removed backends).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The annotation tests had never actually run in CI (collection failed on
the missing 'datasets' extra); now that they do, three stale assertions
surfaced against the evolved pipeline:
* test_module1_plan_memory_subtask_smoke: the memory canned-responder
marker 'Update the memory' no longer appears in module_1_memory.txt
(now 'compressed semantic memory'), so the stub returned no memory
row and the {subtask,plan,memory} subset check failed. Marker
updated to match the current prompt.
* test_module2_mid_episode_emits_paired_interjection_and_speech: the
interjection marker 'Write ONE interjection' is now 'Write ONE
compact interjection' in module_2_interjection.txt, so 0 interjections
were emitted. Marker updated.
* tests/datasets/test_language.py::test_style_registry_routes_columns:
PERSISTENT_STYLES gained 'action_record' in this PR; add it to the
expected set.
These are test/prompt-marker syncs — no production behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fast Pytest 'dataset' tier failed collecting tests/datasets/test_video_
decoder_cache.py with 'Could not load libtorchcodec ... undefined symbol:
torch_dtype_float4_e2m1fn_x2' — a torch/torchcodec ABI mismatch.
Root cause: the annotations extra's vllm hard-pins an older torch
(via xformers/xgrammar -> torch 2.8). uv resolves a SINGLE unified lock
across all extras, so vllm capped torch to 2.8 for every tier —
including dataset, whose torchcodec 0.11.1 needs torch 2.11. The
result was torch 2.8 + torchcodec 0.11.1 installed together -> ABI break.
(main has no vllm, so it resolves torch 2.11 + torchcodec 0.11.1 cleanly.)
Fix: remove vllm from the annotations extra. It is not needed by
the shipped workflow — examples/annotations/run_hf_job.py gets vllm from
the vllm/vllm-openai image and talks to it over the OpenAI-compatible
API (--vlm.backend=openai), and vlm_client._make_vllm_client imports vllm
lazily. For the in-process --vlm.backend=vllm path, install vllm
separately (the ImportError now says so).
After the fix uv resolves torch 2.11.0 + torchcodec 0.11.1 (matching
main); uv lock --check is clean. The annotations extra still provides
datasets / transformers / openai.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fast Pytest Tests failed at COLLECTION in the base '--extra test' tier
with 'ModuleNotFoundError: No module named datasets': tests/annotations/
conftest.py imported the fixture dataset builder (-> lerobot.datasets ->
the HF 'datasets' lib + pandas/pyarrow), which only ship under the
'dataset' extra, so the whole annotations package crashed.
Fix uses the repo's proven module-level guard pattern (see
tests/datasets/test_language.py), NOT a conftest-level importorskip —
verified empirically that pytest.importorskip raised during conftest
*import* is treated as a collection ERROR (exit 1), while module-level
importorskip is a clean SKIP.
* conftest.py: import build_annotation_dataset LAZILY inside the
fixtures so the conftest itself imports cleanly in every tier.
* test_modules / test_validator / test_writer / test_pipeline_recipe_
render: add module-level pytest.importorskip('datasets') +
('pandas') before the pyarrow / lerobot.* imports (# noqa: E402 to
match the existing convention). pyarrow-importing modules place the
guard before the pyarrow import.
* tests/scripts/test_lerobot_annotate.py: same guard (its _push_to_hub
path imports lerobot.datasets).
Result:
- base / hardware / viz tiers (no dataset extra): annotation tests
skip cleanly; the rest of the suite runs -> exit 0.
- dataset tier: datasets present -> guards pass through -> annotation
tests run with the stub VLM. The pipeline modules import only
stdlib + relative + lerobot.datasets (no module-level datatrove /
vllm / openai), so they import fine there.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The docs pointed at src/lerobot/datasets/v30/, which does not exist.
Both scripts actually live in src/lerobot/scripts/:
- convert_dataset_v21_to_v30.py
- augment_dataset_quantile_stats.py
Updated the four references (one python -m module path and three
file-path invocations) to the correct location, matching each
script's own usage docstring.
* fix(train): enable relative action overrides for pretrained processors
Keep pretrained processor pipelines when use_relative_actions is enabled and
apply relative/absolute action processor settings through overrides. Rename the
relative action processor registry key to relative_actions_processor.
* fix(config): reject rename_map without pretrained checkpoint
Fail fast when rename_map is set during fresh initialization, since fresh
configs derive feature names from the current dataset and no rename is applied.
---------
Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>
Quality-gate fix: ruff-format/markdown prettier hook reflow of the
annotation pipeline doc. No content change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Quality-gate fixes after the main merge:
* UP037: drop redundant quotes from PlanConfig forward-ref annotations
(action_records / task_aug_axes) — safe under 'from __future__ import
annotations'.
* ruff format applied to config.py, executor.py, general_vqa.py,
plan_subtasks_memory.py, validator.py, lerobot_annotate.py.
No behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Long episodes no longer get sparse subtasks. Previously a long episode
was subsampled to max_video_frames=32 across its whole duration (~1
frame/4s for a 2-min clip). New opt-in windowing keeps a CONSTANT
frames_per_second density by splitting the episode into fixed-length
windows and running the subtask chain per window.
New PlanConfig.subtask_window_seconds (default 0.0 = off). When > 0 and
the episode is longer than one window:
* episode is split into consecutive [w0, w1] windows of this length
* each window's frames are sampled at frames_per_second (so a 32s
window at 1 fps = 32 frames, filling but not exceeding the per-call
context budget)
* the full describe -> segment -> verify chain runs PER window, in
window-relative time [0, L]; spans are offset back to absolute
* all windows' spans are merged, frame-snap-deduped, and stitched into
one contiguous whole-episode cover
Implementation:
* _episode_video_block / _video_message / _describe_episode /
_verify_subtasks gain an optional window=(w0,w1); when set they
embed frames sampled in that absolute range at frames_per_second
(video_url path skipped — it's whole-episode).
* _clean_spans gains bounds= (override clamp range, for window-relative
spans) and dedupe= (skip frame-snap until the merged absolute set).
* new _generate_subtasks_windowed + _subtasks_for_window orchestrate
the loop; _generate_subtasks branches to them when window_s > 0.
run_hf_job.py: --plan.subtask_window_seconds=32 (32s windows at 1 fps).
Cost scales with episode length (chain calls × ceil(duration/window)).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Swap the annotation VLM from Qwen3.6-35B-A3B (sparse MoE, ~3B active)
to Qwen3.6-27B (dense, 27B all-active). Per Scale's dense-captioning
study, model capacity is the #1 lever and the dominant failure is
visual grounding — both helped by ~9x more active params. Qwen3.6-27B
is a vision-language model (vision encoder, image + video), same family
so the chat template / video handling / enable_thinking=false flag are
unchanged, and at 27B dense it still fits one H200 per server, so the
two-parallel-server layout (TP=1, one per GPU) is preserved — no
throughput-layout change, just a much stronger model.
Kept: parallel_servers=2, num_gpus=2, max-model-len 32768 (the 32-frame
embedded budget is ~10k tokens, well under), gpu-mem 0.8.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adopt the one prompt technique Scale's dense-captioning study found
reliably positive: targeted, verb-scoped, visually-grounded
disambiguation rules. Their lesson was that such a rule must fire ONLY
on the spatial situation it names (their narrow 'Stack vs Put' rule
helped; an over-broad directional 'Scoop' rule bled into other verbs
and hurt), so each rule here is phrased visually and scoped to one
confusable pair:
* stack-vs-put (on top of an object vs on a surface)
* insert-vs-put (fitted slot vs surface)
* pick-up/retrieve-vs-put (decide by which way the OBJECT moves:
gripper closes + object moves with hand = pick up; gripper opens +
object stays = put — directly targets Scale's dominant
direction-flip failure)
* pour-vs-put (tilt + flow vs untilted move)
This is the highest-confidence, lowest-risk change from the Scale
findings; our pipeline already aligns with their 'avoid' list (no
temporal tokens, no overlays, no fancy sampling, no sequential context
injection, uniform sampling, describe-don't-predict framing).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switching the plan module to embedded frames (use_video_url=false)
exposed a context overflow: at frames_per_second=2.0 with the old
max_video_frames=128 default, a 480x640 episode embeds ~128 frames ≈
33-39k vision tokens, over the model's 32768 context — every plan call
died with 'Input length exceeds maximum context length' (HTTP 400),
crashing the whole annotation job.
The video_url path never hit this because the server downsampled; the
embedded path sends every sampled frame, so the frame count is a hard
token budget.
Fix:
* config default max_video_frames 128 -> 32 (~8-10k vision tokens,
comfortable headroom for the prompt + describe/verify passes).
Frames are still sampled UNIFORMLY across the whole episode, so
longer episodes are subsampled, not truncated — full temporal
coverage preserved, just coarser density.
* run_hf_job.py: frames_per_second 2.0 -> 1.0, explicit
--plan.max_video_frames=32, with a comment explaining the token
budget and the 'do not raise toward 128 with embedded frames' rule.
Only the plan module embeds the full episode; VQA (1 frame/tick) and
interjections (4-frame window) were never at risk.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove the subtask_full_coverage config flag. Stitching subtask spans
into a contiguous full-episode cover is now always applied in
_generate_subtasks — a sparse / gap-ridden subtask timeline is never
desirable for conditioning, so there's no reason to make it optional.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The verify pass prunes subtasks, which could leave the first subtask
starting after t0 or leave gaps between spans — so the subtask timeline
no longer tiled the episode and frames fell through with no active
subtask label.
New deterministic post-step (no VLM call), default on via
PlanConfig.subtask_full_coverage:
* first subtask start pulled back to the episode's first frame t0
(idle / approach before the first labelled action folds into it)
* each subtask end snapped to the next subtask start (gaps closed)
* last subtask end extended to the last frame t_last
Runs after segment + verify in _generate_subtasks. Starts other than
the first are left as the VLM/verify produced them (already frame-
snapped + distinct), so the cover is contiguous and non-overlapping.
Disable with --plan.subtask_full_coverage=false if a consumer wants
sparse subtasks.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flip PlanConfig.subtask_describe_first and subtask_verify defaults
False -> True. Every subtask annotation now runs the 3-call grounding
+ pruning chain by default, since the single-call path reliably
hallucinates steps from the task text. Costs 2 extra VLM calls/episode;
disable with --plan.subtask_describe_first=false / --plan.subtask_
verify=false on easy datasets where fewer calls matter more than
label fidelity.
run_hf_job.py: drop the now-redundant explicit flags, leave a note that
the chain is default-on and how to opt out.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The single-call 'watch video -> emit subtask JSON' pattern makes the
VLM commit to structured output before reasoning about what it saw, so
it pattern-matches the task text and hallucinates steps. Split it into
an opt-in multi-call chain that grounds first and prunes last.
New PlanConfig flags (both default False -> single-call unchanged):
* subtask_describe_first: a grounding pass narrates ONLY what is
visible in the video (no subtask JSON yet). That description is
injected into the segmentation prompt via a new {observation_block}
placeholder, so the model segments its own grounded observations
instead of the instruction text. +1 VLM call/episode.
* subtask_verify: after segmentation, an adversarial pass re-watches
the video and drops any candidate subtask it cannot see. Can only
PRUNE (never add/rewrite/move) and fails open (keeps un-verified
spans if the call returns nothing). +1 VLM call/episode.
Implementation:
* _generate_subtasks now orchestrates describe -> segment -> verify.
* Factored span cleaning into _clean_spans (shared by segment + verify
outputs); added _describe_episode and _verify_subtasks helpers.
* New prompts module_1_subtask_describe.txt (returns {description})
and module_1_subtask_verify.txt (returns pruned {subtasks}).
* module_1_subtasks.txt gains a {observation_block} slot at the top.
run_hf_job.py enables both for the RoboCasa run (3 VLM calls/episode
for subtasks). Combined with single-camera grounding + the embedded-
frame path, this is the high-quality configuration.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes for 'subtasks describe actions not in the video' plus a way
to focus the whole pipeline on one camera.
ANTI-HALLUCINATION
1. _episode_video_block: when use_video_url is set but clip extraction
fails, FALL BACK to embedded frames instead of returning an empty
block. An empty block left the VLM with zero visual grounding, so
it invented subtasks from the task text alone — the likely root
cause of hallucinated steps. Now logs a warning and embeds frames.
2. module_1_subtasks.txt gains a GROUNDING preamble (overrides all
other rules): label only motion visible in specific frames; never
invent/anticipate/pad; max_steps is a CEILING not a target; atomic
demos may be exactly ONE subtask; the VIDEO is ground truth, not
the instruction text.
SINGLE-CAMERA GROUNDING
* New VqaConfig.restrict_to_default_camera (default False). When True,
the VQA module grounds on only the --vlm.camera_key stream instead
of iterating every camera — matching the plan / interjection
modules, which already use that single camera. Now the whole
pipeline can focus on one view (e.g. observation.images.base).
run_hf_job.py updated:
* use_video_url=false + frames_per_second=2.0 — embed frames directly
(most reliable; no silent text-only failure mode) with dense
grounding.
* vqa.restrict_to_default_camera=true — VQA on the single camera too.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the replace_subtask_text option and the
_render_action_record_to_subtask_text renderer. Action records are now
strictly additive: when action_records.enabled=True the module emits
style='action_record' rows (the typed {verb,object,arm,grasp,dest,
mistake} schema) and NEVER rewrites the subtask text the policy
conditions on.
The render-back-to-text path was the source of corrupted subtasks
(navigation tasks produced 'move stove to stove', manipulation tasks
got spurious 'with left arm using pinch grip' suffixes). Reconstructing
natural-language subtasks from hallucinated structured fields is
inherently fragile, so the capability is removed rather than guarded.
Removed:
* ActionRecordsConfig.replace_subtask_text field
* PlanSubtasksMemoryModule._render_action_record_to_subtask_text
* the span['text'] = canonical_text overwrite in run_episode
Updated docstrings + run_hf_job.py comment accordingly. emit_record_row
(default True) is now the feature's only output.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three compounding bugs made RoboCasa annotation produce off-task
subtasks ('move stove to stove with left arm') and drifting
augmentations ('wander around the kitchen' for 'Navigate to the stove').
1. action_records.replace_subtask_text now defaults False.
Overwriting the VLM's subtask text with a reconstruction of
hallucinated {verb,object,arm,grasp,dest} fields is high-risk:
navigation / non-manipulation tasks don't fit the schema and render
to nonsense. Records are now additive by default (emit_record_row),
never silently replacing subtask text. Flip replace_subtask_text on
only for manipulation datasets verified to render cleanly.
2. _render_action_record_to_subtask_text drops a degenerate
destination that just echoes the object (verb=move object=stove
destination=stove -> 'move stove' instead of 'move stove to stove').
Also routes 'navigate' through the 'to <dest>' preposition family.
3. module_1_task_aug_axes.txt hardened: variants MUST preserve the
goal/destination. Explicitly forbids 'Navigate to the stove' ->
'wander around the kitchen'. Only wording / arm / orientation /
grasp may vary; verb meaning, object, and destination are fixed.
examples/annotations/run_hf_job.py — corrected for RoboCasa:
* derive_task_from_video=off (was =always). The dataset task string
is authoritative and is what eval conditions on; =always threw it
away, re-derived a hallucinated task from the video, and poisoned
every downstream subtask/plan row. THIS was the dominant cause.
* n_task_rephrasings=0 + task_aug_axes left off — RoboCasa eval uses
exact task strings, so augmentation is unused/harmful.
* action_records left off — manipulation schema doesn't fit atomic /
navigation tasks.
* plan_max_steps=6 to keep atomic-task decomposition tight.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VideoFrameProvider derived its default camera and camera list from
meta.camera_keys, which mixes image- and video-stored cameras. The
clip/decode paths read videos/<key>/from_timestamp, which only exists
for video keys, so an image-stored camera sorted first (e.g.
observation.images.wrist) crashed the plan phase with a KeyError.
Restrict the list and default to meta.video_keys. Add a regression test
and point the example job at the dataset's actual video camera. Skip
bandit B607 (ffmpeg/git are intentionally resolved via PATH).
Co-authored-by: Cursor <cursoragent@cursor.com>
EgoMimic-inspired additions to the plan module, both opt-in for back-compat.
1. PHASE 1a + 1b: per-subtask structured action records
* cfg.action_records.enabled=True triggers, after Phase 1 subtask-span
generation, one extra VLM call per subtask to extract a typed record:
{verb, object, arm, grasp_type, destination, mistake}
* A deterministic Python template (_render_action_record_to_subtask_text)
renders the record back to canonical subtask text. When replace_subtask_
text=True (default), this REPLACES the VLM's free-form text — eliminates
cross-episode phrasing drift.
* When emit_record_row=True (default), the structured record is also
emitted as a row with style='action_record' (added to PERSISTENT_STYLES)
so downstream training can consume the typed schema directly.
* Verb + grasp vocabularies are configurable. Out-of-vocab values are
rejected at extraction time.
2. STRUCTURED 5-AXIS TASK AUGMENTATION
* cfg.task_aug_axes.enabled=True replaces the free-form n_task_rephrasings
path with a structured prompt producing variants along 5 named axes:
synonym_paraphrase (3)
omit_arm (3)
omit_orientation (2)
omit_grasp_method (2)
combined_omissions (2)
Total ~12 variants. Axes with nothing to omit emit fewer entries.
* Each variant is emitted as a task_aug row at t=0 (existing style).
Inspired by https://github.com/GaTech-RL2/EgoVerse/tree/main/egomimic/scripts/language_process
— they pay Scale AI annotators to fill a structured form and then generate
language via a deterministic prompt. We get the same hallucination-reducing
structure via one extra VLM call per subtask.
Files:
src/lerobot/datasets/language.py
src/lerobot/annotations/steerable_pipeline/config.py
src/lerobot/annotations/steerable_pipeline/modules/plan_subtasks_memory.py
src/lerobot/annotations/steerable_pipeline/prompts/module_1_action_record.txt
src/lerobot/annotations/steerable_pipeline/prompts/module_1_task_aug_axes.txt
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(rewards): add TOPReward reward model
* refactor(rewards): clean up TOPReward processor/model
* fix(rewards/topreward): add missing input keys mm_token_type_ids
* fix(rewards/topreward): fix pyproject extra typo and simplify processor (#3653)
Add lerobot[topreward] extra to all in
pyproject.toml, drop the redundant labels arg in scoring, and
collapse the dead-branch shape check in the encoder processor.
* optmize topreward input processing (#3660)
---------
Co-authored-by: Cole <91766445+jcoleharrison@users.noreply.github.com>
Co-authored-by: Haoming Song <haomingsong24@gmail.com>
PR #3145 added YAML support for policy.path but left two bugs:
1. extract_path_fields_from_config only deleted config_data[field] when
no sibling overrides existed. With siblings, the dict stayed in place
and draccus crashed decoding it as PreTrainedConfig (no 'type' key).
Sibling overrides go into _config_yaml_overrides and are applied later
by from_pretrained(), so the field can always be removed.
2. wrap() updated config_path_cli to the cleaned temp file path but
never propagated it to the draccus.parse fallback branch. cli_args
still contained --config_path=<original>, so draccus read the
original YAML with path: still present.
Tests passed because they (a) called extract_path_fields_from_config
directly and (b) included type: alongside path: in the YAML, sidestepping
both bugs.
Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
Tighten ``module_1_subtasks.txt`` so the VLM emits one composite
atomic action per subtask instead of decomposing every pick into
``move to X`` / ``grasp X`` / ``lift X``:
- Lock the verb vocabulary to the composite set the low-level
policy actually learns end-to-end: ``pick up`` (approach + grasp +
lift), ``put``/``place`` (transport + release), ``push``, ``pull``,
``turn``, ``press``, ``open``, ``close``, ``pour``, ``insert``.
``go to`` is allowed only as a pure relocation between phases.
- Add an explicit ``Forbidden ultra-fine splits`` block enumerating
the patterns the VLM was tempted to emit (``move to X``,
``reach for X``, ``grasp X``, ``lift X``, ``release X``) and
instructing it to fold each into its parent composite.
- Rewrite the Good/Bad examples to match the composite contract;
the previous ``"move to blue cube" / "grasp blue cube" / "lift
blue cube"`` Good list was actively encouraging the over-
segmentation pattern this prompt is supposed to prevent.
- Tighten the duration rule: candidates shorter than
``min_subtask_seconds`` must be merged into a neighbour rather
than emitted. Pairs with bumping the runtime floor to 3 s so
composites have room to land.
Pure prompt change — no code or schema change. Existing canonical-
vocabulary retry path is unaffected (the new verb whitelist lives
in prose, not in the validator).
Co-authored-by: Cursor <cursoragent@cursor.com>
Heterogeneous datasets (different tasks/scenes across episodes) don't
share a single small subtask + memory vocabulary, so the canonical
vocabulary phase narrowed every episode to the wrong target distribution.
Flip the example to free-form generation by default and document the
``--vocabulary.enabled=true`` switch for homogeneous datasets where the
canonical vocabulary still helps the downstream policy.
No pipeline-code changes: ``VocabularyConfig.enabled`` already gates
phase 0 (see ``executor.py:_run_vocabulary_phase`` and
``VocabularyConfig`` docstring) and falls back to free-form generation.
Co-authored-by: Cursor <cursoragent@cursor.com>
Resolves conflicts from 32 commits on main:
* docs/source/_toctree.yml — keep both new toc entries
(annotation_pipeline + video_encoding_parameters).
* docs/source/language_and_recipes.mdx — adopt main's section
ordering (Layer 2 before "Temporal semantics") and float32
timestamp dtype to match the codebase.
* src/lerobot/configs/__init__.py — keep both export sets
(recipe + video encoder).
* src/lerobot/datasets/dataset_metadata.py — drop redundant lazy
imports (top-level imports cover both LANGUAGE_COLUMNS and
DEFAULT_TOOLS); adopt main's @tools.setter for info.json
write-back.
* src/lerobot/datasets/feature_utils.py — call the real
validate_feature_language() instead of returning "".
* src/lerobot/datasets/language.py — float32 timestamps to match
pa.float32() used in video_utils.py and the rest of the codebase.
* src/lerobot/datasets/language_render.py — adopt main's
unwrap_scalar() helper (drops two hand-rolled .item()/list
unwrappers); float32 in docstring.
* src/lerobot/processor/render_messages_processor.py — drop
PR-local _scalar() helper, use shared unwrap_scalar().
* tests/datasets/test_language.py — adopt main's new float32 dtype
+ validate_feature_language warning tests.
* tests/datasets/test_dataset_metadata.py — adopt main's new
tools.setter persist/clear tests.
* uv.lock — regenerated cleanly from main's resolver.
90 of 92 touched tests pass. Two pre-existing test failures
(test_module1_plan_memory_subtask_smoke,
test_module2_mid_episode_emits_paired_interjection_and_speech in
tests/annotations/test_modules.py) are unrelated to this merge —
that test file doesn't exist on main, so the failures originate on
the branch and are addressed by the 8 newer fix(annotate) commits
already on origin that will land in a follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If two consecutive VLM-emitted subtask spans have ``start`` timestamps
that round to the same source frame after ``snap_to_frame`` (e.g. on
short episodes the VLM sometimes nominates two ~adjacent action
boundaries within one 30 Hz step), the writer emits two
``style=subtask`` rows at the identical persistent timestamp. The
training-time renderer's default binding
``subtask: active_at(t, style=subtask)`` then raises:
ValueError: Ambiguous resolver for style='subtask';
add role=..., tool_name=..., or camera=... to disambiguate.
… and the whole training run dies on the first batch.
Observed concretely on ``pepijn223/super_poulain_vocab2`` (job
22159979): episodes 3 and 30 each had two subtask rows at the same
timestamp (``release yellow cube`` + ``retract arm`` snapping to the
same frame).
Add ``_dedupe_starts_to_distinct_frames`` to walk the cleaned span list
and, whenever a snapped start collides with one already used, push the
later span onto the next free frame timestamp. Both subtasks survive
on distinct timestamps; the renderer can now disambiguate. If the
episode genuinely has no later free frame (extremely unlikely — would
require a same-timestamp collision on the very last frame of the
episode), the later span is dropped with a warning rather than left
to poison the render.
New test ``test_plan_module_bumps_collocated_subtasks_to_distinct_frames``
locks in the contract; full vocabulary suite is 14/14 green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
The Jaccard-overlap snap was warping VLM output into wrong canonical
labels — e.g. an off-vocab "consult the wizard" span would silently
become "grasp blue cube" if that scored highest. Even with a higher
floor the operator can't tell which subtasks were paraphrases vs
genuine mislabels in the resulting dataset.
Replace with strict exact-match validation + a single targeted retry:
1. Generate subtasks as before.
2. If any returned subtask's normalised form (lowercased, articles
stripped, whitespace collapsed) isn't in the canonical vocab,
fire one retry call naming the offending strings and re-sending
the full canonical list. The retry prompt requires byte-identical
output from the vocab.
3. After the retry, validate again. Spans still off-vocab are
dropped — no fuzzy snapping ever produces a different canonical
label than the VLM actually emitted.
4. If every span ends up off-vocab even after the retry, warn loudly
so the operator extends ``meta/canonical_vocabulary.json`` to
cover the missing phase. The episode is left with empty subtasks
rather than silently fabricated ones — visibility > sweep-under-
the-rug.
Promote ``_NORMALIZE_STRIP_TOKENS`` to a class constant and split the
normalisation helper out so the retry-validation and the final
canonicalisation share one source of truth.
Tests:
- test_plan_module_accepts_article_only_difference: "grasp the blue
cube" still maps to canonical "grasp blue cube" (article-tolerant).
- test_plan_module_retries_when_subtask_off_vocab: paraphrase
triggers the retry which the VLM corrects in pass 2.
- test_plan_module_drops_off_vocab_subtask_after_retry: VLM that
refuses to correct → bad span dropped, in-vocab span kept.
- test_plan_module_empty_when_all_off_vocab_after_retry: every
span off-vocab → episode left empty (no warping).
All 13 vocabulary tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
When the canonical vocabulary is enabled and the VLM produces spans
that don't overlap any canonical label, the previous Jaccard-floor
(0.5) dropped them and the episode came out with no subtasks at all
— invisible to the downstream policy. Observed on
``pepijn223/super_poulain_vocab``: some episodes had empty subtask
columns because every VLM-emitted phrase scored below 0.5 against
the discovered vocabulary.
Two-pass canonicalisation:
- First pass keeps the Jaccard floor (lowered from 0.5 → 0.25, to
let mild paraphrases through) and drops everything below.
- If that first pass leaves the episode with **zero** subtasks,
fall back to a second pass that always snaps each VLM span to
its nearest canonical label by Jaccard (no floor). The episode
ends up with subtasks even when the vocabulary missed a phase
— a slightly-wrong canonical label is still closer to the right
motion than nothing at all.
- Log loudly when the fallback fires so the operator can spot
coverage gaps in ``meta/canonical_vocabulary.json``.
- Log a per-episode count at INFO when some (but not all) spans
were dropped so it's visible without spamming the run output.
Promote the Jaccard floor + ignore-tokens to class constants so
they're a single edit point. Add ``force=True`` parameter to
``_canonicalize_subtask`` for the no-floor fallback path.
New test ``test_plan_module_snaps_when_all_off_vocab`` covers the
fallback; existing ``test_plan_module_drops_off_vocab_subtask`` is
adjusted to keep at least one in-vocab span so the floor path can
still fire and is exercised. All 12 vocabulary tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>