annotations(steerable): remove Phase 0 canonical vocabulary discovery

Drops the optional Phase 0 vocabulary-discovery feature entirely. With the new structured action records (Phase 1a + 1b) providing cross-episode consistency via the deterministic template renderer, the older vocabulary-constraint path is redundant and adds a second constraint mechanism that wasn't well-validated in practice. Removed: * src/lerobot/annotations/steerable_pipeline/vocabulary.py (Vocabulary dataclass + VocabularyDiscoveryModule + load_/ save_vocabulary helpers; canonical_vocabulary.json on-disk format) * src/lerobot/annotations/steerable_pipeline/prompts/module_0_vocabulary.txt (Phase 0 VLM prompt) * tests/annotations/test_vocabulary.py Pruned wiring across: * config.py: VocabularyConfig dataclass + AnnotationPipelineConfig. vocabulary field * executor.py: vocabulary attribute on Executor + _run_vocabulary_ phase method + Phase 0 phases.append call in run() * modules/plan_subtasks_memory.py: Vocabulary import + vocabulary attribute + _subtask_vocabulary_block / _memory_vocabulary_block helpers + _canonicalize_subtask / _normalize / _invalid_subtasks / _build_subtask_retry_message methods + vocabulary-gated retry path in _generate_subtasks + empty-episode warning + _NORMALIZE_ STRIP_TOKENS constant * prompts/module_1_subtasks.txt: {vocabulary_block} placeholder * prompts/module_1_memory.txt: {vocabulary_block} placeholder * __init__.py: Vocabulary / VocabularyDiscoveryModule / load_ vocabulary / save_vocabulary / vocabulary_path / VOCABULARY_ FILENAME re-exports * scripts/lerobot_annotate.py: VocabularyDiscoveryModule import + instantiation + executor argument * examples/annotations/run_hf_job.py: --vocabulary.enabled=false flag + docstring references + inline phase-0 comment The original free-form rephrasings path stays (PlanConfig. n_task_rephrasings still works when task_aug_axes.enabled=False). Action records remain the preferred mechanism for cross-episode subtask consistency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 20:01:25 +00:00 · 2026-06-02 11:48:05 +02:00
parent 2bfaf44db2
commit 5dbf0fac5f
11 changed files with 4 additions and 981 deletions
--- a/examples/annotations/run_hf_job.py
+++ b/examples/annotations/run_hf_job.py
@@ -6,15 +6,11 @@ Spawns one ``h200x2`` job that:
  1. installs this branch of ``lerobot`` plus the annotation extras,
  2. boots two vllm servers (one per GPU) with Qwen3.6-35B-A3B-FP8,
  3. runs the plan / interjections / vqa modules across the dataset
-     in free-form mode (phase 0 canonical-vocabulary discovery is
-     disabled — each episode generates its own subtasks + memory),
+     in free-form mode (each episode generates its own subtasks +
+     memory),
  4. uploads the annotated dataset to ``--dest_repo_id`` (when set)
     or back to ``--repo_id``.

-Re-enable phase 0 with ``--vocabulary.enabled=true`` (optionally
-``--vocabulary.sample_episodes=N``) when the dataset is homogeneous
-enough to share one subtask + memory vocabulary across all episodes.
-
 Usage:

    HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py
@@ -57,14 +53,6 @@ CMD = (
    "--executor.episode_parallelism=16 "
    "--vlm.chat_template_kwargs='{\"enable_thinking\": false}' "
    "--vlm.camera_key=observation.images.wrist "
-    # Phase 0 — canonical vocabulary discovery DISABLED by default.
-    # Heterogeneous datasets (different tasks/scenes across episodes)
-    # don't share a single small subtask + memory vocabulary, so each
-    # episode generates its subtasks + memory free-form. Flip to
-    # ``--vocabulary.enabled=true`` (optionally ``--vocabulary.sample_episodes=N``)
-    # for homogeneous datasets where a shared canonical vocabulary
-    # helps the downstream policy.
-    "--vocabulary.enabled=false "
    # Phase 1 — plan module (subtasks + plan + memory + task_aug).
    "--plan.frames_per_second=1.0 "
    "--plan.use_video_url=true "