mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-30 18:31:25 +00:00
Heterogeneous datasets (different tasks/scenes across episodes) don't share a single small subtask + memory vocabulary, so the canonical vocabulary phase narrowed every episode to the wrong target distribution. Flip the example to free-form generation by default and document the ``--vocabulary.enabled=true`` switch for homogeneous datasets where the canonical vocabulary still helps the downstream policy. No pipeline-code changes: ``VocabularyConfig.enabled`` already gates phase 0 (see ``executor.py:_run_vocabulary_phase`` and ``VocabularyConfig`` docstring) and falls back to free-form generation. Co-authored-by: Cursor <cursoragent@cursor.com>