``_load_hf_dataset`` was building the strict cast schema only from
``meta/info.json["features"]``. Datasets annotated by
``lerobot-annotate`` but still tagged at the older codebase version
(no ``language_persistent`` / ``language_events`` entry in
``info.json``) carry both columns in the parquet itself but not in the
features dict, so ``Dataset.from_parquet`` blew up with
``CastError: column names don't match`` when trying to project a
9-column parquet onto a 7-column schema.
Probe one parquet shard's actual schema; if either language column is
present in the parquet but missing from ``features``, graft it on
using PR 1's ``language_persistent_column_feature`` /
``language_events_column_feature`` helpers. No-op when neither column
is present (fully backwards-compatible with v3.0 datasets), no-op when
both are already registered (fully forwards-compatible with future
v3.1 ``info.json`` writes).
This unblocks dry-run inference on PR 2-annotated datasets that
weren't re-tagged to v3.1 — including the ones in the field today.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(dataset): enhance dataset root directory handling and introduce hub cache support
- Updated DatasetConfig and LeRobotDatasetMetadata to clarify root directory behavior and introduce a dedicated hub cache for downloads.
- Refactored LeRobotDataset and StreamingLeRobotDataset to utilize the new hub cache and improve directory management.
- Added tests to ensure correct behavior when using the hub cache and handling different revisions without a specified root directory.
* refactor(dataset): improve root directory handling in LeRobotDataset
- Updated LeRobotDataset to store the requested root path separately from the actual root path.
- Adjusted metadata loading to use the requested root, enhancing clarity and consistency in directory management.
* refactor(dataset): minor improvements for hub cache support
* chore(datasets): guard in resume + assertion test
---------
Co-authored-by: AdilZouitine <adilzouitinegm@gmail.com>
Co-authored-by: mickaelChen <mickael.chen.levinson@gmail.com>
* refactor(dataset): split reader and writer
* chore(dataset): remove proxys
* refactor(dataset): better reader & writer encapsulation
* refactor(datasets): clean API + reduce leaky implementations
* refactor(dataset): API cleaning for writer, reader and meta
* refactor(dataset): expose writer & reader + other minor improvements
* refactor(dataset): improve teardown routine
* refactor(dataset): add hf_dataset property at the facade level
* chore(dataset): add init for datasset module
* docs(dataset): add docstrings for public API of the dataset classes
* tests(dataset): add tests for new classes
* fix(dataset): remove circular dependecy