mirror of
https://github.com/huggingface/lerobot.git
synced 2026-06-04 21:01:26 +00:00
add benchmark
This commit is contained in:
@@ -32,8 +32,7 @@ Input should be the current image or whole video and the task goal specified in
|
||||
Archiutecture:
|
||||
_ inputs: video o1:T (or current o1:t), language z;
|
||||
_ DINO v3 ViT-B/16 (86M params): https://huggingface.co/facebook/dinov3-vitb16-pretrain-lvd1689m for vision encoding
|
||||
_ sentence-transformers/all-MiniLM-L12-v2: https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2 for text encoding
|
||||
\* Temporal module: small causal transformer ("cross-modal sequential aggregator"), with first-frame positional embedding (to avoid position cheating), frame-dropout, and stride sampling; outputs per-timestep logits.
|
||||
\_ sentence-transformers/all-MiniLM-L12-v2: https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2 for text encoding \* Temporal module: small causal transformer ("cross-modal sequential aggregator"), with first-frame positional embedding (to avoid position cheating), frame-dropout, and stride sampling; outputs per-timestep logits.
|
||||
|
||||
Loss: See this chatgpt thread: https://chatgpt.com/s/t_68999a50a0b081919abc365cdd205e01
|
||||
|
||||
@@ -56,11 +55,13 @@ _ Epic-Kitchens-100
|
||||
_ Something-Something v. 2 Dataset https://www.qualcomm.com/developer/software/something-something-v-2-dataset
|
||||
_ Ego4D (3000 hours)
|
||||
_ Open X-Embodiment (OXE)
|
||||
_ Age bot world: https://huggingface.co/datasets/agibot-world/AgiBotWorld-Alpha
|
||||
_ GTEA+ Gaze: https://cbs.ic.gatech.edu/fpv/
|
||||
_ YouCook2 dataset
|
||||
_ HOWTO100M: https://www.di.ens.fr/willow/research/howto100m/
|
||||
\_ Agi bot world: https://huggingface.co/datasets/agibot-world/AgiBotWorld-Alpha
|
||||
|
||||
- GalexiAI dataset: https://opengalaxea.github.io/G0/
|
||||
_ GTEA+ Gaze: https://cbs.ic.gatech.edu/fpv/
|
||||
_ YouCook2 dataset
|
||||
\_ HOWTO100M: https://www.di.ens.fr/willow/research/howto100m/
|
||||
- Genie generated dataset?
|
||||
|
||||
### TODOs:
|
||||
|
||||
@@ -77,11 +78,10 @@ _ HOWTO100M: https://www.di.ens.fr/willow/research/howto100m/
|
||||
- Only rewind loss [x]
|
||||
- Exactly similar to: https://github.com/lucidrains/rewind-reward-pytorch/blob/main/rewind_reward_pytorch/rewind_reward.py#L11 [x]
|
||||
- Try DINO v2 as encoder Base 86 M: with https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2 [x]
|
||||
- benchmark lucidrains vs this implementation forward pass []
|
||||
- Test rewind (evaluate) []
|
||||
- Cleanup code? []
|
||||
- Convert python -m lerobot.datasets.v21.convert_dataset_v20_to_v21 --repo-id=IPEC-COMMUNITY/bc_z_lerobot and train on 1 percent
|
||||
-----------------
|
||||
- Test rewind (evaluate) [x]
|
||||
- Cleanup code? []
|
||||
- benchmark lucidrains vs this implementation forward pass, debug speed []
|
||||
- Convert python -m lerobot.datasets.v21.convert_dataset_v20_to_v21 --repo-id=IPEC-COMMUNITY/bc_z_lerobot and train on 1 percent
|
||||
- Then on 10 percent
|
||||
- Ablation dino v2 vs dino v3 base 86 M
|
||||
- Add more artificial text to dataset generated by vlm (google gemini) []
|
||||
@@ -90,4 +90,5 @@ _ HOWTO100M: https://www.di.ens.fr/willow/research/howto100m/
|
||||
- How can we improve spatial aware learning? solve issue of Contrastive learning and position
|
||||
- Extend evaluation []
|
||||
- Add other datasets from OXE metioned in rewind []
|
||||
- Ablation for size vision encoder, language encoder, temporal head
|
||||
- Ablation for size vision encoder, language encoder, temporal head []
|
||||
- Add other datasets metnioned here []
|
||||
|
||||
Reference in New Issue
Block a user