Files
lerobot-clone/src/lerobot
Pepijn 3174e14bc0 fix(smolvla2): feed all cameras to VQA generation, not just the chosen one
handle_vqa_query filtered the observation down to the single chosen
camera before calling the VLM. But training feeds every camera: the
ask_vqa_* recipes' image blocks are stripped before tokenization and
the frames reach the model via OBS_IMAGES_*, where embed_prefix
consumes all config.image_features regardless of the per-camera recipe
tag. Filtering to one camera changed the image-token count in the
prefix (the dropped camera zero-padded with mask=0) — a prefix shape
the model never saw at training.

Now the full observation is passed to select_message; the chosen
camera is used only to pick which frame the bbox/point overlay is
drawn on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:46:38 +02:00
..