Commit Graph

14 Commits

Author SHA1 Message Date
Pepijn
63fad12e8d refactor: consolidate VLM classes into single QwenVL implementation
Remove Qwen2VL, Qwen3VL, Qwen35VL in favor of one QwenVL class that
uses AutoModelForImageTextToText and works with the whole Qwen VL
family. Moves shared _parse_skills_response to BaseVLM and extracts
_build_messages/_prepare_inputs/_decode helpers to reduce duplication.

Made-with: Cursor
2026-03-30 20:37:09 +02:00
Pepijn
2545f1a8ed fix: route video_metadata through videos_kwargs for Qwen3/3.5 processors
The Qwen3VLProcessor distributes kwargs to sub-processors via
_merge_kwargs. Flat kwargs like video_metadata and do_sample_frames
were not reaching the video processor, causing fps to default to 24
and producing shape mismatches.

Pass these kwargs explicitly under videos_kwargs so they reach
Qwen3VLVideoProcessor directly. Revert Qwen2VL to its simpler
original approach since its processor doesn't use videos_kwargs.

Made-with: Cursor
2026-03-30 19:09:46 +02:00
Pepijn
5f85b572d7 fix: unpack video_metadata from tuples and pass separately to processor
The Qwen3.5 processor requires video_metadata as a separate parameter,
not embedded in the video tensors. Use return_video_metadata=True from
process_vision_info, then unpack the (tensor, metadata) tuples into
separate videos and video_metadata lists for the processor call.

Made-with: Cursor
2026-03-30 17:37:59 +02:00
Pepijn
72692525da fix: pass fps=1.0 scalar to processor instead of video_metadata tuples
The return_video_metadata=True approach causes 'list index out of range'
due to (tensor, metadata) tuple format issues. Since all extracted
videos are at 1fps (ffmpeg -r 1), directly pass fps=1.0 as a scalar
alongside do_sample_frames=False — this gives the processor the exact
fps for position embedding computation without format compatibility
issues across Qwen processor versions.

Made-with: Cursor
2026-03-30 17:32:30 +02:00
Pepijn
9a298524ca fix: pass video_metadata via process_vision_info for correct position embeddings
The Qwen3.5 processor needs video_metadata (fps, frame indices) to
compute temporal position embeddings. Use return_video_metadata=True
which embeds metadata inside the video tensors as (tensor, metadata)
tuples, and return_video_kwargs=True which returns {'do_sample_frames':
False} without the problematic fps list.

Made-with: Cursor
2026-03-30 17:23:44 +02:00
Pepijn
002a9dd0b9 fix: use do_sample_frames=False instead of video_kwargs fps list
The Qwen3.5 processor expects fps as a scalar, not a list, so passing
video_kwargs with fps=[...] fails validation. Since process_vision_info
already handles frame sampling, we only need do_sample_frames=False to
tell the processor to use the pre-sampled frames as-is.

Made-with: Cursor
2026-03-30 16:55:46 +02:00
Pepijn
e40985b013 fix: pass video_kwargs from process_vision_info to Qwen processor
The Qwen processor needs fps metadata (via return_video_kwargs=True)
to compute correct temporal position embeddings. Without it, the
processor defaults to fps=24 regardless of the actual video fps,
causing shape mismatches between expected and actual video tokens.

Made-with: Cursor
2026-03-30 16:50:34 +02:00
Pepijn
d03200bdb3 fix: force torchvision video backend instead of cv2 bypass
Replace manual cv2 frame reading with FORCE_QWENVL_VIDEO_READER=torchvision
env var. The torchvision backend (PyAV) properly reads video metadata and
respects the fps parameter, avoiding the torchcodec fps=24 default issue.

Made-with: Cursor
2026-03-30 16:42:52 +02:00
Pepijn
ac41cd6672 fix: bypass torchcodec video decoding by pre-reading frames via cv2
When torchcodec is installed, qwen-vl-utils ignores the fps parameter
and defaults to 24fps if video metadata is missing, causing shape
mismatches. Fix by reading video frames directly as PIL images and
passing them to the processor, bypassing torchcodec entirely.

Made-with: Cursor
2026-03-30 16:03:26 +02:00
Pepijn
9b211a45d6 fix: disable thinking mode in Qwen35VL single-episode fallback path
The single-episode `segment_skills` method was missing
`enable_thinking=False` in `apply_chat_template`, causing the model to
output reasoning traces instead of JSON when the batch path fails and
falls back to per-episode processing.

Made-with: Cursor
2026-03-30 15:31:18 +02:00
root
a6387da464 add license 2026-03-11 23:14:22 +00:00
Jade Choghari
0328b3f4aa Update src/lerobot/data_processing/data_annotations/vlm_annotations.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Jade Choghari <chogharijade@gmail.com>
2026-03-11 16:10:37 -07:00
root
819c1b9710 add tests/fixes 2026-03-11 22:49:06 +00:00
root
f0848c6887 add subtasl 2026-03-11 19:51:48 +00:00