lerobot-clone

mirror of https://github.com/huggingface/lerobot.git synced 2026-06-01 11:21:27 +00:00

Author	SHA1	Message	Date
Pepijn	63fad12e8d	refactor: consolidate VLM classes into single QwenVL implementation Remove Qwen2VL, Qwen3VL, Qwen35VL in favor of one QwenVL class that uses AutoModelForImageTextToText and works with the whole Qwen VL family. Moves shared _parse_skills_response to BaseVLM and extracts _build_messages/_prepare_inputs/_decode helpers to reduce duplication. Made-with: Cursor	2026-03-30 20:37:09 +02:00
Pepijn	2545f1a8ed	fix: route video_metadata through videos_kwargs for Qwen3/3.5 processors The Qwen3VLProcessor distributes kwargs to sub-processors via _merge_kwargs. Flat kwargs like video_metadata and do_sample_frames were not reaching the video processor, causing fps to default to 24 and producing shape mismatches. Pass these kwargs explicitly under videos_kwargs so they reach Qwen3VLVideoProcessor directly. Revert Qwen2VL to its simpler original approach since its processor doesn't use videos_kwargs. Made-with: Cursor	2026-03-30 19:09:46 +02:00
Pepijn	5f85b572d7	fix: unpack video_metadata from tuples and pass separately to processor The Qwen3.5 processor requires video_metadata as a separate parameter, not embedded in the video tensors. Use return_video_metadata=True from process_vision_info, then unpack the (tensor, metadata) tuples into separate videos and video_metadata lists for the processor call. Made-with: Cursor	2026-03-30 17:37:59 +02:00
Pepijn	72692525da	fix: pass fps=1.0 scalar to processor instead of video_metadata tuples The return_video_metadata=True approach causes 'list index out of range' due to (tensor, metadata) tuple format issues. Since all extracted videos are at 1fps (ffmpeg -r 1), directly pass fps=1.0 as a scalar alongside do_sample_frames=False — this gives the processor the exact fps for position embedding computation without format compatibility issues across Qwen processor versions. Made-with: Cursor	2026-03-30 17:32:30 +02:00
Pepijn	9a298524ca	fix: pass video_metadata via process_vision_info for correct position embeddings The Qwen3.5 processor needs video_metadata (fps, frame indices) to compute temporal position embeddings. Use return_video_metadata=True which embeds metadata inside the video tensors as (tensor, metadata) tuples, and return_video_kwargs=True which returns {'do_sample_frames': False} without the problematic fps list. Made-with: Cursor	2026-03-30 17:23:44 +02:00
Pepijn	002a9dd0b9	fix: use do_sample_frames=False instead of video_kwargs fps list The Qwen3.5 processor expects fps as a scalar, not a list, so passing video_kwargs with fps=[...] fails validation. Since process_vision_info already handles frame sampling, we only need do_sample_frames=False to tell the processor to use the pre-sampled frames as-is. Made-with: Cursor	2026-03-30 16:55:46 +02:00
Pepijn	e40985b013	fix: pass video_kwargs from process_vision_info to Qwen processor The Qwen processor needs fps metadata (via return_video_kwargs=True) to compute correct temporal position embeddings. Without it, the processor defaults to fps=24 regardless of the actual video fps, causing shape mismatches between expected and actual video tokens. Made-with: Cursor	2026-03-30 16:50:34 +02:00
Pepijn	d03200bdb3	fix: force torchvision video backend instead of cv2 bypass Replace manual cv2 frame reading with FORCE_QWENVL_VIDEO_READER=torchvision env var. The torchvision backend (PyAV) properly reads video metadata and respects the fps parameter, avoiding the torchcodec fps=24 default issue. Made-with: Cursor	2026-03-30 16:42:52 +02:00
Pepijn	ac41cd6672	fix: bypass torchcodec video decoding by pre-reading frames via cv2 When torchcodec is installed, qwen-vl-utils ignores the fps parameter and defaults to 24fps if video metadata is missing, causing shape mismatches. Fix by reading video frames directly as PIL images and passing them to the processor, bypassing torchcodec entirely. Made-with: Cursor	2026-03-30 16:03:26 +02:00
Pepijn	9b211a45d6	fix: disable thinking mode in Qwen35VL single-episode fallback path The single-episode `segment_skills` method was missing `enable_thinking=False` in `apply_chat_template`, causing the model to output reasoning traces instead of JSON when the batch path fails and falls back to per-episode processing. Made-with: Cursor	2026-03-30 15:31:18 +02:00
root	a6387da464	add license	2026-03-11 23:14:22 +00:00
Jade Choghari	0328b3f4aa	Update src/lerobot/data_processing/data_annotations/vlm_annotations.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Jade Choghari <chogharijade@gmail.com>	2026-03-11 16:10:37 -07:00
root	819c1b9710	add tests/fixes	2026-03-11 22:49:06 +00:00
root	f0848c6887	add subtasl	2026-03-11 19:51:48 +00:00

14 Commits