Remove Qwen2VL, Qwen3VL, Qwen35VL in favor of one QwenVL class that
uses AutoModelForImageTextToText and works with the whole Qwen VL
family. Moves shared _parse_skills_response to BaseVLM and extracts
_build_messages/_prepare_inputs/_decode helpers to reduce duplication.
Made-with: Cursor
The Qwen3VLProcessor distributes kwargs to sub-processors via
_merge_kwargs. Flat kwargs like video_metadata and do_sample_frames
were not reaching the video processor, causing fps to default to 24
and producing shape mismatches.
Pass these kwargs explicitly under videos_kwargs so they reach
Qwen3VLVideoProcessor directly. Revert Qwen2VL to its simpler
original approach since its processor doesn't use videos_kwargs.
Made-with: Cursor
The Qwen3.5 processor requires video_metadata as a separate parameter,
not embedded in the video tensors. Use return_video_metadata=True from
process_vision_info, then unpack the (tensor, metadata) tuples into
separate videos and video_metadata lists for the processor call.
Made-with: Cursor
The return_video_metadata=True approach causes 'list index out of range'
due to (tensor, metadata) tuple format issues. Since all extracted
videos are at 1fps (ffmpeg -r 1), directly pass fps=1.0 as a scalar
alongside do_sample_frames=False — this gives the processor the exact
fps for position embedding computation without format compatibility
issues across Qwen processor versions.
Made-with: Cursor
The Qwen3.5 processor needs video_metadata (fps, frame indices) to
compute temporal position embeddings. Use return_video_metadata=True
which embeds metadata inside the video tensors as (tensor, metadata)
tuples, and return_video_kwargs=True which returns {'do_sample_frames':
False} without the problematic fps list.
Made-with: Cursor
The Qwen3.5 processor expects fps as a scalar, not a list, so passing
video_kwargs with fps=[...] fails validation. Since process_vision_info
already handles frame sampling, we only need do_sample_frames=False to
tell the processor to use the pre-sampled frames as-is.
Made-with: Cursor
The Qwen processor needs fps metadata (via return_video_kwargs=True)
to compute correct temporal position embeddings. Without it, the
processor defaults to fps=24 regardless of the actual video fps,
causing shape mismatches between expected and actual video tokens.
Made-with: Cursor
Replace manual cv2 frame reading with FORCE_QWENVL_VIDEO_READER=torchvision
env var. The torchvision backend (PyAV) properly reads video metadata and
respects the fps parameter, avoiding the torchcodec fps=24 default issue.
Made-with: Cursor
When torchcodec is installed, qwen-vl-utils ignores the fps parameter
and defaults to 24fps if video metadata is missing, causing shape
mismatches. Fix by reading video frames directly as PIL images and
passing them to the processor, bypassing torchcodec entirely.
Made-with: Cursor
The single-episode `segment_skills` method was missing
`enable_thinking=False` in `apply_chat_template`, causing the model to
output reasoning traces instead of JSON when the batch path fails and
falls back to per-episode processing.
Made-with: Cursor