Some prompts/models occasionally return pure prose with no JSON object
even on retry. Returning None (and logging a preview) lets the pipeline
skip that one VLM call cleanly instead of aborting the whole episode.
The modules already check for None / non-dict results and degrade
gracefully (no row emitted from that call).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Models often wrap JSON in prose or <think>...</think> blocks. Strip the
think tags first, then try direct json.loads, then fall back to scanning
for the first balanced {...} substring (ignoring braces inside strings).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fall back to huggingface_hub.get_token() when HF_TOKEN/HUGGINGFACE_API_KEY
env vars aren't set. That picks up the token cached by
'huggingface-cli login' so users don't need to export it on every shell.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Setting --vlm.use_hf_inference_providers=true routes requests through
https://router.huggingface.co/v1 using HF_TOKEN as the API key, and
disables auto_serve so no local server is spawned. Combine with a
provider-pinned model id like 'Qwen/Qwen3-VL-30B-A3B-Instruct:novita'
or any plain model id to let HF route.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
transformers serve returns HTTP 422 'Unexpected fields' when
mm_processor_kwargs is in extra_body — that field is vllm-specific.
Drop it by default; opt in via LEROBOT_OPENAI_SEND_MM_KWARGS=1 when
talking to vllm serve.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes for video_url with transformers serve:
- fps must be in extra_body.mm_processor_kwargs, not in the content
block; otherwise the server discards it as unknown kwargs.
- file:// URLs aren't fetched by transformers serve. Read the local mp4
and inline it as a base64 data:video/mp4 URL so the server sees the
bytes directly.
Both surface as std::bad_alloc on the server side when wrong, which is
unhelpful but explains what we hit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
transformers serve rescans the HF cache on every /v1/models request
which exceeds the 2s urllib timeout, leaving the probe loop spinning
even after Uvicorn is fully up. Watch the streamed server output for
'Uvicorn running' / 'Application startup complete' instead.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous logger-based output never appeared, leaving users in the
dark when auto_serve silently no-op'd. Switch to print(flush=True) so
the spawn decision is unmistakable, and stream the server's stdout to
the parent terminal in real-time on a background thread so model-load
progress and errors surface immediately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Default auto_serve to True so lerobot-annotate can drive the entire
flow with one command. Probe api_base/models first — if a server is
already reachable (user started one manually, or it's a remote
endpoint), skip the spawn.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Setting --vlm.auto_serve=true with --vlm.backend=openai makes the CLI
launch 'transformers serve <model_id> --port <serve_port>
--continuous-batching' as a child process, poll /v1/models until ready
(up to serve_ready_timeout_s), run the pipeline, then SIGINT the
server on process exit.
Override the spawn command with --vlm.serve_command='vllm serve ...'
or any OpenAI-compatible launcher.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Module 1 can now send the episode's actual mp4 file as a video_url
content block instead of pre-decoded frames. The server (transformers
serve / vllm serve / ktransformers serve) handles frame sampling at
the configured fps. Default fps=1 (one frame per second is enough for
subtask-boundary detection on manipulation episodes).
A per-episode subclip is extracted to <root>/.annotate_staging/.video_clips/
via ffmpeg stream-copy (no re-encode) so the model sees only this
episode's frames, not the whole shard.
Enable with --module_1.use_video_url=true (and --vlm.backend=openai).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a third backend that talks to any OpenAI-compatible server. This
unblocks Qwen3.6 (and other models) that work in transformers serve /
ktransformers but not in vllm 0.10.2's fallback path:
- launch the server out-of-process (transformers serve, vllm serve,
ktransformers serve)
- point lerobot-annotate at it via --vlm.backend=openai
--vlm.api_base=http://localhost:8000/v1 --vlm.model_id=...
Image and video blocks are converted to OpenAI image_url/video_url
data URLs automatically.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vllm.generate() expects a string/TextPrompt; passing message dicts
fails. vllm.chat() applies the chat template and extracts image/video
blocks automatically, which is what we need for VL models.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vllm 0.10.2 expects guided_decoding to be a GuidedDecodingParams object,
not a dict. Different vllm versions differ here. The parser already has
a one-retry JSON-recovery path, so drop guided decoding entirely for
portability.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Loading Qwen3-VL via transformers + accelerate's device_map='auto'
fails with std::bad_alloc on hosts with abundant RAM. The bug is in
accelerate's post-load dispatch path. Bypassing accelerate by loading
to CPU first and then calling .to('cuda') manually avoids that path.
LEROBOT_TRANSFORMERS_DEVICE_MAP=auto switches back to the old behavior
for cases where it works.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cuDNN 9.x + torch 2.8 has a regression where the conv3d kernel used in
Qwen-VL vision tower patch embedders fails with
CUDNN_STATUS_NOT_INITIALIZED. The crash is independent of model size
and reproduces on both Qwen2.5-VL and Qwen3-VL because both use 3D conv
for video patch embedding.
Setting LEROBOT_DISABLE_CUDNN=1 falls back to native PyTorch conv3d
kernels (slower but functional) so the pipeline can run while the
torch/cuDNN stack is still on the broken combo.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Large VL models (Qwen3-VL-30B-A3B BF16) take ~58 GB of an 80 GB H100,
leaving only ~22 GB for KV cache + cuDNN workspace. The vision tower's
3D conv then fails with CUDNN_STATUS_NOT_INITIALIZED because cuDNN
can't grab a workspace large enough.
- vlm.gpu_memory_utilization (default 0.9) — drop to 0.7 when the vision
encoder needs more cuDNN workspace.
- vlm.max_model_len — cap context to free KV cache memory; the 262k
default for Qwen3 is wildly more than annotation prompts need.
- vlm.trust_remote_code — already plumbed; now also passed to LLM().
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Required for many newer VL checkpoints (Qwen3.x FP8 in particular) that
ship custom loader code in their repo. Without it, the FP8
weight_scale_inv parameters never bind to FP8Linear modules and the
post-load dispatch path bad-allocs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The std::bad_alloc we hit on Qwen3-line VL models is not a real OOM —
it triggers in the post-load tensor-placement path even on hosts with
2 TB RAM. low_cpu_mem_usage=True bypasses the offending intermediate
staging buffer and is the standard accelerate workaround.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without device_map, transformers stages the full FP8 checkpoint in CPU
RAM before any GPU placement, OOMing the host on 27B+ models even when
the GPU has enough VRAM. device_map='auto' streams shards directly to
GPU memory.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Newer transformers versions renamed/removed AutoModelForVision2Seq in
favour of AutoModelForImageTextToText for VL models. Try the new name
first and fall back gracefully so the transformers backend works on
both transformers 4.45-4.5x and 5.x.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the visual-grounding gap flagged after the initial PR review:
modules now decode actual camera frames at the relevant timestamps and
attach them as `{"type":"image", "image":<PIL>}` content blocks to the
VLM prompts.
- New `frames.py`:
- `FrameProvider` Protocol; `VideoFrameProvider` decodes from the
dataset's first `observation.images.*` stream via
`LeRobotDatasetMetadata.get_video_file_path` and
`decode_video_frames`, with the same `from_timestamp` shift the main
dataset uses.
- Per-process LRU cache so co-timestamped Module 1 plan-update + Module
2 calls share decode work.
- `make_frame_provider` falls back to a null provider when the dataset
has no video tracks → text-only prompts (graceful absence).
- Modules 1/2/3 take an optional `frame_provider` (default null) and
prepend image blocks before the text block.
- Module 1 attaches `keyframes_per_episode` keyframes to the subtask
decomposition prompt.
- Module 2 attaches the frame at the interjection timestamp.
- Module 3 attaches the exact emission frame to each VQA pair.
- VlmConfig: backend now defaults to `vllm`; default model is
`Qwen/Qwen3.6-27B-FP8`. New knobs: `--vlm.tensor_parallel_size`,
`--vlm.camera_key` (override the keyframe stream).
- `_make_vllm_client` honours `tensor_parallel_size` so 27B-FP8 sharded
on 2× GPUs works out of the box.
- `test_module3_attaches_frame_image_block_to_prompt` asserts modules
emit one image block per VQA prompt at the exact emission timestamp.
- Docs: example switched to `imstevenpmwork/super_poulain_draft` +
Qwen3.6-27B-FP8 + tensor_parallel_size=2; documents the keyframe
attachment behaviour and the no-video fallback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>