lerobot-clone

mirror of https://github.com/huggingface/lerobot.git synced 2026-06-02 20:01:25 +00:00

Author	SHA1	Message	Date
Pepijn	f1e3ab7794	fix(annotate): don't crash pipeline on persistent JSON parse failure Some prompts/models occasionally return pure prose with no JSON object even on retry. Returning None (and logging a preview) lets the pipeline skip that one VLM call cleanly instead of aborting the whole episode. The modules already check for None / non-dict results and degrade gracefully (no row emitted from that call). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:35 +02:00
Pepijn	585341ba9f	fix(annotate): robust JSON extraction (think tags + first balanced object) Models often wrap JSON in prose or <think>...</think> blocks. Strip the think tags first, then try direct json.loads, then fall back to scanning for the first balanced {...} substring (ignoring braces inside strings). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:35 +02:00
Pepijn	23ff346027	fix(annotate): stream child stdout char-by-char so tqdm \\r progress flushes	2026-04-30 18:48:35 +02:00
Pepijn	c06c8d594a	feat(annotate): use cached HF token from huggingface-cli login Fall back to huggingface_hub.get_token() when HF_TOKEN/HUGGINGFACE_API_KEY env vars aren't set. That picks up the token cached by 'huggingface-cli login' so users don't need to export it on every shell. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:35 +02:00
Pepijn	c99ac45cd1	feat(annotate): one-flag HF Inference Providers backend Setting --vlm.use_hf_inference_providers=true routes requests through https://router.huggingface.co/v1 using HF_TOKEN as the API key, and disables auto_serve so no local server is spawned. Combine with a provider-pinned model id like 'Qwen/Qwen3-VL-30B-A3B-Instruct:novita' or any plain model id to let HF route. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:34 +02:00
Pepijn	13aaafeae0	fix(annotate): omit mm_processor_kwargs by default; transformers serve rejects it transformers serve returns HTTP 422 'Unexpected fields' when mm_processor_kwargs is in extra_body — that field is vllm-specific. Drop it by default; opt in via LEROBOT_OPENAI_SEND_MM_KWARGS=1 when talking to vllm serve. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:34 +02:00
Pepijn	2129648bf4	fix(annotate): mm_processor_kwargs in extra_body; inline file URLs as data URLs Two fixes for video_url with transformers serve: - fps must be in extra_body.mm_processor_kwargs, not in the content block; otherwise the server discards it as unknown kwargs. - file:// URLs aren't fetched by transformers serve. Read the local mp4 and inline it as a base64 data:video/mp4 URL so the server sees the bytes directly. Both surface as std::bad_alloc on the server side when wrong, which is unhelpful but explains what we hit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:34 +02:00
Pepijn	f5cd3f6e4e	fix(annotate): detect server ready via stdout banner, not /v1/models polls transformers serve rescans the HF cache on every /v1/models request which exceeds the 2s urllib timeout, leaving the probe loop spinning even after Uvicorn is fully up. Watch the streamed server output for 'Uvicorn running' / 'Application startup complete' instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:34 +02:00
Pepijn	ecf5766301	fix(annotate): visible auto_serve via stdout prints + live server log stream The previous logger-based output never appeared, leaving users in the dark when auto_serve silently no-op'd. Switch to print(flush=True) so the spawn decision is unmistakable, and stream the server's stdout to the parent terminal in real-time on a background thread so model-load progress and errors surface immediately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:34 +02:00
Pepijn	11597d4f71	fix(annotate): auto_serve defaults to True; probe before spawning Default auto_serve to True so lerobot-annotate can drive the entire flow with one command. Probe api_base/models first — if a server is already reachable (user started one manually, or it's a remote endpoint), skip the spawn. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:34 +02:00
Pepijn	8b9c598cf4	feat(annotate): auto_serve mode spawns and tears down inference server Setting --vlm.auto_serve=true with --vlm.backend=openai makes the CLI launch 'transformers serve <model_id> --port <serve_port> --continuous-batching' as a child process, poll /v1/models until ready (up to serve_ready_timeout_s), run the pipeline, then SIGINT the server on process exit. Override the spawn command with --vlm.serve_command='vllm serve ...' or any OpenAI-compatible launcher. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:34 +02:00
Pepijn	b325475b38	feat(annotate): video_url block for openai backend Module 1 can now send the episode's actual mp4 file as a video_url content block instead of pre-decoded frames. The server (transformers serve / vllm serve / ktransformers serve) handles frame sampling at the configured fps. Default fps=1 (one frame per second is enough for subtask-boundary detection on manipulation episodes). A per-episode subclip is extracted to <root>/.annotate_staging/.video_clips/ via ffmpeg stream-copy (no re-encode) so the model sees only this episode's frames, not the whole shard. Enable with --module_1.use_video_url=true (and --vlm.backend=openai). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:34 +02:00
Pepijn	ef137ff86a	feat(annotate): openai-compatible backend for transformers/ktransformers serve Adds a third backend that talks to any OpenAI-compatible server. This unblocks Qwen3.6 (and other models) that work in transformers serve / ktransformers but not in vllm 0.10.2's fallback path: - launch the server out-of-process (transformers serve, vllm serve, ktransformers serve) - point lerobot-annotate at it via --vlm.backend=openai --vlm.api_base=http://localhost:8000/v1 --vlm.model_id=... Image and video blocks are converted to OpenAI image_url/video_url data URLs automatically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:34 +02:00
Pepijn	c5df821a96	fix(annotate): use vllm.chat() API for multimodal prompts vllm.generate() expects a string/TextPrompt; passing message dicts fails. vllm.chat() applies the chat template and extracts image/video blocks automatically, which is what we need for VL models. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:34 +02:00
Pepijn	7ec3d7999c	fix(annotate): drop guided_decoding=dict (api differs across vllm) vllm 0.10.2 expects guided_decoding to be a GuidedDecodingParams object, not a dict. Different vllm versions differ here. The parser already has a one-retry JSON-recovery path, so drop guided decoding entirely for portability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:34 +02:00
Pepijn	e240305e8e	fix(annotate): default transformers backend to manual GPU placement Loading Qwen3-VL via transformers + accelerate's device_map='auto' fails with std::bad_alloc on hosts with abundant RAM. The bug is in accelerate's post-load dispatch path. Bypassing accelerate by loading to CPU first and then calling .to('cuda') manually avoids that path. LEROBOT_TRANSFORMERS_DEVICE_MAP=auto switches back to the old behavior for cases where it works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:34 +02:00
Pepijn	ccd189b264	fix(annotate): LEROBOT_DISABLE_CUDNN escape hatch for conv3d crash cuDNN 9.x + torch 2.8 has a regression where the conv3d kernel used in Qwen-VL vision tower patch embedders fails with CUDNN_STATUS_NOT_INITIALIZED. The crash is independent of model size and reproduces on both Qwen2.5-VL and Qwen3-VL because both use 3D conv for video patch embedding. Setting LEROBOT_DISABLE_CUDNN=1 falls back to native PyTorch conv3d kernels (slower but functional) so the pipeline can run while the torch/cuDNN stack is still on the broken combo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:34 +02:00
Pepijn	ef1242bbd4	fix(annotate): expose gpu_memory_utilization and max_model_len for vllm Large VL models (Qwen3-VL-30B-A3B BF16) take ~58 GB of an 80 GB H100, leaving only ~22 GB for KV cache + cuDNN workspace. The vision tower's 3D conv then fails with CUDNN_STATUS_NOT_INITIALIZED because cuDNN can't grab a workspace large enough. - vlm.gpu_memory_utilization (default 0.9) — drop to 0.7 when the vision encoder needs more cuDNN workspace. - vlm.max_model_len — cap context to free KV cache memory; the 262k default for Qwen3 is wildly more than annotation prompts need. - vlm.trust_remote_code — already plumbed; now also passed to LLM(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:33 +02:00
Pepijn	ebf4a04d41	fix(annotate): pass trust_remote_code=True to HF auto-classes Required for many newer VL checkpoints (Qwen3.x FP8 in particular) that ship custom loader code in their repo. Without it, the FP8 weight_scale_inv parameters never bind to FP8Linear modules and the post-load dispatch path bad-allocs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:33 +02:00
Pepijn	4419b4ef1b	fix(annotate): low_cpu_mem_usage=True on transformers load path The std::bad_alloc we hit on Qwen3-line VL models is not a real OOM — it triggers in the post-load tensor-placement path even on hosts with 2 TB RAM. low_cpu_mem_usage=True bypasses the offending intermediate staging buffer and is the standard accelerate workaround. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:33 +02:00
Pepijn	ff06ca82d2	fix(annotate): use device_map='auto' for transformers backend Without device_map, transformers stages the full FP8 checkpoint in CPU RAM before any GPU placement, OOMing the host on 27B+ models even when the GPU has enough VRAM. device_map='auto' streams shards directly to GPU memory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:33 +02:00
Pepijn	fcb01e73eb	fix(annotate): try AutoModelForImageTextToText first, fall back to AutoModelForVision2Seq Newer transformers versions renamed/removed AutoModelForVision2Seq in favour of AutoModelForImageTextToText for VL models. Try the new name first and fall back gracefully so the transformers backend works on both transformers 4.45-4.5x and 5.x. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:33 +02:00
Pepijn	9d6af804bf	feat(annotate): attach camera keyframes to module prompts; default to Qwen3.6-27B-FP8 Closes the visual-grounding gap flagged after the initial PR review: modules now decode actual camera frames at the relevant timestamps and attach them as `{"type":"image", "image":<PIL>}` content blocks to the VLM prompts. - New `frames.py`: - `FrameProvider` Protocol; `VideoFrameProvider` decodes from the dataset's first `observation.images.*` stream via `LeRobotDatasetMetadata.get_video_file_path` and `decode_video_frames`, with the same `from_timestamp` shift the main dataset uses. - Per-process LRU cache so co-timestamped Module 1 plan-update + Module 2 calls share decode work. - `make_frame_provider` falls back to a null provider when the dataset has no video tracks → text-only prompts (graceful absence). - Modules 1/2/3 take an optional `frame_provider` (default null) and prepend image blocks before the text block. - Module 1 attaches `keyframes_per_episode` keyframes to the subtask decomposition prompt. - Module 2 attaches the frame at the interjection timestamp. - Module 3 attaches the exact emission frame to each VQA pair. - VlmConfig: backend now defaults to `vllm`; default model is `Qwen/Qwen3.6-27B-FP8`. New knobs: `--vlm.tensor_parallel_size`, `--vlm.camera_key` (override the keyframe stream). - `_make_vllm_client` honours `tensor_parallel_size` so 27B-FP8 sharded on 2× GPUs works out of the box. - `test_module3_attaches_frame_image_block_to_prompt` asserts modules emit one image block per VQA prompt at the exact emission timestamp. - Docs: example switched to `imstevenpmwork/super_poulain_draft` + Qwen3.6-27B-FP8 + tensor_parallel_size=2; documents the keyframe attachment behaviour and the no-video fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:33 +02:00
Pepijn	f763f85213	feat: language annotation pipeline (PR 2/3) Adds the steerable annotation pipeline (`lerobot-annotate`) that populates the `language_persistent` and `language_events` columns introduced in PR 1 directly into `data/chunk-/file-.parquet`. No flavor namespace, no sidecar tree. Modules produced: - Module 1 (plan_subtasks_memory): Pi0.7-style subtasks, plan (init + refresh on interjection), MEM-style memory at subtask boundaries. - Module 2 (interjections_and_speech): t=0 speech-only acknowledgement, mid-episode paired interjection + speech tool-call atom. - Module 3 (general_vqa): bbox/keypoint/count/attribute/spatial pairs at configurable cadence with one-retry JSON validation. Writer enforces: per-episode persistent identity, exact-frame event timestamps, column routing per `column_for_style`, dataset-level `tools` column with the `say` schema, drops legacy `subtask_index`. Validator runs against staged JSONL artifacts before the writer rewrites parquet. Adds `lerobot-annotate` console script, `annotations` extra (datatrove + optional vllm), `make annotation-e2e` opt-in smoke target, and `docs/source/annotation_pipeline.mdx`. Branched from PR 1 (`feat/language-columns`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:48:33 +02:00

24 Commits