annotate: drop local in-process VLM backends — HF Jobs (openai) only for now

The shipped workflow is Hugging Face Jobs (examples/annotations/run_hf_ job.py): it serves the model with vLLM in the vllm/vllm-openai image and the pipeline talks to it over the OpenAI-compatible API. The in-process vllm / transformers local backends added surface (and the vllm one pinned an old torch) without being part of that path, so they're removed for now. * vlm_client.make_vlm_client: keep only backend='openai' (+ 'stub' rejected with the usual guidance). Requesting 'vllm'/'transformers' now raises a clear 'not supported for now — use the HF Jobs flow' error. Removed _make_vllm_client and _make_transformers_client. * config: backend docstring updated (openai-only); default model_id bumped to Qwen/Qwen3.6-27B to match run_hf_job. * docs/annotation_pipeline.mdx: remove the '## Running locally' section; the launcher description now says one vLLM server per GPU over the OpenAI API, and the 'One Qwen-VL pass' note drops the 'vLLM/transformers fallback' wording. Tests are unaffected (they construct StubVlmClient directly; nothing referenced the removed backends). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-04 12:51:27 +00:00 · 2026-06-03 16:28:40 +02:00
parent a18d969753
commit b9a0187335
3 changed files with 32 additions and 158 deletions
--- a/docs/source/annotation_pipeline.mdx
+++ b/docs/source/annotation_pipeline.mdx
@@ -48,38 +48,6 @@ anything already there. Implementations of those tools live under
 `src/lerobot/tools/`; one file per tool, registered via
 `TOOL_REGISTRY`. See the [Tools](./tools) doc for the authoring guide.

-## Running locally
-
-Install the extra and invoke the console script. Episode-level
-concurrency comes from `--executor.episode_parallelism` (default 16);
-that is the only knob the in-process executor exposes.
-
-```bash
-uv sync --extra annotations
-uv run lerobot-annotate \
-  --root=/path/to/dataset \
-  --vlm.model_id=Qwen/Qwen2.5-VL-7B-Instruct
-```
-
-The pipeline attaches actual camera footage to every `plan` /
-`interjections` / `vqa` prompt by default, decoded from the dataset's
-first `observation.images.*` stream. Override with
-`--vlm.camera_key=observation.images.<name>` to pin a specific
-viewpoint. Datasets with no video tracks fall back to text-only prompts
-automatically.
-
-**The `plan` module sees the whole episode as one video block.** Subtask
-decomposition gets a `{"type":"video", "video":[<frames>]}` block
-covering the entire demonstration; Qwen-VL pools temporally on its own
-and decides where to cut. There is no keyframe stride or count knob —
-`--plan.max_video_frames` (default 128) only caps the frames packed
-into the video block as a model-capacity bound. The `interjections`
-module attaches a short window of frames straddling the interjection
-timestamp. The `vqa` module grounds each VQA pair on a single frame —
-its `--vqa.K` knob sets how many consecutive frames each emission tick
-anchors, and every anchored frame gets its own VQA pair on that one
-frame (there is no per-pair frame window).
-
 ## Running on Hugging Face Jobs

 Distributed annotation is delegated to
@@ -91,10 +59,11 @@ HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py
 ```

 [`examples/annotations/run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
-spawns one `h200x2` job that:
+spawns a multi-GPU `h200` job that:

 1. installs the branch under test plus the annotation extras,
-2. boots two vllm servers (one per GPU) for the chosen model,
+2. boots one vLLM server per GPU (in the `vllm/vllm-openai` image) for the
+   chosen model, which the pipeline drives over the OpenAI-compatible API,
 3. runs the `plan` / `interjections` / `vqa` modules across the dataset
   via `lerobot-annotate`,
 4. uploads the annotated dataset to `--push_to_hub`.
@@ -126,9 +95,9 @@ Two things drive the scope:
   speech) only appear on the exact frame whose timestamp matches the
   emission. The pipeline writes timestamps taken straight from the
   source parquet — no floating-point recomputation.
-2. **One Qwen-VL pass.** All three modules share a single VLM client
-   (vLLM if available, transformers fallback) so the cost is one model
-   load per dataset, not three.
+2. **One Qwen-VL pass.** All three modules share a single VLM client (the
+   OpenAI-compatible client talking to the job's vLLM server) so the cost
+   is one model load per dataset, not three.

 ## Module independence and staged reruns