From 82dffde7fad11cba91f7916b050fbe7d7eea35ab Mon Sep 17 00:00:00 2001 From: Pepijn <138571049+pkooij@users.noreply.github.com> Date: Thu, 7 May 2026 13:37:16 +0200 Subject: [PATCH] fix(ci): speed up multi-task benchmark evals (parallelize + cap VLABench steps) (#3529) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * fix(ci): run multi-task benchmark evals 5-at-a-time in parallel The eval script supports running tasks concurrently via a ThreadPoolExecutor (env.max_parallel_tasks). Apply it to the four multi-task benchmark CI jobs (RoboTwin, RoboCasa, RoboMME, LIBERO-plus — 8-10 tasks/task_ids each) so they finish in ~2 waves of 5 instead of running sequentially. Single-task jobs (Libero, MetaWorld, RoboCerebra) are unchanged. * fix(ci): cap VLABench smoke eval at 50 steps per task VLABench's default episode_length is 500 steps; with 10 tasks at ~1 it/s the smoke eval took ~80 minutes of rollouts on top of the image build. The eval is a pipeline smoke test (running_success_rate stays at 0% on this short rollout anyway), so we don't need full episodes — cap each task at 50 steps to bring total rollout time down ~10x. * fix(ci): run VLABench tasks 5-at-a-time in parallel The eval script already supports running multiple tasks concurrently via a ThreadPoolExecutor (env.max_parallel_tasks). Set it to 5 so the 10 VLABench tasks finish in ~2 waves instead of running sequentially. --- .github/workflows/benchmark_tests.yml | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/.github/workflows/benchmark_tests.yml b/.github/workflows/benchmark_tests.yml index b07c8f8da..b82c59a8b 100644 --- a/.github/workflows/benchmark_tests.yml +++ b/.github/workflows/benchmark_tests.yml @@ -382,6 +382,7 @@ jobs: --policy.path=\"\$ROBOTWIN_POLICY\" \ --env.type=robotwin \ --env.task=\"\$ROBOTWIN_TASKS\" \ + --env.max_parallel_tasks=5 \ --eval.batch_size=1 \ --eval.n_episodes=1 \ --eval.use_async_envs=false \ @@ -482,6 +483,7 @@ jobs: --policy.path=lerobot/smolvla_robocasa \ --env.type=robocasa \ --env.task=CloseFridge,OpenCabinet,OpenDrawer,TurnOnMicrowave,TurnOffStove,CloseToasterOvenDoor,SlideDishwasherRack,TurnOnSinkFaucet,NavigateKitchen,TurnOnElectricKettle \ + --env.max_parallel_tasks=5 \ --eval.batch_size=1 \ --eval.n_episodes=1 \ --eval.use_async_envs=false \ @@ -693,6 +695,7 @@ jobs: --env.task=\"\$ROBOMME_TASKS\" \ --env.dataset_split=test \ --env.task_ids=[0] \ + --env.max_parallel_tasks=5 \ --eval.batch_size=1 \ --eval.n_episodes=1 \ --eval.use_async_envs=false \ @@ -800,6 +803,7 @@ jobs: --env.type=libero_plus \ --env.task=\"\$LIBERO_PLUS_SUITE\" \ --env.task_ids=\"\$LIBERO_PLUS_TASK_IDS\" \ + --env.max_parallel_tasks=5 \ --eval.batch_size=1 \ --eval.n_episodes=1 \ --eval.use_async_envs=false \ @@ -900,6 +904,8 @@ jobs: --policy.path=lerobot/smolvla_vlabench \ --env.type=vlabench \ --env.task=select_fruit,select_toy,select_book,select_painting,select_drink,select_ingredient,select_billiards,select_poker,add_condiment,insert_flower \ + --env.episode_length=50 \ + --env.max_parallel_tasks=5 \ --eval.batch_size=1 \ --eval.n_episodes=1 \ --eval.use_async_envs=false \