examples/benchmark/bench_pi052_step.slurm

#!/bin/bash
#SBATCH --job-name=bench-pi052-attn
#SBATCH --partition=hopper-prod
#SBATCH --qos=high
#SBATCH --time=00:30:00
#SBATCH --ntasks=1
#SBATCH --gpus-per-task=1
#SBATCH --output=/fsx/pepijn/logs/bench_pi052_%j.out

set -euo pipefail

cd "${LEROBOT_ROOT:-$HOME/lerobot}"

export PATH="$HOME/miniconda3/bin:$HOME/.local/bin:$PATH"
export LD_LIBRARY_PATH="$HOME/miniconda3/lib:${LD_LIBRARY_PATH:-}"
export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-expandable_segments:True}"

echo "=== Node: $(hostname) ==="
nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv,noheader

python -c "import torch; print('torch', torch.__version__, 'cuda', torch.version.cuda)"

run() {
    echo
    echo "--- $* ---"
    python examples/benchmark/bench_pi052_step.py "$@" || true
}

# Attention parity benchmark — same shapes, different attention kernel.
run --attn eager --batch-size 8
run --attn sdpa  --batch-size 8

# Headroom benchmark — does SDPA's memory cut allow a bigger micro-batch?
run --attn sdpa  --batch-size 12
run --attn sdpa  --batch-size 16
run --attn sdpa  --batch-size 24
pi052: SDPA attention port + selective AC + bench harness Replaces the per-layer ``modeling_gemma.eager_attention_forward`` call with ``torch.nn.functional.scaled_dot_product_attention`` in ``compute_layer_complete`` (pi05) and ``_compute_layer_ki`` (pi052). PyTorch SDPA picks the memory-efficient kernel for the block-bidirectional 4D additive mask the dual-expert model uses (FA2 / FA3 reject it because they only accept causal / sliding-window / varlen patterns). The shared ``sdpa_attention_forward`` helper mirrors the eager signature so the call sites are unchanged. Selective AC: removes the redundant outer ``_apply_checkpoint(forward_func, ...)`` wrap in ``PI05Pytorch.forward``. Per-layer checkpointing inside ``PaliGemmaWithExpertModel.forward`` already handles activation recompute; the outer wrap was double-recomputing the whole backbone. +14% steps/sec on its own (job 22161405 vs 22161398, 1xH100). groot: drop ``@strict`` on ``GR00TN15Config`` — newer ``huggingface_hub`` rejects ``@strict`` on non-dataclass ``PretrainedConfig`` subclasses, which was blocking imports of any sibling policy through ``lerobot.policies.factory``. New ``examples/benchmark/bench_pi052_step.py`` (+ slurm sweeps v1..v8) times PI052Policy.forward+backward (optionally with AdamW) on synthetic inputs. Headline numbers on 1xH100 with KI=True, GC=True, L=512, 4.14 B trainable params, AdamW state in bf16: pre-SDPA eager BS=8 610ms 19.5 GiB -> 13.1 samples/s sdpa BS=8 + compile=default 413ms 19.5 GiB -> 19.3 samples/s sdpa BS=16 + compile=default 715ms 37.3 GiB -> 22.4 samples/s sdpa BS=32 + compile=default 1325ms 44.8 GiB -> 24.2 samples/s sdpa BS=40 + compile=default 1665ms 48.6 GiB -> 24.0 samples/s Parity tests in ``tests/policies/pi052/test_pi052_sdpa_attention.py`` cover fp32 / bf16 / GQA / MHA forward + backward — output and grads match the eager path within bf16 tolerance. Also ships ``examples/benchmark/fsdp_pi052.yaml`` (FSDP2 accelerate config wrapping GemmaDecoderLayer + SiglipEncoderLayer) for the follow-up multi-GPU memory sharding work. Co-authored-by: Cursor <cursoragent@cursor.com> 2026-05-25 21:59:20 +00:00			`#!/bin/bash`
			`#SBATCH --job-name=bench-pi052-attn`
			`#SBATCH --partition=hopper-prod`
			`#SBATCH --qos=high`
			`#SBATCH --time=00:30:00`
			`#SBATCH --ntasks=1`
			`#SBATCH --gpus-per-task=1`
			`#SBATCH --output=/fsx/pepijn/logs/bench_pi052_%j.out`

			`set -euo pipefail`

			`cd "${LEROBOT_ROOT:-$HOME/lerobot}"`

			`export PATH="$HOME/miniconda3/bin:$HOME/.local/bin:$PATH"`
			`export LD_LIBRARY_PATH="$HOME/miniconda3/lib:${LD_LIBRARY_PATH:-}"`
			`export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-expandable_segments:True}"`

			`echo "=== Node: $(hostname) ==="`
			`nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv,noheader`

			`python -c "import torch; print('torch', torch.__version__, 'cuda', torch.version.cuda)"`

			`run() {`
			`echo`
			`echo "--- $* ---"`
			`python examples/benchmark/bench_pi052_step.py "$@" \|\| true`
			`}`

			`# Attention parity benchmark — same shapes, different attention kernel.`
			`run --attn eager --batch-size 8`
			`run --attn sdpa --batch-size 8`

			`# Headroom benchmark — does SDPA's memory cut allow a bigger micro-batch?`
			`run --attn sdpa --batch-size 12`
			`run --attn sdpa --batch-size 16`
			`run --attn sdpa --batch-size 24`