docs/source/sarm.mdx

# SARM: Stage-Aware Reward Modeling

SARM (Stage-Aware Reward Modeling) is a video-based reward modeling framework for long-horizon robot manipulation tasks. This guide covers how to train SARM reward models and optionally use them with Reward-Aligned Behavior Cloning (RA-BC).

**Paper**: [SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation](https://arxiv.org/abs/2509.25358)

<img
  src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/lerobot-sarm.png"
  alt="An overview of SARM"
  width="80%"
/>

## Why Reward Models?

Standard behavior cloning treats all demonstration frames equally, but real-world robot datasets are messy. They contain hesitations, corrections, and variable-quality trajectories. Reward models solve this by learning a generalizable notion of **task progress** from demonstrations: given video frames and a task description, they predict how close the robot is to completing the task (0→1). This learned "progress signal" can be used in multiple ways, two promising applications are: (1) **weighted imitation learning** (RA-BC), where high-progress frames receive more weight during policy training, and (2) **reinforcement learning**, where the reward model provides dense rewards for online or offline policy improvement.

## Overview

SARM has following features:

1. **Stage-aware architecture**: Jointly predicts the high-level task stage and fine-grained progress within each stage
2. **Subtask annotations**: Uses natural language subtask annotations to derive consistent progress labels
3. **Temporal proportions**: Computes dataset-level priors (α̅\_k) for each subtask to normalize progress across variable-length demonstrations

SARM trains on a compact **stage+tau** target for each frame:

- **stage**: integer stage index `k ∈ {0, ..., K-1}`
- **τ (tau)**: within-stage progress `τ ∈ [0, 1]`
- **target encoding**: `y = k + τ` (this is what the dataset processor produces)

At inference time (and in downstream RA-BC), SARM converts the raw `k + τ` value into a **normalized progress** in `[0, 1]` using dataset-level **temporal proportions** `α̅_k` (stored in `meta/temporal_proportions_*.json`).

This matches **Formula (2)** from the paper:

```
progress_t = P_{k-1} + α̅_k × τ_t
```

Where:

- `τ_t = (t - s_k) / (e_k - s_k)` is within-subtask normalized time
- `P_{k-1}` is cumulative prior (sum of previous subtask proportions)
- `α̅_k` is the temporal proportion for subtask k

This ensures identical task states map to consistent progress values, even across demonstrations of different lengths.

## Inputs and Targets (What the new code expects)

SARM is trained through its processor (`src/lerobot/policies/sarm/processor_sarm.py`), which:

- **Encodes** images and task text with CLIP (ViT-B/32) into `video_features` and `text_features`
- **Pads/truncates** robot state into `state_features` (up to `max_state_dim`)
- **Builds targets** as `sparse_targets` (and `dense_targets` in `dense_only`/`dual`) using the stage+tau encoding `y = k + τ`
- **Masks rewind frames** using a per-sample `lengths` tensor (rewind is a training-time augmentation)

At minimum, each training sample needs:

- `task` (string): task description
- `policy.image_key` images and `policy.state_key` states from the dataset

---

## Annotation Modes

You can choose from **3 annotation modes** that determine how progress labels are computed:

| Mode           | Annotations Required | Heads                        | Use Case                                                     |
| -------------- | -------------------- | ---------------------------- | ------------------------------------------------------------ |
| `single_stage` | None                 | Sparse only                  | Simple tasks, quick experiments, no VLM needed               |
| `dense_only`   | Dense (VLM)          | Dual (sparse auto-generated) | Detailed subtask tracking without defining high-level stages |
| `dual`         | Sparse + Dense (VLM) | Dual                         | Full SARM paper setup with both granularities                |

### Mode Details

<hfoptions id="mode_explanation">
<hfoption id="single_stage">

**No annotations required.** The entire episode is treated as a single stage called `"task"`, and progress is linear from 0 to 1 over the episode duration.

- **Sparse head**: 1 stage ("task"), linear progress
- **Dense head**: Not used
- **Best for**: Simple tasks, quick experiments, or when VLM annotation is not available

## Set Up Your Environment

1. Install LeRobot by following our [Installation Guide](./installation).
2. Install SARM dependencies by running:

```bash
pip install -e ".[sarm]"
```

Workflow:

```
1. Train SARM → 2. Visualize predictions → 3. (Optional) Train policy with RA-BC
```

</hfoption>
<hfoption id="dense_only">

**Only dense (fine-grained) annotations from a VLM.** The sparse head automatically uses a single `"task"` stage covering the full episode, while the dense head learns detailed subtask progression.

- **Sparse head**: 1 stage ("task"), linear progress (auto-generated)
- **Dense head**: Multiple fine-grained stages from VLM annotations
- **Best for**: When you want detailed subtask tracking but don't need to define high-level stages

Workflow:

```
1. Annotate (dense) → 2. Verify → 3. Train SARM → 4. Visualize → 5. (Optional) Train policy with RA-BC
```

</hfoption>
<hfoption id="dual">

**Both sparse and dense annotations from VLM.** Full dual-head mode as described in the SARM paper, with both high-level (sparse) and fine-grained (dense) stage predictions.

- **Sparse head**: High-level stages from VLM annotations
- **Dense head**: Fine-grained stages from VLM annotations
- **Best for**: Complex multi-stage tasks where both granularities are useful

Workflow:

```
1. Annotate (sparse+dense) → 2. Verify → 3. Train SARM → 4. Visualize → 5. (Optional) Train policy with RA-BC
```

</hfoption>
</hfoptions>

---

## Step 1: Subtask Annotation

<hfoptions id="annotation_mode">
<hfoption id="single_stage">

**No annotation required!** Skip this step entirely. The model will use the episode's task description and compute linear progress automatically.

</hfoption>
<hfoption id="dense_only">

Generate **dense (fine-grained) annotations only** using a VLM. The sparse stage will be auto-generated.

```bash
python src/lerobot/data_processing/sarm_annotations/subtask_annotation.py \
  --repo-id your-username/your-dataset \
  --dense-only \
  --dense-subtasks "Bring robot arms up from starting position,Grab near side and do 1st fold,Grab side and do 2nd fold,Grab side and do 3rd fold to finish folding" \
  --video-key observation.images.base \
  --num-workers 4 \
  --push-to-hub
```

**What gets saved:**

- `meta/temporal_proportions_sparse.json` - Auto-generated sparse proportions (`{"task": 1.0}`)
- `meta/temporal_proportions_dense.json` - Dense temporal proportions
- Per-episode columns in `episodes/*.parquet`:
  - `dense_subtask_names`, `dense_subtask_start_frames`, `dense_subtask_end_frames`
  - (also time-based columns: `dense_subtask_start_times`, `dense_subtask_end_times`)

</hfoption>
<hfoption id="dual">

Generate **both sparse (high-level) and dense (fine-grained) annotations** using a VLM.

```bash
python src/lerobot/data_processing/sarm_annotations/subtask_annotation.py \
  --repo-id your-username/your-dataset \
  --sparse-subtasks "Bring arms up from starting position,Fold the towel (3 folds in total)" \
  --dense-subtasks "Bring robot arms up from starting position,Grab near side and do 1st fold,Grab side and do 2nd fold,Grab side and do 3rd fold to finish folding" \
  --video-key observation.images.base \
  --num-workers 4 \
  --push-to-hub
```

**What gets saved:**

- `meta/temporal_proportions_sparse.json` - Sparse temporal proportions
- `meta/temporal_proportions_dense.json` - Dense temporal proportions
- Per-episode columns in `episodes/*.parquet`:
  - `sparse_subtask_names`, `sparse_subtask_start_frames`, `sparse_subtask_end_frames`
  - `dense_subtask_names`, `dense_subtask_start_frames`, `dense_subtask_end_frames`
  - (also time-based columns: `*_subtask_start_times`, `*_subtask_end_times`)

</hfoption>
</hfoptions>

### Annotation Arguments

| Argument               | Description                                                                     |
| ---------------------- | ------------------------------------------------------------------------------- |
| `--repo-id`            | HuggingFace dataset repository ID                                               |
| `--sparse-subtasks`    | Comma-separated list of high-level subtask names                                |
| `--dense-subtasks`     | Comma-separated list of fine-grained subtask names                              |
| `--dense-only`         | Generate only dense annotations (auto-creates sparse "task" stage)              |
| `--video-key`          | Camera/video key to use (e.g., `observation.images.top`)                        |
| `--num-workers`        | Number of parallel GPU workers (default: 1)                                     |
| `--episodes`           | Specific episode indices to annotate (default: all)                             |
| `--skip-existing`      | Skip episodes that already have annotations                                     |
| `--model`              | VLM model (default: `Qwen/Qwen3-VL-30B-A3B-Instruct`)                           |
| `--num-visualizations` | Number of episodes to visualize after annotation (default: 5, set to 0 to skip) |

> **Note**: After annotation completes, 5 episodes are automatically visualized by default. Use `--num-visualizations 0` to skip this step.

---

## Step 2: Verify Annotations

<hfoptions id="verify_mode">
<hfoption id="single_stage">

**No verification needed!** Skip this step.

</hfoption>
<hfoption id="dense_only">

Visualize annotations using the `--visualize-only` flag:

```bash
python src/lerobot/data_processing/sarm_annotations/subtask_annotation.py \
  --repo-id your-username/your-dataset \
  --visualize-only \
  --visualize-type dense \
  --num-visualizations 5 \
  --video-key observation.images.base \
  --output-dir ./subtask_viz
```

</hfoption>
<hfoption id="dual">

Visualize annotations using the `--visualize-only` flag:

```bash
python src/lerobot/data_processing/sarm_annotations/subtask_annotation.py \
  --repo-id your-username/your-dataset \
  --visualize-only \
  --visualize-type both \
  --num-visualizations 5 \
  --video-key observation.images.base \
  --output-dir ./subtask_viz
```

</hfoption>
</hfoptions>

This generates visualizations showing video frames with subtask boundaries overlaid and timeline of subtasks.

### Visualization Arguments

| Argument               | Description                                                    |
| ---------------------- | -------------------------------------------------------------- |
| `--visualize-only`     | Only visualize existing annotations (no generation)            |
| `--num-visualizations` | Number of episodes to visualize (default: 5)                   |
| `--visualize-type`     | Type of annotations to visualize: `sparse`, `dense`, or `both` |

**Tip**: If annotations are inaccurate, adjust your subtask descriptions to be more specific and re-run.

---

## Step 3: Train SARM

<hfoptions id="train_mode">
<hfoption id="single_stage">

Train with **no annotations** - uses linear progress from 0 to 1:

```bash
python src/lerobot/scripts/lerobot_train.py \
  --dataset.repo_id=your-username/your-dataset \
  --policy.type=sarm \
  --policy.annotation_mode=single_stage \
  --policy.image_key=observation.images.base \
  --output_dir=outputs/train/sarm_single \
  --batch_size=32 \
  --steps=5000 \
  --wandb.enable=true \
  --wandb.project=sarm \
  --policy.repo_id=your-username/your-model-name
```

</hfoption>
<hfoption id="dense_only">

Train with **dense annotations only** (sparse auto-generated):

```bash
python src/lerobot/scripts/lerobot_train.py \
  --dataset.repo_id=your-username/your-dataset \
  --policy.type=sarm \
  --policy.annotation_mode=dense_only \
  --policy.image_key=observation.images.base \
  --output_dir=outputs/train/sarm_dense \
  --batch_size=32 \
  --steps=5000 \
  --wandb.enable=true \
  --wandb.project=sarm \
  --policy.repo_id=your-username/your-model-name
```

</hfoption>
<hfoption id="dual">

Train with **both sparse and dense annotations**:

```bash
python src/lerobot/scripts/lerobot_train.py \
  --dataset.repo_id=your-username/your-dataset \
  --policy.type=sarm \
  --policy.annotation_mode=dual \
  --policy.image_key=observation.images.base \
  --output_dir=outputs/train/sarm_dual \
  --batch_size=32 \
  --steps=5000 \
  --wandb.enable=true \
  --wandb.project=sarm \
  --policy.repo_id=your-username/your-model-name
```

</hfoption>
</hfoptions>

### Multi-GPU Training

Add `accelerate launch --multi_gpu --num_processes=4` to use multiple GPUs for training.

### Training Arguments

| Argument                   | Description                                                       | Default                  |
| -------------------------- | ----------------------------------------------------------------- | ------------------------ |
| `--policy.annotation_mode` | `single_stage`, `dense_only`, or `dual`                           | `single_stage`           |
| `--policy.image_key`       | Camera key for images                                             | `observation.images.top` |
| `--policy.state_key`       | Key for joint states                                              | `observation.state`      |
| `--policy.n_obs_steps`     | Observation history steps (total obs frames = `n_obs_steps + 1`)  | `8`                      |
| `--policy.frame_gap`       | Gap (in frames) between sampled observations (at 30 fps: 30 ≈ 1s) | `30`                     |

---

## Step 4: Visualize Predictions

Use `compute_rabc_weights.py` with `--visualize-only` to visualize model predictions (and, if available, annotation-derived targets) without writing a parquet file.

<hfoptions id="viz_mode">
<hfoption id="single_stage">

```bash
python src/lerobot/policies/sarm/compute_rabc_weights.py \
  --dataset-repo-id your-username/your-dataset \
  --reward-model-path your-username/sarm-model \
  --visualize-only \
  --num-visualizations 5 \
  --head-mode sparse \
  --output-dir ./sarm_viz
```

</hfoption>
<hfoption id="dense_only">

```bash
python src/lerobot/policies/sarm/compute_rabc_weights.py \
  --dataset-repo-id your-username/your-dataset \
  --reward-model-path your-username/sarm-model \
  --visualize-only \
  --num-visualizations 5 \
  --head-mode dense \
  --output-dir ./sarm_viz
```

</hfoption>
<hfoption id="dual">

```bash
python src/lerobot/policies/sarm/compute_rabc_weights.py \
  --dataset-repo-id your-username/your-dataset \
  --reward-model-path your-username/sarm-model \
  --visualize-only \
  --num-visualizations 5 \
  --head-mode both \
  --output-dir ./sarm_viz
```

</hfoption>
</hfoptions>

The visualization shows:

- **Progress plot**: Predicted progress (and optional annotation-derived “GT” when available and `--stride 1`)
- **Stage probabilities**: Stacked area plot of predicted stage probabilities
- **Sample frames**: Key frames from the episode with progress/stage labels

### Visualization Arguments

| Argument               | Description                                               |
| ---------------------- | --------------------------------------------------------- |
| `--visualize-only`     | Only visualize predictions (no RABC computation)          |
| `--num-visualizations` | Number of episodes to visualize (default: 5)              |
| `--head-mode`          | SARM head to use: `sparse`, `dense`, or `both`            |
| `--stride`             | Compute every N frames, interpolate the rest (default: 1) |

---

## Step 5 (Optional): Train Policy with RA-BC

Reward-Aligned Behavior Cloning (RA-BC) uses the trained SARM model to weight training samples based on predicted progress improvement. This requires two steps:

1. **Precompute progress values** for all frames using the trained SARM model
2. **Train policy** with RA-BC weighting using the precomputed values

### How RA-BC Works

For each training sample, RA-BC computes the progress delta:

```
r_i = φ(o_{t+Δ}) - φ(o_t)
```

Where `φ` is the SARM progress prediction and `Δ` is the policy's `chunk_size`. Samples with positive progress (good demonstrations) get higher weights, while samples with negative or zero progress get down-weighted.

The weighting follows **Equations 8-9** from the paper:

- **Soft weight**: `w̃_i = clip((r_i − (μ − 2σ)) / (4σ + ε), 0, 1)`
- **Final weight**: `w_i = 𝟙{r_i > κ} + 𝟙{0 ≤ r_i ≤ κ} × w̃_i`

### Step 5a: Compute SARM Progress Values

First, run the SARM model on all frames in your dataset to compute progress values:

```bash
python src/lerobot/policies/sarm/compute_rabc_weights.py \
  --dataset-repo-id your-username/your-dataset \
  --reward-model-path your-username/sarm-model \
  --head-mode sparse \
  --num-visualizations 5 \
  --push-to-hub
```

This script:

- Processes all frames and computes progress values
- Saves progress values to a parquet file next to the dataset on disk (defaults to `<dataset_root>/sarm_progress.parquet`)
- Generates visualizations of the first N episodes (default: 5)

**Arguments:**

| Argument               | Description                                                    | Default    |
| ---------------------- | -------------------------------------------------------------- | ---------- |
| `--reward-model-path`  | Path to trained SARM model                                     | (required) |
| `--head-mode`          | SARM head to use: `sparse`, `dense`, or `both`                 | `sparse`   |
| `--device`             | Device for inference                                           | `cuda`     |
| `--visualize-only`     | Only visualize predictions (no RA-BC computation)              | `false`    |
| `--num-visualizations` | Number of episodes to visualize (default: 5, set to 0 to skip) | `5`        |

**Output format** (`sarm_progress.parquet`):

| Column            | Description                                    |
| ----------------- | ---------------------------------------------- |
| `index`           | Global frame index in dataset                  |
| `episode_index`   | Episode number                                 |
| `frame_index`     | Local frame index within episode               |
| `progress_sparse` | Sparse head progress value [0, 1]              |
| `progress_dense`  | Dense head progress value [0, 1] (if computed) |

### Step 5b: Train Policy with RA-BC

Once you have the progress file, train your policy with RA-BC weighting. The progress file is auto-detected from the dataset path (`sarm_progress.parquet`). Currently PI0, PI0.5 and SmolVLA are supported with RA-BC:

```bash
python src/lerobot/scripts/lerobot_train.py \
  --dataset.repo_id=your-username/your-dataset \
  --policy.type=pi0 \
  --use_rabc=true \
  --rabc_head_mode=sparse \
  --rabc_kappa=0.01 \
  --output_dir=outputs/train/policy_rabc \
  --batch_size=32 \
  --steps=40000
```

The training script automatically:

- Loads the precomputed progress values from the parquet file
- Uses the policy's `chunk_size` to compute progress deltas (Δ)
- Computes sample weights based on progress improvement
- Applies weighted loss during training

**RA-BC Arguments:**

| Argument               | Description                                                | Default                            |
| ---------------------- | ---------------------------------------------------------- | ---------------------------------- |
| `--use_rabc`           | Enable RA-BC sample weighting                              | `false`                            |
| `--rabc_progress_path` | Path to progress parquet file (auto-detected from dataset) | `sarm_progress.parquet` in dataset |
| `--rabc_head_mode`     | Which SARM head's progress to use: `sparse` or `dense`     | `sparse`                           |
| `--rabc_kappa`         | Threshold κ for high-quality samples                       | `0.01`                             |

### Tuning RA-BC Kappa

The `kappa` parameter is the threshold that determines which samples get full weight (w=1). Understanding how to tune it is critical for RA-BC to work effectively.

**How the weighting works:**

| Condition           | Weight                  |
| ------------------- | ----------------------- |
| `delta > kappa`     | 1.0 (hard threshold)    |
| `0 ≤ delta ≤ kappa` | Soft weight from Eq. 8  |
| `delta < 0`         | 0.0 (negative progress) |

**Diagnosing kappa issues:**

Monitor these WandB metrics during training:

| Metric             | Healthy Range | Problem Indicator         |
| ------------------ | ------------- | ------------------------- |
| `rabc_mean_weight` | 0.3 - 0.8     | ≈ 1.0 means kappa too low |
| `rabc_delta_mean`  | > 0           | Should be positive        |
| `rabc_delta_std`   | > 0           | Variance in data quality  |

**If `rabc_mean_weight ≈ 1.0`:** Your kappa is too low. Most samples have `delta > kappa` and bypass the soft-weighting entirely. RA-BC becomes equivalent to vanilla BC.

**Setting kappa based on your data:**

The default `kappa=0.01` was tuned for the paper's T-shirt folding task (~90s episodes at 30fps). For your dataset, check the logged `rabc_delta_mean` and `rabc_delta_std`:

```
# If delta_mean ≈ 0.03 and delta_std ≈ 0.02:
# Most deltas fall in range [0.01, 0.05]

# Option 1: Set kappa = delta_mean (medium selectivity)
--rabc_kappa=0.03

# Option 2: Set kappa = delta_mean + delta_std (high selectivity)
--rabc_kappa=0.05

# Option 3: Set kappa = delta_mean + 2*delta_std (very selective)
--rabc_kappa=0.07
```

**When RA-BC may not help:**

If your dataset is already high quality (consistent progress across all demonstrations), RA-BC won't provide much benefit since there's nothing to filter.

### Multi-GPU Training with RA-BC

```bash
accelerate launch \
  --multi_gpu \
  --num_processes=4 \
  src/lerobot/scripts/lerobot_train.py \
  --dataset.repo_id=your-username/your-dataset \
  --policy.type=pi0 \
  --use_rabc=true \
  --rabc_kappa=0.01 \
  --output_dir=outputs/train/policy_rabc \
  --batch_size=32 \
  --steps=40000
```

---

## Tips & Best Practices

### Choosing a Mode

- **Start with `single_stage`** for quick experiments - no annotation overhead
- Use **`dense_only`** when you want detailed progress tracking but tasks don't have clear high-level stages
- Use **`dual`** for complex tasks where both coarse and fine-grained progress is meaningful

### Annotation Quality

1. **Be specific with subtask names**: Instead of "fold", use "grab near side and fold toward center"
2. **Verify with visualization**: Always check a few episodes before training
3. **Consistent naming**: Use the same subtask names across all episodes

### RA-BC

1. **Train SARM first**: RA-BC quality depends entirely on SARM quality
2. **Monitor `rabc_mean_weight`**: If it's ≈ 1.0, increase kappa (see [Tuning RA-BC Kappa](#tuning-ra-bc-kappa))

---

## Citation

```bibtex
@article{chen2025sarm,
  title={SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation},
  author={Chen, Qianzhong and Yu, Justin and Schwager, Mac and Abbeel, Pieter and Shentu, Yide and Wu, Philipp},
  journal={arXiv preprint arXiv:2509.25358},
  year={2025}
}
```
-												Add sarm (#2639)

* add initial modeling

* make rewind pretrained policy

* add annotation

* small fix

* add sarm

* subtasks

* fix spawn

* fix rewind discrepancies

* Add script to generate embedding for dataset (#2138)

* Add generate and validate script

* fix precommit

* Improve generate embeddings function by using dataset tools (#2206)

---------

Co-authored-by: Michel Aractingi <michel.aractingi@huggingface.co>

* cleanup

* change order train log

* print batch size

* update sarm processor

* add reward output

* change expected features

* add image validation

* change validation

* get state input from dataset stats

* raise if no state key is found

* pass stats

* cleanup and refactor

* add episode inddex to complementary data

* add subtask init and detection

* revert lerobot_train changes

* pass dataset metadata to policy

* change loadig subtasks

* add small logging

* fix progress conversion and adding initial frame

* use large offset for initial frame (ugly)

* Remove rewind, use clip tokenizer

* add tests, implement formula 1,2 correctly and cleanup

* use task from dataset, cleanup visualizer

* simplify

* simplify and cleanup code and move compute_temporal_proportions to utils

* fix normalization in visualization

* Fix visualization and change prompt

* fix formatting

* add visualize subtask annotations

* use qwen thinking

* try different prompt

* format

* update prompt

* higher temp, long output

* different settings

* use instruct

* show full resp

* split message

* Temp: increase tolerance dataset

* Fix RA-BC (#2572)

* Add next observation loading for RA-BC progress deltas

* Compute weights based on temporal progress deltas instead of static rewards

* Add hard-masking for negative progress deltas in weight computation

* Feat/add dual head (#2582)

* Add dual dense sparse head and annotation

* Add docs

* add dual to procesor

* cleanup

* change sampling in visualize and cleanup

* remove validation

* remove compile

* Feat/test uniform (#2587)

* test uniform

* add different string for misaligned

* Fix rewind and add tests

* uncomment text implementation

* run precommit

* Add head mode for ra-bc

* fix visalization of single task

* add

* return per sample loss

* Fix RA_BC (#2602)

* update rabc implementation

* compute rabc beforehand

* fix import

* add only progress calulation

* use precomputed progress

* multi gpu processing

* import

* fix dataset meta data extraction

* add logging

* logging

* log

* progress per episode

* split differently

* move clip to gpu

* pre decode frames for an episode

* fix cuda initalization

* fix import

* multi processing

* rename

* fix import

* fix

* fix rabc

* use last known progress if oob

* use last known progress if oob

* add misalignment loss with random embeddings

* discard previous changes

* add selection of models to docs for ra_bc

* add transformers dep

* extend tolerance

* initial commit with new codebase

* add tests

* fix

* remove temporal sampler

* drop last frame for sampler

* use original ref

* some fixes

* fix visualization

* remove smoothing and fix order subtasks

* add stride rabc computation

* add push to hub

* add explanation

* add kappa expllaination

* better rabc logging

* feedback pr

* remove dataset tolerance

* revert dataset tool

* revert dataset changes

* add credit

* run precommit

* change path for generate ra_bc

* fix type

* include sarm in all in pyproject

* fix precommit

* lazy import matplotlib

* lazy import qwen

* remove rich console

* skip if transformers is not installed?

* run only when we have faker

* place transformer lazy loading

* Dont test if low transformer version

* fix

* increase transformer

* increase as 4.57.0 is yanked

* remove pi from all

* go back

---------

Co-authored-by: Michel Aractingi <michel.aractingi@huggingface.co>
Co-authored-by: s1lent4gnt <kmeftah.khalil@gmail.com>
											
										
										
											2025-12-18 12:50:32 +01:00
+								# SARM: Stage-Aware Reward Modeling
 								SARM (Stage-Aware Reward Modeling) is a video-based reward modeling framework for long-horizon robot manipulation tasks. This guide covers how to train SARM reward models and optionally use them with Reward-Aligned Behavior Cloning (RA-BC).
 								**Paper**: [SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation](https://arxiv.org/abs/2509.25358)
-												docs: improve assets (#2777)

* add assets

* add libero results pifast:

* update

* update

* update size

* update naems:
:

* update training tokenizer
											
										
										
											2026-01-12 13:33:28 +01:00
+								<img
 								  src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/lerobot-sarm.png"
 								  alt="An overview of SARM"
 								  width="80%"
 								/>
-												Add sarm (#2639)

* add initial modeling

* make rewind pretrained policy

* add annotation

* small fix

* add sarm

* subtasks

* fix spawn

* fix rewind discrepancies

* Add script to generate embedding for dataset (#2138)

* Add generate and validate script

* fix precommit

* Improve generate embeddings function by using dataset tools (#2206)

---------

Co-authored-by: Michel Aractingi <michel.aractingi@huggingface.co>

* cleanup

* change order train log

* print batch size

* update sarm processor

* add reward output

* change expected features

* add image validation

* change validation

* get state input from dataset stats

* raise if no state key is found

* pass stats

* cleanup and refactor

* add episode inddex to complementary data

* add subtask init and detection

* revert lerobot_train changes

* pass dataset metadata to policy

* change loadig subtasks

* add small logging

* fix progress conversion and adding initial frame

* use large offset for initial frame (ugly)

* Remove rewind, use clip tokenizer

* add tests, implement formula 1,2 correctly and cleanup

* use task from dataset, cleanup visualizer

* simplify

* simplify and cleanup code and move compute_temporal_proportions to utils

* fix normalization in visualization

* Fix visualization and change prompt

* fix formatting

* add visualize subtask annotations

* use qwen thinking

* try different prompt

* format

* update prompt

* higher temp, long output

* different settings

* use instruct

* show full resp

* split message

* Temp: increase tolerance dataset

* Fix RA-BC (#2572)

* Add next observation loading for RA-BC progress deltas

* Compute weights based on temporal progress deltas instead of static rewards

* Add hard-masking for negative progress deltas in weight computation

* Feat/add dual head (#2582)

* Add dual dense sparse head and annotation

* Add docs

* add dual to procesor

* cleanup

* change sampling in visualize and cleanup

* remove validation

* remove compile

* Feat/test uniform (#2587)

* test uniform

* add different string for misaligned

* Fix rewind and add tests

* uncomment text implementation

* run precommit

* Add head mode for ra-bc

* fix visalization of single task

* add

* return per sample loss

* Fix RA_BC (#2602)

* update rabc implementation

* compute rabc beforehand

* fix import

* add only progress calulation

* use precomputed progress

* multi gpu processing

* import

* fix dataset meta data extraction

* add logging

* logging

* log

* progress per episode

* split differently

* move clip to gpu

* pre decode frames for an episode

* fix cuda initalization

* fix import

* multi processing

* rename

* fix import

* fix

* fix rabc

* use last known progress if oob

* use last known progress if oob

* add misalignment loss with random embeddings

* discard previous changes

* add selection of models to docs for ra_bc

* add transformers dep

* extend tolerance

* initial commit with new codebase

* add tests

* fix

* remove temporal sampler

* drop last frame for sampler

* use original ref

* some fixes

* fix visualization

* remove smoothing and fix order subtasks

* add stride rabc computation

* add push to hub

* add explanation

* add kappa expllaination

* better rabc logging

* feedback pr

* remove dataset tolerance

* revert dataset tool

* revert dataset changes

* add credit

* run precommit

* change path for generate ra_bc

* fix type

* include sarm in all in pyproject

* fix precommit

* lazy import matplotlib

* lazy import qwen

* remove rich console

* skip if transformers is not installed?

* run only when we have faker

* place transformer lazy loading

* Dont test if low transformer version

* fix

* increase transformer

* increase as 4.57.0 is yanked

* remove pi from all

* go back

---------

Co-authored-by: Michel Aractingi <michel.aractingi@huggingface.co>
Co-authored-by: s1lent4gnt <kmeftah.khalil@gmail.com>
											
										
										
											2025-12-18 12:50:32 +01:00
+								## Why Reward Models?
 								Standard behavior cloning treats all demonstration frames equally, but real-world robot datasets are messy. They contain hesitations, corrections, and variable-quality trajectories. Reward models solve this by learning a generalizable notion of **task progress** from demonstrations: given video frames and a task description, they predict how close the robot is to completing the task (0→1). This learned "progress signal" can be used in multiple ways, two promising applications are: (1) **weighted imitation learning** (RA-BC), where high-progress frames receive more weight during policy training, and (2) **reinforcement learning**, where the reward model provides dense rewards for online or offline policy improvement.
 								## Overview
 								SARM has following features:
 . **Stage-aware architecture**: Jointly predicts the high-level task stage and fine-grained progress within each stage
 . **Subtask annotations**: Uses natural language subtask annotations to derive consistent progress labels
 . **Temporal proportions**: Computes dataset-level priors (α̅\_k) for each subtask to normalize progress across variable-length demonstrations
 								SARM trains on a compact **stage+tau** target for each frame:
 								- **stage**: integer stage index `k ∈ {0, ..., K-1}`
 								- **τ (tau)**: within-stage progress `τ ∈ [0, 1]`
 								- **target encoding**: `y = k + τ` (this is what the dataset processor produces)
 								At inference time (and in downstream RA-BC), SARM converts the raw `k + τ` value into a **normalized progress** in `[0, 1]` using dataset-level **temporal proportions** `α̅_k` (stored in `meta/temporal_proportions_*.json`).
 								This matches **Formula (2)** from the paper:
 								```
 								progress_t = P_{k-1} + α̅_k × τ_t
 								```
 								Where:
 								- `τ_t = (t - s_k) / (e_k - s_k)` is within-subtask normalized time
 								- `P_{k-1}` is cumulative prior (sum of previous subtask proportions)
 								- `α̅_k` is the temporal proportion for subtask k
 								This ensures identical task states map to consistent progress values, even across demonstrations of different lengths.
 								## Inputs and Targets (What the new code expects)
 								SARM is trained through its processor (`src/lerobot/policies/sarm/processor_sarm.py`), which:
 								- **Encodes** images and task text with CLIP (ViT-B/32) into `video_features` and `text_features`
 								- **Pads/truncates** robot state into `state_features` (up to `max_state_dim`)
 								- **Builds targets** as `sparse_targets` (and `dense_targets` in `dense_only`/`dual`) using the stage+tau encoding `y = k + τ`
 								- **Masks rewind frames** using a per-sample `lengths` tensor (rewind is a training-time augmentation)
 								At minimum, each training sample needs:
 								- `task` (string): task description
 								- `policy.image_key` images and `policy.state_key` states from the dataset
 								---
 								## Annotation Modes
 								You can choose from **3 annotation modes** that determine how progress labels are computed:
 								| Mode           | Annotations Required | Heads                        | Use Case                                                     |
 								| -------------- | -------------------- | ---------------------------- | ------------------------------------------------------------ |
 								| `single_stage` | None                 | Sparse only                  | Simple tasks, quick experiments, no VLM needed               |
 								| `dense_only`   | Dense (VLM)          | Dual (sparse auto-generated) | Detailed subtask tracking without defining high-level stages |
 								| `dual`         | Sparse + Dense (VLM) | Dual                         | Full SARM paper setup with both granularities                |
 								### Mode Details
 								<hfoptions id="mode_explanation">
 								<hfoption id="single_stage">
 								**No annotations required.** The entire episode is treated as a single stage called `"task"`, and progress is linear from 0 to 1 over the episode duration.
 								- **Sparse head**: 1 stage ("task"), linear progress
 								- **Dense head**: Not used
 								- **Best for**: Simple tasks, quick experiments, or when VLM annotation is not available
 								## Set Up Your Environment
 . Install LeRobot by following our [Installation Guide](./installation).
 . Install SARM dependencies by running:
 								```bash
 								pip install -e ".[sarm]"
 								```
 								Workflow:
 								```
 . Train SARM → 2. Visualize predictions → 3. (Optional) Train policy with RA-BC
 								```
 								</hfoption>
 								<hfoption id="dense_only">
 								**Only dense (fine-grained) annotations from a VLM.** The sparse head automatically uses a single `"task"` stage covering the full episode, while the dense head learns detailed subtask progression.
 								- **Sparse head**: 1 stage ("task"), linear progress (auto-generated)
 								- **Dense head**: Multiple fine-grained stages from VLM annotations
 								- **Best for**: When you want detailed subtask tracking but don't need to define high-level stages
 								Workflow:
 								```
 . Annotate (dense) → 2. Verify → 3. Train SARM → 4. Visualize → 5. (Optional) Train policy with RA-BC
 								```
 								</hfoption>
 								<hfoption id="dual">
 								**Both sparse and dense annotations from VLM.** Full dual-head mode as described in the SARM paper, with both high-level (sparse) and fine-grained (dense) stage predictions.
 								- **Sparse head**: High-level stages from VLM annotations
 								- **Dense head**: Fine-grained stages from VLM annotations
 								- **Best for**: Complex multi-stage tasks where both granularities are useful
 								Workflow:
 								```
 . Annotate (sparse+dense) → 2. Verify → 3. Train SARM → 4. Visualize → 5. (Optional) Train policy with RA-BC
 								```
 								</hfoption>
 								</hfoptions>
 								---
 								## Step 1: Subtask Annotation
 								<hfoptions id="annotation_mode">
 								<hfoption id="single_stage">
 								**No annotation required!** Skip this step entirely. The model will use the episode's task description and compute linear progress automatically.
 								</hfoption>
 								<hfoption id="dense_only">
 								Generate **dense (fine-grained) annotations only** using a VLM. The sparse stage will be auto-generated.
 								```bash
 								python src/lerobot/data_processing/sarm_annotations/subtask_annotation.py \
 								  --repo-id your-username/your-dataset \
 								  --dense-only \
 								  --dense-subtasks "Bring robot arms up from starting position,Grab near side and do 1st fold,Grab side and do 2nd fold,Grab side and do 3rd fold to finish folding" \
 								  --video-key observation.images.base \
 								  --num-workers 4 \
 								  --push-to-hub
 								```
 								**What gets saved:**
 								- `meta/temporal_proportions_sparse.json` - Auto-generated sparse proportions (`{"task": 1.0}`)
 								- `meta/temporal_proportions_dense.json` - Dense temporal proportions
 								- Per-episode columns in `episodes/*.parquet`:
 								  - `dense_subtask_names`, `dense_subtask_start_frames`, `dense_subtask_end_frames`
 								  - (also time-based columns: `dense_subtask_start_times`, `dense_subtask_end_times`)
 								</hfoption>
 								<hfoption id="dual">
 								Generate **both sparse (high-level) and dense (fine-grained) annotations** using a VLM.
 								```bash
 								python src/lerobot/data_processing/sarm_annotations/subtask_annotation.py \
 								  --repo-id your-username/your-dataset \
 								  --sparse-subtasks "Bring arms up from starting position,Fold the towel (3 folds in total)" \
 								  --dense-subtasks "Bring robot arms up from starting position,Grab near side and do 1st fold,Grab side and do 2nd fold,Grab side and do 3rd fold to finish folding" \
 								  --video-key observation.images.base \
 								  --num-workers 4 \
 								  --push-to-hub
 								```
 								**What gets saved:**
 								- `meta/temporal_proportions_sparse.json` - Sparse temporal proportions
 								- `meta/temporal_proportions_dense.json` - Dense temporal proportions
 								- Per-episode columns in `episodes/*.parquet`:
 								  - `sparse_subtask_names`, `sparse_subtask_start_frames`, `sparse_subtask_end_frames`
 								  - `dense_subtask_names`, `dense_subtask_start_frames`, `dense_subtask_end_frames`
 								  - (also time-based columns: `*_subtask_start_times`, `*_subtask_end_times`)
 								</hfoption>
 								</hfoptions>
 								### Annotation Arguments
 								| Argument               | Description                                                                     |
 								| ---------------------- | ------------------------------------------------------------------------------- |
 								| `--repo-id`            | HuggingFace dataset repository ID                                               |
 								| `--sparse-subtasks`    | Comma-separated list of high-level subtask names                                |
 								| `--dense-subtasks`     | Comma-separated list of fine-grained subtask names                              |
 								| `--dense-only`         | Generate only dense annotations (auto-creates sparse "task" stage)              |
 								| `--video-key`          | Camera/video key to use (e.g., `observation.images.top`)                        |
 								| `--num-workers`        | Number of parallel GPU workers (default: 1)                                     |
 								| `--episodes`           | Specific episode indices to annotate (default: all)                             |
 								| `--skip-existing`      | Skip episodes that already have annotations                                     |
 								| `--model`              | VLM model (default: `Qwen/Qwen3-VL-30B-A3B-Instruct`)                           |
 								| `--num-visualizations` | Number of episodes to visualize after annotation (default: 5, set to 0 to skip) |
 								> **Note**: After annotation completes, 5 episodes are automatically visualized by default. Use `--num-visualizations 0` to skip this step.
 								---
 								## Step 2: Verify Annotations
 								<hfoptions id="verify_mode">
 								<hfoption id="single_stage">
 								**No verification needed!** Skip this step.
 								</hfoption>
 								<hfoption id="dense_only">
 								Visualize annotations using the `--visualize-only` flag:
 								```bash
 								python src/lerobot/data_processing/sarm_annotations/subtask_annotation.py \
 								  --repo-id your-username/your-dataset \
 								  --visualize-only \
 								  --visualize-type dense \
 								  --num-visualizations 5 \
 								  --video-key observation.images.base \
 								  --output-dir ./subtask_viz
 								```
 								</hfoption>
 								<hfoption id="dual">
 								Visualize annotations using the `--visualize-only` flag:
 								```bash
 								python src/lerobot/data_processing/sarm_annotations/subtask_annotation.py \
 								  --repo-id your-username/your-dataset \
 								  --visualize-only \
 								  --visualize-type both \
 								  --num-visualizations 5 \
 								  --video-key observation.images.base \
 								  --output-dir ./subtask_viz
 								```
 								</hfoption>
 								</hfoptions>
 								This generates visualizations showing video frames with subtask boundaries overlaid and timeline of subtasks.
 								### Visualization Arguments
 								| Argument               | Description                                                    |
 								| ---------------------- | -------------------------------------------------------------- |
 								| `--visualize-only`     | Only visualize existing annotations (no generation)            |
 								| `--num-visualizations` | Number of episodes to visualize (default: 5)                   |
 								| `--visualize-type`     | Type of annotations to visualize: `sparse`, `dense`, or `both` |
 								**Tip**: If annotations are inaccurate, adjust your subtask descriptions to be more specific and re-run.
 								---
 								## Step 3: Train SARM
 								<hfoptions id="train_mode">
 								<hfoption id="single_stage">
 								Train with **no annotations** - uses linear progress from 0 to 1:
 								```bash
 								python src/lerobot/scripts/lerobot_train.py \
 								  --dataset.repo_id=your-username/your-dataset \
 								  --policy.type=sarm \
 								  --policy.annotation_mode=single_stage \
 								  --policy.image_key=observation.images.base \
 								  --output_dir=outputs/train/sarm_single \
 								  --batch_size=32 \
 								  --steps=5000 \
 								  --wandb.enable=true \
 								  --wandb.project=sarm \
 								  --policy.repo_id=your-username/your-model-name
 								```
 								</hfoption>
 								<hfoption id="dense_only">
 								Train with **dense annotations only** (sparse auto-generated):
 								```bash
 								python src/lerobot/scripts/lerobot_train.py \
 								  --dataset.repo_id=your-username/your-dataset \
 								  --policy.type=sarm \
 								  --policy.annotation_mode=dense_only \
 								  --policy.image_key=observation.images.base \
 								  --output_dir=outputs/train/sarm_dense \
 								  --batch_size=32 \
 								  --steps=5000 \
 								  --wandb.enable=true \
 								  --wandb.project=sarm \
 								  --policy.repo_id=your-username/your-model-name
 								```
 								</hfoption>
 								<hfoption id="dual">
 								Train with **both sparse and dense annotations**:
 								```bash
 								python src/lerobot/scripts/lerobot_train.py \
 								  --dataset.repo_id=your-username/your-dataset \
 								  --policy.type=sarm \
 								  --policy.annotation_mode=dual \
 								  --policy.image_key=observation.images.base \
 								  --output_dir=outputs/train/sarm_dual \
 								  --batch_size=32 \
 								  --steps=5000 \
 								  --wandb.enable=true \
 								  --wandb.project=sarm \
 								  --policy.repo_id=your-username/your-model-name
 								```
 								</hfoption>
 								</hfoptions>
 								### Multi-GPU Training
 								Add `accelerate launch --multi_gpu --num_processes=4` to use multiple GPUs for training.
 								### Training Arguments
 								| Argument                   | Description                                                       | Default                  |
 								| -------------------------- | ----------------------------------------------------------------- | ------------------------ |
 								| `--policy.annotation_mode` | `single_stage`, `dense_only`, or `dual`                           | `single_stage`           |
 								| `--policy.image_key`       | Camera key for images                                             | `observation.images.top` |
 								| `--policy.state_key`       | Key for joint states                                              | `observation.state`      |
 								| `--policy.n_obs_steps`     | Observation history steps (total obs frames = `n_obs_steps + 1`)  | `8`                      |
 								| `--policy.frame_gap`       | Gap (in frames) between sampled observations (at 30 fps: 30 ≈ 1s) | `30`                     |
 								---
 								## Step 4: Visualize Predictions
 								Use `compute_rabc_weights.py` with `--visualize-only` to visualize model predictions (and, if available, annotation-derived targets) without writing a parquet file.
 								<hfoptions id="viz_mode">
 								<hfoption id="single_stage">
 								```bash
 								python src/lerobot/policies/sarm/compute_rabc_weights.py \
 								  --dataset-repo-id your-username/your-dataset \
 								  --reward-model-path your-username/sarm-model \
 								  --visualize-only \
 								  --num-visualizations 5 \
 								  --head-mode sparse \
 								  --output-dir ./sarm_viz
 								```
 								</hfoption>
 								<hfoption id="dense_only">
 								```bash
 								python src/lerobot/policies/sarm/compute_rabc_weights.py \
 								  --dataset-repo-id your-username/your-dataset \
 								  --reward-model-path your-username/sarm-model \
 								  --visualize-only \
 								  --num-visualizations 5 \
 								  --head-mode dense \
 								  --output-dir ./sarm_viz
 								```
 								</hfoption>
 								<hfoption id="dual">
 								```bash
 								python src/lerobot/policies/sarm/compute_rabc_weights.py \
 								  --dataset-repo-id your-username/your-dataset \
 								  --reward-model-path your-username/sarm-model \
 								  --visualize-only \
 								  --num-visualizations 5 \
 								  --head-mode both \
 								  --output-dir ./sarm_viz
 								```
 								</hfoption>
 								</hfoptions>
 								The visualization shows:
 								- **Progress plot**: Predicted progress (and optional annotation-derived “GT” when available and `--stride 1`)
 								- **Stage probabilities**: Stacked area plot of predicted stage probabilities
 								- **Sample frames**: Key frames from the episode with progress/stage labels
 								### Visualization Arguments
 								| Argument               | Description                                               |
 								| ---------------------- | --------------------------------------------------------- |
 								| `--visualize-only`     | Only visualize predictions (no RABC computation)          |
 								| `--num-visualizations` | Number of episodes to visualize (default: 5)              |
 								| `--head-mode`          | SARM head to use: `sparse`, `dense`, or `both`            |
 								| `--stride`             | Compute every N frames, interpolate the rest (default: 1) |
 								---
 								## Step 5 (Optional): Train Policy with RA-BC
 								Reward-Aligned Behavior Cloning (RA-BC) uses the trained SARM model to weight training samples based on predicted progress improvement. This requires two steps:
 . **Precompute progress values** for all frames using the trained SARM model
 . **Train policy** with RA-BC weighting using the precomputed values
 								### How RA-BC Works
 								For each training sample, RA-BC computes the progress delta:
 								```
 								r_i = φ(o_{t+Δ}) - φ(o_t)
 								```
 								Where `φ` is the SARM progress prediction and `Δ` is the policy's `chunk_size`. Samples with positive progress (good demonstrations) get higher weights, while samples with negative or zero progress get down-weighted.
 								The weighting follows **Equations 8-9** from the paper:
 								- **Soft weight**: `w̃_i = clip((r_i − (μ − 2σ)) / (4σ + ε), 0, 1)`
 								- **Final weight**: `w_i = 𝟙{r_i > κ} + 𝟙{0 ≤ r_i ≤ κ} × w̃_i`
 								### Step 5a: Compute SARM Progress Values
 								First, run the SARM model on all frames in your dataset to compute progress values:
 								```bash
 								python src/lerobot/policies/sarm/compute_rabc_weights.py \
 								  --dataset-repo-id your-username/your-dataset \
 								  --reward-model-path your-username/sarm-model \
 								  --head-mode sparse \
 								  --num-visualizations 5 \
 								  --push-to-hub
 								```
 								This script:
 								- Processes all frames and computes progress values
 								- Saves progress values to a parquet file next to the dataset on disk (defaults to `<dataset_root>/sarm_progress.parquet`)
 								- Generates visualizations of the first N episodes (default: 5)
 								**Arguments:**
 								| Argument               | Description                                                    | Default    |
 								| ---------------------- | -------------------------------------------------------------- | ---------- |
 								| `--reward-model-path`  | Path to trained SARM model                                     | (required) |
 								| `--head-mode`          | SARM head to use: `sparse`, `dense`, or `both`                 | `sparse`   |
 								| `--device`             | Device for inference                                           | `cuda`     |
 								| `--visualize-only`     | Only visualize predictions (no RA-BC computation)              | `false`    |
 								| `--num-visualizations` | Number of episodes to visualize (default: 5, set to 0 to skip) | `5`        |
 								**Output format** (`sarm_progress.parquet`):
 								| Column            | Description                                    |
 								| ----------------- | ---------------------------------------------- |
 								| `index`           | Global frame index in dataset                  |
 								| `episode_index`   | Episode number                                 |
 								| `frame_index`     | Local frame index within episode               |
 								| `progress_sparse` | Sparse head progress value [0, 1]              |
 								| `progress_dense`  | Dense head progress value [0, 1] (if computed) |
 								### Step 5b: Train Policy with RA-BC
 								Once you have the progress file, train your policy with RA-BC weighting. The progress file is auto-detected from the dataset path (`sarm_progress.parquet`). Currently PI0, PI0.5 and SmolVLA are supported with RA-BC:
 								```bash
 								python src/lerobot/scripts/lerobot_train.py \
 								  --dataset.repo_id=your-username/your-dataset \
 								  --policy.type=pi0 \
 								  --use_rabc=true \
 								  --rabc_head_mode=sparse \
 								  --rabc_kappa=0.01 \
 								  --output_dir=outputs/train/policy_rabc \
 								  --batch_size=32 \
 								  --steps=40000
 								```
 								The training script automatically:
 								- Loads the precomputed progress values from the parquet file
 								- Uses the policy's `chunk_size` to compute progress deltas (Δ)
 								- Computes sample weights based on progress improvement
 								- Applies weighted loss during training
 								**RA-BC Arguments:**
 								| Argument               | Description                                                | Default                            |
 								| ---------------------- | ---------------------------------------------------------- | ---------------------------------- |
 								| `--use_rabc`           | Enable RA-BC sample weighting                              | `false`                            |
 								| `--rabc_progress_path` | Path to progress parquet file (auto-detected from dataset) | `sarm_progress.parquet` in dataset |
 								| `--rabc_head_mode`     | Which SARM head's progress to use: `sparse` or `dense`     | `sparse`                           |
 								| `--rabc_kappa`         | Threshold κ for high-quality samples                       | `0.01`                             |
 								### Tuning RA-BC Kappa
 								The `kappa` parameter is the threshold that determines which samples get full weight (w=1). Understanding how to tune it is critical for RA-BC to work effectively.
 								**How the weighting works:**
 								| Condition           | Weight                  |
 								| ------------------- | ----------------------- |
 								| `delta > kappa`     | 1.0 (hard threshold)    |
 								| `0 ≤ delta ≤ kappa` | Soft weight from Eq. 8  |
 								| `delta < 0`         | 0.0 (negative progress) |
 								**Diagnosing kappa issues:**
 								Monitor these WandB metrics during training:
 								| Metric             | Healthy Range | Problem Indicator         |
 								| ------------------ | ------------- | ------------------------- |
 								| `rabc_mean_weight` | 0.3 - 0.8     | ≈ 1.0 means kappa too low |
 								| `rabc_delta_mean`  | > 0           | Should be positive        |
 								| `rabc_delta_std`   | > 0           | Variance in data quality  |
 								**If `rabc_mean_weight ≈ 1.0`:** Your kappa is too low. Most samples have `delta > kappa` and bypass the soft-weighting entirely. RA-BC becomes equivalent to vanilla BC.
 								**Setting kappa based on your data:**
 								The default `kappa=0.01` was tuned for the paper's T-shirt folding task (~90s episodes at 30fps). For your dataset, check the logged `rabc_delta_mean` and `rabc_delta_std`:
 								```
 								# If delta_mean ≈ 0.03 and delta_std ≈ 0.02:
 								# Most deltas fall in range [0.01, 0.05]
 								# Option 1: Set kappa = delta_mean (medium selectivity)
 								--rabc_kappa=0.03
 								# Option 2: Set kappa = delta_mean + delta_std (high selectivity)
 								--rabc_kappa=0.05
 								# Option 3: Set kappa = delta_mean + 2*delta_std (very selective)
 								--rabc_kappa=0.07
 								```
 								**When RA-BC may not help:**
 								If your dataset is already high quality (consistent progress across all demonstrations), RA-BC won't provide much benefit since there's nothing to filter.
 								### Multi-GPU Training with RA-BC
 								```bash
 								accelerate launch \
 								  --multi_gpu \
 								  --num_processes=4 \
 								  src/lerobot/scripts/lerobot_train.py \
 								  --dataset.repo_id=your-username/your-dataset \
 								  --policy.type=pi0 \
 								  --use_rabc=true \
 								  --rabc_kappa=0.01 \
 								  --output_dir=outputs/train/policy_rabc \
 								  --batch_size=32 \
 								  --steps=40000
 								```
 								---
 								## Tips & Best Practices
 								### Choosing a Mode
 								- **Start with `single_stage`** for quick experiments - no annotation overhead
 								- Use **`dense_only`** when you want detailed progress tracking but tasks don't have clear high-level stages
 								- Use **`dual`** for complex tasks where both coarse and fine-grained progress is meaningful
 								### Annotation Quality
 . **Be specific with subtask names**: Instead of "fold", use "grab near side and fold toward center"
 . **Verify with visualization**: Always check a few episodes before training
 . **Consistent naming**: Use the same subtask names across all episodes
 								### RA-BC
 . **Train SARM first**: RA-BC quality depends entirely on SARM quality
 . **Monitor `rabc_mean_weight`**: If it's ≈ 1.0, increase kappa (see [Tuning RA-BC Kappa](#tuning-ra-bc-kappa))
 								---
 								## Citation
 								```bibtex
 								@article{chen2025sarm,
 								  title={SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation},
 								  author={Chen, Qianzhong and Yu, Justin and Schwager, Mac and Abbeel, Pieter and Shentu, Yide and Wu, Philipp},
 								  journal={arXiv preprint arXiv:2509.25358},
 								  year={2025}
 								}
 								```