lerobot/common/policies/diffusion/configuration_diffusion.py

#!/usr/bin/env python

# Copyright 2024 Columbia Artificial Intelligence, Robotics Lab,
# and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass, field


@dataclass
class DiffusionConfig:
    """Configuration class for DiffusionPolicy.

    Defaults are configured for training with PushT providing proprioceptive and single camera observations.

    The parameters you will most likely need to change are the ones which depend on the environment / sensors.
    Those are: `input_shapes` and `output_shapes`.

    Notes on the inputs and outputs:
        - "observation.state" is required as an input key.
        - A key starting with "observation.image is required as an input.
        - "action" is required as an output key.

    Args:
        n_obs_steps: Number of environment steps worth of observations to pass to the policy (takes the
            current step and additional steps going back).
        horizon: Diffusion model action prediction size as detailed in `DiffusionPolicy.select_action`.
        n_action_steps: The number of action steps to run in the environment for one invocation of the policy.
            See `DiffusionPolicy.select_action` for more details.
        input_shapes: A dictionary defining the shapes of the input data for the policy. The key represents
            the input data name, and the value is a list indicating the dimensions of the corresponding data.
            For example, "observation.image" refers to an input from a camera with dimensions [3, 96, 96],
            indicating it has three color channels and 96x96 resolution. Importantly, `input_shapes` doesn't
            include batch dimension or temporal dimension.
        output_shapes: A dictionary defining the shapes of the output data for the policy. The key represents
            the output data name, and the value is a list indicating the dimensions of the corresponding data.
            For example, "action" refers to an output shape of [14], indicating 14-dimensional actions.
            Importantly, `output_shapes` doesn't include batch dimension or temporal dimension.
        input_normalization_modes: A dictionary with key representing the modality (e.g. "observation.state"),
            and the value specifies the normalization mode to apply. The two available modes are "mean_std"
            which subtracts the mean and divides by the standard deviation and "min_max" which rescale in a
            [-1, 1] range.
        output_normalization_modes: Similar dictionary as `normalize_input_modes`, but to unnormalize to the
            original scale. Note that this is also used for normalizing the training targets.
        vision_backbone: Name of the torchvision resnet backbone to use for encoding images.
        crop_shape: (H, W) shape to crop images to as a preprocessing step for the vision backbone. Must fit
            within the image size. If None, no cropping is done.
        crop_is_random: Whether the crop should be random at training time (it's always a center crop in eval
            mode).
        pretrained_backbone_weights: Pretrained weights from torchvision to initalize the backbone.
            `None` means no pretrained weights.
        use_group_norm: Whether to replace batch normalization with group normalization in the backbone.
            The group sizes are set to be about 16 (to be precise, feature_dim // 16).
        spatial_softmax_num_keypoints: Number of keypoints for SpatialSoftmax.
        down_dims: Feature dimension for each stage of temporal downsampling in the diffusion modeling Unet.
            You may provide a variable number of dimensions, therefore also controlling the degree of
            downsampling.
        kernel_size: The convolutional kernel size of the diffusion modeling Unet.
        n_groups: Number of groups used in the group norm of the Unet's convolutional blocks.
        diffusion_step_embed_dim: The Unet is conditioned on the diffusion timestep via a small non-linear
            network. This is the output dimension of that network, i.e., the embedding dimension.
        use_film_scale_modulation: FiLM (https://arxiv.org/abs/1709.07871) is used for the Unet conditioning.
            Bias modulation is used be default, while this parameter indicates whether to also use scale
            modulation.
        noise_scheduler_type: Name of the noise scheduler to use. Supported options: ["DDPM", "DDIM"].
        num_train_timesteps: Number of diffusion steps for the forward diffusion schedule.
        beta_schedule: Name of the diffusion beta schedule as per DDPMScheduler from Hugging Face diffusers.
        beta_start: Beta value for the first forward-diffusion step.
        beta_end: Beta value for the last forward-diffusion step.
        prediction_type: The type of prediction that the diffusion modeling Unet makes. Choose from "epsilon"
            or "sample". These have equivalent outcomes from a latent variable modeling perspective, but
            "epsilon" has been shown to work better in many deep neural network settings.
        clip_sample: Whether to clip the sample to [-`clip_sample_range`, +`clip_sample_range`] for each
            denoising step at inference time. WARNING: you will need to make sure your action-space is
            normalized to fit within this range.
        clip_sample_range: The magnitude of the clipping range as described above.
        num_inference_steps: Number of reverse diffusion steps to use at inference time (steps are evenly
            spaced). If not provided, this defaults to be the same as `num_train_timesteps`.
        do_mask_loss_for_padding: Whether to mask the loss when there are copy-padded actions. See
            `LeRobotDataset` and `load_previous_and_future_frames` for mor information. Note, this defaults
            to False as the original Diffusion Policy implementation does the same.
    """

    # Inputs / output structure.
    n_obs_steps: int = 2
    horizon: int = 16
    n_action_steps: int = 8

    input_shapes: dict[str, list[int]] = field(
        default_factory=lambda: {
            "observation.image": [3, 96, 96],
            "observation.state": [2],
        }
    )
    output_shapes: dict[str, list[int]] = field(
        default_factory=lambda: {
            "action": [2],
        }
    )

    # Normalization / Unnormalization
    input_normalization_modes: dict[str, str] = field(
        default_factory=lambda: {
            "observation.image": "mean_std",
            "observation.state": "min_max",
        }
    )
    output_normalization_modes: dict[str, str] = field(default_factory=lambda: {"action": "min_max"})

    # Architecture / modeling.
    # Vision backbone.
    vision_backbone: str = "resnet18"
    crop_shape: tuple[int, int] | None = (84, 84)
    crop_is_random: bool = True
    pretrained_backbone_weights: str | None = None
    use_group_norm: bool = True
    spatial_softmax_num_keypoints: int = 32
    # Unet.
    down_dims: tuple[int, ...] = (512, 1024, 2048)
    kernel_size: int = 5
    n_groups: int = 8
    diffusion_step_embed_dim: int = 128
    use_film_scale_modulation: bool = True
    # Noise scheduler.
    noise_scheduler_type: str = "DDPM"
    num_train_timesteps: int = 100
    beta_schedule: str = "squaredcos_cap_v2"
    beta_start: float = 0.0001
    beta_end: float = 0.02
    prediction_type: str = "epsilon"
    clip_sample: bool = True
    clip_sample_range: float = 1.0

    # Inference
    num_inference_steps: int | None = None

    # Loss computation
    do_mask_loss_for_padding: bool = False

    def __post_init__(self):
        """Input validation (not exhaustive)."""
        if not self.vision_backbone.startswith("resnet"):
            raise ValueError(
                f"`vision_backbone` must be one of the ResNet variants. Got {self.vision_backbone}."
            )
        # There should only be one image key.
        image_keys = {k for k in self.input_shapes if k.startswith("observation.image")}
        if len(image_keys) != 1:
            raise ValueError(
                f"{self.__class__.__name__} only handles one image for now. Got image keys {image_keys}."
            )
        image_key = next(iter(image_keys))
        if self.crop_shape is not None and (
            self.crop_shape[0] > self.input_shapes[image_key][1]
            or self.crop_shape[1] > self.input_shapes[image_key][2]
        ):
            raise ValueError(
                f"`crop_shape` should fit within `input_shapes[{image_key}]`. Got {self.crop_shape} "
                f"for `crop_shape` and {self.input_shapes[image_key]} for "
                "`input_shapes[{image_key}]`."
            )
        supported_prediction_types = ["epsilon", "sample"]
        if self.prediction_type not in supported_prediction_types:
            raise ValueError(
                f"`prediction_type` must be one of {supported_prediction_types}. Got {self.prediction_type}."
            )
        supported_noise_schedulers = ["DDPM", "DDIM"]
        if self.noise_scheduler_type not in supported_noise_schedulers:
            raise ValueError(
                f"`noise_scheduler_type` must be one of {supported_noise_schedulers}. "
                f"Got {self.noise_scheduler_type}."
            )
Add copyrights (#157) 2024-05-15 12:13:09 +02:00			`#!/usr/bin/env python`

			`# Copyright 2024 Columbia Artificial Intelligence, Robotics Lab,`
			`# and The HuggingFace Inc. team. All rights reserved.`
			`#`
			`# Licensed under the Apache License, Version 2.0 (the "License");`
			`# you may not use this file except in compliance with the License.`
			`# You may obtain a copy of the License at`
			`#`
			`# http://www.apache.org/licenses/LICENSE-2.0`
			`#`
			`# Unless required by applicable law or agreed to in writing, software`
			`# distributed under the License is distributed on an "AS IS" BASIS,`
			`# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.`
			`# See the License for the specific language governing permissions and`
			`# limitations under the License.`
Move normalization to policy for act and diffusion (#90) Co-authored-by: Alexander Soare <alexander.soare159@gmail.com> 2024-04-25 11:47:38 +02:00			`from dataclasses import dataclass, field`
backup wip 2024-04-15 19:06:44 +01:00

			`@dataclass`
			`class DiffusionConfig:`
Refactor TD-MPC (#103) Co-authored-by: Cadene <re.cadene@gmail.com> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> 2024-05-01 16:40:04 +01:00			`"""Configuration class for DiffusionPolicy.`
backup wip 2024-04-15 19:06:44 +01:00
			`Defaults are configured for training with PushT providing proprioceptive and single camera observations.`

			`The parameters you will most likely need to change are the ones which depend on the environment / sensors.`
Move normalization to policy for act and diffusion (#90) Co-authored-by: Alexander Soare <alexander.soare159@gmail.com> 2024-04-25 11:47:38 +02:00			Those are: `input_shapes` and `output_shapes`.
backup wip 2024-04-15 19:06:44 +01:00
Add real-world support for ACT on Aloha/Aloha2 (#228) Co-authored-by: Alexander Soare <alexander.soare159@gmail.com> 2024-05-31 15:31:02 +02:00			`Notes on the inputs and outputs:`
			`- "observation.state" is required as an input key.`
			`- A key starting with "observation.image is required as an input.`
			`- "action" is required as an output key.`

backup wip 2024-04-15 19:06:44 +01:00			`Args:`
			`n_obs_steps: Number of environment steps worth of observations to pass to the policy (takes the`
			`current step and additional steps going back).`
backup wip 2024-04-16 12:51:32 +01:00			horizon: Diffusion model action prediction size as detailed in `DiffusionPolicy.select_action`.
			`n_action_steps: The number of action steps to run in the environment for one invocation of the policy.`
			See `DiffusionPolicy.select_action` for more details.
Add real-world support for ACT on Aloha/Aloha2 (#228) Co-authored-by: Alexander Soare <alexander.soare159@gmail.com> 2024-05-31 15:31:02 +02:00			`input_shapes: A dictionary defining the shapes of the input data for the policy. The key represents`
			`the input data name, and the value is a list indicating the dimensions of the corresponding data.`
			`For example, "observation.image" refers to an input from a camera with dimensions [3, 96, 96],`
			indicating it has three color channels and 96x96 resolution. Importantly, `input_shapes` doesn't
			`include batch dimension or temporal dimension.`
			`output_shapes: A dictionary defining the shapes of the output data for the policy. The key represents`
			`the output data name, and the value is a list indicating the dimensions of the corresponding data.`
			`For example, "action" refers to an output shape of [14], indicating 14-dimensional actions.`
			Importantly, `output_shapes` doesn't include batch dimension or temporal dimension.
Refactor TD-MPC (#103) Co-authored-by: Cadene <re.cadene@gmail.com> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> 2024-05-01 16:40:04 +01:00			`input_normalization_modes: A dictionary with key representing the modality (e.g. "observation.state"),`
			`and the value specifies the normalization mode to apply. The two available modes are "mean_std"`
			`which subtracts the mean and divides by the standard deviation and "min_max" which rescale in a`
			`[-1, 1] range.`
			output_normalization_modes: Similar dictionary as `normalize_input_modes`, but to unnormalize to the
			`original scale. Note that this is also used for normalizing the training targets.`
backup wip 2024-04-16 12:51:32 +01:00			`vision_backbone: Name of the torchvision resnet backbone to use for encoding images.`
			`crop_shape: (H, W) shape to crop images to as a preprocessing step for the vision backbone. Must fit`
			`within the image size. If None, no cropping is done.`
			`crop_is_random: Whether the crop should be random at training time (it's always a center crop in eval`
			`mode).`
Remove warnings (#111) - Replace `use_pretrained_backbone` with `pretrained_backbone_weights` - Bump diffusers' minimum version `0.26.3` -> `0.27.2` - Add ignore flags in CI's pytest - Change Box observation spaces in simulation environments - Set `version_base="1.2"` in Hydra initializations - Bump einops' minimum version `0.7.0` -> `0.8.0` 2024-04-29 00:31:33 +02:00			`pretrained_backbone_weights: Pretrained weights from torchvision to initalize the backbone.`
			`None` means no pretrained weights.
backup wip 2024-04-16 12:51:32 +01:00			`use_group_norm: Whether to replace batch normalization with group normalization in the backbone.`
			`The group sizes are set to be about 16 (to be precise, feature_dim // 16).`
			`spatial_softmax_num_keypoints: Number of keypoints for SpatialSoftmax.`
			`down_dims: Feature dimension for each stage of temporal downsampling in the diffusion modeling Unet.`
			`You may provide a variable number of dimensions, therefore also controlling the degree of`
			`downsampling.`
			`kernel_size: The convolutional kernel size of the diffusion modeling Unet.`
			`n_groups: Number of groups used in the group norm of the Unet's convolutional blocks.`
			`diffusion_step_embed_dim: The Unet is conditioned on the diffusion timestep via a small non-linear`
			`network. This is the output dimension of that network, i.e., the embedding dimension.`
			`use_film_scale_modulation: FiLM (https://arxiv.org/abs/1709.07871) is used for the Unet conditioning.`
			`Bias modulation is used be default, while this parameter indicates whether to also use scale`
			`modulation.`
Support for DDIMScheduler in Diffusion Policy (#146) 2024-05-08 13:05:16 -04:00			`noise_scheduler_type: Name of the noise scheduler to use. Supported options: ["DDPM", "DDIM"].`
backup wip 2024-04-16 12:51:32 +01:00			`num_train_timesteps: Number of diffusion steps for the forward diffusion schedule.`
			`beta_schedule: Name of the diffusion beta schedule as per DDPMScheduler from Hugging Face diffusers.`
			`beta_start: Beta value for the first forward-diffusion step.`
			`beta_end: Beta value for the last forward-diffusion step.`
			`prediction_type: The type of prediction that the diffusion modeling Unet makes. Choose from "epsilon"`
			`or "sample". These have equivalent outcomes from a latent variable modeling perspective, but`
			`"epsilon" has been shown to work better in many deep neural network settings.`
			clip_sample: Whether to clip the sample to [-`clip_sample_range`, +`clip_sample_range`] for each
			`denoising step at inference time. WARNING: you will need to make sure your action-space is`
			`normalized to fit within this range.`
			`clip_sample_range: The magnitude of the clipping range as described above.`
			`num_inference_steps: Number of reverse diffusion steps to use at inference time (steps are evenly`
			spaced). If not provided, this defaults to be the same as `num_train_timesteps`.
Remove loss masking from diffusion policy (#135) 2024-05-06 07:27:01 +01:00			`do_mask_loss_for_padding: Whether to mask the loss when there are copy-padded actions. See`
			`LeRobotDataset` and `load_previous_and_future_frames` for mor information. Note, this defaults
			`to False as the original Diffusion Policy implementation does the same.`
backup wip 2024-04-15 19:06:44 +01:00			`"""`

			`# Inputs / output structure.`
			`n_obs_steps: int = 2`
			`horizon: int = 16`
			`n_action_steps: int = 8`

Refactor TD-MPC (#103) Co-authored-by: Cadene <re.cadene@gmail.com> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> 2024-05-01 16:40:04 +01:00			`input_shapes: dict[str, list[int]] = field(`
Move normalization to policy for act and diffusion (#90) Co-authored-by: Alexander Soare <alexander.soare159@gmail.com> 2024-04-25 11:47:38 +02:00			`default_factory=lambda: {`
			`"observation.image": [3, 96, 96],`
			`"observation.state": [2],`
			`}`
			`)`
Refactor TD-MPC (#103) Co-authored-by: Cadene <re.cadene@gmail.com> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> 2024-05-01 16:40:04 +01:00			`output_shapes: dict[str, list[int]] = field(`
Move normalization to policy for act and diffusion (#90) Co-authored-by: Alexander Soare <alexander.soare159@gmail.com> 2024-04-25 11:47:38 +02:00			`default_factory=lambda: {`
			`"action": [2],`
			`}`
			`)`

			`# Normalization / Unnormalization`
Make sure targets are normalized too (#106) 2024-04-26 11:18:39 +01:00			`input_normalization_modes: dict[str, str] = field(`
Move normalization to policy for act and diffusion (#90) Co-authored-by: Alexander Soare <alexander.soare159@gmail.com> 2024-04-25 11:47:38 +02:00			`default_factory=lambda: {`
			`"observation.image": "mean_std",`
			`"observation.state": "min_max",`
			`}`
			`)`
Make sure targets are normalized too (#106) 2024-04-26 11:18:39 +01:00			`output_normalization_modes: dict[str, str] = field(default_factory=lambda: {"action": "min_max"})`
backup wip 2024-04-15 19:06:44 +01:00
			`# Architecture / modeling.`
			`# Vision backbone.`
			`vision_backbone: str = "resnet18"`
backup wip 2024-04-16 12:51:32 +01:00			`crop_shape: tuple[int, int] \| None = (84, 84)`
backup wip 2024-04-15 19:06:44 +01:00			`crop_is_random: bool = True`
Remove warnings (#111) - Replace `use_pretrained_backbone` with `pretrained_backbone_weights` - Bump diffusers' minimum version `0.26.3` -> `0.27.2` - Add ignore flags in CI's pytest - Change Box observation spaces in simulation environments - Set `version_base="1.2"` in Hydra initializations - Bump einops' minimum version `0.7.0` -> `0.8.0` 2024-04-29 00:31:33 +02:00			`pretrained_backbone_weights: str \| None = None`
backup wip 2024-04-15 19:06:44 +01:00			`use_group_norm: bool = True`
			`spatial_softmax_num_keypoints: int = 32`
			`# Unet.`
			`down_dims: tuple[int, ...] = (512, 1024, 2048)`
			`kernel_size: int = 5`
			`n_groups: int = 8`
			`diffusion_step_embed_dim: int = 128`
backup wip 2024-04-16 12:51:32 +01:00			`use_film_scale_modulation: bool = True`
backup wip 2024-04-15 19:06:44 +01:00			`# Noise scheduler.`
Support for DDIMScheduler in Diffusion Policy (#146) 2024-05-08 13:05:16 -04:00			`noise_scheduler_type: str = "DDPM"`
backup wip 2024-04-15 19:06:44 +01:00			`num_train_timesteps: int = 100`
			`beta_schedule: str = "squaredcos_cap_v2"`
			`beta_start: float = 0.0001`
			`beta_end: float = 0.02`
			`prediction_type: str = "epsilon"`
backup wip 2024-04-16 12:51:32 +01:00			`clip_sample: bool = True`
			`clip_sample_range: float = 1.0`
backup wip 2024-04-15 19:06:44 +01:00
			`# Inference`
backup wip 2024-04-16 12:51:32 +01:00			`num_inference_steps: int \| None = None`
backup wip 2024-04-15 19:06:44 +01:00
Remove loss masking from diffusion policy (#135) 2024-05-06 07:27:01 +01:00			`# Loss computation`
			`do_mask_loss_for_padding: bool = False`

backup wip 2024-04-15 19:06:44 +01:00			`def __post_init__(self):`
			`"""Input validation (not exhaustive)."""`
			`if not self.vision_backbone.startswith("resnet"):`
backup wip 2024-04-16 12:51:32 +01:00			`raise ValueError(`
			f"`vision_backbone` must be one of the ResNet variants. Got {self.vision_backbone}."
			`)`
Make policies compatible with other/multiple image keys (#149) 2024-05-16 13:51:53 +01:00			`# There should only be one image key.`
			`image_keys = {k for k in self.input_shapes if k.startswith("observation.image")}`
			`if len(image_keys) != 1:`
			`raise ValueError(`
			`f"{self.__class__.__name__} only handles one image for now. Got image keys {image_keys}."`
			`)`
			`image_key = next(iter(image_keys))`
Handle `crop_shape=None` in Diffusion Policy (#219) 2024-05-28 18:27:33 +01:00			`if self.crop_shape is not None and (`
Make policies compatible with other/multiple image keys (#149) 2024-05-16 13:51:53 +01:00			`self.crop_shape[0] > self.input_shapes[image_key][1]`
			`or self.crop_shape[1] > self.input_shapes[image_key][2]`
Move normalization to policy for act and diffusion (#90) Co-authored-by: Alexander Soare <alexander.soare159@gmail.com> 2024-04-25 11:47:38 +02:00			`):`
backup wip 2024-04-16 12:51:32 +01:00			`raise ValueError(`
Make policies compatible with other/multiple image keys (#149) 2024-05-16 13:51:53 +01:00			f"`crop_shape` should fit within `input_shapes[{image_key}]`. Got {self.crop_shape} "
			f"for `crop_shape` and {self.input_shapes[image_key]} for "
			"`input_shapes[{image_key}]`."
backup wip 2024-04-16 12:51:32 +01:00			`)`
			`supported_prediction_types = ["epsilon", "sample"]`
			`if self.prediction_type not in supported_prediction_types:`
			`raise ValueError(`
			f"`prediction_type` must be one of {supported_prediction_types}. Got {self.prediction_type}."
			`)`
Support for DDIMScheduler in Diffusion Policy (#146) 2024-05-08 13:05:16 -04:00			`supported_noise_schedulers = ["DDPM", "DDIM"]`
			`if self.noise_scheduler_type not in supported_noise_schedulers:`
			`raise ValueError(`
			f"`noise_scheduler_type` must be one of {supported_noise_schedulers}. "
			`f"Got {self.noise_scheduler_type}."`
			`)`