mirror of
https://github.com/huggingface/lerobot.git
synced 2026-06-01 03:11:29 +00:00
Adam optimizer states (exp_avg + exp_avg_sq) require ~16GB extra on top of model params and gradients for 4B parameter models, exceeding the 22GB GPU. SGD has zero optimizer state overhead and profiling only measures forward/backward timing anyway. Also adds torch.cuda.empty_cache() after deterministic forward to release transient memory before the training loop starts. Made-with: Cursor
5.6 KiB
5.6 KiB