Fixing the 'initial_lr' KeyError When Resuming Training from a Checkpoint #70

omarrayyann · 2024-12-07T14:08:29Z

If you attempt to resume training from a checkpoint, you might encounter the following error:

KeyError: "param 'initial_lr' is not specified in param_groups[0] when resuming an optimizer"

it can be fixed by adjusting the initialization of param_groups in the train_diffusion_unet_image_workspace.py file (line ~ 70) to this:

param_groups = [
            {'params': self.model.model.parameters(), 'lr': cfg.optimizer.lr, 'initial_lr': cfg.optimizer.lr},
            {'params': obs_encorder_params, 'lr': obs_encorder_lr, 'initial_lr': obs_encorder_lr}
        ]

The text was updated successfully, but these errors were encountered:

WilliamBonilla62 · 2024-12-09T19:45:15Z

Hey @omarrayyann change this in the file train_diffusion_unet_image_workspace.py

lr_scheduler = get_scheduler(
    cfg.training.lr_scheduler,
    optimizer=self.optimizer,
    num_warmup_steps=cfg.training.lr_warmup_steps,
    num_training_steps=(
        len(train_dataloader) * cfg.training.num_epochs
    ) // cfg.training.gradient_accumulate_every,
    # pytorch assumes stepping LRScheduler every epoch
    # however huggingface diffusers steps it every batch
    last_epoch=self.global_step - 1
)

to

lr_scheduler = get_scheduler(
    cfg.training.lr_scheduler,
    optimizer=self.optimizer,
    num_warmup_steps=cfg.training.lr_warmup_steps,
    num_training_steps=(
        len(train_dataloader) * cfg.training.num_epochs
    ) // cfg.training.gradient_accumulate_every,
    # pytorch assumes stepping LRScheduler every epoch
    # however huggingface diffusers steps it every batch
    last_epoch=-1
)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing the 'initial_lr' KeyError When Resuming Training from a Checkpoint #70

Fixing the 'initial_lr' KeyError When Resuming Training from a Checkpoint #70

omarrayyann commented Dec 7, 2024 •

edited

Loading

WilliamBonilla62 commented Dec 9, 2024 •

edited

Loading

Fixing the 'initial_lr' KeyError When Resuming Training from a Checkpoint #70

Fixing the 'initial_lr' KeyError When Resuming Training from a Checkpoint #70

Comments

omarrayyann commented Dec 7, 2024 • edited Loading

WilliamBonilla62 commented Dec 9, 2024 • edited Loading

omarrayyann commented Dec 7, 2024 •

edited

Loading

WilliamBonilla62 commented Dec 9, 2024 •

edited

Loading