lightning_template.utils.callbacks.model_checkpoint#
Classes#
Save the model periodically by monitoring a quantity. Every metric logged with |
Module Contents#
- class lightning_template.utils.callbacks.model_checkpoint.ModelCheckpointWithLinkBest(save_best: bool | None = None, *args, **kwargs)#
Bases:
lightning.pytorch.callbacks.ModelCheckpointSave the model periodically by monitoring a quantity. Every metric logged with
log()orlog_dict()is a candidate for the monitor key. For more information, see checkpointing.After training finishes, use
best_model_pathto retrieve the path to the best checkpoint file andbest_model_scoreto retrieve its score.- Parameters:
dirpath –
directory to save the model file.
Example:
# custom path # saves a file like: my/path/epoch=0-step=10.ckpt >>> checkpoint_callback = ModelCheckpoint(dirpath='my/path/')
By default, dirpath is
Noneand will be set at runtime to the location specified byTrainer’s :paramref:`~lightning.pytorch.trainer.trainer.Trainer.default_root_dir` argument, and if the Trainer uses a logger, the path will also contain logger name and version.filename –
checkpoint filename. Can contain named formatting options to be auto-filled.
Example:
# save any arbitrary metrics like `val_loss`, etc. in name # saves a file like: my/path/epoch=2-val_loss=0.02-other_metric=0.03.ckpt >>> checkpoint_callback = ModelCheckpoint( ... dirpath='my/path', ... filename='{epoch}-{val_loss:.2f}-{other_metric:.2f}' ... )
By default, filename is
Noneand will be set to'{epoch}-{step}', where “epoch” and “step” match the number of finished epoch and optimizer steps respectively.monitor – quantity to monitor. By default it is
Nonewhich saves a checkpoint only for the last epoch.verbose – verbosity mode. Default:
False.save_last – When
True, saves a last.ckpt copy whenever a checkpoint file gets saved. Can be set to'link'on a local filesystem to create a symbolic link. This allows accessing the latest checkpoint in a deterministic manner. Default:None.save_top_k – if
save_top_k == k, the best k models according to the quantity monitored will be saved. Ifsave_top_k == 0, no models are saved. Ifsave_top_k == -1, all models are saved. Please note that the monitors are checked everyevery_n_epochsepochs. Ifsave_top_k >= 2and the callback is called multiple times inside an epoch, and the filename remains unchanged, the name of the saved file will be appended with a version count starting withv1to avoid collisions unlessenable_version_counteris set to False. The version counter is unrelated to the top-k ranking of the checkpoint, and we recommend formatting the filename to include the monitored metric to avoid collisions.mode – one of {min, max}. If
save_top_k != 0, the decision to overwrite the current save file is made based on either the maximization or the minimization of the monitored quantity. For'val_acc', this should be'max', for'val_loss'this should be'min', etc.auto_insert_metric_name – When
True, the checkpoints filenames will contain the metric name. For example,filename='checkpoint_{epoch:02d}-{acc:02.0f}with epoch1and acc1.12will resolve tocheckpoint_epoch=01-acc=01.ckpt. Is useful to set it toFalsewhen metric names contain/as this will result in extra folders. For example,filename='epoch={epoch}-step={step}-val_acc={val/acc:.2f}', auto_insert_metric_name=Falsesave_weights_only – if
True, then only the model’s weights will be saved. Otherwise, the optimizer states, lr-scheduler states, etc are added in the checkpoint too.every_n_train_steps – Number of training steps between checkpoints. If
every_n_train_steps == None or every_n_train_steps == 0, we skip saving during training. To disable, setevery_n_train_steps = 0. This value must beNoneor non-negative. This must be mutually exclusive withtrain_time_intervalandevery_n_epochs.train_time_interval – Checkpoints are monitored at the specified time interval. For all practical purposes, this cannot be smaller than the amount of time it takes to process a single training batch. This is not guaranteed to execute at the exact time specified, but should be close. This must be mutually exclusive with
every_n_train_stepsandevery_n_epochs.every_n_epochs – Number of epochs between checkpoints. This value must be
Noneor non-negative. To disable saving top-k checkpoints, setevery_n_epochs = 0. This argument does not impact the saving ofsave_last=Truecheckpoints. If all ofevery_n_epochs,every_n_train_stepsandtrain_time_intervalareNone, we save a checkpoint at the end of every epoch (equivalent toevery_n_epochs = 1). Ifevery_n_epochs == Noneand eitherevery_n_train_steps != Noneortrain_time_interval != None, saving at the end of each epoch is disabled (equivalent toevery_n_epochs = 0). This must be mutually exclusive withevery_n_train_stepsandtrain_time_interval. Setting bothModelCheckpoint(..., every_n_epochs=V, save_on_train_epoch_end=False)andTrainer(max_epochs=N, check_val_every_n_epoch=M)will only save checkpoints at epochs 0 < E <= N where both values forevery_n_epochsandcheck_val_every_n_epochevenly divide E.save_on_train_epoch_end – Whether to run checkpointing at the end of the training epoch. If this is
False, then the check runs at the end of the validation.enable_version_counter – Whether to append a version to the existing file name. If this is
False, then the checkpoint files will be overwritten.
Note
For extra customization, ModelCheckpoint includes the following attributes:
CHECKPOINT_JOIN_CHAR = "-"CHECKPOINT_EQUALS_CHAR = "="CHECKPOINT_NAME_LAST = "last"FILE_EXTENSION = ".ckpt"STARTING_VERSION = 1
For example, you can change the default last checkpoint name by doing
checkpoint_callback.CHECKPOINT_NAME_LAST = "{epoch}-last"If you want to checkpoint every N hours, every M train batches, and/or every K val epochs, then you should create multiple
ModelCheckpointcallbacks.If the checkpoint’s
dirpathchanged from what it was before while resuming the training, onlybest_model_pathwill be reloaded and a warning will be issued.If you provide a
filenameon a mounted device where changing permissions is not allowed (causingchmodto raise aPermissionError), install fsspec>=2025.5.0. Then the error is caught, the file’s permissions remain unchanged, and the checkpoint is still saved. Otherwise, no checkpoint will be saved and training stops.- Raises:
MisconfigurationException – If
save_top_kis smaller than-1, ifmonitorisNoneandsave_top_kis none ofNone,-1, and0, or ifmodeis none of"min"or"max".ValueError – If
trainer.save_checkpointisNone.
Example:
>>> from lightning.pytorch import Trainer >>> from lightning.pytorch.callbacks import ModelCheckpoint # saves checkpoints to 'my/path/' at every epoch >>> checkpoint_callback = ModelCheckpoint(dirpath='my/path/') >>> trainer = Trainer(callbacks=[checkpoint_callback]) # save epoch and val_loss in name # saves a file like: my/path/sample-mnist-epoch=02-val_loss=0.32.ckpt >>> checkpoint_callback = ModelCheckpoint( ... monitor='val_loss', ... dirpath='my/path/', ... filename='sample-mnist-{epoch:02d}-{val_loss:.2f}' ... ) # save epoch and val_loss in name, but specify the formatting yourself (e.g. to avoid problems with Tensorboard # or Neptune, due to the presence of characters like '=' or '/') # saves a file like: my/path/sample-mnist-epoch02-val_loss0.32.ckpt >>> checkpoint_callback = ModelCheckpoint( ... monitor='val/loss', ... dirpath='my/path/', ... filename='sample-mnist-epoch{epoch:02d}-val_loss{val/loss:.2f}', ... auto_insert_metric_name=False ... ) # retrieve the best checkpoint after training checkpoint_callback = ModelCheckpoint(dirpath='my/path/') trainer = Trainer(callbacks=[checkpoint_callback]) model = ... trainer.fit(model) checkpoint_callback.best_model_path
Tip
Saving and restoring multiple checkpoint callbacks at the same time is supported under variation in the following arguments:
monitor, mode, every_n_train_steps, every_n_epochs, train_time_interval
Read more: Persisting Callback State
- CHECKPOINT_NAME_BEST = 'best'#
- save_best = None#
- _update_best_and_save(current: torch.Tensor, trainer: lightning.pytorch.Trainer, monitor_candidates: Dict[str, lightning.pytorch.utilities.types._METRIC]) None#
- _save_checkpoint(trainer: lightning.pytorch.Trainer, filepath: str) None#
- _save_last_checkpoint(trainer: lightning.pytorch.Trainer, monitor_candidates: Dict[str, torch.Tensor]) None#
- _save_best_checkpoint(trainer: lightning.pytorch.Trainer, monitor_candidates: Dict[str, lightning.pytorch.utilities.types._METRIC]) None#