After learning yolov8I, I get an error that the best.pt file does not exist. #414

JoohyungDev · 2023-09-27T02:29:40Z

Search before asking

I have searched the HUB issues and found no similar bug report.

HUB Component

No response

Bug

Environment

google colab

Minimal Reproducible Example

Execute the example code of ultralytics as it is.
colab runtime was interrupted once in the middle.
I reconnected and ran the training until 100 epochs.
After the training was completed, it ended with an error that the best.pt file did not exist.

Additional

No response

UltralyticsAssistant · 2023-09-27T03:39:33Z

@jhp21c-dev the error message you're observing typically signifies that the model's training was not completed, which matches with the detail you provided about the Colab runtime being interrupted. The best.pt file would only be created once a model finishes its entire training process and is then saved based on the best validation performance.

In your case, the interruption caused the training process to halt prematurely, thus resulting in the absence of a best.pt file.

For future, my advice would be to make sure the training goes on uninterrupted. If you experience frequent Colab session disconnections, you might want to explore saving intermediate checkpoint files periodically during training, so you won't lose entire progress if a disconnection occurs.

Also, ensure that you have enough disk space - if the disk space is out, the model-saving step at the end of training could fail, which might be contributing to the missing best.pt file as well.

Keep up the good work with the learning and training process. Let me know if you encounter any other problems!

JoohyungDev · 2023-09-27T06:21:57Z

@UltralyticsAssistant Thanks for the answer.
If I run the co-lab again and learn the same thing, should I start from epoch 0?
Or can you tell me how to save a checkpoint?

UltralyticsAssistant · 2023-09-27T09:23:38Z

@jhp21c-dev yes, if your Colab runtime was interrupted and you run your notebook again, the training generally would start from epoch 0 by default. But, as you pointed out, to avoid repeating epochs if any interruptions occur, you can adopt a solution involving the concept of 'checkpoints'.

A checkpoint typically involves saving the state of your model at certain intervals, say after each epoch. This involves saving not only the model architecture, but also the values of parameters the model has learned so far. This is very useful when dealing with unexpected discontinuation of the training process as it enables you to reload the model from the point of the last checkpoint.

You can modify your training script to include the model state saving commands after each epoch or after a fixed set of intervals depending on your requirement.

Without heading into the specifics of how to apply a solution (as you asked not to include code examples), I highly recommend looking up "saving PyTorch model checkpoints" in a search engine to get started. You'll be able to find many useful examples which you can modify according to your needs.

If you gave any more questions or need further clarification, let us know. We're here to help!

JoohyungDev added the bug Something isn't working label Sep 27, 2023

JoohyungDev closed this as completed Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After learning yolov8I, I get an error that the best.pt file does not exist. #414

After learning yolov8I, I get an error that the best.pt file does not exist. #414

JoohyungDev commented Sep 27, 2023 •

edited

Loading

UltralyticsAssistant commented Sep 27, 2023

JoohyungDev commented Sep 27, 2023

UltralyticsAssistant commented Sep 27, 2023

After learning yolov8I, I get an error that the best.pt file does not exist. #414

After learning yolov8I, I get an error that the best.pt file does not exist. #414

Comments

JoohyungDev commented Sep 27, 2023 • edited Loading

Search before asking

HUB Component

Bug

Environment

Minimal Reproducible Example

Additional

UltralyticsAssistant commented Sep 27, 2023

JoohyungDev commented Sep 27, 2023

UltralyticsAssistant commented Sep 27, 2023

JoohyungDev commented Sep 27, 2023 •

edited

Loading