Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After learning yolov8I, I get an error that the best.pt file does not exist. #414

Closed
1 task done
JoohyungDev opened this issue Sep 27, 2023 · 3 comments
Closed
1 task done
Labels
bug Something isn't working

Comments

@JoohyungDev
Copy link

JoohyungDev commented Sep 27, 2023

Search before asking

  • I have searched the HUB issues and found no similar bug report.

HUB Component

No response

Bug

image
image

Environment

  • google colab

Minimal Reproducible Example

  1. Execute the example code of ultralytics as it is.
  2. colab runtime was interrupted once in the middle.
  3. I reconnected and ran the training until 100 epochs.
  4. After the training was completed, it ended with an error that the best.pt file did not exist.

Additional

No response

@JoohyungDev JoohyungDev added the bug Something isn't working label Sep 27, 2023
@UltralyticsAssistant
Copy link
Member

@jhp21c-dev the error message you're observing typically signifies that the model's training was not completed, which matches with the detail you provided about the Colab runtime being interrupted. The best.pt file would only be created once a model finishes its entire training process and is then saved based on the best validation performance.

In your case, the interruption caused the training process to halt prematurely, thus resulting in the absence of a best.pt file.

For future, my advice would be to make sure the training goes on uninterrupted. If you experience frequent Colab session disconnections, you might want to explore saving intermediate checkpoint files periodically during training, so you won't lose entire progress if a disconnection occurs.

Also, ensure that you have enough disk space - if the disk space is out, the model-saving step at the end of training could fail, which might be contributing to the missing best.pt file as well.

Keep up the good work with the learning and training process. Let me know if you encounter any other problems!

@JoohyungDev
Copy link
Author

@UltralyticsAssistant Thanks for the answer.
If I run the co-lab again and learn the same thing, should I start from epoch 0?
Or can you tell me how to save a checkpoint?

@UltralyticsAssistant
Copy link
Member

@jhp21c-dev yes, if your Colab runtime was interrupted and you run your notebook again, the training generally would start from epoch 0 by default. But, as you pointed out, to avoid repeating epochs if any interruptions occur, you can adopt a solution involving the concept of 'checkpoints'.

A checkpoint typically involves saving the state of your model at certain intervals, say after each epoch. This involves saving not only the model architecture, but also the values of parameters the model has learned so far. This is very useful when dealing with unexpected discontinuation of the training process as it enables you to reload the model from the point of the last checkpoint.

You can modify your training script to include the model state saving commands after each epoch or after a fixed set of intervals depending on your requirement.

Without heading into the specifics of how to apply a solution (as you asked not to include code examples), I highly recommend looking up "saving PyTorch model checkpoints" in a search engine to get started. You'll be able to find many useful examples which you can modify according to your needs.

If you gave any more questions or need further clarification, let us know. We're here to help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants