Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to resume training #493

Closed
1 task done
VarunNelakanti opened this issue Dec 11, 2023 · 6 comments
Closed
1 task done

Not able to resume training #493

VarunNelakanti opened this issue Dec 11, 2023 · 6 comments
Labels
bug Something isn't working Stale

Comments

@VarunNelakanti
Copy link

Search before asking

  • I have searched the HUB issues and found no similar bug report.

HUB Component

Training

Bug

I was training a model, and due to some error in Google Collab when it ran out of RAM memory, it stopped running. I tried to restart the training, but it always throws the error which I'm attaching in the issue.
Screenshot 2023-12-11 at 19 55 02

As you can see, there is a shape error and I do not understand why it is happening. It has trained till 79th epoch, and it does not resume training. Any help will be useful.

Environment

-Google Collab, Ultralytics

Minimal Reproducible Example

  1. Start a custom segmentation training on Yolov8
  2. Stop the training
  3. Restart the training

Additional

No response

@VarunNelakanti VarunNelakanti added the bug Something isn't working label Dec 11, 2023
Copy link

👋 Hello @VarunNelakanti, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more:

  • Quickstart. Start training and deploying YOLO models with HUB in seconds.
  • Datasets: Preparing and Uploading. Learn how to prepare and upload your datasets to HUB in YOLO format.
  • Projects: Creating and Managing. Group your models into projects for improved organization.
  • Models: Training and Exporting. Train YOLOv5 and YOLOv8 models on your custom datasets and export them to various formats for deployment.
  • Integrations. Explore different integration options for your trained models, such as TensorFlow, ONNX, OpenVINO, CoreML, and PaddlePaddle.
  • Ultralytics HUB App. Learn about the Ultralytics App for iOS and Android, which allows you to run models directly on your mobile device.
    • iOS. Learn about YOLO CoreML models accelerated on Apple's Neural Engine on iPhones and iPads.
    • Android. Explore TFLite acceleration on mobile devices.
  • Inference API. Understand how to use the Inference API for running your trained models in the cloud to generate predictions.

If this is a 🐛 Bug Report, please provide screenshots and steps to reproduce your problem to help us get started working on a fix.

If this is a ❓ Question, please provide as much information as possible, including dataset, model, environment details etc. so that we might provide the most helpful response.

We try to respond to all issues as promptly as possible. Thank you for your patience!

@VarunNelakanti
Copy link
Author

Also, no changes have been made to either the model or the data. Tried to restart the training within 5 mins of it crashing.

@UltralyticsAssistant
Copy link
Member

@VarunNelakanti hello! I'm sorry to hear that you're having trouble resuming your training.

The error screenshot suggests there's a mismatch between the expected shape of certain data and the shape of data being received during training, which commonly arises from discrepancies in dataset or checkpoint structure.

It's possible that the abrupt interruption caused some inconsistencies in the training state. Here are a few steps to check:

  1. Verify that the checkpoint file has not been corrupted. Check if its file size looks reasonable compared to other checkpoints.
  2. Ensure the dataset is correctly set up and accessible, with no changes to file paths or structures after the crash.
  3. Confirm that the configurations (hyperparameters, model architecture) used to resume training are the same as those used initially.
  4. If possible, try to resume with a slightly earlier checkpoint to see if the most recent one is the issue.

Remember to check the Ultralytics HUB Docs for further guidance on resuming training processes. If you've gone through these steps and the issue persists, please provide the full error message (text version, if possible) and any relevant logs to better diagnose the problem.

Thank you for your patience, and let's get this sorted out! 🛠️

@VarunNelakanti
Copy link
Author

Hello,

  1. Checkpoint file is not corrupted, it's file size does look to be the same. ( If it is the size of the downloaded checkpoint of the model)
  2. No changed to the file paths and the dataset
  3. No change in configurations were done too.
  4. I'm not sure how to do this. Not able to find it in the documentation. Could you point me to it?

@UltralyticsAssistant
Copy link
Member

@VarunNelakanti great to hear that you've verified the checkpoint file and the dataset integrity.

To resume training from an earlier checkpoint:

  1. Locate the checkpoint files. These should be named with the epoch number or similar identifier.
  2. Select an earlier checkpoint file than the most recent one.
  3. Use the training command you usually would, but specify the path to the earlier checkpoint file as the resume argument.

This process will tell the training script to pick up from the epoch that the earlier checkpoint corresponds to, rather than the latest one.

If you’re having trouble with the specific command syntax, please consult the Ultralytics HUB Docs on the Training section, which should provide the necessary guidance for resuming from a given checkpoint.

Should you encounter any further issues after trying to resume from an earlier checkpoint, please provide the exact error text (not as a screenshot) here, and I’ll be happy to assist you in troubleshooting the matter. Keep up the good work! 🌟

Copy link

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

@github-actions github-actions bot added the Stale label Jan 12, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

2 participants