Not able to resume training #493

VarunNelakanti · 2023-12-11T18:58:45Z

Search before asking

I have searched the HUB issues and found no similar bug report.

HUB Component

Training

Bug

I was training a model, and due to some error in Google Collab when it ran out of RAM memory, it stopped running. I tried to restart the training, but it always throws the error which I'm attaching in the issue.

As you can see, there is a shape error and I do not understand why it is happening. It has trained till 79th epoch, and it does not resume training. Any help will be useful.

Environment

-Google Collab, Ultralytics

Minimal Reproducible Example

Start a custom segmentation training on Yolov8
Stop the training
Restart the training

Additional

No response

github-actions · 2023-12-11T18:59:12Z

👋 Hello @VarunNelakanti, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more:

Quickstart. Start training and deploying YOLO models with HUB in seconds.
Datasets: Preparing and Uploading. Learn how to prepare and upload your datasets to HUB in YOLO format.
Projects: Creating and Managing. Group your models into projects for improved organization.
Models: Training and Exporting. Train YOLOv5 and YOLOv8 models on your custom datasets and export them to various formats for deployment.
Integrations. Explore different integration options for your trained models, such as TensorFlow, ONNX, OpenVINO, CoreML, and PaddlePaddle.
Ultralytics HUB App. Learn about the Ultralytics App for iOS and Android, which allows you to run models directly on your mobile device.
- iOS. Learn about YOLO CoreML models accelerated on Apple's Neural Engine on iPhones and iPads.
- Android. Explore TFLite acceleration on mobile devices.
Inference API. Understand how to use the Inference API for running your trained models in the cloud to generate predictions.

If this is a 🐛 Bug Report, please provide screenshots and steps to reproduce your problem to help us get started working on a fix.

If this is a ❓ Question, please provide as much information as possible, including dataset, model, environment details etc. so that we might provide the most helpful response.

We try to respond to all issues as promptly as possible. Thank you for your patience!

VarunNelakanti · 2023-12-11T19:03:33Z

Also, no changes have been made to either the model or the data. Tried to restart the training within 5 mins of it crashing.

UltralyticsAssistant · 2023-12-12T01:22:25Z

@VarunNelakanti hello! I'm sorry to hear that you're having trouble resuming your training.

The error screenshot suggests there's a mismatch between the expected shape of certain data and the shape of data being received during training, which commonly arises from discrepancies in dataset or checkpoint structure.

It's possible that the abrupt interruption caused some inconsistencies in the training state. Here are a few steps to check:

Verify that the checkpoint file has not been corrupted. Check if its file size looks reasonable compared to other checkpoints.
Ensure the dataset is correctly set up and accessible, with no changes to file paths or structures after the crash.
Confirm that the configurations (hyperparameters, model architecture) used to resume training are the same as those used initially.
If possible, try to resume with a slightly earlier checkpoint to see if the most recent one is the issue.

Remember to check the Ultralytics HUB Docs for further guidance on resuming training processes. If you've gone through these steps and the issue persists, please provide the full error message (text version, if possible) and any relevant logs to better diagnose the problem.

Thank you for your patience, and let's get this sorted out! 🛠️

VarunNelakanti · 2023-12-12T12:56:33Z

Hello,

Checkpoint file is not corrupted, it's file size does look to be the same. ( If it is the size of the downloaded checkpoint of the model)
No changed to the file paths and the dataset
No change in configurations were done too.
I'm not sure how to do this. Not able to find it in the documentation. Could you point me to it?

UltralyticsAssistant · 2023-12-12T20:18:49Z

@VarunNelakanti great to hear that you've verified the checkpoint file and the dataset integrity.

To resume training from an earlier checkpoint:

Locate the checkpoint files. These should be named with the epoch number or similar identifier.
Select an earlier checkpoint file than the most recent one.
Use the training command you usually would, but specify the path to the earlier checkpoint file as the resume argument.

This process will tell the training script to pick up from the epoch that the earlier checkpoint corresponds to, rather than the latest one.

If you’re having trouble with the specific command syntax, please consult the Ultralytics HUB Docs on the Training section, which should provide the necessary guidance for resuming from a given checkpoint.

Should you encounter any further issues after trying to resume from an earlier checkpoint, please provide the exact error text (not as a screenshot) here, and I’ll be happy to assist you in troubleshooting the matter. Keep up the good work! 🌟

github-actions · 2024-01-12T00:19:19Z

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

VarunNelakanti added the bug Something isn't working label Dec 11, 2023

github-actions bot added the Stale label Jan 12, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to resume training #493

Not able to resume training #493

VarunNelakanti commented Dec 11, 2023

github-actions bot commented Dec 11, 2023

VarunNelakanti commented Dec 11, 2023

UltralyticsAssistant commented Dec 12, 2023

VarunNelakanti commented Dec 12, 2023

UltralyticsAssistant commented Dec 12, 2023

github-actions bot commented Jan 12, 2024

Not able to resume training #493

Not able to resume training #493

Comments

VarunNelakanti commented Dec 11, 2023

Search before asking

HUB Component

Bug

Environment

Minimal Reproducible Example

Additional

github-actions bot commented Dec 11, 2023

VarunNelakanti commented Dec 11, 2023

UltralyticsAssistant commented Dec 12, 2023

VarunNelakanti commented Dec 12, 2023

UltralyticsAssistant commented Dec 12, 2023

github-actions bot commented Jan 12, 2024