-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not able to resume training #493
Comments
👋 Hello @VarunNelakanti, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more:
If this is a 🐛 Bug Report, please provide screenshots and steps to reproduce your problem to help us get started working on a fix. If this is a ❓ Question, please provide as much information as possible, including dataset, model, environment details etc. so that we might provide the most helpful response. We try to respond to all issues as promptly as possible. Thank you for your patience! |
Also, no changes have been made to either the model or the data. Tried to restart the training within 5 mins of it crashing. |
@VarunNelakanti hello! I'm sorry to hear that you're having trouble resuming your training. The error screenshot suggests there's a mismatch between the expected shape of certain data and the shape of data being received during training, which commonly arises from discrepancies in dataset or checkpoint structure. It's possible that the abrupt interruption caused some inconsistencies in the training state. Here are a few steps to check:
Remember to check the Ultralytics HUB Docs for further guidance on resuming training processes. If you've gone through these steps and the issue persists, please provide the full error message (text version, if possible) and any relevant logs to better diagnose the problem. Thank you for your patience, and let's get this sorted out! 🛠️ |
Hello,
|
@VarunNelakanti great to hear that you've verified the checkpoint file and the dataset integrity. To resume training from an earlier checkpoint:
This process will tell the training script to pick up from the epoch that the earlier checkpoint corresponds to, rather than the latest one. If you’re having trouble with the specific command syntax, please consult the Ultralytics HUB Docs on the Training section, which should provide the necessary guidance for resuming from a given checkpoint. Should you encounter any further issues after trying to resume from an earlier checkpoint, please provide the exact error text (not as a screenshot) here, and I’ll be happy to assist you in troubleshooting the matter. Keep up the good work! 🌟 |
👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help. For additional resources and information, please see the links below:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed! Thank you for your contributions to YOLO 🚀 and Vision AI ⭐ |
Search before asking
HUB Component
Training
Bug
I was training a model, and due to some error in Google Collab when it ran out of RAM memory, it stopped running. I tried to restart the training, but it always throws the error which I'm attaching in the issue.
![Screenshot 2023-12-11 at 19 55 02](https://private-user-images.githubusercontent.com/32270482/289640405-c3dd00f7-192c-474f-a54f-48b812dfd908.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAwMjczMjQsIm5iZiI6MTcyMDAyNzAyNCwicGF0aCI6Ii8zMjI3MDQ4Mi8yODk2NDA0MDUtYzNkZDAwZjctMTkyYy00NzRmLWE1NGYtNDhiODEyZGZkOTA4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MDMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzAzVDE3MTcwNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWZiNDAwNDFkOGE0ZTkxNThmNmU2ZjZiYzhjNDgzZTY0MzNjMGUzMTQ1ZmU5ODI5MGY5ODg5ZmYyZWQyY2I0OTEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.l9T3BpJesQXDVB_TzPb-JBaVDz1c9qZboAm-h0FR5QM)
As you can see, there is a shape error and I do not understand why it is happening. It has trained till 79th epoch, and it does not resume training. Any help will be useful.
Environment
-Google Collab, Ultralytics
Minimal Reproducible Example
Additional
No response
The text was updated successfully, but these errors were encountered: