-
Notifications
You must be signed in to change notification settings - Fork 19.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak when using custom DataGenerator #19907
Comments
I have nothing to provide you but solidarity. I am running into this same problem with a TFRecords data pipeline:
I'll spare you the plot, but I am having the same issue with a vanilla TF dataset. I've tried removing interleave, removing GZIP compression, calling TFRecordDataset directly, removed batching, removed prefetching... nothing. I believe this is a Tensorflow problem and (in particular) a TF Dataset problem: tensorflow/tensorflow#65675 This TF 2.16 + K3 era has been a disaster. Not the Keras part -- just some growing pains. But TF, man... |
I am facing the same problem, using scripts from here: https://github.com/kpertsch/rlds_dataset_mod which also involves with certain features from tensorflow dataset. The scripts is intended to do some modifications to an existing tensorflow dataset stored in TFRecord format. |
Hi @Omitg24! Are you facing this issue in the TensorFlow backend? Have you tested other backends (jax or torch)? Do you see this issue with other backends? |
This issue is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you. |
|
Does this still happen if you use the Also use keras |
For the past 3 weeks I've been searching nonstop for a solution to this problem, when training a LSTM model with a custom DataGenerator, Keras ends up using all my RAM memory. The context of the project is to predict sleep stages, in this script, its expected to paralelyze 15 different participants with its 10 folds (10 train and 10 validation), and in a following phase test with its respective partition.
Having said that, this is the LSTM Network I'm currently using:
I'm using:
This network has been used in this project
Then, I've implemented this custom DataGenerator which suites my problem.
And finally, the training phase is the following:
With that, I have this output file (I'm showing the first and the last epoch) where we can see how it ends up spending 80GBs of RAM on just one participant with 10 epochs and 10 folds.
I've tried to explicitly delete variables, also calling garbace collector and using clear_session() after finishing training each model, since its an incremental training, I think I'm not suposed to use it between folds.
Finally, if this could help proving my issue, I've also tried to see what would print a memory_profiler, just in case it was really freeing memory (but not the necessary), this is the result for one epoch 10 folds on one participant.
Hope someone knows how to fix this issue. Thanks!
What I've tried
I've tried reading the folds just when needed, explicitly freeing memory by deleting variables and calling garbage_collector, using different techniques of paralelization, but I've always faced the issue of one single participant consuming too much memory to handle.
The text was updated successfully, but these errors were encountered: