-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of memory error on workers while running Beam+Dataflow #4525
Comments
Some naive ideas to cope with this:
|
@albertvillanova We were finally able to process the full NQ dataset on our machines using 600 gb with 5 workers. Maybe these numbers will work for you as well. |
Thanks a lot for the hint, @seirasto. I have one question: what runner did you use? Direct, Apache Flink/Nemo/Samza/Spark, Google Dataflow...? Thank you. |
I asked my colleague who ran the code and he said apache beam. |
@albertvillanova Since we have already processed the NQ dataset on our machines can we upload it to datasets so the NQ PR can be merged? |
Maybe @lhoestq can give a more accurate answer as I am not sure about the authentication requirements to upload those files to our cloud bucket. Anyway I propose to continue this discussion on the dedicated PR for Natural questions dataset: |
He looked into it further and he just used DirectRunner. @albertvillanova |
OK, thank you @seirasto for your hint. That explains why you did not encounter the out of memory error: this only appears when the processing is distributed (on workers memory) and DirectRunner does not distribute the processing (all is done in a single machine). |
@albertvillanova Doesn't DirectRunner offer distributed processing through? https://beam.apache.org/documentation/runners/direct/
|
Unrelated to the OOM issue, but we deprecated datasets with Beam scripts in #6474. I think we can close this issue |
Describe the bug
While running the preprocessing of the natural_question dataset (see PR #4368), there is an issue for the "default" config (train+dev files).
Previously we ran the preprocessing for the "dev" config (only dev files) with success.
Train data files are larger than dev ones and apparently workers run out of memory while processing them.
Any help/hint is welcome!
Error message:
Info from the Diagnostics tab:
Additional information
Stack trace
Logs
The text was updated successfully, but these errors were encountered: