Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory error on workers while running Beam+Dataflow #4525

Open
albertvillanova opened this issue Jun 20, 2022 · 10 comments
Open

Out of memory error on workers while running Beam+Dataflow #4525

albertvillanova opened this issue Jun 20, 2022 · 10 comments
Labels
bug Something isn't working

Comments

@albertvillanova
Copy link
Member

albertvillanova commented Jun 20, 2022

Describe the bug

While running the preprocessing of the natural_question dataset (see PR #4368), there is an issue for the "default" config (train+dev files).

Previously we ran the preprocessing for the "dev" config (only dev files) with success.

Train data files are larger than dev ones and apparently workers run out of memory while processing them.

Any help/hint is welcome!

Error message:

Data channel closed, unable to receive additional data from SDK sdk-0-0

Info from the Diagnostics tab:

Out of memory: Killed process 1882 (python) total-vm:6041764kB, anon-rss:3290928kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:9520kB oom_score_adj:900
The worker VM had to shut down one or more processes due to lack of memory.

Additional information

Stack trace

Traceback (most recent call last):
  File "/home/albert_huggingface_co/natural_questions/venv/bin/datasets-cli", line 8, in <module>
    sys.exit(main())
  File "/home/albert_huggingface_co/natural_questions/venv/lib/python3.9/site-packages/datasets/commands/datasets_cli.py", line 39, in main
    service.run()
  File "/home/albert_huggingface_co/natural_questions/venv/lib/python3.9/site-packages/datasets/commands/run_beam.py", line 127, in run
    builder.download_and_prepare(
  File "/home/albert_huggingface_co/natural_questions/venv/lib/python3.9/site-packages/datasets/builder.py", line 704, in download_and_prepare
    self._download_and_prepare(
  File "/home/albert_huggingface_co/natural_questions/venv/lib/python3.9/site-packages/datasets/builder.py", line 1389, in _download_and_prepare
    pipeline_results.wait_until_finish()
  File "/home/albert_huggingface_co/natural_questions/venv/lib/python3.9/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1667, in wait_until_finish
    raise DataflowRuntimeException(
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Data channel closed, unable to receive additional data from SDK sdk-0-0

Logs

Error message from worker: Data channel closed, unable to receive additional data from SDK sdk-0-0

Workflow failed. Causes: S30:train/ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/GroupByKey/Read+train/ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/GroupByKey/GroupByWindow+train/ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/FlatMap(restore_timestamps)+train/ReadAllFromText/ReadAllFiles/Reshard/RemoveRandomKeys+train/ReadAllFromText/ReadAllFiles/ReadRange+train/Map(_parse_example)+train/Encode+train/Count N. Examples+train/Get values/Values+train/Save to parquet/Write/WriteImpl/WindowInto(WindowIntoFn)+train/Save to parquet/Write/WriteImpl/WriteBundles+train/Save to parquet/Write/WriteImpl/Pair+train/Save to parquet/Write/WriteImpl/GroupByKey/Write failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers: beamapp-alberthuggingface-06170554-5p23-harness-t4v9 Root cause: Data channel closed, unable to receive additional data from SDK sdk-0-0, beamapp-alberthuggingface-06170554-5p23-harness-t4v9 Root cause: The worker lost contact with the service., beamapp-alberthuggingface-06170554-5p23-harness-bwsj Root cause: The worker lost contact with the service., beamapp-alberthuggingface-06170554-5p23-harness-5052 Root cause: The worker lost contact with the service.
@albertvillanova albertvillanova added the bug Something isn't working label Jun 20, 2022
@albertvillanova albertvillanova changed the title Out of memory error on workers while running Apache Beam + Google Dataflow Out of memory error on workers while running Beam+Dataflow Jun 20, 2022
@albertvillanova
Copy link
Member Author

Some naive ideas to cope with this:

  • enable more RAM on each worker
  • force the spanning of more workers
  • others?

@seirasto
Copy link
Contributor

@albertvillanova We were finally able to process the full NQ dataset on our machines using 600 gb with 5 workers. Maybe these numbers will work for you as well.

@albertvillanova
Copy link
Member Author

Thanks a lot for the hint, @seirasto.

I have one question: what runner did you use? Direct, Apache Flink/Nemo/Samza/Spark, Google Dataflow...? Thank you.

@seirasto
Copy link
Contributor

I asked my colleague who ran the code and he said apache beam.

@seirasto
Copy link
Contributor

@albertvillanova Since we have already processed the NQ dataset on our machines can we upload it to datasets so the NQ PR can be merged?

@albertvillanova
Copy link
Member Author

albertvillanova commented Jun 28, 2022

Maybe @lhoestq can give a more accurate answer as I am not sure about the authentication requirements to upload those files to our cloud bucket.

Anyway I propose to continue this discussion on the dedicated PR for Natural questions dataset:

@seirasto
Copy link
Contributor

I asked my colleague who ran the code and he said apache beam.

He looked into it further and he just used DirectRunner. @albertvillanova

@albertvillanova
Copy link
Member Author

albertvillanova commented Jun 30, 2022

OK, thank you @seirasto for your hint.

That explains why you did not encounter the out of memory error: this only appears when the processing is distributed (on workers memory) and DirectRunner does not distribute the processing (all is done in a single machine).

@jdwillard19
Copy link

@albertvillanova Doesn't DirectRunner offer distributed processing through?

https://beam.apache.org/documentation/runners/direct/

Setting parallelism

Number of threads or subprocesses is defined by setting the direct_num_workers pipeline option. From 2.22.0, direct_num_workers = 0 is supported. When direct_num_workers is set to 0, it will set the number of threads/subprocess to the number of cores of the machine where the pipeline is running.

Setting running mode

In Beam 2.19.0 and newer, you can use the direct_running_mode pipeline option to set the running mode. direct_running_mode can be one of ['in_memory', 'multi_threading', 'multi_processing'].

in_memory: Runner and workers’ communication happens in memory (not through gRPC). This is a default mode.

multi_threading: Runner and workers communicate through gRPC and each worker runs in a thread.

multi_processing: Runner and workers communicate through gRPC and each worker runs in a subprocess.

@lhoestq
Copy link
Member

lhoestq commented Mar 26, 2024

Unrelated to the OOM issue, but we deprecated datasets with Beam scripts in #6474. I think we can close this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants