Out of memory error on workers while running Beam+Dataflow #4525

albertvillanova · 2022-06-20T07:28:12Z

Describe the bug

While running the preprocessing of the natural_question dataset (see PR #4368), there is an issue for the "default" config (train+dev files).

Previously we ran the preprocessing for the "dev" config (only dev files) with success.

Train data files are larger than dev ones and apparently workers run out of memory while processing them.

Any help/hint is welcome!

Error message:

Data channel closed, unable to receive additional data from SDK sdk-0-0

Info from the Diagnostics tab:

Out of memory: Killed process 1882 (python) total-vm:6041764kB, anon-rss:3290928kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:9520kB oom_score_adj:900
The worker VM had to shut down one or more processes due to lack of memory.

Additional information

Stack trace

Traceback (most recent call last):
  File "/home/albert_huggingface_co/natural_questions/venv/bin/datasets-cli", line 8, in <module>
    sys.exit(main())
  File "/home/albert_huggingface_co/natural_questions/venv/lib/python3.9/site-packages/datasets/commands/datasets_cli.py", line 39, in main
    service.run()
  File "/home/albert_huggingface_co/natural_questions/venv/lib/python3.9/site-packages/datasets/commands/run_beam.py", line 127, in run
    builder.download_and_prepare(
  File "/home/albert_huggingface_co/natural_questions/venv/lib/python3.9/site-packages/datasets/builder.py", line 704, in download_and_prepare
    self._download_and_prepare(
  File "/home/albert_huggingface_co/natural_questions/venv/lib/python3.9/site-packages/datasets/builder.py", line 1389, in _download_and_prepare
    pipeline_results.wait_until_finish()
  File "/home/albert_huggingface_co/natural_questions/venv/lib/python3.9/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1667, in wait_until_finish
    raise DataflowRuntimeException(
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Data channel closed, unable to receive additional data from SDK sdk-0-0

Logs

Error message from worker: Data channel closed, unable to receive additional data from SDK sdk-0-0

Workflow failed. Causes: S30:train/ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/GroupByKey/Read+train/ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/GroupByKey/GroupByWindow+train/ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/FlatMap(restore_timestamps)+train/ReadAllFromText/ReadAllFiles/Reshard/RemoveRandomKeys+train/ReadAllFromText/ReadAllFiles/ReadRange+train/Map(_parse_example)+train/Encode+train/Count N. Examples+train/Get values/Values+train/Save to parquet/Write/WriteImpl/WindowInto(WindowIntoFn)+train/Save to parquet/Write/WriteImpl/WriteBundles+train/Save to parquet/Write/WriteImpl/Pair+train/Save to parquet/Write/WriteImpl/GroupByKey/Write failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers: beamapp-alberthuggingface-06170554-5p23-harness-t4v9 Root cause: Data channel closed, unable to receive additional data from SDK sdk-0-0, beamapp-alberthuggingface-06170554-5p23-harness-t4v9 Root cause: The worker lost contact with the service., beamapp-alberthuggingface-06170554-5p23-harness-bwsj Root cause: The worker lost contact with the service., beamapp-alberthuggingface-06170554-5p23-harness-5052 Root cause: The worker lost contact with the service.

The text was updated successfully, but these errors were encountered:

albertvillanova · 2022-06-20T07:36:05Z

Some naive ideas to cope with this:

enable more RAM on each worker
force the spanning of more workers
others?

seirasto · 2022-06-23T19:46:15Z

@albertvillanova We were finally able to process the full NQ dataset on our machines using 600 gb with 5 workers. Maybe these numbers will work for you as well.

albertvillanova · 2022-06-24T05:26:18Z

Thanks a lot for the hint, @seirasto.

I have one question: what runner did you use? Direct, Apache Flink/Nemo/Samza/Spark, Google Dataflow...? Thank you.

seirasto · 2022-06-28T03:28:05Z

I asked my colleague who ran the code and he said apache beam.

seirasto · 2022-06-28T14:25:02Z

@albertvillanova Since we have already processed the NQ dataset on our machines can we upload it to datasets so the NQ PR can be merged?

albertvillanova · 2022-06-28T14:34:14Z

Maybe @lhoestq can give a more accurate answer as I am not sure about the authentication requirements to upload those files to our cloud bucket.

Anyway I propose to continue this discussion on the dedicated PR for Natural questions dataset:

Add long answer candidates to natural questions dataset #4368

seirasto · 2022-06-30T01:30:23Z

I asked my colleague who ran the code and he said apache beam.

He looked into it further and he just used DirectRunner. @albertvillanova

albertvillanova · 2022-06-30T09:28:22Z

OK, thank you @seirasto for your hint.

That explains why you did not encounter the out of memory error: this only appears when the processing is distributed (on workers memory) and DirectRunner does not distribute the processing (all is done in a single machine).

jdwillard19 · 2024-03-25T18:20:28Z

@albertvillanova Doesn't DirectRunner offer distributed processing through?

https://beam.apache.org/documentation/runners/direct/

Setting parallelism

Number of threads or subprocesses is defined by setting the direct_num_workers pipeline option. From 2.22.0, direct_num_workers = 0 is supported. When direct_num_workers is set to 0, it will set the number of threads/subprocess to the number of cores of the machine where the pipeline is running.

Setting running mode

In Beam 2.19.0 and newer, you can use the direct_running_mode pipeline option to set the running mode. direct_running_mode can be one of ['in_memory', 'multi_threading', 'multi_processing'].

in_memory: Runner and workers’ communication happens in memory (not through gRPC). This is a default mode.

multi_threading: Runner and workers communicate through gRPC and each worker runs in a thread.

multi_processing: Runner and workers communicate through gRPC and each worker runs in a subprocess.

lhoestq · 2024-03-26T12:54:44Z

Unrelated to the OOM issue, but we deprecated datasets with Beam scripts in #6474. I think we can close this issue

albertvillanova added the bug Something isn't working label Jun 20, 2022

albertvillanova changed the title ~~Out of memory error on workers while running Apache Beam + Google Dataflow~~ Out of memory error on workers while running Beam+Dataflow Jun 20, 2022

This was referenced Jun 20, 2022

Add long answer candidates to natural questions dataset #4368

Merged

Downloading via Apache Pipeline, client cancelled (org.apache.beam.vendor.grpc.v1p43p2.io.grpc.StatusRuntimeException) #4524

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory error on workers while running Beam+Dataflow #4525

Out of memory error on workers while running Beam+Dataflow #4525

albertvillanova commented Jun 20, 2022 •

edited

Loading

albertvillanova commented Jun 20, 2022

seirasto commented Jun 23, 2022

albertvillanova commented Jun 24, 2022

seirasto commented Jun 28, 2022

seirasto commented Jun 28, 2022

albertvillanova commented Jun 28, 2022 •

edited

Loading

seirasto commented Jun 30, 2022

albertvillanova commented Jun 30, 2022 •

edited

Loading

jdwillard19 commented Mar 25, 2024

lhoestq commented Mar 26, 2024

Out of memory error on workers while running Beam+Dataflow #4525

Out of memory error on workers while running Beam+Dataflow #4525

Comments

albertvillanova commented Jun 20, 2022 • edited Loading

Describe the bug

Additional information

Stack trace

Logs

albertvillanova commented Jun 20, 2022

seirasto commented Jun 23, 2022

albertvillanova commented Jun 24, 2022

seirasto commented Jun 28, 2022

seirasto commented Jun 28, 2022

albertvillanova commented Jun 28, 2022 • edited Loading

seirasto commented Jun 30, 2022

albertvillanova commented Jun 30, 2022 • edited Loading

jdwillard19 commented Mar 25, 2024

lhoestq commented Mar 26, 2024

albertvillanova commented Jun 20, 2022 •

edited

Loading

albertvillanova commented Jun 28, 2022 •

edited

Loading

albertvillanova commented Jun 30, 2022 •

edited

Loading