Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Java OOM when testing non-UTC time zone with lots of cases fallback. #9829

Closed
res-life opened this issue Nov 22, 2023 · 16 comments · Fixed by #9944
Closed

[BUG] Java OOM when testing non-UTC time zone with lots of cases fallback. #9829

res-life opened this issue Nov 22, 2023 · 16 comments · Fixed by #9944
Assignees
Labels
bug Something isn't working

Comments

@res-life
Copy link
Collaborator

res-life commented Nov 22, 2023

Describe the bug

Premerge reports OOM when running CI.
Refer to: #9773

Errors are:

[2023-11-21T12:23:14.166Z] ----------------------------- Captured stderr call -----------------------------
[2023-11-21T12:23:14.166Z] 2023-11-21 15:50:18 INFO     Error while receiving.
[2023-11-21T12:23:14.166Z] Traceback (most recent call last):
[2023-11-21T12:23:14.166Z]   File "/home/jenkins/agent/workspace/jenkins-rapids_premerge-github-8481-ci-2/.download/spark-3.1.1-bin-hadoop3.2/python/pyspark/sql/utils.py", line 111, in deco
[2023-11-21T12:23:14.166Z]     return f(*a, **kw)
[2023-11-21T12:23:14.166Z]   File "/home/jenkins/agent/workspace/jenkins-rapids_premerge-github-8481-ci-2/.download/spark-3.1.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
[2023-11-21T12:23:14.166Z]     raise Py4JJavaError(
[2023-11-21T12:23:14.166Z] py4j.protocol.Py4JJavaError: An error occurred while calling o650151.collectToPython.
[2023-11-21T12:23:14.166Z] : java.lang.OutOfMemoryError: Java heap space
[2023-11-21T12:23:14.166Z] 

Steps/Code to reproduce bug
Based on this PR #9773.

export TEST_PARALLEL=1
export TZ=Iran
./integration_tests/run_pyspark_from_build.sh -s

Additional context
Locally I set TEST_PARALLEL to 1 on Spark 341, it also fails.

Changes in #9652

  • user.timezone=$TZ
  • PYSP_TEST_spark_sql_session_timeZone=$TZ

Changes in #9773:

  • Set TZ=non-UTC
@res-life res-life added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 22, 2023
@winningsix
Copy link
Collaborator

There're more contexts here. One point may be related is the way we run those test cases. When we set to non-UTC case, most cases will be fallback onto CPU. It may lead to higher memory demand for Java heap. If that's the case, it's more likely a test setting problem other than Rapids plugin's bug.

@res-life
Copy link
Collaborator Author

res-life commented Nov 23, 2023

When TEST_PARALLEL > 1, then PYSP_TEST_spark_rapids_memory_gpu_allocSize is 1536m. And the following GPU error occurs:

2454832 23/11/23 12:17:01 ERROR Executor: Exception in task 0.0 in stage 71540.0 (TID 265099)
2454833 com.nvidia.spark.rapids.jni.SplitAndRetryOOM: GPU OutOfMemory: could not split input target batch size to less than 10 MiB
2454834   at com.nvidia.spark.rapids.GpuParquetScan$.splitTargetBatchSize(GpuParquetScan.scala:266)
2454835   at com.nvidia.spark.rapids.MultiFileParquetPartitionReader.chunkedSplit(GpuParquetScan.scala:1968)
2454836   at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$4(GpuMultiFileReader.scala:1068)
2454837   at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:442)
2454838   at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:557)
2454839   at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:495)
2454840   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
2454841   at scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:469)
2454842   at com.nvidia.spark.rapids.GpuColumnarBatchWithPartitionValuesIterator.hasNext(GpuColumnarBatchIterator.scala:114)
2454843   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
2454844   at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1025)
2454845   at com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
2454846   at com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
2454847   at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
2454848   at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
2454849   at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:285)
2454850   at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
2454851   at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:284)
2454852   at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
2454853   at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:301)
2454854   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
2454855   at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
2454856   at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
2454857   at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
2454858   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
2454859   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
2454860   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
2454861   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)

@res-life
Copy link
Collaborator Author

About the Gpu OOM, I checked the passed CIs for UTC time zone, it also reports OOM:

[2023-11-24T02:28:23.045Z] 23/11/24 02:28:22 WARN DeviceMemoryEventHandler: [RETRY 1] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-24T02:28:23.045Z] 23/11/24 02:28:22 WARN DeviceMemoryEventHandler: [RETRY 2] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-24T02:28:23.045Z] 23/11/24 02:28:22 WARN DeviceMemoryEventHandler: Device store exhausted, unable to allocate 2147483712 bytes. Total RMM allocated is 137038848 bytes.
[2023-11-24T02:28:23.045Z] 23/11/24 02:28:22 WARN DeviceMemoryEventHandler: [RETRY 1] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-24T02:28:23.045Z] 23/11/24 02:28:22 WARN DeviceMemoryEventHandler: [RETRY 2] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-24T02:28:23.045Z] 23/11/24 02:28:22 WARN DeviceMemoryEventHandler: Device store exhausted, unable to allocate 2147483712 bytes. Total RMM allocated is 137038848 bytes.
[2023-11-24T02:28:23.045Z] 23/11/24 02:28:22 WARN DeviceMemoryEventHandler: [RETRY 1] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-24T02:28:23.045Z] 23/11/24 02:28:22 WARN DeviceMemoryEventHandler: [RETRY 2] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-24T02:28:23.045Z] 23/11/24 02:28:22 WARN DeviceMemoryEventHandler: Device store exhausted, unable to allocate 2147483712 bytes. Total RMM allocated is 137038848 bytes.
[2023-11-24T02:28:23.045Z] 23/11/24 02:28:22 WARN DeviceMemoryEventHandler: [RETRY 1] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-24T02:28:23.045Z] 23/11/24 02:28:22 WARN DeviceMemoryEventHandler: [RETRY 2] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-24T02:28:23.045Z] 23/11/24 02:28:22 WARN DeviceMemoryEventHandler: Device store exhausted, unable to allocate 2147483712 bytes. Total RMM allocated is 137038848 bytes.
[2023-11-24T02:28:23.045Z] 23/11/24 02:28:22 ERROR Executor: Exception in task 0.0 in stage 22016.0 (TID 65075)
[2023-11-24T02:28:23.045Z] com.nvidia.spark.rapids.jni.SplitAndRetryOOM: GPU OutOfMemory: could not split input target batch size to less than 10 MiB
[2023-11-24T02:28:23.045Z] 	at com.nvidia.spark.rapids.GpuParquetScan$.splitTargetBatchSize(GpuParquetScan.scala:266)
[2023-11-24T02:28:23.045Z] 	at com.nvidia.spark.rapids.MultiFileParquetPartitionReader.chunkedSplit(GpuParquetScan.scala:1968)
[2023-11-24T02:28:23.045Z] 	at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$4(GpuMultiFileReader.scala:1068)
[2023-11-24T02:28:23.045Z] 	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:442)
[2023-11-24T02:28:23.045Z] 	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:557)
[2023-11-24T02:28:23.045Z] 	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:495)
[2023-11-24T02:28:23.045Z] 	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
[2023-11-24T02:28:23.045Z] 	at scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:469)
[2023-11-24T02:28:23.045Z] 	at com.nvidia.spark.rapids.GpuColumnarBatchWithPartitionValuesIterator.hasNext(GpuColumnarBatchIterator.scala:114)
[2023-11-24T02:28:23.045Z] 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[2023-11-24T02:28:23.045Z] 	at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1025)
[2023-11-24T02:28:23.045Z] 	at com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
[2023-11-24T02:28:23.045Z] 	at com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
[2023-11-24T02:28:23.045Z] 	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
[2023-11-24T02:28:23.045Z] 	at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
[2023-11-24T02:28:23.045Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:285)
[2023-11-24T02:28:23.045Z] 	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-11-24T02:28:23.045Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:284)
[2023-11-24T02:28:23.045Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
[2023-11-24T02:28:23.045Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:301)
[2023-11-24T02:28:23.045Z] 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[2023-11-24T02:28:23.045Z] 	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
[2023-11-24T02:28:23.046Z] 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
[2023-11-24T02:28:23.046Z] 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
[2023-11-24T02:28:23.046Z] 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[2023-11-24T02:28:23.046Z] 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[2023-11-24T02:28:23.046Z] 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[2023-11-24T02:28:23.046Z] 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
[2023-11-24T02:28:23.046Z] 	at org.apache.spark.scheduler.Task.run(Task.scala:131)
[2023-11-24T02:28:23.046Z] 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
[2023-11-24T02:28:23.046Z] 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
[2023-11-24T02:28:23.046Z] 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
[2023-11-24T02:28:23.046Z] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2023-11-24T02:28:23.046Z] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2023-11-24T02:28:23.046Z] 	at java.lang.Thread.run(Thread.java:750)

Refer to CI 8525, 8527, select Blue Ocean, select Premerge CI 2.

@revans2 Is this a issue?

@winningsix
Copy link
Collaborator

From the log, Parquet is trying to split its batch to a "< 10MB" batch. It's reasonable for GpuParquetScan to reject further split on a small batch. Most likely this is an issue for other parts. It seems there're other outstanding allocations consuming the memory here even we configured it at a higher size.

Is there any err msg about memory leak like below?

DeviceMemoryBuffer: A DEVICE BUFFER WAS LEAKED

@winningsix
Copy link
Collaborator

winningsix commented Nov 27, 2023

Per offline discussion with @res-life @GaryShen2008, a few further information:
In local environment (using 1 process(es) TEST_PARALLEL = 1), each process CPU core = 16, Mem = 32 GB, Gpu Mem = 50 GB
CI (with 5 process(es) TEST_PARALLEL = 5), each process CPU core = TBA, Mem = 2 GB, Gpu Mem = 1.2 GB

  • CI passed in local environment using configuration above while failed in CI.

A few follow-ups as below needed to further narrow down the issue:

  • Understand the needed peak GPU memory for non-UTC test environment. A suspected difference against UTC case is we will have much more fallbacks there with more c2r operators.
  • Update CI with updated configurations to pass our new pipeline.

cc @pxLi

@res-life
Copy link
Collaborator Author

Tested after this commit, then the OOM will disappear.
Because this commit will convert lots of fallback to regular passed case.
We can close this after this commit merged.

@res-life res-life added the P2 Not required for release label Nov 27, 2023
@revans2 revans2 removed the P2 Not required for release label Nov 27, 2023
@revans2
Copy link
Collaborator

revans2 commented Nov 27, 2023

There are a number of different issues listed here and I think we need to split them up into separate issues so we can address them properly.

The original issue for a java heap OOM when running the tests. Well the simplest way to debug this is to get a heap dump on OOM -XX:+HeapDumpOnOutOfMemoryError with that we can then look through the heap and see what is taking all of the memory so we can know if we have a leak, or what. Even if the issue goes away after a specific patch #9829 (comment) it should be simple enough to reproduce this an understand what is happening. But I would time box it to 1 day at the most.

As for the GPU out of memory error. #9829 (comment). That one is more concerning and we should try to understand what is happening here. The "split" on a parquet read is hack that does not do much. But it indicates that we really ran out of GPU memory and need to do some debugging to see what happened. It could be a memory leak or it could be something else that is problematic. That should be filed as a separate issue and I would suggest that we first look for leaks #9829 (comment) but with -Dai.rapids.refcount.debug=true This will print out stack traces for every ref count increment or decrement interaction with the memory that was leaked. This should hopefully let us track down why it leaked. But it could also be a red herring and was leaked after the exception was thrown not before. We could also enable RMM logging and see all of the allocations to get an idea of how much memory was in use at any point in time to see what happened.

@res-life res-life changed the title [BUG] Java OOM when testing non-UTC time zone [BUG] Java OOM when testing non-UTC time zone with lots of cases fallback. Nov 28, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Nov 28, 2023
@res-life
Copy link
Collaborator Author

res-life commented Dec 1, 2023

@wjxiz1992
Copy link
Collaborator

Updates:

  1. I can stably reproduce the OOM at one single test case test_explode_array_data[Array(Array(String))][DATAGEN_SEED=1701258438, INJECT_OOM, IGNORE_ORDER({'local': True})]
  2. Firstly I did heapdump at two places, one is when the OOM occurs, the other is right before the OOM test case started. Both of them are using large memory.

When OOM:
image
Right before OOM test case:
image

  1. By checking the large size objects, we found top comsumers are
    image
    the ...execution.joins.... are not related to the test_explode_array_data test case, but still there, which leads the suspections for memory leak.
  2. By checking the incoming reference of the largest MutableURLClassLoader, we found ExecutionPlanCaptureCallBack seems to have so many self reference
    image
  3. I comment out all jvm.org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback.assert... in asserts.py, then no OOM, and the heap dump at the test case is like below6.
    image

No MutableURLClassLoad and other large size in the heap this time:
image

@revans2
Copy link
Collaborator

revans2 commented Dec 1, 2023

It sounds like ExecutionPlanCaptureCallback has a memory leak in it.

@revans2
Copy link
Collaborator

revans2 commented Dec 1, 2023

okay we should keep this open, but I am not as concerned about this as I was initially. We can drop the priority on it.

@jlowe
Copy link
Member

jlowe commented Dec 1, 2023

@wjxiz1992 I'm not sure how ExecutionPlanCaptureCallback is tied into the problem if we are just running test_explode_array_data. That test does not perform plan capturing, nor does it call the methods in asserts.py that leverage ExecutionPlanCaptureCallback (it's just calling assert_gpu_and_cpu_are_equal_collect, not one of the assert forms with plan capture). Therefore ExecutionPlanCaptureCallback's methods should not be getting called with the exception of captureIfNeeded since it is an installed Spark plan listener, but shouldCapture should always be false since nothing should be asking to capture plans.

Indeed I ran test_explode_array_data with a debugger attached to the JVM and breakpoints set on every ExecutionPlanCaptureCallback method, but only captureIfNeeded is getting called in practice and shouldCapture is always false, therefore it's not capturing any plans. I'm not sure how this is supposed to be contributing a ton of space to the heap, nor do I see how commenting out the ExecutionPlanCaptureCallback.assert... code in asserts.py would affect how test_explode_array_data executes. Any insight here?

@winningsix
Copy link
Collaborator

I am still investigating on it with no further conclusion yet. But It's leaked in execPlans from ShimmedExecutionPlanCaptureCallbackImpl (see below). And OOM is fixed after I clear execPlans within assertContains method. I am still very confusing because shouldCaptured works in pair getResultsWithTimeout and startCapture. It seems correct in asserts.py file.

image

@jlowe
Copy link
Member

jlowe commented Dec 2, 2023

Yes, I totally agree that the OOM is due to leaked plans in ExecutionPlanCaptureCallback. What's likely is that there's a test that needs a plan capture, it calls startCapture which triggers the plan listener to capture everything until told otherwise, and then that test throws an exception and the listener is never told to stop capturing. It then proceeds to capture for subsequent tests that don't actually interact with the listener capturing, but the previous test that threw an exception and didn't cleanup properly left the listener in a bad state.

That all makes sense, but what I don't get is how to reproduce it solely with test_explode_array_data. That test doesn't trigger capturing in any way that I can see (and debugger breakpoints prove it, at least for my environment). In order for EsecutionPlanCaptureCallback to start leaking plans, it needs to start capturing first, and I don't see how test_explode_array_data will do that on its own.

@winningsix
Copy link
Collaborator

winningsix commented Dec 4, 2023

Yes, I totally agree that the OOM is due to leaked plans in ExecutionPlanCaptureCallback. What's likely is that there's a test that needs a plan capture, it calls startCapture which triggers the plan listener to capture everything until told otherwise, and then that test throws an exception and the listener is never told to stop capturing. It then proceeds to capture for subsequent tests that don't actually interact with the listener capturing, but the previous test that threw an exception and didn't cleanup properly left the listener in a bad state.

That all makes sense, but what I don't get is how to reproduce it solely with test_explode_array_data. That test doesn't trigger capturing in any way that I can see (and debugger breakpoints prove it, at least for my environment). In order for EsecutionPlanCaptureCallback to start leaking plans, it needs to start capturing first, and I don't see how test_explode_array_data will do that on its own.

Yes, your guess is correct. The shouldCapture failed to be reset after an exception thrown in test_csv_save_as_table_fallback. This is due to in 6aef5c2 it failed to include Coalesce in allow-list. And xfail makes this issue slight.

Encountered an exception applying GPU overrides java.lang.IllegalArgumentException: Part of the plan is not columnar class org.apache.spark.sql.execution.CoalesceExec
Coalesce 1

@wjxiz1992
Copy link
Collaborator

wjxiz1992 commented Dec 4, 2023

but what I don't get is how to reproduce it solely with test_explode_array_data

Sorry, I missed some important things.

I also did test, that I only run this test_explode_array_data and dump, no any memory object leading to ExecutionPlanCaptureCallback.

I don't think it's related to test_explode_array_data. it is just at that place where its previous tests have already leak a lot and when it is his order to perform the run, his little extra memory requests break the whole JVM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
6 participants