[BUG] SplitAndRetryOOM query78 at 100TB with spark.executor.cores=64 #9204

abellina · 2023-09-07T19:57:45Z

I am trying out some scaling tests with spark.executor.cores=64 and I ran into an SplitAndRetryOOM with NDS query78.

The place this is happening it looks to be the actual aggregation, as we should have concatenated tables before calling into it. The unfortunate thing is that we may need to split things up right after we concatenate them:

com.nvidia.spark.rapids.jni.SplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:428)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:546)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:484)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:276)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:129)
	at com.nvidia.spark.rapids.AggHelper.aggregate(aggregate.scala:336)
	at com.nvidia.spark.rapids.GpuAggregateIterator$.aggregate(aggregate.scala:449)
	at com.nvidia.spark.rapids.GpuAggregateIterator$.$anonfun$computeAggregateAndClose$1(aggregate.scala:478)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.GpuAggregateIterator$.computeAggregateAndClose(aggregate.scala:469)
	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.concatenateAndMerge(aggregate.scala:905)
	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.mergePass(aggregate.scala:868)
	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.$anonfun$tryMergeAggregatedBatches$1(aggregate.scala:810)
	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.$anonfun$tryMergeAggregatedBatches$1$adapted(aggregate.scala:808)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.tryMergeAggregatedBatches(aggregate.scala:808)
	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.$anonfun$next$2(aggregate.scala:753)
	at scala.Option.getOrElse(Option.scala:189)
	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(aggregate.scala:749)
	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(aggregate.scala:711)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.$anonfun$next$6(aggregate.scala:2034)
	at scala.Option.map(Option.scala:230)
	at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(aggregate.scala:2034)
	at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(aggregate.scala:1898)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:320)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:342)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2(RapidsShuffleInternalManagerBase.scala:282)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2$adapted(RapidsShuffleInternalManagerBase.scala:275)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1(RapidsShuffleInternalManagerBase.scala:275)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1$adapted(RapidsShuffleInternalManagerBase.scala:274)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.write(RapidsShuffleInternalManagerBase.scala:274)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

The text was updated successfully, but these errors were encountered:

abellina · 2023-09-08T14:34:10Z

If this is reproducible, lets print out the estimated aggregate size computed right before we enter concatenateAndMerge around aggregate.scala line 856 (potentialSize) and see if that is some really large number.

That said, after talking to @revans2 some more, he thinks the likely issue is a race condition where this particular task should have actually retried and then only if the retry failed gone with a split and retry. The reason for the race is that there are two other threads that are also failing at the same time and they were not under the retry (this other failure is handled by #9102 already). I'll retry with this fix by @firestarman, but this issue I filed is an example of a race condition @revans2 was worried about that we may want to fix in spark-rapids-jni.

abellina added bug Something isn't working ? - Needs Triage Need team to review and classify reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Sep 7, 2023

abellina mentioned this issue Sep 11, 2023

Fix leak in aggregate when there are retries #9217

Merged

revans2 closed this as completed in #9217 Sep 12, 2023

mattahrens removed the ? - Needs Triage Need team to review and classify label Sep 13, 2023

abellina mentioned this issue Sep 21, 2023

[BUG] SplitAndRetryOOM query14_part1 at 100TB with spark.executor.cores=64 #9208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] SplitAndRetryOOM query78 at 100TB with spark.executor.cores=64 #9204

[BUG] SplitAndRetryOOM query78 at 100TB with spark.executor.cores=64 #9204

abellina commented Sep 7, 2023

abellina commented Sep 8, 2023

[BUG] SplitAndRetryOOM query78 at 100TB with spark.executor.cores=64 #9204

[BUG] SplitAndRetryOOM query78 at 100TB with spark.executor.cores=64 #9204

Comments

abellina commented Sep 7, 2023

abellina commented Sep 8, 2023