Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] SplitAndRetryOOM query78 at 100TB with spark.executor.cores=64 #9204

Closed
abellina opened this issue Sep 7, 2023 · 1 comment · Fixed by #9217
Closed

[BUG] SplitAndRetryOOM query78 at 100TB with spark.executor.cores=64 #9204

abellina opened this issue Sep 7, 2023 · 1 comment · Fixed by #9217
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@abellina
Copy link
Collaborator

abellina commented Sep 7, 2023

I am trying out some scaling tests with spark.executor.cores=64 and I ran into an SplitAndRetryOOM with NDS query78.

The place this is happening it looks to be the actual aggregation, as we should have concatenated tables before calling into it. The unfortunate thing is that we may need to split things up right after we concatenate them:

com.nvidia.spark.rapids.jni.SplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:428)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:546)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:484)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:276)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:129)
	at com.nvidia.spark.rapids.AggHelper.aggregate(aggregate.scala:336)
	at com.nvidia.spark.rapids.GpuAggregateIterator$.aggregate(aggregate.scala:449)
	at com.nvidia.spark.rapids.GpuAggregateIterator$.$anonfun$computeAggregateAndClose$1(aggregate.scala:478)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.GpuAggregateIterator$.computeAggregateAndClose(aggregate.scala:469)
	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.concatenateAndMerge(aggregate.scala:905)
	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.mergePass(aggregate.scala:868)
	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.$anonfun$tryMergeAggregatedBatches$1(aggregate.scala:810)
	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.$anonfun$tryMergeAggregatedBatches$1$adapted(aggregate.scala:808)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.tryMergeAggregatedBatches(aggregate.scala:808)
	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.$anonfun$next$2(aggregate.scala:753)
	at scala.Option.getOrElse(Option.scala:189)
	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(aggregate.scala:749)
	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(aggregate.scala:711)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.$anonfun$next$6(aggregate.scala:2034)
	at scala.Option.map(Option.scala:230)
	at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(aggregate.scala:2034)
	at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(aggregate.scala:1898)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:320)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:342)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2(RapidsShuffleInternalManagerBase.scala:282)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2$adapted(RapidsShuffleInternalManagerBase.scala:275)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1(RapidsShuffleInternalManagerBase.scala:275)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1$adapted(RapidsShuffleInternalManagerBase.scala:274)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.write(RapidsShuffleInternalManagerBase.scala:274)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Sep 7, 2023
@abellina
Copy link
Collaborator Author

abellina commented Sep 8, 2023

If this is reproducible, lets print out the estimated aggregate size computed right before we enter concatenateAndMerge around aggregate.scala line 856 (potentialSize) and see if that is some really large number.

That said, after talking to @revans2 some more, he thinks the likely issue is a race condition where this particular task should have actually retried and then only if the retry failed gone with a split and retry. The reason for the race is that there are two other threads that are also failing at the same time and they were not under the retry (this other failure is handled by #9102 already). I'll retry with this fix by @firestarman, but this issue I filed is an example of a race condition @revans2 was worried about that we may want to fix in spark-rapids-jni.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants