Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] NDS query 14 parts 1 and 2 both fail at SF100K #8942

Closed
mattahrens opened this issue Aug 7, 2023 · 1 comment · Fixed by #8989
Closed

[BUG] NDS query 14 parts 1 and 2 both fail at SF100K #8942

mattahrens opened this issue Aug 7, 2023 · 1 comment · Fixed by #8989
Assignees
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@mattahrens
Copy link
Collaborator

query14_part1 exception:

Job aborted due to stage failure: Task 60 in stage 44.0 failed 4 times, most recent failure: Lost task 60.3 in stage 44.0 (TID 16851) (10.153.158.11 executor 5): com.nvidia.spark.rapids.jni.SplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$NoInputSpliterator.split(RmmRapidsRetryIterator.scala:359)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:530)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:468)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:275)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:181)
	at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.createGatherer(GpuHashJoin.scala:305)
	at com.nvidia.spark.rapids.SplittableJoinIterator.$anonfun$setupNextGatherer$2(AbstractGpuJoinIterator.scala:239)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.SplittableJoinIterator.$anonfun$setupNextGatherer$1(AbstractGpuJoinIterator.scala:221)
	at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:150)
	at com.nvidia.spark.rapids.SplittableJoinIterator.setupNextGatherer(AbstractGpuJoinIterator.scala:221)
	at com.nvidia.spark.rapids.AbstractGpuJoinIterator.hasNext(AbstractGpuJoinIterator.scala:95)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.rapids.execution.BaseSubHashJoinIterator.$anonfun$hasNext$9(GpuSubPartitionHashJoin.scala:537)
	at org.apache.spark.sql.rapids.execution.BaseSubHashJoinIterator.$anonfun$hasNext$9$adapted(GpuSubPartitionHashJoin.scala:537)
	at scala.Option.exists(Option.scala:376)
	at org.apache.spark.sql.rapids.execution.BaseSubHashJoinIterator.hasNext(GpuSubPartitionHashJoin.scala:537)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
	at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
	at scala.collection.AbstractIterator.to(Iterator.scala:1431)
	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431)
	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
	at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

query14_part2 exception (looks to be same as part 1 exception)

Job aborted due to stage failure: Task 60 in stage 60.0 failed 4 times, most recent failure: Lost task 60.3 in stage 60.0 (TID 21368) (10.153.158.16 executor 6): com.nvidia.spark.rapids.jni.SplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$NoInputSpliterator.split(RmmRapidsRetryIterator.scala:359)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:530)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:468)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:275)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:181)
	at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.createGatherer(GpuHashJoin.scala:305)
	at com.nvidia.spark.rapids.SplittableJoinIterator.$anonfun$setupNextGatherer$2(AbstractGpuJoinIterator.scala:239)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.SplittableJoinIterator.$anonfun$setupNextGatherer$1(AbstractGpuJoinIterator.scala:221)
	at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:150)
	at com.nvidia.spark.rapids.SplittableJoinIterator.setupNextGatherer(AbstractGpuJoinIterator.scala:221)
	at com.nvidia.spark.rapids.AbstractGpuJoinIterator.hasNext(AbstractGpuJoinIterator.scala:95)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.rapids.execution.BaseSubHashJoinIterator.$anonfun$hasNext$9(GpuSubPartitionHashJoin.scala:537)
	at org.apache.spark.sql.rapids.execution.BaseSubHashJoinIterator.$anonfun$hasNext$9$adapted(GpuSubPartitionHashJoin.scala:537)
	at scala.Option.exists(Option.scala:376)
	at org.apache.spark.sql.rapids.execution.BaseSubHashJoinIterator.hasNext(GpuSubPartitionHashJoin.scala:537)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
	at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
	at scala.collection.AbstractIterator.to(Iterator.scala:1431)
	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431)
	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
	at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Note that this failure is reproducible even after potential q95 fix for SF30K/SF100K (ref: #8936).

@mattahrens mattahrens added bug Something isn't working ? - Needs Triage Need team to review and classify labels Aug 7, 2023
@abellina
Copy link
Collaborator

abellina commented Aug 8, 2023

Chatted with @revans2 and he suspects this is a race condition where we could hold all on the GPU given that we have more accurate calculation around the splits for this gatherer, but instead of retrying one last time before asking to split (with one thread remaining) we are jumping to split directly. It should be relatively easy to mock out a test plugin-side to see if that's the issue, then we can come up with a real solution in spark-rapids-jni...

@mattahrens mattahrens added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Aug 8, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants