Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CudfException during conditional hash join while running nds query64 #10047

Closed
jbrennan333 opened this issue Dec 13, 2023 · 3 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@jbrennan333
Copy link
Collaborator

jbrennan333 commented Dec 13, 2023

Describe the bug
When running NDS power run on my desktop, I am seeing the following cudf exception during query64:

23/12/13 21:23:23 ERROR Executor: Exception in task 0.0 in stage 450.0 (TID 14627)
ai.rapids.cudf.CudfException: CUDF failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-618-cuda11/thirdparty/cudf/cpp/src/ast/expression_parser.cpp:149: An AST expression was provided non-matching operand types.
        at ai.rapids.cudf.Table.mixedInnerJoinGatherMaps(Native Method)
        at ai.rapids.cudf.Table.mixedInnerJoinGatherMaps(Table.java:3139)
        at org.apache.spark.sql.rapids.execution.ConditionalHashJoinIterator.$anonfun$joinGathererLeftRight$6(GpuHashJoin.scala:558)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
        at org.apache.spark.sql.rapids.execution.ConditionalHashJoinIterator.$anonfun$joinGathererLeftRight$5(GpuHashJoin.scala:554)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
        at org.apache.spark.sql.rapids.execution.ConditionalHashJoinIterator.$anonfun$joinGathererLeftRight$4(GpuHashJoin.scala:553)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
        at org.apache.spark.sql.rapids.execution.ConditionalHashJoinIterator.joinGathererLeftRight(GpuHashJoin.scala:552)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.$anonfun$joinGathererLeftRight$2(GpuHashJoin.scala:403)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.$anonfun$joinGathererLeftRight$1(GpuHashJoin.scala:402)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.joinGathererLeftRight(GpuHashJoin.scala:401)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.joinGatherer(GpuHashJoin.scala:415)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.$anonfun$joinGatherer$1(GpuHashJoin.scala:428)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.joinGatherer(GpuHashJoin.scala:425)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.$anonfun$createGatherer$5(GpuHashJoin.scala:358)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.$anonfun$createGatherer$4(GpuHashJoin.scala:355)
        at com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:88)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.$anonfun$createGatherer$3(GpuHashJoin.scala:349)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.$anonfun$createGatherer$2(GpuHashJoin.scala:348)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$NoInputSpliterator.next(RmmRapidsRetryIterator.scala:395)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:600)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:291)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:185)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.createGatherer(GpuHashJoin.scala:346)
        at com.nvidia.spark.rapids.SplittableJoinIterator.$anonfun$setupNextGatherer$2(AbstractGpuJoinIterator.scala:245)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
        at com.nvidia.spark.rapids.SplittableJoinIterator.$anonfun$setupNextGatherer$1(AbstractGpuJoinIterator.scala:227)
        at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:150)
        at com.nvidia.spark.rapids.SplittableJoinIterator.setupNextGatherer(AbstractGpuJoinIterator.scala:227)
        at com.nvidia.spark.rapids.AbstractGpuJoinIterator.hasNext(AbstractGpuJoinIterator.scala:101)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at com.nvidia.spark.rapids.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:174)
        at com.nvidia.spark.rapids.GpuRangePartitioner$.$anonfun$sketch$1(GpuRangePartitioner.scala:54)
        at com.nvidia.spark.rapids.GpuRangePartitioner$.$anonfun$sketch$1$adapted(GpuRangePartitioner.scala:51)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Steps/Code to reproduce bug
Run nds full power run or just query64, and look for the ERROR in the executor/driver logs.

Expected behavior
NDS queries should run without exceptions.

I am wondering if this could be related to #9760?

@jbrennan333 jbrennan333 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Dec 13, 2023
@jbrennan333
Copy link
Collaborator Author

I have confirmed that if I revert commit 7c307d4 (#9760), I do not see this exception, and query64 succeeds.
@winningsix can you take a look?

@jbrennan333
Copy link
Collaborator Author

Note that this does not fail for me when I run on spark2a at scale 3000. It is failing consistently when I run on my desktop at scale 100.

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 18, 2023
@sameerz
Copy link
Collaborator

sameerz commented Jan 23, 2024

Closing this issue as #9759 is tracking the original request, and this has been resolved by the revert.

@sameerz sameerz closed this as completed Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants