Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Parquet unsigned int scan test failure #7213

Closed
jlowe opened this issue Nov 30, 2022 · 3 comments
Closed

[BUG] Parquet unsigned int scan test failure #7213

jlowe opened this issue Nov 30, 2022 · 3 comments
Assignees
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf

Comments

@jlowe
Copy link
Member

jlowe commented Nov 30, 2022

Have seen the following failure in premerge and nightly builds of 23.02 where one of the tests in ParquetScanSuite fails:

[2022-11-30T17:37:00.861Z] �[31m- Test Parquet nested unsigned int: uint8, uint16, uint32 *** FAILED ***�[0m
[2022-11-30T17:37:00.861Z] �[31m  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 251.0 failed 1 times, most recent failure: Lost task 0.0 in stage 251.0 (TID 1125) (nightly-work2-690-q0bgr-jdx4z executor driver): java.lang.IllegalArgumentException: Cannot grow BufferHolder by size -8 because the size is negative�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:67)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter.initialize(UnsafeArrayWriter.java:61)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_3$(Unknown Source)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:365)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.scheduler.Task.run(Task.scala:136)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at java.lang.Thread.run(Thread.java:750)�[0m
[2022-11-30T17:37:00.861Z] �[31m�[0m
[2022-11-30T17:37:00.861Z] �[31mDriver stacktrace:�[0m
[2022-11-30T17:37:00.861Z] �[31m  at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at scala.Option.foreach(Option.scala:407)�[0m
[2022-11-30T17:37:00.862Z] �[31m  ...�[0m
[2022-11-30T17:37:00.862Z] �[31m  Cause: java.lang.IllegalArgumentException: Cannot grow BufferHolder by size -8 because the size is negative�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:67)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter.initialize(UnsafeArrayWriter.java:61)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_3$(Unknown Source)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:365)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)�[0m
[2022-11-30T17:37:00.862Z] �[31m  ...�[0m
@jlowe jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 30, 2022
@jlowe
Copy link
Member Author

jlowe commented Dec 2, 2022

I can make this reliably fail on my desktop by applying the following patch:

diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala
index 7f34bb1ae..9ed55e3be 100644
--- a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala
+++ b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala
@@ -293,7 +293,7 @@ object GpuDeviceManager extends Logging {
       }
 
       Cuda.setDevice(gpuId)
-      Rmm.initialize(init, logConf, poolAllocation)
+      Rmm.initialize(RmmAllocationMode.CUDA_DEFAULT, logConf, 0L)
       RapidsBufferCatalog.init(conf)
 
       GpuShuffleEnv.init(conf, RapidsBufferCatalog.getDiskBlockManager())

and running:

mvn clean package -Dbuildver=330 -DwildcardSuites=com.nvidia.spark.rapids.ParquetScanSuite

I tried running this with compute-sanitizer, but it did not find any errors.

@jlowe
Copy link
Member Author

jlowe commented Dec 2, 2022

Tracked this down to some bad offsets for a LIST column being generated from the chunked Parquet reader. @nvdbaranec was able to reproduce the problem in pure C++ code for libcudf.

@jlowe jlowe added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Dec 2, 2022
@nvdbaranec
Copy link
Collaborator

nvdbaranec commented Dec 2, 2022

Triaged this. It turns out this is another case of a malformed (but plausible in the wild) parquet file. Essentially, we have a table with 2 rows in it. However, 3 of the columns (members of a struct) contain 4 values which we translate as 4 rows. This causes the figure-out-chunks code to blow up quietly. We had a similar issue a while back that got fixed in the reader itself but it looks like it has re-appeared in this new code path.

The filename jogged my memory too. nested-unsigned.parquet. It turns out this was the same file that cause the earlier issue
rapidsai/cudf#11353
#6147

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Dec 6, 2022
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Dec 17, 2022
…2360)

Fixes:

NVIDIA/spark-rapids#7213
NVIDIA/spark-rapids#7228

This adds code to detect a subset of possible malformed parquet page data.  Specifically:  where the input file contains N rows, but the page data for some (non-list) columns contains a number of values != N.   This is a very lightweight check.

There is an associated PR for the spark plugin that should be merged immediately after this one (otherwise builds will fail) so I'm adding the Do Not Merge tag.

Authors:
  - https://github.com/nvdbaranec

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #12360
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf
Projects
None yet
Development

No branches or pull requests

3 participants