[BUG] Parquet unsigned int scan test failure #7213

jlowe · 2022-11-30T20:41:40Z

Have seen the following failure in premerge and nightly builds of 23.02 where one of the tests in ParquetScanSuite fails:

[2022-11-30T17:37:00.861Z] �[31m- Test Parquet nested unsigned int: uint8, uint16, uint32 *** FAILED ***�[0m
[2022-11-30T17:37:00.861Z] �[31m  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 251.0 failed 1 times, most recent failure: Lost task 0.0 in stage 251.0 (TID 1125) (nightly-work2-690-q0bgr-jdx4z executor driver): java.lang.IllegalArgumentException: Cannot grow BufferHolder by size -8 because the size is negative�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:67)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter.initialize(UnsafeArrayWriter.java:61)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_3$(Unknown Source)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:365)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.scheduler.Task.run(Task.scala:136)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)�[0m
[2022-11-30T17:37:00.861Z] �[31m	at java.lang.Thread.run(Thread.java:750)�[0m
[2022-11-30T17:37:00.861Z] �[31m�[0m
[2022-11-30T17:37:00.861Z] �[31mDriver stacktrace:�[0m
[2022-11-30T17:37:00.861Z] �[31m  at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at scala.Option.foreach(Option.scala:407)�[0m
[2022-11-30T17:37:00.862Z] �[31m  ...�[0m
[2022-11-30T17:37:00.862Z] �[31m  Cause: java.lang.IllegalArgumentException: Cannot grow BufferHolder by size -8 because the size is negative�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:67)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter.initialize(UnsafeArrayWriter.java:61)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_3$(Unknown Source)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:365)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)�[0m
[2022-11-30T17:37:00.862Z] �[31m  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)�[0m
[2022-11-30T17:37:00.862Z] �[31m  ...�[0m

The text was updated successfully, but these errors were encountered:

jlowe · 2022-12-02T15:41:14Z

I can make this reliably fail on my desktop by applying the following patch:

diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala
index 7f34bb1ae..9ed55e3be 100644
--- a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala
+++ b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala
@@ -293,7 +293,7 @@ object GpuDeviceManager extends Logging {
       }
 
       Cuda.setDevice(gpuId)
-      Rmm.initialize(init, logConf, poolAllocation)
+      Rmm.initialize(RmmAllocationMode.CUDA_DEFAULT, logConf, 0L)
       RapidsBufferCatalog.init(conf)
 
       GpuShuffleEnv.init(conf, RapidsBufferCatalog.getDiskBlockManager())

and running:

mvn clean package -Dbuildver=330 -DwildcardSuites=com.nvidia.spark.rapids.ParquetScanSuite

I tried running this with compute-sanitizer, but it did not find any errors.

jlowe · 2022-12-02T19:29:15Z

Tracked this down to some bad offsets for a LIST column being generated from the chunked Parquet reader. @nvdbaranec was able to reproduce the problem in pure C++ code for libcudf.

nvdbaranec · 2022-12-02T20:50:52Z

Triaged this. It turns out this is another case of a malformed (but plausible in the wild) parquet file. Essentially, we have a table with 2 rows in it. However, 3 of the columns (members of a struct) contain 4 values which we translate as 4 rows. This causes the figure-out-chunks code to blow up quietly. We had a similar issue a while back that got fixed in the reader itself but it looks like it has re-appeared in this new code path.

The filename jogged my memory too. nested-unsigned.parquet. It turns out this was the same file that cause the earlier issue
rapidsai/cudf#11353
#6147

…2360) Fixes: NVIDIA/spark-rapids#7213 NVIDIA/spark-rapids#7228 This adds code to detect a subset of possible malformed parquet page data. Specifically: where the input file contains N rows, but the page data for some (non-list) columns contains a number of values != N. This is a very lightweight check. There is an associated PR for the spark plugin that should be merged immediately after this one (otherwise builds will fail) so I'm adding the Do Not Merge tag. Authors: - https://github.com/nvdbaranec Approvers: - Nghia Truong (https://github.com/ttnghia) - Vyas Ramasubramani (https://github.com/vyasr) - Mike Wilson (https://github.com/hyperbolic2346) URL: #12360

jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 30, 2022

jlowe added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Dec 2, 2022

jlowe assigned nvdbaranec Dec 2, 2022

nvdbaranec mentioned this issue Dec 2, 2022

[FEA] nested-unsigned.parquet is a malformed but useful parquet file. #7228

Closed

pxLi mentioned this issue Dec 6, 2022

Remove deprecated compatibility support in premerge [databricks] #7268

Merged

sameerz removed the ? - Needs Triage Need team to review and classify label Dec 6, 2022

jlowe mentioned this issue Dec 9, 2022

[BUG] Test Parquet nested unsigned int: uint8, uint16, uint32 *** FAILED #7318

Closed

This was referenced Dec 12, 2022

Add code to detect possible malformed page data in parquet files. rapidsai/cudf#12360

Merged

Fix nested-unsigned test issues. #7340

Merged

nvdbaranec closed this as completed Dec 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Parquet unsigned int scan test failure #7213

[BUG] Parquet unsigned int scan test failure #7213

jlowe commented Nov 30, 2022

jlowe commented Dec 2, 2022

jlowe commented Dec 2, 2022

nvdbaranec commented Dec 2, 2022 •

edited

Loading

[BUG] Parquet unsigned int scan test failure #7213

[BUG] Parquet unsigned int scan test failure #7213

Comments

jlowe commented Nov 30, 2022

jlowe commented Dec 2, 2022

jlowe commented Dec 2, 2022

nvdbaranec commented Dec 2, 2022 • edited Loading

nvdbaranec commented Dec 2, 2022 •

edited

Loading