Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Reading some lists in parquet produces incorrect results #9240

Closed
revans2 opened this issue Sep 16, 2021 · 4 comments · Fixed by #9848
Closed

[BUG] Reading some lists in parquet produces incorrect results #9240

revans2 opened this issue Sep 16, 2021 · 4 comments · Fixed by #9848
Assignees
Labels
bug Something isn't working cuIO cuIO issue Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Sep 16, 2021

Describe the bug
When testing parquet list support we were able to produce a file that Spark can read in correctly, but cudf cannot. Thanks @res-life for finding this. The data is simple.

+---------+
|a        |
+---------+
|[0, 1, 2]|
|[1, 2, 3]|
|[2, 3, 4]|
|[3, 4, 5]|
+---------+

But when cudf tries to read it in it gets confused and reads in 6 values instead.
1, 2, 1, 2, 3, 2

Steps/Code to reproduce bug
Apply the following patch

diff --git a/cpp/tests/io/parquet_test.cpp b/cpp/tests/io/parquet_test.cpp
index 7260aa9e68..1344041f05 100644
--- a/cpp/tests/io/parquet_test.cpp
+++ b/cpp/tests/io/parquet_test.cpp
@@ -2184,6 +2184,17 @@ TEST_F(ParquetWriterStressTest, DeviceWriteLargeTableWithValids)
   CUDF_TEST_EXPECT_TABLES_EQUAL(custom_tbl.tbl->view(), expected->view());
 }
 
+TEST_F(ParquetReaderTest, SpecialArray)
+{
+  cudf_io::parquet_reader_options read_opts =
+    cudf_io::parquet_reader_options::builder(cudf_io::source_info{"./array-of-int32.parquet"});
+  auto result = cudf_io::read_parquet(read_opts);
+
+  // we should only get back 4 rows
+  EXPECT_EQ(result.tbl->view().column(0).type().id(), cudf::type_id::LIST);
+  EXPECT_EQ(result.tbl->view().column(0).size(), 4);
+}
+
 TEST_F(ParquetReaderTest, UserBounds)
 {
   // trying to read more rows than there are should result in

And download and unzip
array-of-int32.parquet.zip and place it in the directory you are going to run the tests from. Then build and run just the new test.

./gtests/PARQUET_TEST --gtest_filter=ParquetReaderTest.SpecialArray
Note: Google Test filter = ParquetReaderTest.SpecialArray
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ParquetReaderTest
[ RUN      ] ParquetReaderTest.SpecialArray
/home/roberte/src/cudf/cpp/tests/io/parquet_test.cpp:2194: Failure
Expected equality of these values:
  result.tbl->view().column(0).type().id()
    Which is: 4-byte object <03-00 00-00>
  cudf::type_id::LIST
    Which is: 4-byte object <18-00 00-00>
/home/roberte/src/cudf/cpp/tests/io/parquet_test.cpp:2195: Failure
Expected equality of these values:
  result.tbl->view().column(0).size()
    Which is: 6
  4
[  FAILED  ] ParquetReaderTest.SpecialArray (2 ms)
[----------] 1 test from ParquetReaderTest (2 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (2 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] ParquetReaderTest.SpecialArray

 1 FAILED TEST

Expected behavior
The code should read the same value.

Environment overview (please complete the following information)
I reproduced this on 21.10, but we see the same behavior in 21.08 too.

Additional context

Spark does not usually write out files like this. If I read in the file with Spark are write it out again the tests all pass using that file instead of the one that is attached. This came from a modified version of a Spark test, because it at least happens often enough that Spark wants to be sure that this works.

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Sep 16, 2021
@beckernick beckernick added cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Sep 16, 2021
@beckernick beckernick added this to the IO Data Type Expansion milestone Sep 16, 2021
@vuule
Copy link
Contributor

vuule commented Sep 20, 2021

@devavret

@johnnyzhon
Copy link

johnnyzhon commented Sep 24, 2021

Why can't I reproduce it ?

test data frame:
schema_array_of_primitive = StructType([
StructField("arrayF_str", ArrayType(StringType())),
StructField("arrayF_byte", ArrayType(ByteType())),
StructField("arrayF_short", ArrayType(ShortType())),
StructField("arrayF_int", ArrayType(IntegerType())),
StructField("arrayF_long", ArrayType(LongType())),
StructField("arrayF_float", ArrayType(FloatType())),
StructField("arrayF_double", ArrayType(DoubleType())),
StructField("arrayF_boolean", ArrayType(BooleanType())),
StructField("arrayF_timestamp", ArrayType(TimestampType())),
StructField("arrayF_date", ArrayType(DateType()))
])

data_array_of_primitive = [
(["NVIDIA","intel","amd"], [40,30,20], [800,700,600], [4200,3200,2200], [-40000,30001,50000], [40.12,40.13,40.14], [4.00013,4.00013,4.00015], None, [datetime(2020,2,1,12,1,1)], [date(1990, 10, 12)]),
(["spark","hdfs","hadoop"], [30,20,10], [700,600,500], [3200,2200,2300], [-40000,30000,50001], None, [3.000013,3.000013,3.000014], [True,True,True], [datetime(2020,2,10,12,30,30)], [date(2019, 3, 15)]),
(["take out","come back"], [20,20,10], [600,400,500], None, [-40001,30000,50000], [20.12,20.13,20.14], [2.000013,12.000013,2.000016], [True,True,True], [datetime(2020,3,1,12,1,1)],[date(1992, 1, 1)]),
(["Mellanox","deepminD"], [40,40,10], [800,800,200], [4200,5200,6200], [-40000,30000,50000], [-20.12,20.12,20.13], [4.00012,4.00018,24.00013], [False,False,False], [datetime(2020,4,1,12,1,1)], [date(1993, 1, 1)]),
(["TEST","dev","product"], [30,30,20], [500,600,700], [3200,7200,1200], [40000,30002,50000], [20.12,20.13,20.14], [14.000013,24.000013,34.000013], [True,False,False], [datetime(2020,7,1,12,1,1)], [date(1996, 1, 1)])
]
Steps:

  • df =spark.createDataFrame(data_array_of_primitive, schema_array_of_primitive)
  • df.write.parquet("/usr/app/test.parquet")
  • df_new = spark.read.parquet("/usr/app/test.parquet")
  • df_new.createOrReplaceTempView('test')
  • spark.sql("select arrayF_byte from test").show()

21/09/24 02:27:08 WARN GpuOverrides:
*Exec will run on GPU
*Partitioning <SinglePartition$> will run on GPU
!Exec cannot run on GPU because not all expressions can be replaced
@expression cast(arrayF_byte#51 as string) AS arrayF_byte#79 could run on GPU
!Expression cast(arrayF_byte#51 as string) cannot run on GPU because Cast from ArrayType(ByteType,true) to StringType is not supported
@expression arrayF_byte#51 could run on GPU
*Exec will run on GPU

+------------+
| arrayF_byte|
+------------+
|[30, 30, 20]|
|[40, 30, 20]|
|[20, 20, 10]|
|[40, 40, 10]|
|[30, 20, 10]|
+------------+

@revans2
Copy link
Contributor Author

revans2 commented Oct 1, 2021

@smone123 you cannot produce this type of parquet file with Spark, as I said before in the "Additional context" section. To make this work we had to use the low level parquet java API and do some very specific optimizations. Spark has a unit test that does this, and that is how we found it.

@PointKernel
Copy link
Member

Some updates on this issue: there could be kernel errors as well but firstly, the schema is not properly interpreted when parsing Parquet's Thrift Compact Protocol encoded metadata. More specifically, The converted type returned by read schema should contain LIST instead of INT32 or UNKNOWN.

rapids-bot bot pushed a commit that referenced this issue Dec 10, 2021
Closes #9240

This PR added the [one-level list encoding](https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.h#L43-L77) support in parquet reader. It also involved cleanups like removing the unused stream argument and fixing typos in docs/comments.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Devavret Makkar (https://github.com/devavret)
  - Vukasin Milovanovic (https://github.com/vuule)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #9848
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants