Use C++ to parse and filter parquet footers. #5310

revans2 · 2022-04-25T20:24:55Z

This still needs tests and documentation, but I wanted to get a PR up sooner than later.

This depends on NVIDIA/spark-rapids-jni#199

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

gerashegalov · 2022-04-28T22:34:25Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

-          ParquetMetadataConverter.range(file.start, file.start + file.length))
+      val footer = footerReader match {
+        case ParquetFooterReaderType.NATIVE =>
+          System.err.println("NATIVE FOOTER READER...")


Switch to real logging once done with WIP

agree with @gerashegalov

abellina · 2022-04-29T22:03:51Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

-          ParquetMetadataConverter.range(file.start, file.start + file.length))
+      val footer = footerReader match {
+        case ParquetFooterReaderType.NATIVE =>
+          System.err.println("NATIVE FOOTER READER...")


agree with @gerashegalov

abellina · 2022-04-29T22:10:46Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

+
+  @throws[IOException]
+  override def read(buf: ByteBuffer): Int =
+    if (buf.hasArray) {


maybe a small comment somewhere on why we need two implementations one for direct and one for JVM byte buffers.

abellina · 2022-04-29T22:11:45Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

-    "msg=constructor ParquetFileReader in class ParquetFileReader is deprecated"
-  )
+  private def addNamesAndCount(names: ArrayBuffer[String], children: ArrayBuffer[Int],
+      name: String, num_children: Int): Unit = {


Suggested change

name: String, num_children: Int): Unit = {

name: String, numChildren: Int): Unit = {

revans2 · 2022-05-09T19:39:29Z

build

revans2 · 2022-05-09T20:29:02Z

build

revans2 · 2022-05-09T20:30:02Z

@jlowe @abellina @gerashegalov could you please take another look? I think I have addressed all of the review comments.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

jlowe

All my comments are minor, mostly about potentially simplifying/optimizing the copies leveraging HostMemoryBuffer's ability to alias as a DirectByteBuffer. Nothing is must-fix on my end.

revans2 · 2022-05-10T13:50:21Z

build

revans2 · 2022-05-10T13:51:08Z

Thanks for the review @jlowe I think I have addressed everything.

abellina · 2022-05-10T14:09:21Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

+  @throws[IOException]
+  override def readFully(buf: ByteBuffer): Unit = {
+    val requested = buf.remaining()
+    val avail = available()


I was curious why this readFully checks for available() throwing before the read, whereas others are checking the amount read and throwing after the read. But this is not a blocker.

revans2 · 2022-05-10T19:23:41Z

Converted to draft because I hit a bug with the latest code that I need to debug.

This reverts commit 62de08b.

revans2 · 2022-05-10T19:46:15Z

build

revans2 · 2022-05-10T19:48:02Z

I reverted the last set of changes to the StreamReader. They were causing issues when reading large files. I don't know why ByteBuffers are only used for large files but they are. Because the comments were nits I decided to revert this and I will file a follow on issue to see if we can make it work properly, along with some investigation into why tests were not taking that code path.

revans2 · 2022-05-10T19:51:46Z

I filed #5452 as the follow on issue.

revans2 · 2022-05-11T15:47:16Z

build

Use C++ to parse and filter parquet footers.

53428ed

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

jlowe added the performance A performance related task/issue label Apr 26, 2022

jlowe added this to the Apr 18 - Apr 29 milestone Apr 26, 2022

jlowe reviewed Apr 26, 2022

View reviewed changes

gerashegalov marked this pull request as ready for review April 27, 2022 04:34

revans2 mentioned this pull request Apr 27, 2022

[FEA] Fix case insensitive match on native parquet column pruning rapidsai/cudf#10747

Closed

revans2 added 3 commits April 28, 2022 09:58

Deal with empty schema and added tests

263f09e

Updated docs

e9a67c5

Update tests

75a5519

gerashegalov reviewed Apr 28, 2022

View reviewed changes

sameerz modified the milestones: Apr 18 - Apr 29, May 2 - May 20 Apr 29, 2022

abellina reviewed Apr 29, 2022

View reviewed changes

revans2 added 3 commits May 6, 2022 11:34

Merge branch 'branch-22.06' into cpp_parquet_footer_parse

479d0f1

Some updates and rework

78658b8

Merge branch 'branch-22.06' into cpp_parquet_footer_parse

dcdc255

More rework

645117e

jlowe reviewed May 9, 2022

View reviewed changes

jlowe previously approved these changes May 9, 2022

View reviewed changes

Addressed a few more comments

62de08b

revans2 dismissed jlowe’s stale review via 62de08b May 10, 2022 13:50

abellina reviewed May 10, 2022

View reviewed changes

abellina previously approved these changes May 10, 2022

View reviewed changes

jlowe previously approved these changes May 10, 2022

View reviewed changes

revans2 marked this pull request as draft May 10, 2022 19:23

revans2 added 2 commits May 10, 2022 14:36

Revert "Addressed a few more comments"

769b73e

This reverts commit 62de08b.

Fix minor nit

3ff8b21

revans2 dismissed stale reviews from jlowe and abellina via 3ff8b21 May 10, 2022 19:45

revans2 marked this pull request as ready for review May 10, 2022 19:46

revans2 mentioned this pull request May 10, 2022

[FEA] explore using ByteBuffers for HMBSeekableInputStream #5452

Open

jlowe approved these changes May 10, 2022

View reviewed changes

abellina approved these changes May 10, 2022

View reviewed changes

revans2 merged commit 00c0a6c into NVIDIA:branch-22.06 May 11, 2022

revans2 deleted the cpp_parquet_footer_parse branch May 11, 2022 17:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use C++ to parse and filter parquet footers. #5310

Use C++ to parse and filter parquet footers. #5310

revans2 commented Apr 25, 2022

gerashegalov Apr 28, 2022

abellina Apr 29, 2022

abellina Apr 29, 2022

abellina Apr 29, 2022

abellina Apr 29, 2022

revans2 commented May 9, 2022

revans2 commented May 9, 2022

revans2 commented May 9, 2022

jlowe left a comment

revans2 commented May 10, 2022

revans2 commented May 10, 2022

abellina May 10, 2022

revans2 commented May 10, 2022

revans2 commented May 10, 2022

revans2 commented May 10, 2022

revans2 commented May 10, 2022

revans2 commented May 11, 2022

	name: String, num_children: Int): Unit = {
	name: String, numChildren: Int): Unit = {

Use C++ to parse and filter parquet footers. #5310

Use C++ to parse and filter parquet footers. #5310

Conversation

revans2 commented Apr 25, 2022

gerashegalov Apr 28, 2022

Choose a reason for hiding this comment

abellina Apr 29, 2022

Choose a reason for hiding this comment

abellina Apr 29, 2022

Choose a reason for hiding this comment

abellina Apr 29, 2022

Choose a reason for hiding this comment

abellina Apr 29, 2022

Choose a reason for hiding this comment

revans2 commented May 9, 2022

revans2 commented May 9, 2022

revans2 commented May 9, 2022

jlowe left a comment

Choose a reason for hiding this comment

revans2 commented May 10, 2022

revans2 commented May 10, 2022

abellina May 10, 2022

Choose a reason for hiding this comment

revans2 commented May 10, 2022

revans2 commented May 10, 2022

revans2 commented May 10, 2022

revans2 commented May 10, 2022

revans2 commented May 11, 2022