Support multi-threaded reading for avro[databricks] #5255

firestarman · 2022-04-14T07:57:33Z

This PR is to enable the multi-threaded reading for avro.

It has mainly

added 4 relevant configs.
created the classes for the multi-threaded reading factory and reader, along with some utils.
done some small refactor to reduce some duplicated code.
updated the avro tests.

closes #5148

Signed-off-by: Firestarman firestarmanllc@gmail.com

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2022-04-14T08:03:35Z

build

tgravescs · 2022-04-14T13:13:27Z

can you file an issue for this? By description if this is multi-threaded, are we doing coalescing as well in the future - clarify in issue please.

tgravescs

I need to look a bit more at core logic still.

What testing was done on this? Did you manually verify the multi-reader is working, did we do any perf test to show its helping?
I know we didn't do it for other readers but it would be nice if we had some test to verify we are picking up the right reader and the reader is doing what we expect.

docs/configs.md

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileSourceScanExec.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/ExternalSource.scala

jlowe · 2022-04-14T14:29:20Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuAvroScan.scala

+        None
+      } else {
+        // Dump buffer for debugging when required
+        dumpDataToFile(hostBuf, bufSize, splits, Option(debugDumpPrefix), Some("avro"))


Nit: Rather than create a new Option object every time this is called, why not create it once when the RapidsConf value is read and pass that along to be reused? Similarly we're creating an "avro" Option object here that is almost never used, would be good to cache this and reuse to avoid unnecessary garbage creation.

Good suggestion.
Done

wbo4958 · 2022-04-14T22:39:21Z

I need to look a bit more at core logic still.

What testing was done on this? Did you manually verify the multi-reader is working, did we do any perf test to show its helping? I know we didn't do it for other readers but it would be nice if we had some test to verify we are picking up the right reader and the reader is doing what we expect.

@firestarman, please figure out to add some unit tests for choosing the right reader for 'avro' in GpuReaderSuites.scala

wbo4958

Overall, LGTM

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala

wbo4958 · 2022-04-15T00:06:15Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuAvroScan.scala

+  private val maxNumFileProcessed = rapidsConf.maxNumAvroFilesParallel
+
+  // Disable coalescing reading until it is supported.
+  override val canUseCoalesceFilesReader: Boolean = false


Could we enable this and force the reader to multithreaded and add the unit tests in GpuReaderSuite.scala

Instead I changed the tests a little to cover the cases for the avro reader type check.

wbo4958 · 2022-04-15T00:44:26Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuAvroScan.scala

+ * A PartitionReader that can read multiple AVRO files in parallel.
+ * This is most efficient running in a cloud environment where the I/O of reading is slow.
+ */
+class GpuMultiFileCloudAvroPartitionReader(


please add the parameters description

wbo4958 · 2022-04-15T00:50:06Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuAvroScan.scala

-case class AvroBlockMeta(header: Header, blocks: Seq[BlockInfo])
+  /** Estimate the total size from the given block meta */
+  private def estimateOutputSize(blockMeta: AvroBlockMeta): Long = {
+    // For simplicity, we just copy the whole header of AVRO


For now, it's ok to copy the whole header, but from a long-term point of view, We may need to figure out how to just copy the useful information.

Yes, we may do it in the later PR.

firestarman · 2022-04-15T00:59:26Z

can you file an issue for this? By description if this is multi-threaded, are we doing coalescing as well in the future - clarify in issue please.

Nice finding. We already have #5148 but forgot to link to it. Updated.
BTW, we have the epic issue #4831 to track all the TODOs for avro reading, including the coalescing.

firestarman · 2022-04-15T05:34:09Z

@firestarman, please figure out to add some unit tests for choosing the right reader for 'avro' in GpuReaderSuites.scala

Added it, thanks for the information.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2022-04-15T08:00:17Z

I need to look a bit more at core logic still.

It mainly follows the logic of orc and parquet multi-threaded framework.

What testing was done on this? Did you manually verify the multi-reader is working, did we do any perf test to show its helping?
I know we didn't do it for other readers but it would be nice if we had some test to verify we are picking up the right reader and the reader is doing what we expect.

Now I have added unit tests in GpuReaderSuites to veirfy the reader type being picked up correctly and integration tests for the multi-threaded reading functionality.
No perf test for now, but I plan to do it after supporting the coalescing. If perf test is required for this PR, I can priority it.

firestarman · 2022-04-15T08:01:09Z

build

firestarman · 2022-04-18T01:33:14Z

build

firestarman · 2022-04-18T01:34:04Z

Rerun premerge to verify it for databricks.

tgravescs · 2022-04-18T13:34:15Z

No perf test for now, but I plan to do it after supporting the coalescing. If perf test is required for this PR, I can priority it.

Personally I find it odd to put in something that is supposed to be for performance without showing it will help. I guess in this case we are pretty confident based on other readers already using it and I as long as we are doing it before 22.06 ships I guess I"m ok, but prefer not to see it in the future.

firestarman · 2022-04-20T01:02:28Z

Waiting for benchmark numbers, and @HaoYang670 is helping on this.

firestarman · 2022-04-25T03:13:43Z

According to the tests we ran, the perf is quite bad for cloud envs. We have a solution and it will take much time, so close this first.

tgravescs · 2022-04-25T14:03:41Z

@firestarman can you provide some details about the issue and then proposed solution if its going to take time?

tgravescs · 2022-04-25T14:16:55Z

looks like described in #5304

GaryShen2008 · 2022-04-27T01:24:27Z

looks like described in #5304

Yes, @tgravescs could you please review the proposal?

Support multi-threaded reading for avro

50d681a

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

wbo4958 requested review from wbo4958, jlowe and tgravescs April 14, 2022 08:01

tgravescs reviewed Apr 14, 2022

View reviewed changes

docs/configs.md Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileSourceScanExec.scala Outdated Show resolved Hide resolved

jlowe reviewed Apr 14, 2022

View reviewed changes

wbo4958 reviewed Apr 15, 2022

View reviewed changes

Address the comments

d17da6e

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman requested review from revans2, GaryShen2008 and NvTimLiu as code owners April 15, 2022 07:34

doc updates

1c1c0d8

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman requested review from wbo4958, jlowe and tgravescs April 15, 2022 07:54

jlowe approved these changes Apr 15, 2022

View reviewed changes

firestarman changed the title ~~Support multi-threaded reading for avro~~ Support multi-threaded reading for avro[databricks] Apr 18, 2022

sameerz added the performance A performance related task/issue label Apr 18, 2022

firestarman closed this Apr 25, 2022

firestarman deleted the avro-read branch April 26, 2022 08:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multi-threaded reading for avro[databricks] #5255

Support multi-threaded reading for avro[databricks] #5255

firestarman commented Apr 14, 2022 •

edited

Loading

firestarman commented Apr 14, 2022

tgravescs commented Apr 14, 2022

tgravescs left a comment

jlowe Apr 14, 2022

firestarman Apr 15, 2022 •

edited

Loading

wbo4958 commented Apr 14, 2022

wbo4958 left a comment

wbo4958 Apr 15, 2022

firestarman Apr 15, 2022

wbo4958 Apr 15, 2022

firestarman Apr 15, 2022

wbo4958 Apr 15, 2022

firestarman Apr 15, 2022

firestarman commented Apr 15, 2022 •

edited

Loading

firestarman commented Apr 15, 2022

firestarman commented Apr 15, 2022 •

edited

Loading

firestarman commented Apr 15, 2022

firestarman commented Apr 18, 2022

firestarman commented Apr 18, 2022 •

edited

Loading

tgravescs commented Apr 18, 2022

firestarman commented Apr 20, 2022

firestarman commented Apr 25, 2022

tgravescs commented Apr 25, 2022

tgravescs commented Apr 25, 2022

GaryShen2008 commented Apr 27, 2022

Support multi-threaded reading for avro[databricks] #5255

Support multi-threaded reading for avro[databricks] #5255

Conversation

firestarman commented Apr 14, 2022 • edited Loading

firestarman commented Apr 14, 2022

tgravescs commented Apr 14, 2022

tgravescs left a comment

Choose a reason for hiding this comment

jlowe Apr 14, 2022

Choose a reason for hiding this comment

firestarman Apr 15, 2022 • edited Loading

Choose a reason for hiding this comment

wbo4958 commented Apr 14, 2022

wbo4958 left a comment

Choose a reason for hiding this comment

wbo4958 Apr 15, 2022

Choose a reason for hiding this comment

firestarman Apr 15, 2022

Choose a reason for hiding this comment

wbo4958 Apr 15, 2022

Choose a reason for hiding this comment

firestarman Apr 15, 2022

Choose a reason for hiding this comment

wbo4958 Apr 15, 2022

Choose a reason for hiding this comment

firestarman Apr 15, 2022

Choose a reason for hiding this comment

firestarman commented Apr 15, 2022 • edited Loading

firestarman commented Apr 15, 2022

firestarman commented Apr 15, 2022 • edited Loading

firestarman commented Apr 15, 2022

firestarman commented Apr 18, 2022

firestarman commented Apr 18, 2022 • edited Loading

tgravescs commented Apr 18, 2022

firestarman commented Apr 20, 2022

firestarman commented Apr 25, 2022

tgravescs commented Apr 25, 2022

tgravescs commented Apr 25, 2022

GaryShen2008 commented Apr 27, 2022

firestarman commented Apr 14, 2022 •

edited

Loading

firestarman Apr 15, 2022 •

edited

Loading

firestarman commented Apr 15, 2022 •

edited

Loading

firestarman commented Apr 15, 2022 •

edited

Loading

firestarman commented Apr 18, 2022 •

edited

Loading