Support multi-threaded reading for avro #5421

firestarman · 2022-05-04T14:09:58Z

This PR is to enable the multi-threaded reading for avro.

It has mainly

renamed the original AvroDataFileReader to AvroMetaFileReader to indicate its real behavior, that is collecting blocks' metadata.
created a new reader named AvroDataFileReader to read the data block by block in the iterator pattern.
created a new AvroFileReader being the parent of the AvroDataFileReader and AvroMetaFileReader.
implemented the core GpuMultiFileCloudAvroPartitionReader, who leverages the new added AvroDataFileReader to read the data from cloud files directly instead of collecting blocks' metadata first. We do this because the Avro file has no special section for blocks' metadata, and collecting the metadata through a file will take a very long time if the file is on the cloud and has many blocks, leading to quite bad performance.
added the tests for it.

Performance for s3 compatible storage (seconds)

Data files are on the cloud
The test ran on the local machine (CPU 12 cores, and one GPU (Titan V, with 12GB memory))

Data Size CPU PERFILE MULTI-THREADED

147MB 103.027 830.735 81.51

1GB 679.514 3632.736 229.859

closes #5148
closes #5304

Signed-off-by: Firestarman firestarmanllc@gmail.com

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2022-05-04T14:35:27Z

build

firestarman · 2022-05-05T00:43:00Z

@tgravescs Could you help review this ?

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2022-05-06T03:58:54Z

build

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2022-05-06T08:22:50Z

build

firestarman · 2022-05-06T12:00:51Z

build

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2022-05-07T01:36:14Z

fix conflicts and rebase to the top commit

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2022-05-07T02:20:01Z

build

firestarman · 2022-05-07T06:19:13Z

build

pxLi · 2022-05-09T00:57:10Z

build

tgravescs · 2022-05-10T14:21:07Z

I haven't looked at code yet so maybe it will be answered there, but why is the PERFILE so much worse here, can't you use the same read technique there?

tgravescs

sorry haven't made it all through yet, just posting what I have so far

sql-plugin/src/main/scala/com/nvidia/spark/rapids/AvroDataFileReader.scala

firestarman · 2022-05-11T01:20:48Z

I haven't looked at code yet so maybe it will be answered there, but why is the PERFILE so much worse here, can't you use the same read technique there?

It is due to the same reason, and I tried to apply this new Avro reader to PERFILE locally, but it got a littel worse perf for reading local files. Since we already have the multi-threaded reader for cloud cases, not sure if it would be good to do this.

Local:
            Files     CPU        PERFILE-new   PERFILE-old
            400       25.023     10.172        9.582 
            1000      27.141     16.441        12.722 
            2000      27.179     26.146        21.882 

Cloud：                                     
            128GB     94.134     88.834        830.753 
            1GB       402.612    341.626       3632.736

We can discuss more in this tracking issue #5458

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2022-05-11T02:34:50Z

build

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2022-05-12T01:29:34Z

Resolve the conflicts.

firestarman · 2022-05-12T01:33:31Z

build

firestarman · 2022-05-13T01:04:22Z

@tgravescs Could you review this again ?

tgravescs

Main concern here is compression on the avro files and our reading estimation? Maybe I'm missing something there.
Have we tested with different avro files with different compression codecs?

tgravescs · 2022-05-16T14:50:23Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuAvroScan.scala

@@ -282,6 +287,10 @@ trait GpuAvroReaderBase extends Arm with Logging { self: FilePartitionReaderBase

  def readDataSchema: StructType

+  def conf: Configuration
+
+  val cacheBufferSize = conf.getInt("avro.read.allocation.size", 8 * 1024 * 1024)


comment where this default came from and why 8MB makes sense

Just copied it from the Parquet reader code.
val copyBufferSize = conf.getInt("parquet.read.allocation.size", 8 * 1024 * 1024)

tgravescs · 2022-05-16T15:06:20Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuAvroScan.scala

@@ -221,10 +216,20 @@ case class GpuAvroMultiFilePartitionReaderFactory(
  /**
   * Build the PartitionReader for cloud reading
   */


it would be nice to have a high level description of the approach it took, similar to what we describe in #5458. Really we should have done it for all the readers, about how it goes about filtering, and copying. Either here or in the GpuMultiFileCloudAvroPartitionReader class. The reason I put here is hte coalescing one filters the blocks in the buildBaseColumnarReaderForCoalescing

tgravescs · 2022-05-16T15:16:52Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuAvroScan.scala

-          AvroBlockMeta(reader.header, reader.headerSize, filteredBlocks)
-        }
+      val reader = closeOnExcept(in) { _ => AvroFileReader.openMetaReader(in) }
+      withResource(reader) { _ =>


this is odd, just do withResource(AvroFileReader.openMetaReader(in))

This is to ensure the opened in will be closed if an exception is thrown when calling AvroFileReader.openMetaReader.

tgravescs · 2022-05-16T15:22:15Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/AvroDataFileReader.scala

+}
+
+/**
+ * AvroDataFileReader reads the Avro file data in the iterator pattern.


this is only used by the multifile cloud reader, correct. It might be good to expand the comment on what we mean by iterator pattern.

tgravescs · 2022-05-16T15:23:20Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/AvroDataFileReader.scala

+   */
+  def readNextRawBlock(out: OutputStream): Unit = {
+    // This is designed to reduce the data copy as much as possible.
+    // Currently it leverages the BinarayDecoder, and data will be copied twice.


specify where its copied twice, ie once to x, again to y

tgravescs · 2022-05-16T15:31:23Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuAvroScan.scala

+      val startingBytesRead = fileSystemBytesRead()
+      val in = new FsInput(new Path(new URI(partFile.filePath)), config)
+      val reader = closeOnExcept(in) { _ => AvroFileReader.openDataReader(in) }
+      withResource(reader) { _ =>


just use withResource(AvroFileReader.openDataReader(in)).. or am I missing something you were trying to catch with the closeOnExcept here?

This is to ensure the opened in will be closed if an exception is thrown when calling AvroFileReader.openMetaReader.

I don't understand the comment, I don't see you using reader outside of withResource and withResource has a finally block that will close it as well.

First, it opens a FsInput as in, which should be closed.
Next, it tries to open a reader with this in. If succeeds, a reader now is created, and this in will be closed when closing this reader by withResouce(reader). If fails, closeOnExcept(in) will make sure to close this in.

withResource(AvroFileReader.openDataReader(in)) will ensure both the reader and in being closed only when all succeeds, but the in will leak if an exception is thrown when opening a reader.

Ah, I see, thanks I missed that, this is still a weird way to do that because if something fails between the closeOnExcept and the withResources then reader is leaked. To me it makes more sense to have the openDatareader handle closing in if an exception is thrown. What if you changed openDataReader to pass in the filepath and have it create the FSInput and have the closeOnException?

actually I think you can pass the file path all the way into AvroFileReader and have it deal with it. That way its close method can skip closing it if was never opened.

Thanks a lot, I will take it as a follow up.

Here it is #5554

tgravescs · 2022-05-16T15:37:26Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/AvroDataFileReader.scala

+   * Better to check its existence by calling 'hasNextBlock' first.
+   * This will not move the reader position forward.
+   */
+  def peekBlock(reuse: MutableBlockInfo): MutableBlockInfo = {


why is this mutable? that is definitely not a normal scala thing to do. did you specifically see issues with it creating new ones? If not please make it immutable

No issue is met yet, but this is to reduce the number of temporary objects a lot in JVM, which can reduce the GC times, just like the handling as below, which is from the Avro DataFileStream .

public D next(D reuse) throws IOException { if (!hasNext()) throw new NoSuchElementException(); D result = reader.read(reuse, datumIn); if (0 == --blockRemaining) { blockFinished(); } return result; }

Without this reuse, each block will create a temporary object in JVM. In our tests, that is about 1000 objects for a file. But the number can reduced to 2 with this reuse. Besides we can reuse an instance because we do not need all the information objects at the same time. And one object will be no longer needed just after a new call to peekBlock.

Anyway I am fine to make it immutable if you prefer.

tgravescs · 2022-05-16T15:40:11Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuAvroScan.scala

+              // block meta through the file is quite expensive for files in cloud. So we do
+              // not know the target buffer size ahead. Then we have to do an estimation.
+              //   "the estimated total block size = partFile.length + additional space"
+              // Letting "additional space = one block length * 1.2" is we may move the


not exactly sure what is meant here. perhaps "is we may" should be "allows us to" ?

Sorry for confusion, updated. It should be "is because we may".

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuAvroScan.scala

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2022-05-17T05:25:17Z

Main concern here is compression on the avro files and our reading estimation? Maybe I'm missing something there.
Have we tested with different avro files with different compression codecs?

Not yet, but this will not be a problem. The partFile.length is the compressed size, and the reader also copies the compressed data to batch buffers as is.

firestarman · 2022-05-17T07:09:47Z

build

tgravescs · 2022-05-17T21:04:30Z

ok, can we make sure we add tests for all the compression types, we should have them in parquet and orc, especially want to make sure if its a type we don't support that we fail and don't return wrong data or something like that. If that is a followup thats fine since avro is off by default.

firestarman · 2022-05-18T01:31:02Z

ok, can we make sure we add tests for all the compression types, we should have them in parquet and orc, especially want to make sure if its a type we don't support that we fail and don't return wrong data or something like that. If that is a followup thats fine since avro is off by default.

Sure, added it as a follow-up in #4831

wbo4958

Overall, GTME

wbo4958 · 2022-05-18T06:18:36Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuAvroScan.scala

+ *
+ * When reading a file, it
+ *   - seeks to the start position of the first block located in this partition.
+ *   - next, parses the meta and sync, rewirtes the meta and sync, and copies the data to a


typo: rewirtes -> rewrites

wbo4958 · 2022-05-18T06:32:39Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/AvroDataFileReader.scala

+    // Search for the sequence of bytes in the stream using Knuth-Morris-Pratt
+    var i = 0L
+    var j = 0
+    var b = in.read()


maybe we can create a followup for improving this. in.read() each reads only 1 byte, maybe reading into some buffer could be better.

IIUC, the underlying StreamSource in BinaryDecoder does this.

wbo4958 · 2022-05-18T06:37:32Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuAvroScan.scala

+      withResource(reader) { _ =>
+        // Go to the start of the first block after the start position
+        reader.sync(partFile.start)
+        if (!reader.hasNextBlock || isDone) {


Looks this piece of code is redundant, since the below "while" can cover this?

No, this is for the following estimation of buffer size. Besides, here checks the isDone flag at the same time.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/AvroDataFileReader.scala

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuAvroScan.scala

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2022-05-18T13:25:19Z

build

firestarman · 2022-05-19T02:34:59Z

build

wbo4958

LGTM

Support multi-threaded reading for avro

b346a05

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman requested review from wbo4958, tgravescs and GaryShen2008 May 4, 2022 14:10

sameerz added the performance A performance related task/issue label May 4, 2022

sameerz added this to the May 2 - May 20 milestone May 4, 2022

firestarman requested a review from jlowe May 5, 2022 01:59

More docs

d0f3a51

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

Add a test for the corrupted files case

48538c8

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

Merge branch 'branch-22.06' into avro-multi

91ad223

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

Fix a build error

6fc8ef7

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

tgravescs reviewed May 10, 2022

View reviewed changes

Address the comments

43a7be9

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman requested a review from tgravescs May 11, 2022 02:18

firestarman mentioned this pull request May 11, 2022

[FEA] Apply the new AvroDataFileReader to PERFILE reading of Avro #5458

Open

Merge branch 'branch-22.06' into avro-multi

6220a87

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

tgravescs reviewed May 16, 2022

View reviewed changes

address the new comments

71354c2

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman requested a review from tgravescs May 17, 2022 04:46

wbo4958 reviewed May 18, 2022

View reviewed changes

correct a typo

da07996

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

wbo4958 approved these changes May 19, 2022

View reviewed changes

firestarman merged commit aabe71a into NVIDIA:branch-22.06 May 20, 2022

firestarman deleted the avro-multi branch May 20, 2022 02:12

firestarman mentioned this pull request May 20, 2022

[FEA] Move FsInput creation into Avro readers. #5554

Closed

Data Size	CPU	PERFILE	MULTI-THREADED
147MB	103.027	830.735	81.51
1GB	679.514	3632.736	229.859

Support multi-threaded reading for avro #5421

Support multi-threaded reading for avro #5421

Conversation

firestarman commented May 4, 2022 • edited Loading

Performance for s3 compatible storage (seconds)

firestarman commented May 4, 2022

firestarman commented May 5, 2022

firestarman commented May 6, 2022

firestarman commented May 6, 2022

firestarman commented May 6, 2022

firestarman commented May 7, 2022

firestarman commented May 7, 2022

firestarman commented May 7, 2022

pxLi commented May 9, 2022

tgravescs commented May 10, 2022

tgravescs left a comment

Choose a reason for hiding this comment

firestarman commented May 11, 2022 • edited Loading

firestarman commented May 11, 2022

firestarman commented May 12, 2022

firestarman commented May 12, 2022

firestarman commented May 13, 2022

tgravescs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

firestarman May 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

firestarman May 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

firestarman commented May 17, 2022 • edited Loading

firestarman commented May 17, 2022

tgravescs commented May 17, 2022

firestarman commented May 18, 2022

wbo4958 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

firestarman May 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

firestarman commented May 18, 2022

firestarman commented May 19, 2022

wbo4958 left a comment

Choose a reason for hiding this comment

firestarman commented May 4, 2022 •

edited

Loading

firestarman commented May 11, 2022 •

edited

Loading

firestarman May 19, 2022 •

edited

Loading

firestarman May 17, 2022 •

edited

Loading

firestarman commented May 17, 2022 •

edited

Loading

firestarman May 18, 2022 •

edited

Loading