Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multi-threaded reading for avro[databricks] #5255

Closed
wants to merge 3 commits into from

Conversation

firestarman
Copy link
Collaborator

@firestarman firestarman commented Apr 14, 2022

This PR is to enable the multi-threaded reading for avro.

It has mainly

  • added 4 relevant configs.
  • created the classes for the multi-threaded reading factory and reader, along with some utils.
  • done some small refactor to reduce some duplicated code.
  • updated the avro tests.

closes #5148

Signed-off-by: Firestarman firestarmanllc@gmail.com

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

@tgravescs
Copy link
Collaborator

can you file an issue for this? By description if this is multi-threaded, are we doing coalescing as well in the future - clarify in issue please.

Copy link
Collaborator

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to look a bit more at core logic still.

What testing was done on this? Did you manually verify the multi-reader is working, did we do any perf test to show its helping?
I know we didn't do it for other readers but it would be nice if we had some test to verify we are picking up the right reader and the reader is doing what we expect.

docs/configs.md Outdated Show resolved Hide resolved
None
} else {
// Dump buffer for debugging when required
dumpDataToFile(hostBuf, bufSize, splits, Option(debugDumpPrefix), Some("avro"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Rather than create a new Option object every time this is called, why not create it once when the RapidsConf value is read and pass that along to be reused? Similarly we're creating an "avro" Option object here that is almost never used, would be good to cache this and reuse to avoid unnecessary garbage creation.

Copy link
Collaborator Author

@firestarman firestarman Apr 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion.
Done

@wbo4958
Copy link
Collaborator

wbo4958 commented Apr 14, 2022

I need to look a bit more at core logic still.

What testing was done on this? Did you manually verify the multi-reader is working, did we do any perf test to show its helping? I know we didn't do it for other readers but it would be nice if we had some test to verify we are picking up the right reader and the reader is doing what we expect.

@firestarman, please figure out to add some unit tests for choosing the right reader for 'avro' in GpuReaderSuites.scala

Copy link
Collaborator

@wbo4958 wbo4958 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM

private val maxNumFileProcessed = rapidsConf.maxNumAvroFilesParallel

// Disable coalescing reading until it is supported.
override val canUseCoalesceFilesReader: Boolean = false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we enable this and force the reader to multithreaded and add the unit tests in GpuReaderSuite.scala

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead I changed the tests a little to cover the cases for the avro reader type check.

* A PartitionReader that can read multiple AVRO files in parallel.
* This is most efficient running in a cloud environment where the I/O of reading is slow.
*/
class GpuMultiFileCloudAvroPartitionReader(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add the parameters description

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

case class AvroBlockMeta(header: Header, blocks: Seq[BlockInfo])
/** Estimate the total size from the given block meta */
private def estimateOutputSize(blockMeta: AvroBlockMeta): Long = {
// For simplicity, we just copy the whole header of AVRO
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, it's ok to copy the whole header, but from a long-term point of view, We may need to figure out how to just copy the useful information.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we may do it in the later PR.

@firestarman
Copy link
Collaborator Author

firestarman commented Apr 15, 2022

can you file an issue for this? By description if this is multi-threaded, are we doing coalescing as well in the future - clarify in issue please.

Nice finding. We already have #5148 but forgot to link to it. Updated.
BTW, we have the epic issue #4831 to track all the TODOs for avro reading, including the coalescing.

@firestarman
Copy link
Collaborator Author

@firestarman, please figure out to add some unit tests for choosing the right reader for 'avro' in GpuReaderSuites.scala

Added it, thanks for the information.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

firestarman commented Apr 15, 2022

I need to look a bit more at core logic still.

It mainly follows the logic of orc and parquet multi-threaded framework.

What testing was done on this? Did you manually verify the multi-reader is working, did we do any perf test to show its helping?
I know we didn't do it for other readers but it would be nice if we had some test to verify we are picking up the right reader and the reader is doing what we expect.

Now I have added unit tests in GpuReaderSuites to veirfy the reader type being picked up correctly and integration tests for the multi-threaded reading functionality.
No perf test for now, but I plan to do it after supporting the coalescing. If perf test is required for this PR, I can priority it.

@firestarman
Copy link
Collaborator Author

build

@firestarman firestarman changed the title Support multi-threaded reading for avro Support multi-threaded reading for avro[databricks] Apr 18, 2022
@firestarman
Copy link
Collaborator Author

build

@firestarman
Copy link
Collaborator Author

firestarman commented Apr 18, 2022

Rerun premerge to verify it for databricks.

@sameerz sameerz added the performance A performance related task/issue label Apr 18, 2022
@tgravescs
Copy link
Collaborator

No perf test for now, but I plan to do it after supporting the coalescing. If perf test is required for this PR, I can priority it.

Personally I find it odd to put in something that is supposed to be for performance without showing it will help. I guess in this case we are pretty confident based on other readers already using it and I as long as we are doing it before 22.06 ships I guess I"m ok, but prefer not to see it in the future.

@firestarman
Copy link
Collaborator Author

Waiting for benchmark numbers, and @HaoYang670 is helping on this.

@firestarman
Copy link
Collaborator Author

According to the tests we ran, the perf is quite bad for cloud envs. We have a solution and it will take much time, so close this first.

@tgravescs
Copy link
Collaborator

@firestarman can you provide some details about the issue and then proposed solution if its going to take time?

@tgravescs
Copy link
Collaborator

looks like described in #5304

@firestarman firestarman deleted the avro-read branch April 26, 2022 08:18
@GaryShen2008
Copy link
Collaborator

looks like described in #5304

Yes, @tgravescs could you please review the proposal?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add the MULTI-THREADED reading support for avro
6 participants