-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multi-threaded reading for avro[databricks] #5255
Conversation
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
can you file an issue for this? By description if this is multi-threaded, are we doing coalescing as well in the future - clarify in issue please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to look a bit more at core logic still.
What testing was done on this? Did you manually verify the multi-reader is working, did we do any perf test to show its helping?
I know we didn't do it for other readers but it would be nice if we had some test to verify we are picking up the right reader and the reader is doing what we expect.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileSourceScanExec.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/ExternalSource.scala
Outdated
Show resolved
Hide resolved
None | ||
} else { | ||
// Dump buffer for debugging when required | ||
dumpDataToFile(hostBuf, bufSize, splits, Option(debugDumpPrefix), Some("avro")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Rather than create a new Option object every time this is called, why not create it once when the RapidsConf value is read and pass that along to be reused? Similarly we're creating an "avro" Option object here that is almost never used, would be good to cache this and reuse to avoid unnecessary garbage creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion.
Done
@firestarman, please figure out to add some unit tests for choosing the right reader for 'avro' in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, LGTM
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala
Show resolved
Hide resolved
private val maxNumFileProcessed = rapidsConf.maxNumAvroFilesParallel | ||
|
||
// Disable coalescing reading until it is supported. | ||
override val canUseCoalesceFilesReader: Boolean = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we enable this and force the reader to multithreaded and add the unit tests in GpuReaderSuite.scala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead I changed the tests a little to cover the cases for the avro reader type check.
* A PartitionReader that can read multiple AVRO files in parallel. | ||
* This is most efficient running in a cloud environment where the I/O of reading is slow. | ||
*/ | ||
class GpuMultiFileCloudAvroPartitionReader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add the parameters description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
case class AvroBlockMeta(header: Header, blocks: Seq[BlockInfo]) | ||
/** Estimate the total size from the given block meta */ | ||
private def estimateOutputSize(blockMeta: AvroBlockMeta): Long = { | ||
// For simplicity, we just copy the whole header of AVRO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, it's ok to copy the whole header, but from a long-term point of view, We may need to figure out how to just copy the useful information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we may do it in the later PR.
Nice finding. We already have #5148 but forgot to link to it. Updated. |
Added it, thanks for the information. |
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
It mainly follows the logic of orc and parquet multi-threaded framework.
Now I have added unit tests in |
build |
build |
Rerun premerge to verify it for databricks. |
Personally I find it odd to put in something that is supposed to be for performance without showing it will help. I guess in this case we are pretty confident based on other readers already using it and I as long as we are doing it before 22.06 ships I guess I"m ok, but prefer not to see it in the future. |
Waiting for benchmark numbers, and @HaoYang670 is helping on this. |
According to the tests we ran, the perf is quite bad for cloud envs. We have a solution and it will take much time, so close this first. |
@firestarman can you provide some details about the issue and then proposed solution if its going to take time? |
looks like described in #5304 |
Yes, @tgravescs could you please review the proposal? |
This PR is to enable the multi-threaded reading for avro.
It has mainly
closes #5148
Signed-off-by: Firestarman firestarmanllc@gmail.com