-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet coalesce file reader for local filesystems #990
Parquet coalesce file reader for local filesystems #990
Conversation
…cloud environments Signed-off-by: Thomas Graves <tgraves@apache.org>
Signed-off-by: Thomas Graves <tgraves@apache.org>
Signed-off-by: Thomas Graves <tgraves@nvidia.com>
…, and coalesce reader
Signed-off-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
Signed-off-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
build |
build |
Signed-off-by: Thomas Graves <tgraves@nvidia.com>
build |
Signed-off-by: Thomas Graves <tgraves@nvidia.com>
build |
* Add back the small file consolidation for the parquet reader for non-cloud environments Signed-off-by: Thomas Graves <tgraves@apache.org> * make resolveURI local Signed-off-by: Thomas Graves <tgraves@apache.org> * debug * fix debug * Cleanup * rework names * Fix bug in footer psoition * Add input file transition logic back and update tests Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Update configs so can control multi file optmization, multi file read, and coalesce reader * remove debug Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Update tests for 3 parquet readers and small bug fix * Update logging * test fixes * various fixes * Update configs and fix parametsr to GpuParquetScan Signed-off-by: Thomas Graves <tgraves@apache.org> * remove unneeded function dbshim Signed-off-by: Thomas Graves <tgraves@nvidia.com> * remove debug log and update configs Signed-off-by: Thomas Graves <tgraves@nvidia.com> * cleanup and debug * Update configs.md * cleanup Signed-off-by: Thomas Graves <tgraves@nvidia.com> * create a common function for getting small file opts for fileSourceScan * Fix extra line and update config text * Update text * change to use close on exception Signed-off-by: Thomas Graves <tgraves@apache.org> * update configs doc * Fix missing imports * Fix import order Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Rework the parquet multi-file configs to have a single configuration and change the way they are passed around for the InputFileName Signed-off-by: Thomas Graves <tgraves@apache.org> * make rapidsConf transient Signed-off-by: Thomas Graves <tgraves@apache.org> * fix typo Signed-off-by: Thomas Graves <tgraves@apache.org> * forward rapidsconf Signed-off-by: Thomas Graves <tgraves@apache.org> * update test and fix missed config check * Add log statement for original per file reader * Update text and fix test * add space Signed-off-by: Thomas Graves <tgraves@apache.org> * update config.md Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Fix parameter to spark 3.1.0 parquet san * Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Fix scalastyle line length Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Update docs and change tests to copy reader confs * Update GpuColumnVector.from call to handle MapTypes Signed-off-by: Thomas Graves <tgraves@nvidia.com> Co-authored-by: Jason Lowe <jlowe@nvidia.com>
if i use HDFS , it woud be go to the local file systems ? but in another #1200 , you said i should set the spark.rapids.sql.format.parquet.reader.type="MULTITHREADED" , i am confused . please help me . btw , I use HDFS , how to confirme my multi-thread read small parquet is working well . my parquet file is about 250MB , and executor.cores = 16 . I attemp to set spark.rapids.sql.format.parquet.multiThreadedRead.numThreads = 48 , but query execution time is almost same . |
Please do not ask questions on closed PRs, an issue would be much better. In testing the coalesce reader works better for local file system (ie local disks on same node as the executors). But that is assuming the disks are fast enough and I think the case you ran into is partitioning. So it might not always be the case, which is why it's configurable. |
* Add back the small file consolidation for the parquet reader for non-cloud environments Signed-off-by: Thomas Graves <tgraves@apache.org> * make resolveURI local Signed-off-by: Thomas Graves <tgraves@apache.org> * debug * fix debug * Cleanup * rework names * Fix bug in footer psoition * Add input file transition logic back and update tests Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Update configs so can control multi file optmization, multi file read, and coalesce reader * remove debug Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Update tests for 3 parquet readers and small bug fix * Update logging * test fixes * various fixes * Update configs and fix parametsr to GpuParquetScan Signed-off-by: Thomas Graves <tgraves@apache.org> * remove unneeded function dbshim Signed-off-by: Thomas Graves <tgraves@nvidia.com> * remove debug log and update configs Signed-off-by: Thomas Graves <tgraves@nvidia.com> * cleanup and debug * Update configs.md * cleanup Signed-off-by: Thomas Graves <tgraves@nvidia.com> * create a common function for getting small file opts for fileSourceScan * Fix extra line and update config text * Update text * change to use close on exception Signed-off-by: Thomas Graves <tgraves@apache.org> * update configs doc * Fix missing imports * Fix import order Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Rework the parquet multi-file configs to have a single configuration and change the way they are passed around for the InputFileName Signed-off-by: Thomas Graves <tgraves@apache.org> * make rapidsConf transient Signed-off-by: Thomas Graves <tgraves@apache.org> * fix typo Signed-off-by: Thomas Graves <tgraves@apache.org> * forward rapidsconf Signed-off-by: Thomas Graves <tgraves@apache.org> * update test and fix missed config check * Add log statement for original per file reader * Update text and fix test * add space Signed-off-by: Thomas Graves <tgraves@apache.org> * update config.md Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Fix parameter to spark 3.1.0 parquet san * Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Fix scalastyle line length Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Update docs and change tests to copy reader confs * Update GpuColumnVector.from call to handle MapTypes Signed-off-by: Thomas Graves <tgraves@nvidia.com> Co-authored-by: Jason Lowe <jlowe@nvidia.com>
* Add back the small file consolidation for the parquet reader for non-cloud environments Signed-off-by: Thomas Graves <tgraves@apache.org> * make resolveURI local Signed-off-by: Thomas Graves <tgraves@apache.org> * debug * fix debug * Cleanup * rework names * Fix bug in footer psoition * Add input file transition logic back and update tests Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Update configs so can control multi file optmization, multi file read, and coalesce reader * remove debug Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Update tests for 3 parquet readers and small bug fix * Update logging * test fixes * various fixes * Update configs and fix parametsr to GpuParquetScan Signed-off-by: Thomas Graves <tgraves@apache.org> * remove unneeded function dbshim Signed-off-by: Thomas Graves <tgraves@nvidia.com> * remove debug log and update configs Signed-off-by: Thomas Graves <tgraves@nvidia.com> * cleanup and debug * Update configs.md * cleanup Signed-off-by: Thomas Graves <tgraves@nvidia.com> * create a common function for getting small file opts for fileSourceScan * Fix extra line and update config text * Update text * change to use close on exception Signed-off-by: Thomas Graves <tgraves@apache.org> * update configs doc * Fix missing imports * Fix import order Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Rework the parquet multi-file configs to have a single configuration and change the way they are passed around for the InputFileName Signed-off-by: Thomas Graves <tgraves@apache.org> * make rapidsConf transient Signed-off-by: Thomas Graves <tgraves@apache.org> * fix typo Signed-off-by: Thomas Graves <tgraves@apache.org> * forward rapidsconf Signed-off-by: Thomas Graves <tgraves@apache.org> * update test and fix missed config check * Add log statement for original per file reader * Update text and fix test * add space Signed-off-by: Thomas Graves <tgraves@apache.org> * update config.md Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Fix parameter to spark 3.1.0 parquet san * Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com> * Fix scalastyle line length Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Update docs and change tests to copy reader confs * Update GpuColumnVector.from call to handle MapTypes Signed-off-by: Thomas Graves <tgraves@nvidia.com> Co-authored-by: Jason Lowe <jlowe@nvidia.com>
…IDIA#990) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Add back in support for the parquet coalesce file reader. We introduced this in 0.2 here dcd119c#diff-6c8f47c497382f147106e8e82020d1828eee055c2f2a7c6d60807262e8b63cf9
It was then replaced with the multi-threaded parquet reader which worked much better in the cloud. Unfortunately I didn't do enough testing on local file systems (file: and hdfs:). The coalesce reader works much better there. So I'm adding it back and we have 3 parquet readers now. The original per file reader, the multi-threaded reader which works well with cloud, then the coalesce reader which works well with local filesystems.
This implementation is very similar to the one we had before (code was just copied) with the addition of it now does the copy of blocks from a single file to the host memory buffer in background threads. We allocate the buffer, slice it, then run the actual copy in parallel in separate threads and wait for it to be done, then pass that single buffer down to gpu.
Note this required me to put back the checks for the inputFileName and blocks and if we find the user using those we change it to not use the coalescing reader.
There is now one config that you select the reader you want to use, by default its on auto where it chooses either the coalescing of the multi-threaded based not he cloud filesystem scheme.
I hardcoded some filesystem schemes as to what I knew were cloud based filesystems then have a config so user can add more: spark.rapids.cloudSchemes. I thought about putting all the schemes in there but figure you can turn off one of the other 2 configs if you really want it to use one of the other.