New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Parquet coalesce file reader for local filesystems #990

Merged

tgravescs merged 49 commits into NVIDIA:branch-0.3 from tgravescs:consolidateFilesWithCloud

Oct 27, 2020

Collaborator

tgravescs commented Oct 20, 2020 •

edited

Loading

Add back in support for the parquet coalesce file reader. We introduced this in 0.2 here dcd119c#diff-6c8f47c497382f147106e8e82020d1828eee055c2f2a7c6d60807262e8b63cf9
It was then replaced with the multi-threaded parquet reader which worked much better in the cloud. Unfortunately I didn't do enough testing on local file systems (file: and hdfs:). The coalesce reader works much better there. So I'm adding it back and we have 3 parquet readers now. The original per file reader, the multi-threaded reader which works well with cloud, then the coalesce reader which works well with local filesystems.

This implementation is very similar to the one we had before (code was just copied) with the addition of it now does the copy of blocks from a single file to the host memory buffer in background threads. We allocate the buffer, slice it, then run the actual copy in parallel in separate threads and wait for it to be done, then pass that single buffer down to gpu.

Note this required me to put back the checks for the inputFileName and blocks and if we find the user using those we change it to not use the coalescing reader.

There is now one config that you select the reader you want to use, by default its on auto where it chooses either the coalescing of the multi-threaded based not he cloud filesystem scheme.

I hardcoded some filesystem schemes as to what I knew were cloud based filesystems then have a config so user can add more: spark.rapids.cloudSchemes. I thought about putting all the schemes in there but figure you can turn off one of the other 2 configs if you really want it to use one of the other.

tgravescs and others added 25 commits

October 1, 2020 15:55


          Add back the small file consolidation for the parquet reader for non-…

dc5ecf5

…cloud environments

Signed-off-by: Thomas Graves <tgraves@apache.org>


          make resolveURI local

7f99fa5

Signed-off-by: Thomas Graves <tgraves@apache.org>


          debug

ac053d6


          fix debug

aa19029


          Cleanup

ed90c71


          rework names

95fa108


          Fix bug in footer psoition

4887bb1


          Add input file transition logic back and update tests

83ae5a2

Signed-off-by: Thomas Graves <tgraves@nvidia.com>


          Update configs so can control multi file optmization, multi file read…

b346b85

…, and coalesce reader


          remove debug

81c17a8

Signed-off-by: Thomas Graves <tgraves@nvidia.com>


          Update tests for 3 parquet readers and small bug fix

b2daf51


          Update logging

79fdd53


          test fixes

060aaae


          various fixes

cd3730c


          Update configs and fix parametsr to GpuParquetScan

6e57beb

Signed-off-by: Thomas Graves <tgraves@apache.org>


          remove unneeded function dbshim

5bcb6e8

Signed-off-by: Thomas Graves <tgraves@nvidia.com>


          remove debug log and update configs

Signed-off-by: Thomas Graves <tgraves@nvidia.com>


          cleanup and debug

455a573


          Update configs.md

cc31a51


          cleanup

7a6268f

Signed-off-by: Thomas Graves <tgraves@nvidia.com>


          create a common function for getting small file opts for fileSourceScan

588cfb4


          Fix extra line and update config text

584cd85


          Update text

6251f5f


          change to use close on exception

d638ed0

Signed-off-by: Thomas Graves <tgraves@apache.org>


          update configs doc

f80efb0

tgravescs self-assigned this

tgravescs added the performance label

tgravescs added this to the Oct 12 - Oct 23 milestone

Collaborator Author

tgravescs commented Oct 20, 2020

build

jlowe previously approved these changes

View reviewed changes

Collaborator Author

tgravescs commented Oct 26, 2020

build

tgravescs added 2 commits

October 26, 2020 13:45


          Fix scalastyle line length

ef6be79

Signed-off-by: Thomas Graves <tgraves@nvidia.com>


          Update docs and change tests to copy reader confs

78f9033

tgravescs dismissed stale reviews from jlowe and revans2 via

78f9033

October 26, 2020 19:21

Collaborator Author

tgravescs commented Oct 26, 2020

build

jlowe previously approved these changes

View reviewed changes

revans2 previously approved these changes

View reviewed changes

tgravescs added 2 commits

October 26, 2020 15:31


          Merge remote-tracking branch 'origin/branch-0.3' into consolidateFile…

e829623

…sWithCloud


          Update GpuColumnVector.from call to handle MapTypes

0b12aca

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

tgravescs dismissed stale reviews from revans2 and jlowe via

0b12aca

October 26, 2020 20:50

Collaborator Author

tgravescs commented Oct 26, 2020

build

jlowe approved these changes

View reviewed changes

revans2 approved these changes

View reviewed changes

tgravescs merged commit 4c451ab into NVIDIA:branch-0.3

tgravescs deleted the consolidateFilesWithCloud branch

October 27, 2020 13:21

sperlingxx pushed a commit to sperlingxx/spark-rapids that referenced this pull request


          Parquet coalesce file reader for local filesystems (NVIDIA#990)

d186394

* Add back the small file consolidation for the parquet reader for non-cloud environments

Signed-off-by: Thomas Graves <tgraves@apache.org>

* make resolveURI local

Signed-off-by: Thomas Graves <tgraves@apache.org>

* debug

* fix debug

* Cleanup

* rework names

* Fix bug in footer psoition

* Add input file transition logic back and update tests

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Update configs so can control multi file optmization, multi file read, and coalesce reader

* remove debug

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Update tests for 3 parquet readers and small bug fix

* Update logging

* test fixes

* various fixes

* Update configs and fix parametsr to GpuParquetScan

Signed-off-by: Thomas Graves <tgraves@apache.org>

* remove unneeded function dbshim

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* remove debug log and update configs

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* cleanup and debug

* Update configs.md

* cleanup

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* create a common function for getting small file opts for fileSourceScan

* Fix extra line and update config text

* Update text

* change to use close on exception

Signed-off-by: Thomas Graves <tgraves@apache.org>

* update configs doc

* Fix missing imports

* Fix import order

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Rework the parquet multi-file configs to have a single configuration and change the way they are passed around
for the InputFileName

Signed-off-by: Thomas Graves <tgraves@apache.org>

* make rapidsConf transient

Signed-off-by: Thomas Graves <tgraves@apache.org>

* fix typo

Signed-off-by: Thomas Graves <tgraves@apache.org>

* forward rapidsconf

Signed-off-by: Thomas Graves <tgraves@apache.org>

* update test and fix missed config check

* Add log statement for original per file reader

* Update text and fix test

* add space

Signed-off-by: Thomas Graves <tgraves@apache.org>

* update config.md

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Fix parameter to spark 3.1.0 parquet san

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Fix scalastyle line length

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Update docs and change tests to copy reader confs

* Update GpuColumnVector.from call to handle MapTypes

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

chenrui17 commented Nov 27, 2020 •

edited

Loading

It was then replaced with the multi-threaded parquet reader which worked much better in the cloud. Unfortunately I didn't do enough testing on local file systems (file: and hdfs:).

if i use HDFS , it woud be go to the local file systems ? but in another #1200 , you said i should set the spark.rapids.sql.format.parquet.reader.type="MULTITHREADED" , i am confused . please help me .

btw , I use HDFS , how to confirme my multi-thread read small parquet is working well . my parquet file is about 250MB , and executor.cores = 16 . I attemp to set spark.rapids.sql.format.parquet.multiThreadedRead.numThreads = 48 , but query execution time is almost same .

Collaborator Author

tgravescs commented Nov 30, 2020

Please do not ask questions on closed PRs, an issue would be much better. In testing the coalesce reader works better for local file system (ie local disks on same node as the executors). But that is assuming the disks are fast enough and I think the case you ran into is partitioning. So it might not always be the case, which is why it's configurable.

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request


          Parquet coalesce file reader for local filesystems (NVIDIA#990)

2b17a1e

* Add back the small file consolidation for the parquet reader for non-cloud environments

Signed-off-by: Thomas Graves <tgraves@apache.org>

* make resolveURI local

Signed-off-by: Thomas Graves <tgraves@apache.org>

* debug

* fix debug

* Cleanup

* rework names

* Fix bug in footer psoition

* Add input file transition logic back and update tests

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Update configs so can control multi file optmization, multi file read, and coalesce reader

* remove debug

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Update tests for 3 parquet readers and small bug fix

* Update logging

* test fixes

* various fixes

* Update configs and fix parametsr to GpuParquetScan

Signed-off-by: Thomas Graves <tgraves@apache.org>

* remove unneeded function dbshim

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* remove debug log and update configs

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* cleanup and debug

* Update configs.md

* cleanup

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* create a common function for getting small file opts for fileSourceScan

* Fix extra line and update config text

* Update text

* change to use close on exception

Signed-off-by: Thomas Graves <tgraves@apache.org>

* update configs doc

* Fix missing imports

* Fix import order

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Rework the parquet multi-file configs to have a single configuration and change the way they are passed around
for the InputFileName

Signed-off-by: Thomas Graves <tgraves@apache.org>

* make rapidsConf transient

Signed-off-by: Thomas Graves <tgraves@apache.org>

* fix typo

Signed-off-by: Thomas Graves <tgraves@apache.org>

* forward rapidsconf

Signed-off-by: Thomas Graves <tgraves@apache.org>

* update test and fix missed config check

* Add log statement for original per file reader

* Update text and fix test

* add space

Signed-off-by: Thomas Graves <tgraves@apache.org>

* update config.md

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Fix parameter to spark 3.1.0 parquet san

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Fix scalastyle line length

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Update docs and change tests to copy reader confs

* Update GpuColumnVector.from call to handle MapTypes

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request


          Parquet coalesce file reader for local filesystems (NVIDIA#990)

b1be50c

* Add back the small file consolidation for the parquet reader for non-cloud environments

Signed-off-by: Thomas Graves <tgraves@apache.org>

* make resolveURI local

Signed-off-by: Thomas Graves <tgraves@apache.org>

* debug

* fix debug

* Cleanup

* rework names

* Fix bug in footer psoition

* Add input file transition logic back and update tests

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Update configs so can control multi file optmization, multi file read, and coalesce reader

* remove debug

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Update tests for 3 parquet readers and small bug fix

* Update logging

* test fixes

* various fixes

* Update configs and fix parametsr to GpuParquetScan

Signed-off-by: Thomas Graves <tgraves@apache.org>

* remove unneeded function dbshim

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* remove debug log and update configs

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* cleanup and debug

* Update configs.md

* cleanup

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* create a common function for getting small file opts for fileSourceScan

* Fix extra line and update config text

* Update text

* change to use close on exception

Signed-off-by: Thomas Graves <tgraves@apache.org>

* update configs doc

* Fix missing imports

* Fix import order

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Rework the parquet multi-file configs to have a single configuration and change the way they are passed around
for the InputFileName

Signed-off-by: Thomas Graves <tgraves@apache.org>

* make rapidsConf transient

Signed-off-by: Thomas Graves <tgraves@apache.org>

* fix typo

Signed-off-by: Thomas Graves <tgraves@apache.org>

* forward rapidsconf

Signed-off-by: Thomas Graves <tgraves@apache.org>

* update test and fix missed config check

* Add log statement for original per file reader

* Update text and fix test

* add space

Signed-off-by: Thomas Graves <tgraves@apache.org>

* update config.md

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Fix parameter to spark 3.1.0 parquet san

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Fix scalastyle line length

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Update docs and change tests to copy reader confs

* Update GpuColumnVector.from call to handle MapTypes

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request


          Update submodule cudf to 7dade51f8636f00153b0c0f190dbbb352c5b309b (NV…

d8d00e8

…IDIA#990)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels