Use multi-threaded parquet read with small files #677

tgravescs · 2020-09-08T00:37:13Z

closes #627
closes #608

This changes the parquet reader small file improvements to use multi-threaded read instead of aggregating the files together. I changed the config names to match this implementation. There is 1 config to turn it on and off, 1 to control the number of threads used per executor (its shared across all tasks) and then 1 config to control the max number of files to be processed per task in parallel before it gets copied to the GPU. The last config allows you to somewhat control the host memory being used if you are limited.

Essentially what happens now is that we do everything for each file in a separate thread. So we get the list of files for each task and launch multiple threads in parallel and each thread does everything from reading the footer, to filtering the blocks down, and then copying the data into host memory buffers. This makes things pretty straight forward as to what is going on and allows us to support the input_file_name and mergeSchema options that we didn't before.
The code launches the files to be processed in the order in which they come in to match the CPU side and they run in parallel up to the number of free threads.

I tested on 2 queries which both have partitioning and this showed great improvement. one went from about 11 minutes down to 4.5 minutes. the other went from 7-8 minutes down to 2.4 minutes. Those were using executors with 4 or 6 cores (on databricks - so standalone mode) each and 20 threads worked very well there. The downside to 20 is that if you have executors with lots of cores 20 might not be enough.

I tested with consolidated files on the first query without partitioning where each task only got a part of a file and the results were the same or slightly better than previous implementation.

Signed-off-by: Thomas Graves tgraves@nvidia.com

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

Signed-off-by: Thomas Graves <tgraves@apache.org>

tgravescs · 2020-09-08T00:37:20Z

build

tgravescs · 2020-09-08T00:38:30Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

+        "parquet files in parallel. This can not be changed at runtime after the executor has" +
+        "started.")
+      .integerConf
+      .createWithDefault(20)


this worked well on databricks with hosts with 4 to 6 cores. If we have executors with lots of cores this might not be ideal. I could try to adjust this based on executor cores but with standalone mode I don't know what that is. Any thoughts on this? I figure this is good start and we can adjust later.

+1 for putting something in that we can adjust later on if needed.

revans2

Just a question.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

tgravescs · 2020-09-08T16:18:56Z

build

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuInputFileBlock.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

Signed-off-by: Thomas Graves <tgraves@apache.org>

jlowe · 2020-09-08T21:11:55Z

build

* Try multi-threaded read with parquet with small files Signed-off-by: Thomas Graves <tgraves@nvidia.com> * cleanup and comments * comment config Signed-off-by: Thomas Graves <tgraves@apache.org> * Add note about TaskContext not being set in the threadpool Signed-off-by: Thomas Graves <tgraves@nvidia.com> * remove extra import and use closeOnExcept * try just using future throw * let future throw Signed-off-by: Thomas Graves <tgraves@apache.org> * Use safeclose() Signed-off-by: Thomas Graves <tgraves@apache.org> Co-authored-by: Thomas Graves <tgraves@nvidia.com>

…IDIA#677) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com> Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

tgravescs added 3 commits September 7, 2020 18:06

Try multi-threaded read with parquet with small files

92fca3b

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

cleanup and comments

b9079cd

comment config

3dea941

Signed-off-by: Thomas Graves <tgraves@apache.org>

tgravescs commented Sep 8, 2020

View reviewed changes

sameerz assigned tgravescs Sep 8, 2020

sameerz added the performance A performance related task/issue label Sep 8, 2020

tgravescs added this to the Aug 31 - Sep 11 milestone Sep 8, 2020

revans2 previously approved these changes Sep 8, 2020

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala Show resolved Hide resolved

Add note about TaskContext not being set in the threadpool

9bf57f3

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

tgravescs dismissed revans2’s stale review via 9bf57f3 September 8, 2020 16:18

revans2 previously approved these changes Sep 8, 2020

View reviewed changes

jlowe reviewed Sep 8, 2020

View reviewed changes

tgravescs added 4 commits September 8, 2020 14:21

remove extra import and use closeOnExcept

67ccd1c

try just using future throw

0e953d2

let future throw

5466db1

Signed-off-by: Thomas Graves <tgraves@apache.org>

Use safeclose()

7fe28a9

Signed-off-by: Thomas Graves <tgraves@apache.org>

tgravescs dismissed revans2’s stale review via 7fe28a9 September 8, 2020 20:03

jlowe approved these changes Sep 8, 2020

View reviewed changes

tgravescs merged commit be48350 into NVIDIA:branch-0.2 Sep 9, 2020

tgravescs deleted the multithreadparquetReadrebase branch September 9, 2020 00:46

jlowe mentioned this pull request Nov 25, 2020

[BUG] No data size in Input column in Stages page from Spark UI when using Parquet as file source #1199

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use multi-threaded parquet read with small files #677

Use multi-threaded parquet read with small files #677

tgravescs commented Sep 8, 2020 •

edited

Loading

tgravescs commented Sep 8, 2020

tgravescs Sep 8, 2020

revans2 Sep 8, 2020

revans2 left a comment

tgravescs commented Sep 8, 2020

jlowe commented Sep 8, 2020

Use multi-threaded parquet read with small files #677

Use multi-threaded parquet read with small files #677

Conversation

tgravescs commented Sep 8, 2020 • edited Loading

tgravescs commented Sep 8, 2020

tgravescs Sep 8, 2020

Choose a reason for hiding this comment

revans2 Sep 8, 2020

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

tgravescs commented Sep 8, 2020

jlowe commented Sep 8, 2020

tgravescs commented Sep 8, 2020 •

edited

Loading