Accelerate the coalescing parquet reader when reading files from multiple partitioned folders #1401

tgravescs · 2020-12-15T22:19:32Z

Accelerate the scan speed for coalescing parquet reader when reading files from multiple partitioned folders.

Previously whenever we hit a file that was in a different partition we split the batch so we could easily add the partition values. This results in us having to do a lot of batches when there are a lot of partitioned files, which is not great for performance.
To fix that we can change it to combine files with different partitioning by keeping track of the partition values and which rows those values need to be applied to. Then after we read the files we need to add those columns that are built based off the partition values and row counts. This works because we read the files in the same order as when we construct what goes into each batch.

in this PR I added the tracking of partition values and the corresponding number of rows, then after we read the files into the columnar batch we add all the partition values. The partition values are constructed by doing 1 partition column at a time. For each column it generates the individual partition value columns for the number of rows necessary, then it concatenates all of those together, then it moves to the next partition column. ie meaning if you have paths with multiple partitions ../key1=2/key2=foo/ , it does key1=X for all values of X first then it does key2=Y afterwards.

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

tgravescs · 2020-12-15T22:28:54Z

build

github-actions · 2020-12-15T22:30:58Z

👎 Promotion blocked, new vulnerability found

Vulnerability report

Component	Vulnerability	Description	Severity
Guava: Google Core Libraries for Java	CVE-2020-8908	A temp directory creation vulnerability exist in Guava versions prior to 30.0 allowing an attacker with access to the machine to potentially access data in a temporary directory created by the Guava com.google.common.io.Files.createTempDir(). The permissions granted to the directory created default to the standard unix-like /tmp ones, leaving the files open. We recommend updating Guava to version 30.0 or later, or update to Java 7 or later, or to explicitly change the permissions after the creation of the directory if neither are possible.	LOW

...ugin/src/main/scala/com/nvidia/spark/rapids/ColumnarPartitionReaderWithPartitionValues.scala

jlowe · 2020-12-15T23:58:34Z

...ugin/src/main/scala/com/nvidia/spark/rapids/ColumnarPartitionReaderWithPartitionValues.scala

@@ -77,19 +77,29 @@ object ColumnarPartitionReaderWithPartitionValues extends Arm {
    var partitionColumns: Array[GpuColumnVector] = null
    try {


The original code would always close fileBatch, but this now leaks it if something throws before we get to addGpuColumnVectorsToBatch (e.g.: buildPartitionColumns).

I think I'm handling this now, let me know if you see anything I missed.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

pxLi · 2020-12-16T01:04:27Z

build

pxLi · 2020-12-16T03:41:11Z

build

tgravescs · 2020-12-17T04:20:58Z

thanks Jason, updated.

tgravescs · 2020-12-17T04:21:07Z

build

jlowe

Just a small nit for what appears to be an unnecessary null check but otherwise looks good to me.

...ugin/src/main/scala/com/nvidia/spark/rapids/ColumnarPartitionReaderWithPartitionValues.scala

tgravescs · 2020-12-17T16:37:59Z

build

…iple partitioned folders (NVIDIA#1401) *Accelerate the coalescing parquet reader when reading files from multiple partitioned folders Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Properly close and change to use withResource and closeOnExcept * remove null check

…IDIA#1401) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

tgravescs and others added 11 commits December 15, 2020 13:03

first go

3df6210

concat all the part columns

bf724d3

Fix issue with logic

767be34

remove logs

04c02bc

log statement and build script

90ae63a

fix cleanup

6da0d9b

fix and remove logging

87142a6

fix close and add test

0510b6f

refactor to use less memory

59e68b6

fix typo

7c47f74

cleanup

0aba141

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

tgravescs added the feature request New feature or request label Dec 15, 2020

tgravescs added this to the Dec 7 - Dec 18 milestone Dec 15, 2020

tgravescs self-assigned this Dec 15, 2020

jlowe requested changes Dec 16, 2020

View reviewed changes

Properly close and change to use withResource and closeOnExcept

9bb24eb

jlowe previously approved these changes Dec 17, 2020

View reviewed changes

...ugin/src/main/scala/com/nvidia/spark/rapids/ColumnarPartitionReaderWithPartitionValues.scala Outdated Show resolved Hide resolved

remove null check

caaf004

tgravescs dismissed jlowe’s stale review via caaf004 December 17, 2020 16:37

jlowe approved these changes Dec 17, 2020

View reviewed changes

tgravescs merged commit 801ba80 into NVIDIA:branch-0.4 Dec 17, 2020

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023

Update submodule cudf to dd6553a22d6cfcc2f017775a57d7b49783d62a9c (NV…

678b168

…IDIA#1401) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate the coalescing parquet reader when reading files from multiple partitioned folders #1401

Accelerate the coalescing parquet reader when reading files from multiple partitioned folders #1401

tgravescs commented Dec 15, 2020

tgravescs commented Dec 15, 2020

github-actions bot commented Dec 15, 2020

jlowe Dec 15, 2020

tgravescs Dec 17, 2020

pxLi commented Dec 16, 2020

pxLi commented Dec 16, 2020

tgravescs commented Dec 17, 2020

tgravescs commented Dec 17, 2020

jlowe left a comment

tgravescs commented Dec 17, 2020

		@@ -77,19 +77,29 @@ object ColumnarPartitionReaderWithPartitionValues extends Arm {
		var partitionColumns: Array[GpuColumnVector] = null
		try {

Accelerate the coalescing parquet reader when reading files from multiple partitioned folders #1401

Accelerate the coalescing parquet reader when reading files from multiple partitioned folders #1401

Conversation

tgravescs commented Dec 15, 2020

tgravescs commented Dec 15, 2020

github-actions bot commented Dec 15, 2020

👎 Promotion blocked, new vulnerability found

Vulnerability report

jlowe Dec 15, 2020

Choose a reason for hiding this comment

tgravescs Dec 17, 2020

Choose a reason for hiding this comment

pxLi commented Dec 16, 2020

pxLi commented Dec 16, 2020

tgravescs commented Dec 17, 2020

tgravescs commented Dec 17, 2020

jlowe left a comment

Choose a reason for hiding this comment

tgravescs commented Dec 17, 2020