Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate the coalescing parquet reader when reading files from multiple partitioned folders #1401

Merged
merged 13 commits into from
Dec 17, 2020

Conversation

tgravescs
Copy link
Collaborator

fixes #1200

Accelerate the scan speed for coalescing parquet reader when reading files from multiple partitioned folders.

Previously whenever we hit a file that was in a different partition we split the batch so we could easily add the partition values. This results in us having to do a lot of batches when there are a lot of partitioned files, which is not great for performance.
To fix that we can change it to combine files with different partitioning by keeping track of the partition values and which rows those values need to be applied to. Then after we read the files we need to add those columns that are built based off the partition values and row counts. This works because we read the files in the same order as when we construct what goes into each batch.

in this PR I added the tracking of partition values and the corresponding number of rows, then after we read the files into the columnar batch we add all the partition values. The partition values are constructed by doing 1 partition column at a time. For each column it generates the individual partition value columns for the number of rows necessary, then it concatenates all of those together, then it moves to the next partition column. ie meaning if you have paths with multiple partitions ../key1=2/key2=foo/ , it does key1=X for all values of X first then it does key2=Y afterwards.

@tgravescs tgravescs added the feature request New feature or request label Dec 15, 2020
@tgravescs tgravescs added this to the Dec 7 - Dec 18 milestone Dec 15, 2020
@tgravescs tgravescs self-assigned this Dec 15, 2020
@tgravescs
Copy link
Collaborator Author

build

@github-actions
Copy link

👎 Promotion blocked, new vulnerability found

Vulnerability report

Component Vulnerability Description Severity
Guava: Google Core Libraries for Java CVE-2020-8908 A temp directory creation vulnerability exist in Guava versions prior to 30.0 allowing an attacker with access to the machine to potentially access data in a temporary directory created by the Guava com.google.common.io.Files.createTempDir(). The permissions granted to the directory created default to the standard unix-like /tmp ones, leaving the files open. We recommend updating Guava to version 30.0 or later, or update to Java 7 or later, or to explicitly change the permissions after the creation of the directory if neither are possible. LOW

@@ -77,19 +77,29 @@ object ColumnarPartitionReaderWithPartitionValues extends Arm {
var partitionColumns: Array[GpuColumnVector] = null
try {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code would always close fileBatch, but this now leaks it if something throws before we get to addGpuColumnVectorsToBatch (e.g.: buildPartitionColumns).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm handling this now, let me know if you see anything I missed.

@pxLi
Copy link
Collaborator

pxLi commented Dec 16, 2020

build

1 similar comment
@pxLi
Copy link
Collaborator

pxLi commented Dec 16, 2020

build

@tgravescs
Copy link
Collaborator Author

thanks Jason, updated.

@tgravescs
Copy link
Collaborator Author

build

jlowe
jlowe previously approved these changes Dec 17, 2020
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small nit for what appears to be an unnecessary null check but otherwise looks good to me.

@tgravescs
Copy link
Collaborator Author

build

@tgravescs tgravescs merged commit 801ba80 into NVIDIA:branch-0.4 Dec 17, 2020
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
…iple partitioned folders (NVIDIA#1401)

*Accelerate the coalescing parquet reader when reading files from multiple partitioned folders
Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Properly close and change to use withResource and closeOnExcept

* remove null check
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
…iple partitioned folders (NVIDIA#1401)

*Accelerate the coalescing parquet reader when reading files from multiple partitioned folders
Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Properly close and change to use withResource and closeOnExcept

* remove null check
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
…IDIA#1401)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
3 participants