Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet coalesce file reader for local filesystems #990

Merged
merged 49 commits into from
Oct 27, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
dc5ecf5
Add back the small file consolidation for the parquet reader for non-…
tgravescs Oct 1, 2020
7f99fa5
make resolveURI local
tgravescs Oct 1, 2020
ac053d6
debug
tgravescs Oct 1, 2020
aa19029
fix debug
tgravescs Oct 1, 2020
ed90c71
Cleanup
tgravescs Oct 12, 2020
95fa108
rework names
tgravescs Oct 12, 2020
4887bb1
Fix bug in footer psoition
tgravescs Oct 12, 2020
83ae5a2
Add input file transition logic back and update tests
tgravescs Oct 13, 2020
b346b85
Update configs so can control multi file optmization, multi file read…
tgravescs Oct 13, 2020
81c17a8
remove debug
tgravescs Oct 13, 2020
b2daf51
Update tests for 3 parquet readers and small bug fix
tgravescs Oct 14, 2020
79fdd53
Update logging
tgravescs Oct 14, 2020
060aaae
test fixes
tgravescs Oct 14, 2020
cd3730c
various fixes
tgravescs Oct 14, 2020
6e57beb
Update configs and fix parametsr to GpuParquetScan
tgravescs Oct 14, 2020
5bcb6e8
remove unneeded function dbshim
tgravescs Oct 16, 2020
5111876
remove debug log and update configs
tgravescs Oct 19, 2020
455a573
cleanup and debug
tgravescs Oct 19, 2020
cc31a51
Update configs.md
tgravescs Oct 19, 2020
7a6268f
cleanup
tgravescs Oct 20, 2020
588cfb4
create a common function for getting small file opts for fileSourceScan
tgravescs Oct 20, 2020
584cd85
Fix extra line and update config text
tgravescs Oct 20, 2020
6251f5f
Update text
tgravescs Oct 20, 2020
d638ed0
change to use close on exception
tgravescs Oct 20, 2020
f80efb0
update configs doc
tgravescs Oct 20, 2020
eae6a2e
Merge remote-tracking branch 'origin/branch-0.3' into consolidateFile…
tgravescs Oct 21, 2020
27130a6
Fix missing imports
tgravescs Oct 21, 2020
48fb270
Fix import order
tgravescs Oct 21, 2020
de3acd0
Rework the parquet multi-file configs to have a single configuration …
tgravescs Oct 23, 2020
eb4a379
make rapidsConf transient
tgravescs Oct 23, 2020
eaf5ad3
fix typo
tgravescs Oct 23, 2020
5a1322a
forward rapidsconf
tgravescs Oct 23, 2020
2088bb4
update test and fix missed config check
tgravescs Oct 23, 2020
7d1057f
Add log statement for original per file reader
tgravescs Oct 23, 2020
6309069
Update text and fix test
tgravescs Oct 23, 2020
277a2f6
add space
tgravescs Oct 23, 2020
311bf55
update config.md
tgravescs Oct 23, 2020
2aad748
Fix parameter to spark 3.1.0 parquet san
tgravescs Oct 23, 2020
ee7cd7e
Merge remote-tracking branch 'origin/branch-0.3' into consolidateFile…
tgravescs Oct 23, 2020
735a1c2
Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.s…
tgravescs Oct 26, 2020
bf32b89
Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.s…
tgravescs Oct 26, 2020
38ccef8
Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.s…
tgravescs Oct 26, 2020
7733db6
Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.s…
tgravescs Oct 26, 2020
b0ea686
Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetSc…
tgravescs Oct 26, 2020
76dc867
Merge remote-tracking branch 'origin/branch-0.3' into consolidateFile…
tgravescs Oct 26, 2020
ef6be79
Fix scalastyle line length
tgravescs Oct 26, 2020
78f9033
Update docs and change tests to copy reader confs
tgravescs Oct 26, 2020
e829623
Merge remote-tracking branch 'origin/branch-0.3' into consolidateFile…
tgravescs Oct 26, 2020
0b12aca
Update GpuColumnVector.from call to handle MapTypes
tgravescs Oct 26, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions docs/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ scala> spark.conf.set("spark.rapids.sql.incompatibleOps.enabled", true)

Name | Description | Default Value
-----|-------------|--------------
<a name="cloudSchemes"></a>spark.rapids.cloudSchemes|Comma separated list of additional URI schemes that are to be considered cloud based filesystems. Schemes already included: dbfs, s3, s3a, s3n, wasbs, gs. Cloud based stores generally would be total separate from the executors and likely have a higher I/O read cost. Many times the cloud filesystems also get better throughput when you have multiple readers in parallel. This is used with spark.rapids.sql.format.parquet.reader.type|None
<a name="memory.gpu.allocFraction"></a>spark.rapids.memory.gpu.allocFraction|The fraction of total GPU memory that should be initially allocated for pooled memory. Extra memory will be allocated as needed, but it may result in more fragmentation. This must be less than or equal to the maximum limit configured via spark.rapids.memory.gpu.maxAllocFraction.|0.9
<a name="memory.gpu.debug"></a>spark.rapids.memory.gpu.debug|Provides a log of GPU memory allocations and frees. If set to STDOUT or STDERR the logging will go there. Setting it to NONE disables logging. All other values are reserved for possible future expansion and in the mean time will disable logging.|NONE
<a name="memory.gpu.maxAllocFraction"></a>spark.rapids.memory.gpu.maxAllocFraction|The fraction of total GPU memory that limits the maximum size of the RMM pool. The value must be greater than or equal to the setting for spark.rapids.memory.gpu.allocFraction. Note that this limit will be reduced by the reserve memory configured in spark.rapids.memory.gpu.reserve.|1.0
Expand Down Expand Up @@ -61,10 +62,10 @@ Name | Description | Default Value
<a name="sql.format.orc.read.enabled"></a>spark.rapids.sql.format.orc.read.enabled|When set to false disables orc input acceleration|true
<a name="sql.format.orc.write.enabled"></a>spark.rapids.sql.format.orc.write.enabled|When set to false disables orc output acceleration|true
<a name="sql.format.parquet.enabled"></a>spark.rapids.sql.format.parquet.enabled|When set to false disables all parquet input and output acceleration|true
<a name="sql.format.parquet.multiThreadedRead.enabled"></a>spark.rapids.sql.format.parquet.multiThreadedRead.enabled|When set to true, reads multiple small files within a partition more efficiently by reading each file in a separate thread in parallel on the CPU side before sending to the GPU. Limited by spark.rapids.sql.format.parquet.multiThreadedRead.numThreads and spark.rapids.sql.format.parquet.multiThreadedRead.maxNumFilesParallel|true
<a name="sql.format.parquet.multiThreadedRead.maxNumFilesParallel"></a>spark.rapids.sql.format.parquet.multiThreadedRead.maxNumFilesParallel|A limit on the maximum number of files per task processed in parallel on the CPU side before the file is sent to the GPU. This affects the amount of host memory used when reading the files in parallel.|2147483647
<a name="sql.format.parquet.multiThreadedRead.numThreads"></a>spark.rapids.sql.format.parquet.multiThreadedRead.numThreads|The maximum number of threads, on the executor, to use for reading small parquet files in parallel. This can not be changed at runtime after the executor has started.|20
<a name="sql.format.parquet.multiThreadedRead.maxNumFilesParallel"></a>spark.rapids.sql.format.parquet.multiThreadedRead.maxNumFilesParallel|A limit on the maximum number of files per task processed in parallel on the CPU side before the file is sent to the GPU. This affects the amount of host memory used when reading the files in parallel. Used with MULTITHREADED reader, see spark.rapids.sql.format.parquet.reader.type|2147483647
<a name="sql.format.parquet.multiThreadedRead.numThreads"></a>spark.rapids.sql.format.parquet.multiThreadedRead.numThreads|The maximum number of threads, on the executor, to use for reading small parquet files in parallel. This can not be changed at runtime after the executor has started. Used with COALESCING and MULTITHREADED reader, see spark.rapids.sql.format.parquet.reader.type.|20
<a name="sql.format.parquet.read.enabled"></a>spark.rapids.sql.format.parquet.read.enabled|When set to false disables parquet input acceleration|true
<a name="sql.format.parquet.reader.type"></a>spark.rapids.sql.format.parquet.reader.type|Sets the parquet reader type. We support different types that are optimized for different environments. The original Spark style reader can be selected by setting this to PERFILE which individually reads and copies files to the GPU. Loading many small files individually has high overhead, and using either COALESCING or MULTITHREADED is recommended instead. The COALESCING reader is good when using a local file system where the executors are on the same nodes or close to the nodes the data is being read on. This reader coalesces all the files assigned to a task into a single host buffer before sending it down to the GPU. It copies blocks from a single file into a host buffer in separate threads in parallel, see spark.rapids.sql.format.parquet.multiThreadedRead.numThreads. MULTITHREADED is good for cloud environments where you are reading from a blobstore that is totally separate and likely has a higher I/O read cost. Many times the cloud environments also get better throughput when you have multiple readers in parallel. This reader uses multiple threads to read each file in parallel and each file is sent to the GPU separately. This allows the CPU to keep reading while GPU is also doing work. See spark.rapids.sql.format.parquet.multiThreadedRead.numThreads and spark.rapids.sql.format.parquet.multiThreadedRead.maxNumFilesParallel to control the number of threads and amount of memory used. By default this is set to AUTO so we select the reader we think is best. This will either be the COALESCING or the MULTITHREADED based on whether we think the file is in the cloud. See spark.rapids.cloudSchemes.|AUTO
<a name="sql.format.parquet.write.enabled"></a>spark.rapids.sql.format.parquet.write.enabled|When set to false disables parquet output acceleration|true
<a name="sql.hasNans"></a>spark.rapids.sql.hasNans|Config to indicate if your data has NaN's. Cudf doesn't currently support NaN's properly so you can get corrupt data if you have NaN's in your data and it runs on the GPU.|true
<a name="sql.hashOptimizeSort.enabled"></a>spark.rapids.sql.hashOptimizeSort.enabled|Whether sorts should be inserted after some hashed operations to improve output ordering. This can improve output file sizes when saving to columnar formats.|false
Expand Down
Loading