forked from NVIDIA/spark-rapids
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Parquet small file reading optimization (NVIDIA#595)
* Initial prototype small filees parquet * Change datasource v1 to use small files * Working but has 72 bytes off in size * Copy filesourcescan to databricks and fix merge error * Fix databricks package name * Try to debug size calculation - adds lots of warnings * Cleanup and have file source scan small files only work for parquet * Switch to use ArrayBuffer so order correct * debug * Fix order issue * add more to calculated size * cleanup * Try to handle partition values * fix passing partitionValues * refactor * disable mergeschema * add check for mergeSchema * Add tests for both small file optimization on and off * hadnle input file - but doesn't totally work * remove extra values reader * Fixes * Debug * Check to see if Inputfile execs used * Finding InputFileName works * finding input file working * cleanup and add tests for V2 datasource * Add check for input file to GpuParquetScan * Add more tests * Add GPU metrics to GpuFileSourceScanExec Signed-off-by: Jason Lowe <jlowe@nvidia.com> * remove log messages * Docs * cleanup * Update 300db and 310 FileSourceScanExecs passing unit tests * Add test for bucketing * Add in logic for datetime corrected rebase mode * Commonize some code * Cleanup * fixes * Extract GpuFileSourceScanExec from shims Signed-off-by: Jason Lowe <jlowe@nvidia.com> * Add more tests * comments * update test * Pass metrics via GPU file format rather than custom options map Signed-off-by: Jason Lowe <jlowe@nvidia.com> * working * pass schema around properly * fix value from tuple * Rename case class * Update tests * Update code checking for DataSourceScanExec Signed-off-by: Jason Lowe <jlowe@nvidia.com> * Fix scaladoc warning and unused imports Signed-off-by: Jason Lowe <jlowe@nvidia.com> * Add realloc if over memory size * refactor memory checks * Fix copyright Signed-off-by: Jason Lowe <jlowe@nvidia.com> * Upmerge to latest FileSourceScanExec changes for metrics * Add missing check Filesource scan mergeSchema and cleanup * Cleanup * remove bucket test for now * formatting * Fixes * Add more tests * Merge conflict Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Fix merge conflict Signed-off-by: Thomas Graves <tgraves@nvidia.com> * enable parquet bucket tests and change warning Signed-off-by: Thomas Graves <tgraves@nvidia.com> * cleanup Signed-off-by: Thomas Graves <tgraves@nvidia.com> * remove debug logs Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Move FilePartition creation to shim Signed-off-by: Thomas Graves <tgraves@apache.org> * Add better message for mergeSchema Signed-off-by: Thomas Graves <tgraves@apache.org> * Address review comments. Add in withResources and closeOnExcept and minor things. Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Fix spacing Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Fix databricks support and passing arguments Signed-off-by: Thomas Graves <tgraves@nvidia.com> * fix typo in db Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Update config description Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Rework Signed-off-by: Thomas Graves <tgraves@nvidia.com> Co-authored-by: Thomas Graves <tgraves@nvidia.com> Co-authored-by: Jason Lowe <jlowe@nvidia.com>
- Loading branch information
1 parent
05acc5f
commit 7ac919b
Showing
18 changed files
with
1,094 additions
and
280 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.