Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] POC add data total size information to Parquet file metadata #12974

Closed
wants to merge 58 commits into from

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Mar 17, 2023

Description

Parquet files that do not fit entirely within RAM must be read in chunks. But determining the ideal chunk size is made difficult by the fact that the Parquet file metadata does not give any indication of the fully decompressed/decoded data sizes. #11867 is a partial fix for this, but requires reading the entire file into RAM and decompressing all pages to be able to compute sensible batch sizes. Reading very large files will still result in out-of-memory errors.

This PR takes a different tack. During file write, page size information is gathered from the input table and is saved in the Parquet file footer, immediately preceding the page indexes. Size and location of this metadata is saved in key/value pairs contained in the column chunk metadata. By saving custom metadata in this way, other Parquet readers will still be able to consume these files without noticing the information is there. This WIP also modifies the chunked reader such that it can make use of this metadata to compute splits without having to read any page data, but still functions if the metadata is not present.

I'm hoping this PR will kick off a discussion of how to handle Parquet data sets that do not fit in RAM, with this being one possible solution.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rapids-bot
Copy link

rapids-bot bot commented Mar 17, 2023

Pull requests from external contributors require approval from a rapidsai organization member with write or admin permissions before CI can begin.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 17, 2023
@revans2
Copy link
Contributor

revans2 commented Mar 20, 2023

I'm hoping this PR will kick off a discussion of how to handle Parquet data sets that do not fit in RAM, with this being one possible solution.

From the Spark side of things, Spark has a config spark.sql.files.maxPartitionBytes This is a generic config, but for parquet it splits large files up into ranges of that many bytes at the most and assigns a task to each of these partitions. Then any row group where the middle of the row group falls into that file range is going to be fully processed by that task.

This was originally designed to work with HDFS which stores the data in blocks and if the partition size matches the HDFS block size, then there is a higher probability that the computation can move to data and have locality. In practice today most data being read is from blob stores like S3, which is almost always a remote read.

In addition to this downside there is also the problem that it does not take into account column pruning or predicate push down in any way. Not does it take into account the compression of the file.

Because of this when running with the rapids accelerator for apache spark we recommend that users set this number to be quite a bit larger (typically 2 GiB). This is to make sure that the GPU gets enough data to actually keep it busy while reading the files, but we added in the chunked reader feature to try and handle the cases when the data sizes really explode. For the most part it has worked for us, but we still do run out of memory if the data explodes in size.

It would be great if we did have a way to better optimize the reads so that we stayed within a target memory budget, but at the same time we need to be able to handle not crashing on files that were not written by CUDF.

I am going to have to dig into this proposal more before I can really comment much more on it. Or if there is other information that would be good to have stored with it.

For your information we don't use CUDF to access files directly, instead we download the parts of the files ourselves, pice them back together into a single parquet file and hand it to CUDF as a buffer to parse. This is because the file reader API is spark is pluggable so we cannot know exactly what is happening, especially with security tokens, transparent encryption, and transparent caching, without jumping through a number of hoops. It also allows us to play games with things like combining small files together into larger files to speed up parsing. So if we wanted to use this feature we would likely need a way to parse the metadata separately from parsing the file and a way to write it back out again.

@etseidl etseidl changed the base branch from branch-23.04 to branch-23.08 June 30, 2023 00:18
@github-actions github-actions bot removed Python Affects Python cuDF API. CMake CMake build issue conda Java Affects Java cuDF API. labels Jun 30, 2023
@m29498
Copy link

m29498 commented Jun 30, 2023

We have usecases where this approach in this PR would prove very useful. In cases where we need to load multiple Parquet files at once to be processed and merged into one file, this would be very valuable in staying within a certain memory budget in the GPU. For example, in merging multiple sorted files together.

This would greatly simplify some of the processing code to deal with datasets that are larger than the GPU memory and we currently have to try to predict memory usage ahead of time and adapt the window size as we process files.

@etseidl etseidl changed the base branch from branch-23.08 to branch-23.10 July 26, 2023 03:52
@etseidl
Copy link
Contributor Author

etseidl commented Aug 22, 2023

seems this will be superceded by PARQUET-2261. Will close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants