[WIP] POC add data total size information to Parquet file metadata #12974

etseidl · 2023-03-17T23:22:15Z

Description

Parquet files that do not fit entirely within RAM must be read in chunks. But determining the ideal chunk size is made difficult by the fact that the Parquet file metadata does not give any indication of the fully decompressed/decoded data sizes. #11867 is a partial fix for this, but requires reading the entire file into RAM and decompressing all pages to be able to compute sensible batch sizes. Reading very large files will still result in out-of-memory errors.

This PR takes a different tack. During file write, page size information is gathered from the input table and is saved in the Parquet file footer, immediately preceding the page indexes. Size and location of this metadata is saved in key/value pairs contained in the column chunk metadata. By saving custom metadata in this way, other Parquet readers will still be able to consume these files without noticing the information is there. This WIP also modifies the chunked reader such that it can make use of this metadata to compute splits without having to read any page data, but still functions if the metadata is not present.

I'm hoping this PR will kick off a discussion of how to handle Parquet data sets that do not fit in RAM, with this being one possible solution.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

rapids-bot · 2023-03-17T23:22:20Z

Pull requests from external contributors require approval from a rapidsai organization member with write or admin permissions before CI can begin.

revans2 · 2023-03-20T15:50:28Z

I'm hoping this PR will kick off a discussion of how to handle Parquet data sets that do not fit in RAM, with this being one possible solution.

From the Spark side of things, Spark has a config spark.sql.files.maxPartitionBytes This is a generic config, but for parquet it splits large files up into ranges of that many bytes at the most and assigns a task to each of these partitions. Then any row group where the middle of the row group falls into that file range is going to be fully processed by that task.

This was originally designed to work with HDFS which stores the data in blocks and if the partition size matches the HDFS block size, then there is a higher probability that the computation can move to data and have locality. In practice today most data being read is from blob stores like S3, which is almost always a remote read.

In addition to this downside there is also the problem that it does not take into account column pruning or predicate push down in any way. Not does it take into account the compression of the file.

Because of this when running with the rapids accelerator for apache spark we recommend that users set this number to be quite a bit larger (typically 2 GiB). This is to make sure that the GPU gets enough data to actually keep it busy while reading the files, but we added in the chunked reader feature to try and handle the cases when the data sizes really explode. For the most part it has worked for us, but we still do run out of memory if the data explodes in size.

It would be great if we did have a way to better optimize the reads so that we stayed within a target memory budget, but at the same time we need to be able to handle not crashing on files that were not written by CUDF.

I am going to have to dig into this proposal more before I can really comment much more on it. Or if there is other information that would be good to have stored with it.

For your information we don't use CUDF to access files directly, instead we download the parts of the files ourselves, pice them back together into a single parquet file and hand it to CUDF as a buffer to parse. This is because the file reader API is spark is pluggable so we cannot know exactly what is happening, especially with security tokens, transparent encryption, and transparent caching, without jumping through a number of hoops. It also allows us to play games with things like combining small files together into larger files to speed up parsing. So if we wanted to use this feature we would likely need a way to parse the metadata separately from parsing the file and a way to write it back out again.

m29498 · 2023-06-30T09:59:33Z

We have usecases where this approach in this PR would prove very useful. In cases where we need to load multiple Parquet files at once to be processed and merged into one file, this would be very valuable in staying within a certain memory budget in the GPU. For example, in merging multiple sorted files together.

This would greatly simplify some of the processing code to deal with datasets that are larger than the GPU memory and we currently have to try to predict memory usage ahead of time and adapt the window size as we process files.

…_stats

etseidl · 2023-08-22T15:09:43Z

seems this will be superceded by PARQUET-2261. Will close this.

etseidl and others added 26 commits March 9, 2023 17:53

add raw data size info to column chunk metadata

ffcf963

remove prints, calculate page sizes too

54249e5

Merge branch 'rapidsai:branch-23.04' into feature/chunk_stats

3212720

Merge branch 'rapidsai:branch-23.04' into feature/chunk_stats

66e420e

Merge branch 'rapidsai:branch-23.04' into feature/chunk_stats

066a0da

Merge branch 'rapidsai:branch-23.04' into feature/chunk_stats

ccee4f6

add serialized structure to footer

7024534

read back page indexes and size info and save in metadata

d75d53c

don't read indexes if not doing chunking...update copyright

a424425

calculate splits with metadata

372529b

Merge branch 'rapidsai:branch-23.04' into feature/chunk_stats

c59013b

check for empty offset index

be492fa

do not clear chunk_read_limit if size stats are not present

ab61586

fudge sizes for now

501a209

Merge branch 'rapidsai:branch-23.04' into feature/chunk_stats

b3741c5

get rid of sizes with overhead

5e3608c

add writer for vector<int64_t>

2b77f22

only do size metadata if also doing column indexes

5c7480e

get rid of host_pages vector, add some TODOs

cdf8abe

use constexpr for KV keys

3ea748b

checkpoint before refactor

9f0e9ec

refactor to reuse compute_splits

5fc7ace

cleanup

0096169

Merge branch 'rapidsai:branch-23.04' into feature/chunk_stats

81981ca

post refactor cleanup

a9bf41b

fix writer::impl::close

8eca6f4

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 17, 2023

Merge branch 'rapidsai:branch-23.04' into feature/chunk_stats

6de2552

etseidl added 2 commits June 23, 2023 15:58

Merge branch 'rapidsai:branch-23.08' into feature/chunk_stats

307e8a6

Merge branch 'rapidsai:branch-23.08' into feature/chunk_stats

62f55af

etseidl changed the base branch from branch-23.04 to branch-23.08 June 30, 2023 00:18

Merge branch 'rapidsai:branch-23.08' into feature/chunk_stats

a87432b

github-actions bot removed Python Affects Python cuDF API. CMake CMake build issue conda Java Affects Java cuDF API. labels Jun 30, 2023

etseidl added 3 commits June 30, 2023 14:24

Merge branch 'rapidsai:branch-23.08' into feature/chunk_stats

e6432c5

Merge branch 'rapidsai:branch-23.08' into feature/chunk_stats

e35c002

Merge branch 'rapidsai:branch-23.08' into feature/chunk_stats

54e7a08

etseidl changed the base branch from branch-23.08 to branch-23.10 July 26, 2023 03:52

etseidl and others added 11 commits July 25, 2023 20:52

Merge branch 'branch-23.10' into feature/chunk_stats

bc1d084

Merge remote-tracking branch 'origin/branch-23.10' into feature/chunk…

8d4afec

…_stats

finish merge

0fa655b

Merge branch 'rapidsai:branch-23.10' into feature/chunk_stats

88035b8

Merge branch 'rapidsai:branch-23.10' into feature/chunk_stats

5389423

Merge branch 'rapidsai:branch-23.10' into feature/chunk_stats

13e5b08

Merge branch 'rapidsai:branch-23.10' into feature/chunk_stats

62ebeb6

Merge branch 'rapidsai:branch-23.10' into feature/chunk_stats

d0c3d94

Merge branch 'rapidsai:branch-23.10' into feature/chunk_stats

41800e4

Merge branch 'rapidsai:branch-23.10' into feature/chunk_stats

e2bdd57

Merge branch 'rapidsai:branch-23.10' into feature/chunk_stats

c22ce08

etseidl closed this Aug 22, 2023

This was referenced Aug 23, 2023

PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering apache/parquet-format#197

Merged

PARQUET-2261 Size Statistics #14000

Merged

etseidl deleted the feature/chunk_stats branch October 6, 2023 19:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] POC add data total size information to Parquet file metadata #12974

[WIP] POC add data total size information to Parquet file metadata #12974

etseidl commented Mar 17, 2023

rapids-bot bot commented Mar 17, 2023

revans2 commented Mar 20, 2023

m29498 commented Jun 30, 2023

etseidl commented Aug 22, 2023

[WIP] POC add data total size information to Parquet file metadata #12974

[WIP] POC add data total size information to Parquet file metadata #12974

Conversation

etseidl commented Mar 17, 2023

Description

Checklist

rapids-bot bot commented Mar 17, 2023

revans2 commented Mar 20, 2023

m29498 commented Jun 30, 2023

etseidl commented Aug 22, 2023