Example of reading and writing parquet metadata outside the file #6081

alamb · 2024-07-17T21:42:07Z

Which issue does this PR close?

Related to #6002
Closes #6504

Rationale for this change

To figure out a good API we need an example of what we are trying to do

What changes are included in this PR?

Adds an example, with comments
The example is based on my interpretation of @adriangb's description here API for encoding/decoding ParquetMetadata with more control #6002 (comment)

Are there any user-facing changes?

Not yet, just an example

adriangb · 2024-07-17T23:17:54Z

parquet/examples/external_metadata.rs

+/// Specifically it:
+/// 1. It reads the metadata of a Parquet file
+/// 2. Removes some column statistics from the metadata (to make them smaller)
+/// 3. Stores the metadata in a separate file
+/// 4. Reads the metadata from the separate file and uses that to read the Parquet file


Hmmm I feel like we can simplify this example a bit. My use case is essentially along the lines of https://github.com/apache/datafusion/pull/10701/files#diff-81450b08df2ee29b3a9069865fc4f0c94883023c9d75bde729756c6bb4ec630d but instead of the metadata cache being in-memory you can imagine it's on disk (so that e.g. I can cache more metadata than would fit in memory).

Maybe this can be modeled something like:

struct KeyValueStore { storage: HashMap<String, Vec<u8>> } impl KeyValueStore { pub async fn get(&self, key: String) -> &[u8]; pub async fn set(&self, key: String, value: Vec<u8>); }

The point being that we serialize the metadata to the key value store and then deserialize it from there, passing it into the reader instead of having the reader get it from the file itself. I don't think the editing of the metadata is necessary to get this example across.

ok, make sense. I agree maybe that is being overly ambitious

I think reading/writing to a file is pretty similar and actually using a kv store might make it more complicated so I kept a file for now

File is fine by me 😄. Maybe a comment about storing the metadata in a fast cache like Redis or in a metadata store will be enough to spark imagination?

parquet/examples/external_metadata.rs

alamb · 2024-08-06T22:12:59Z

parquet/examples/external_metadata.rs

+/// This function reads the format written by `write_metadata_to_file`
+fn read_metadata_from_file(file: impl AsRef<Path>) -> ParquetMetaData {
+    let mut file = std::fs::File::open(file).unwrap();
+    // This API is kind of awkward compared to the writer


I also filled out this part of the PR showing how to read the metadata back -- it is (very) ugly compared to the nice ParquetMetadataWriter

@adriangb any interest in creating a ParquetMetadataReader API similar to ParquetMetadataWriter that handles these details? If so I can create a ticket / review a PR

Yes certainly interested!

What do you think about this as a plan: #6002 (comment)

alamb · 2024-08-06T22:22:04Z

parquet/examples/external_metadata.rs

+    let file = std::fs::File::open(file).unwrap();
+    let options = ArrowReaderOptions::new()
+        // tell the reader to read the page index
+        .with_page_index(true);


this is also kind of akward -- it would be great if the actual reading of the parquet metadata could do this...

Do you mean loading the page index?

Yeah, what I was trying to get at was that since the ColumnIndex and OffsetIndex (aka the "Page index structures") are not store inline, decode_metadata doesn't read them -- the logic to do so is baked into this reader

https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderOptions.html#method.with_page_index

I tried to describe this more on #6002 (comment)

Can we now get rid of with_page_index? Didn't ParquetMetaDataReader already load it?

I believe we can...it appears that options.page_index is only used in ArrowReaderMetadata::load, so it should have no effect.

I think this is now pretty clear -- we still need to explicitly call for loading the page index, but it makes sense I think

We could potentially change the default to be "read the page index unless told not to"

parquet/examples/external_metadata.rs

adriangb · 2024-08-07T02:45:38Z

parquet/examples/external_metadata.rs

+/// This function reads the format written by `write_metadata_to_file`
+fn read_metadata_from_file(file: impl AsRef<Path>) -> ParquetMetaData {
+    let mut file = std::fs::File::open(file).unwrap();
+    // This API is kind of awkward compared to the writer


Yes certainly interested!

adriangb · 2024-08-07T02:46:03Z

parquet/examples/external_metadata.rs

+    let file = std::fs::File::open(file).unwrap();
+    let options = ArrowReaderOptions::new()
+        // tell the reader to read the page index
+        .with_page_index(true);


Do you mean loading the page index?

mapleFU

👍

mapleFU · 2024-08-07T11:52:15Z

parquet/src/file/metadata/writer.rs

+        let column_indexes = self.convert_column_indexes();
+        let offset_indexes = self.convert_offset_index();
+
+        let mut encoder = ThriftMetadataWriter::new(


Would encoder better serializer here?

adriangb · 2024-09-24T17:30:26Z

parquet/examples/external_metadata.rs

+    ParquetMetaDataReader::new()
+        .with_column_indexes(true)
+        .with_offset_indexes(true)
+        .parse_and_finish(&mut file)
+        .unwrap()


Maybe I'm lost but I thought you'd need to pass in the original file size here to adjust offsets.

I bet the problem is that the example file doesn't actually have page offsets / page indexes 🤔 to load so the problem isn't hit.

Because an actual File is being passed, we can seek wherever we need to to find the page indexes. We only need to pass the original file size if we're passing a buffer with just the tail of the file.

But doesn't that file only have the tail of the original file? In other words, the file it's opening has for example 100 bytes but there's byte offsets referencing byte 101 to load the page index.

Ah, needed to read more of the example 😅. I think @alamb is correct...the original file has no page indexes anyway, so on the second read, even though we ask for them, since there are no offsets specified, there's no seeking done to find them. It would be interesting to see what happens if we start with a different file (alltypes_tiny_pages.parquet for instance), since we don't ask for the page indexes in the first read, what happens to the page index offsets when we write the metadata back out?

Interestingly, this has also come up downstream in DataFusion with @progval on apache/datafusion#12593 (where the semi-automatic loading of page indexes causes unintended accesses and slowdowns)

I wonder if the metadata writer needs to modify the page index offsets/lengths in the ColumnMetaData if the indexes are not present in the ParquetMetaData. Then again, I could see wanting to preserve the page index offsets of the original file if you only want to save the footer metadata externally...perhaps an option on the metadata writer to preserve page index offsets if desired?

This is an excellent point. This is why I think it is so important to have this example to motivate the API design

As I understand the usecase it

store the parquet metadata as some bytes externally (e.g. in a database like redis, or some other location)

Use that metadata both for various pruning as well as actually reading the parquet data when needed

I'll update the example to also have a file with page indexes and see what happens 🏃

Yep that's exactly my use case. I've done it by implementing AsyncFileReader::get_metadata so it works for both pruning and reading the file.

I wrote up some tests here: #6463

The good news is that as long as you load the page indexes with the initial metadata load, the ParquetMetadataWriter will correctly update the offsets so the metadata can be read again

The bad news is that #6464 happens (precisely as @etseidl predicated above):

I wonder if the metadata writer needs to modify the page index offsets/lengths in the ColumnMetaData if the indexes are not present in the ParquetMetaData.

I also made a PR with some tests for this usecase: #6463

parquet/examples/external_metadata.rs

alamb · 2024-09-24T17:36:15Z

parquet/examples/external_metadata.rs

+    ParquetMetaDataReader::new()
+        .with_column_indexes(true)
+        .with_offset_indexes(true)
+        .parse_and_finish(&mut file)
+        .unwrap()


I bet the problem is that the example file doesn't actually have page offsets / page indexes 🤔 to load so the problem isn't hit.

alamb · 2024-10-03T13:18:02Z

Current status:

I am pretty excited about how this example is looking
I made a PR to improve some underlying documentaiton: Improve parquet MetadataFetch and AsyncFileReader docs #6505
The only thing remaining in my mind is to update the example to strip statistics from ColumnMetadata but I think that will need some additional APIs as well

alamb · 2024-10-07T19:21:56Z

Ok, i think this PR is now basically ready to go.

I have one final small API addition (for modifying ColumnChunkMetadata) here: #6523 but once that is merged then this PR will be ready for review

alamb · 2024-10-08T20:32:31Z

parquet/src/file/metadata/mod.rs

-//!   and [`decode_metadata`]
-//! * Read from an `async` source to `ParquetMetaData`: [`MetadataLoader`]
-//! * Read from bytes or from an async source to `ParquetMetaData`: [`ParquetMetaDataReader`]
+//! * [`ParquetMetaDataReader`] for reading


this also updates the docs to point at the new APIs added in the previous releases

I can pull these changes out into their own PR if we prefer

alamb · 2024-10-08T20:32:55Z

Ok, 3 months later this PR is now ready for a good review!

etseidl

Looks great! Thanks for doing this, it really helped drive the development to have a concrete example.

parquet/src/file/metadata/mod.rs

parquet/examples/external_metadata.rs

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

alamb · 2024-10-10T12:56:09Z

Thanks again for the reviews and inspiration @adriangb and @etseidl -- I think these APIs are looking quite good now ❤️

github-actions bot added the parquet Changes to the parquet crate label Jul 17, 2024

This was referenced Jul 17, 2024

API for encoding/decoding ParquetMetadata with more control #6002

Closed

Add ParquetMetadataWriter allow ad-hoc encoding of ParquetMetadata #6000

Closed

adriangb reviewed Jul 17, 2024

View reviewed changes

alamb force-pushed the alamb/parquet-stats-example branch from ea603d4 to ddd4240 Compare August 6, 2024 22:10

alamb commented Aug 6, 2024

View reviewed changes

alamb mentioned this pull request Aug 6, 2024

Add ThriftMetadataWriter for writing Parquet metadata #6197

Merged

adriangb reviewed Aug 7, 2024

View reviewed changes

mapleFU reviewed Aug 7, 2024

View reviewed changes

etseidl mentioned this pull request Sep 13, 2024

POC: Add ParquetMetaDataReader #6392

Closed

alamb force-pushed the alamb/parquet-stats-example branch from 578b8be to 9bcf552 Compare September 24, 2024 16:10

alamb mentioned this pull request Sep 24, 2024

Add ParquetMetaDataReader #6431

Merged

adriangb reviewed Sep 24, 2024

View reviewed changes

alamb commented Sep 24, 2024

View reviewed changes

alamb force-pushed the alamb/parquet-stats-example branch 3 times, most recently from 622cc7a to b6454ab Compare September 26, 2024 14:41

alamb force-pushed the alamb/parquet-stats-example branch 3 times, most recently from e1d54e4 to 3d6b976 Compare October 3, 2024 13:10

alamb mentioned this pull request Oct 3, 2024

Improve parquet MetadataFetch and AsyncFileReader docs #6505

Merged

alamb force-pushed the alamb/parquet-stats-example branch 2 times, most recently from a112ae4 to f73a96d Compare October 7, 2024 19:20

alamb force-pushed the alamb/parquet-stats-example branch from f73a96d to 0a125e9 Compare October 7, 2024 19:30

Example of reading and writing parquet metadata outside the file

9d926cf

alamb force-pushed the alamb/parquet-stats-example branch from 0a125e9 to 9d926cf Compare October 8, 2024 20:31

alamb commented Oct 8, 2024

View reviewed changes

alamb marked this pull request as ready for review October 8, 2024 20:32

etseidl approved these changes Oct 8, 2024

View reviewed changes

parquet/src/file/metadata/mod.rs Outdated Show resolved Hide resolved

parquet/examples/external_metadata.rs Outdated Show resolved Hide resolved

Apply suggestions from code review

ffb4fa7

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

alamb added the documentation Improvements or additions to documentation label Oct 10, 2024

alamb merged commit 77dcdc0 into apache:master Oct 10, 2024
17 checks passed

alamb deleted the alamb/parquet-stats-example branch October 10, 2024 12:56

Example of reading and writing parquet metadata outside the file #6081

Example of reading and writing parquet metadata outside the file #6081

Conversation

alamb commented Jul 17, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

adriangb Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriangb Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Oct 3, 2024

alamb commented Oct 7, 2024

Choose a reason for hiding this comment

alamb commented Oct 8, 2024

etseidl left a comment

Choose a reason for hiding this comment

alamb commented Oct 10, 2024

alamb commented Jul 17, 2024 •

edited

Loading

adriangb Jul 17, 2024 •

edited

Loading

adriangb Aug 7, 2024 •

edited

Loading