-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use page statistics in Parquet reader #14973
Conversation
parquet_read_decode benchmark
|
parquet_read_chunks benchmark
and the same but forcing PLAIN encoding
|
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already review the previous version of this feature, but going over it again since I didn't quite approve it back then (I almost did, IIRC).
Few nitpicks so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just one comment that reads a bit odd to me.
|
||
// calculate num_nulls if not available from column index | ||
if (not pg_info.num_nulls.has_value()) { | ||
pg_info.num_nulls = std::reduce(h, h + max_def_level); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the least verbose use of a standard algorithm I've ever seen :D
// info is missing or insufficient, then just return without modifying the row_group_info. | ||
if (not pg_info.num_nulls.has_value() or not pg_info.num_valid.has_value()) { return; } | ||
|
||
// if using older page indexes that lack size info, don't use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// if using older page indexes that lack size info, don't use | |
// if using older page indexes that lack size info, don't use `pg_info` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaned this up and added a more detailed TODO...will push once CI finishes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at a few other PRs, this might be an "if CI finishes" situation 😬
/ok to test |
/ok to test |
Looks like #15020 broke this (db 3 ets 0 😮 😭 🤣). Regrouping... |
Co-authored-by: Yunsong Wang <yunsongw@nvidia.com>
/ok to test |
/merge |
Description
#14000 added the ability to write new page statistics to the Parquet writer. This PR uses these new statistics to avoid some string size computations. Benchmarks show an improvement in read times of up to 20%.
Checklist