Parquet: Make row-group filters cooperate to filter #6893

zhongyujiang · 2023-02-21T12:33:28Z

We found that Parquet row-group filters may not work well sometimes, specifically, when evaluating expressions connected by OR and if the child expressions of this OR expression can only be evaluated by different row-group filters.

For example, suppose we have a sorted column foo, its null values are all clustered together after sorting，so queries like foo IS NULL can filter out most of the data. But when we want to combine other conditions to query, for example: bar IN (x, y, z) OR foo IS NULL(column bar is not sorted), row group filters can't work well, we found this is because that ParquetMetricRowGroupFilter has poor effect on evaluating bar IN (x, y, z) while at the same time ParquetDictionaryRowGroupFilter cannot answer foo IS NULL because Parquet dictionary has no nulls stats. This also happens when one child node of OR can only be answered by ParquetBloomRowGroupFilter but the other can only be answered by ParquetMetricRowGroupFilter or ParquetDictionaryRowGroupFilter. 

This PR tries to solve this kind of issue. It borrows the idea of ResidualEvaluator, allowing row-group filters to eliminate those predicates that can get ROWS_CANNOT_MATCH / ROWS_ALL_MATCH conclusions during the evaluation process, so that an expression can be evaluated for residuals, which is then passed to the next row-group filter for evaluation. In this way, it makes three row-group filters to work together to evaluate an expression.

UPDATE:
I tested this part of the code and the result shows that it does improve the kind of queries mentioned above, filtering out a lot of files. Another minor benefit of this I can think of is that when an expression can be eliminated in the metric filter, there is no need to load its dictionary in the subsequent dictionary filter.

zhongyujiang · 2023-02-21T12:34:51Z

@rdblue Could you help review this?

clesaec · 2023-02-22T08:27:18Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java

-    public Boolean and(Boolean leftResult, Boolean rightResult) {
-      return leftResult && rightResult;
+    public Expression and(Supplier<Expression> left, Supplier<Expression> right) {
+      Expression leftResult = left.get();


As on methods "and" & "or", you always evaluate, at least left expression, it's like using Suppliers is not relevant here; and just having
BloomEvalVisitor extends BoundExpressionVisitor<Boolean> to
BloomEvalVisitor extends BoundExpressionVisitor<Expression> would be enough to delay expression evaluation itself.
(or just have public abstract static class FindsResidualVisitor extends BoundExpressionVisitor<Expression>)
WDYT ?

Yes, at least one node will be evaluated, the purpose of using supplier is to allow us to skip the evaluation of the other node as appropriate. Before this PR, this short-circuit logic is implemented through Expressions#visitEvaluator, but it can only be used for Boolean visitors.

cccs-jc · 2024-04-03T17:51:12Z

I'm interested in getting that PR into the upstream Iceberg. @zhongyujiang any reason why you stopped pursuing it? Are you using it in production?

zhongyujiang · 2024-04-04T01:14:56Z

any reason why you stopped pursuing it?

There was relatively little feedback from the community after openning this PR, so I did not proceed with it further. Since more people are encountering the same issue now, I'll resolve these conflicts and open a new one to get more eyes from the community.

Are you using it in production?

No, we don't.

cccs-jc · 2024-04-04T17:23:24Z

It would be great to revive your PR. I think it's the best approach and it's a major improvement over the current implementation. The query speed is much faster with this fix.

I have update the test case in my own branch, feel free to consult if you need to. https://github.com/CybercentreCanada/iceberg/tree/iceberg-improve_parq_row_group_filter

thanks @zhongyujiang

zhongyujiang · 2024-04-07T03:26:16Z

Replaced by #10090.

Parquet: Make row-group filters cooperate to filter.

fa89c03

github-actions bot added API data parquet labels Feb 21, 2023

clesaec reviewed Feb 22, 2023

View reviewed changes

zhongyujiang added 2 commits April 7, 2023 19:30

Switch to AssertJ and Junit5.

e32b235

Fix Junit5 assume.

5742b56

zhongyujiang mentioned this pull request Mar 25, 2024

OR condition does not leverage all parquet metadata (metrics, dictionary, bloom filter) causing inefficient queries #10029

Open

zhongyujiang mentioned this pull request Apr 7, 2024

Parquet: Make row-group filters cooperate to filter #10090

Open

zhongyujiang closed this Apr 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet: Make row-group filters cooperate to filter #6893

Parquet: Make row-group filters cooperate to filter #6893

zhongyujiang commented Feb 21, 2023 •

edited

Loading

zhongyujiang commented Feb 21, 2023

clesaec Feb 22, 2023

zhongyujiang Feb 22, 2023

cccs-jc commented Apr 3, 2024

zhongyujiang commented Apr 4, 2024

cccs-jc commented Apr 4, 2024

zhongyujiang commented Apr 7, 2024

Parquet: Make row-group filters cooperate to filter #6893

Parquet: Make row-group filters cooperate to filter #6893

Conversation

zhongyujiang commented Feb 21, 2023 • edited Loading

zhongyujiang commented Feb 21, 2023

clesaec Feb 22, 2023

Choose a reason for hiding this comment

zhongyujiang Feb 22, 2023

Choose a reason for hiding this comment

cccs-jc commented Apr 3, 2024

zhongyujiang commented Apr 4, 2024

cccs-jc commented Apr 4, 2024

zhongyujiang commented Apr 7, 2024

zhongyujiang commented Feb 21, 2023 •

edited

Loading