Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Make row-group filters cooperate to filter #6893

Closed

Conversation

zhongyujiang
Copy link
Contributor

@zhongyujiang zhongyujiang commented Feb 21, 2023

We found that Parquet row-group filters may not work well sometimes, specifically, when evaluating expressions connected by OR and if the child expressions of this OR expression can only be evaluated by different row-group filters.

For example, suppose we have a sorted column foo, its null values are all clustered together after sorting,so queries like foo IS NULL can filter out most of the data. But when we want to combine other conditions to query, for example: bar IN (x, y, z) OR foo IS NULL(column bar is not sorted), row group filters can't work well, we found this is because that ParquetMetricRowGroupFilter has poor effect on evaluating bar IN (x, y, z) while at the same time ParquetDictionaryRowGroupFilter cannot answer foo IS NULL because Parquet dictionary has no nulls stats. This also happens when one child node of OR can only be answered by ParquetBloomRowGroupFilter but the other can only be answered by ParquetMetricRowGroupFilter or ParquetDictionaryRowGroupFilter.


This PR tries to solve this kind of issue. It borrows the idea of ResidualEvaluator, allowing row-group filters to eliminate those predicates that can get ROWS_CANNOT_MATCH / ROWS_ALL_MATCH conclusions during the evaluation process, so that an expression can be evaluated for residuals, which is then passed to the next row-group filter for evaluation. In this way, it makes three row-group filters to work together to evaluate an expression.

UPDATE:
I tested this part of the code and the result shows that it does improve the kind of queries mentioned above, filtering out a lot of files. Another minor benefit of this I can think of is that when an expression can be eliminated in the metric filter, there is no need to load its dictionary in the subsequent dictionary filter.

@zhongyujiang
Copy link
Contributor Author

@rdblue Could you help review this?

public Boolean and(Boolean leftResult, Boolean rightResult) {
return leftResult && rightResult;
public Expression and(Supplier<Expression> left, Supplier<Expression> right) {
Expression leftResult = left.get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As on methods "and" & "or", you always evaluate, at least left expression, it's like using Suppliers is not relevant here; and just having
BloomEvalVisitor extends BoundExpressionVisitor<Boolean> to
BloomEvalVisitor extends BoundExpressionVisitor<Expression> would be enough to delay expression evaluation itself.
(or just have public abstract static class FindsResidualVisitor extends BoundExpressionVisitor<Expression>)
WDYT ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, at least one node will be evaluated, the purpose of using supplier is to allow us to skip the evaluation of the other node as appropriate. Before this PR, this short-circuit logic is implemented through Expressions#visitEvaluator, but it can only be used for Boolean visitors.

@cccs-jc
Copy link
Contributor

cccs-jc commented Apr 3, 2024

I'm interested in getting that PR into the upstream Iceberg. @zhongyujiang any reason why you stopped pursuing it? Are you using it in production?

@zhongyujiang
Copy link
Contributor Author

any reason why you stopped pursuing it?

There was relatively little feedback from the community after openning this PR, so I did not proceed with it further. Since more people are encountering the same issue now, I'll resolve these conflicts and open a new one to get more eyes from the community.

Are you using it in production?

No, we don't.

@cccs-jc
Copy link
Contributor

cccs-jc commented Apr 4, 2024

It would be great to revive your PR. I think it's the best approach and it's a major improvement over the current implementation. The query speed is much faster with this fix.

I have update the test case in my own branch, feel free to consult if you need to. https://github.com/CybercentreCanada/iceberg/tree/iceberg-improve_parq_row_group_filter

thanks @zhongyujiang

@zhongyujiang
Copy link
Contributor Author

Replaced by #10090.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants