Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Make row-group filters cooperate to filter #10090

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

zhongyujiang
Copy link
Contributor

This PR refactors three Parquet row-group filters into a form that computes residual expressions, allowing it to return a residual expression for the given row-groups. The residual computed by the previous filter can be passed to the next filter, allowing the three Parquet row-group filters to work together. This improves the handling of some OR condition queries.

For example:
Let's assume we have a query a = 'foo' OR b = 'bar', where column a is dictionary-encoded in a Parquet row-group, while column b is not entirely dictionary-encoded in all data pages but has a bloom filter. Therefore, a = 'foo' can only be evaluated by the dictionary filter, and b = 'bar' can only be evaluated by the bloom filter. In the current situation, even if both filters evaluate the expressions as ROWS_CANNOT_MATCH individually, because each filter can only evaluate one sub-expression, the final output would still be ROWS_MIGHT_MATCH (let's assume the metric filter evaluates both sub-expressions as ROWS_MIGHT_MATCH).
After refactoring into the form of computing residuals, the dictionary filter will compute the residual for a = 'foo' OR b = 'bar' as b = 'bar'. Then this residual expression will be passed to the bloom filter and evaluated as Expressions.alwaysFalse(). As a result, the reading of this row-group can be skipped.

This is a revive of #6893, and can close #10029.

cc @cccs-jc @rdblue @huaxingao @amogh-jahagirdar @RussellSpitzer Could you please review this? Thanks!

@@ -290,7 +299,7 @@ private <T> boolean shouldRead(
hashValue = bloom.hash(((Number) value).intValue());
return bloom.findHash(hashValue);
default:
return ROWS_MIGHT_MATCH;
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is more readable to keep the constants:

Suggested change
return true;
return ROWS_MIGHT_MATCH;

A few more occurrences below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the types of these two constants have been changed from boolean to the generic class Expression of FindsResidualVisitor.

For readability, maybe we can add a comment after the boolean value ?

return true; /* rows might match */

@amogh-jahagirdar
Copy link
Contributor

@zhongyujiang I'm really sorry for the delayed review on my part. I think this is an important improvement!

I will be taking a deeper look at the implementation this week. A while back, I did check this code out locally and run some tests on some samples to get confidence on correctness but of course we'll also want unit test coverage as much as reasonably possible.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your patience @zhongyujiang . Had some comments, I think the main one is around test structure.

public Boolean not(Boolean result) {
return !result;
public Expression not(Expression result) {
throw new UnsupportedOperationException("This path shouldn't be reached.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really following why this throws UnsupportedOperationException? Should it have already been rewritten by RewriteNot or something?

Copy link
Contributor Author

@zhongyujiang zhongyujiang Jun 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right, it should already been rewritten.

It's just when I was refactoring this part, I found that the dict filter and metric filter handle not inconsistently compared to the bloom filter. The bloom filter will directly throw an UnsupportedOperationException because it cannot evaluate not. In fact, the dict filter and metric filter also cannot handle not since they can only conclude ROWS_CANNOT_MATCH or ROWS_MIGHT_MATCH. So, I made their handling of not consistent with the bloom filter. Anyway, this should not be reached.

Is this ok? I can revert this if not.

Comment on lines 1206 to 1227
Expression expected = Binder.bind(SCHEMA.asStruct(), Expressions.or(bloom, dict), true);
ParquetMetricsRowGroupFilter metricFilter = new ParquetMetricsRowGroupFilter(SCHEMA, expr);
Expression metricResidual = metricFilter.residualFor(parquetSchema, rowGroupMetadata);
assertThat(expected.isEquivalentTo(metricResidual))
.as("Expected residual: %s, actual residual: %s", expected, metricResidual);

expected = Binder.bind(SCHEMA.asStruct(), bloom, true);
ParquetDictionaryRowGroupFilter dictFilter =
new ParquetDictionaryRowGroupFilter(SCHEMA, metricResidual);
Expression dictResidual =
dictFilter.residualFor(
parquetSchema, rowGroupMetadata, reader.getDictionaryReader(rowGroupMetadata));

assertThat(expected.isEquivalentTo(dictResidual))
.as("Expected residual: %s, actual residual: %s", expected, dictResidual);

expected = Expressions.alwaysFalse();
ParquetBloomRowGroupFilter bloomFilter = new ParquetBloomRowGroupFilter(SCHEMA, dictResidual);
Expression bloomResidual =
bloomFilter.residualFor(
parquetSchema, rowGroupMetadata, reader.getBloomFilterDataReader(rowGroupMetadata));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a bit odd to test the combined filter residual logic in this test method in TestBloomRowGroupFilter. See my comment above on a separate test class which encapsulates the logic, which should also make it easier for testing since we can then move this to a separate test class.

As far as testing goes, here are the cases I think:

1.) We know the existing tests should cover the cases where an individual filter is always true or always false.
2.) So following 1, then we'd want the following tests for the combined tests:
a.) Where the metrics filter has a residual that's always true/false
b.) Where the metrics filter has a residual that's not true/false and the dictionary filter has one that is true/false.
c.) Where the metrics filter has a residual that's not true/false, the dictionary filter does not have one that is true/false, and the bloom filter returns a residual that's not true/false.
d.) Same as c but the bloom filter does return true/false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @amogh-jahagirdar, thanks for the comments.

I'm sorry for the delay, comments have been addressed, please take a look when you have time.

@zhongyujiang zhongyujiang force-pushed the residual-parq-rowgroup-evaluator-2 branch from 47302b4 to dd70a4c Compare June 29, 2024 14:25
@zhongyujiang zhongyujiang force-pushed the residual-parq-rowgroup-evaluator-2 branch from dd70a4c to 9b8cbe5 Compare June 30, 2024 03:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants