Parquet: Make row-group filters cooperate to filter #10090

zhongyujiang · 2024-04-07T02:54:52Z

This PR refactors three Parquet row-group filters into a form that computes residual expressions, allowing it to return a residual expression for the given row-groups. The residual computed by the previous filter can be passed to the next filter, allowing the three Parquet row-group filters to work together. This improves the handling of some OR condition queries.

For example: Let's assume we have a query a = 'foo' OR b = 'bar', where column a is dictionary-encoded in a Parquet row-group, while column b is not entirely dictionary-encoded in all data pages but has a bloom filter. Therefore, a = 'foo' can only be evaluated by the dictionary filter, and b = 'bar' can only be evaluated by the bloom filter. In the current situation, even if both filters evaluate the expressions as ROWS_CANNOT_MATCH individually, because each filter can only evaluate one sub-expression, the final output would still be ROWS_MIGHT_MATCH (let's assume the metric filter evaluates both sub-expressions as ROWS_MIGHT_MATCH).
After refactoring into the form of computing residuals, the dictionary filter will compute the residual for a = 'foo' OR b = 'bar' as b = 'bar'. Then this residual expression will be passed to the bloom filter and evaluated as Expressions.alwaysFalse(). As a result, the reading of this row-group can be skipped.

This is a revive of #6893, and can close #10029.

cc @cccs-jc @rdblue @huaxingao @amogh-jahagirdar @RussellSpitzer Could you please review this? Thanks!

Fokko · 2024-06-08T06:53:36Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java

@@ -290,7 +299,7 @@ private <T> boolean shouldRead(
              hashValue = bloom.hash(((Number) value).intValue());
              return bloom.findHash(hashValue);
            default:
-              return ROWS_MIGHT_MATCH;
+              return true;


I think it is more readable to keep the constants:

Suggested change

return true;

return ROWS_MIGHT_MATCH;

A few more occurrences below

This is because the types of these two constants have been changed from boolean to the generic class Expression of FindsResidualVisitor.

For readability, maybe we can add a comment after the boolean value ?

return true; /* rows might match */

amogh-jahagirdar · 2024-06-10T06:16:28Z

@zhongyujiang I'm really sorry for the delayed review on my part. I think this is an important improvement!

I will be taking a deeper look at the implementation this week. A while back, I did check this code out locally and run some tests on some samples to get confidence on correctness but of course we'll also want unit test coverage as much as reasonably possible.

amogh-jahagirdar

Thank you for your patience @zhongyujiang . Had some comments, I think the main one is around test structure.

parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java

amogh-jahagirdar · 2024-06-14T22:32:11Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java

-    public Boolean not(Boolean result) {
-      return !result;
+    public Expression not(Expression result) {
+      throw new UnsupportedOperationException("This path shouldn't be reached.");


I'm not really following why this throws UnsupportedOperationException? Should it have already been rewritten by RewriteNot or something?

Yes, you're right, it should already been rewritten.

It's just when I was refactoring this part, I found that the dict filter and metric filter handle not inconsistently compared to the bloom filter. The bloom filter will directly throw an UnsupportedOperationException because it cannot evaluate not. In fact, the dict filter and metric filter also cannot handle not since they can only conclude ROWS_CANNOT_MATCH or ROWS_MIGHT_MATCH. So, I made their handling of not consistent with the bloom filter. Anyway, this should not be reached.

Is this ok? I can revert this if not.

parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java

amogh-jahagirdar · 2024-06-15T16:42:18Z

parquet/src/test/java/org/apache/iceberg/parquet/TestBloomRowGroupFilter.java

+    Expression expected = Binder.bind(SCHEMA.asStruct(), Expressions.or(bloom, dict), true);
+    ParquetMetricsRowGroupFilter metricFilter = new ParquetMetricsRowGroupFilter(SCHEMA, expr);
+    Expression metricResidual = metricFilter.residualFor(parquetSchema, rowGroupMetadata);
+    assertThat(expected.isEquivalentTo(metricResidual))
+        .as("Expected residual: %s, actual residual: %s", expected, metricResidual);
+
+    expected = Binder.bind(SCHEMA.asStruct(), bloom, true);
+    ParquetDictionaryRowGroupFilter dictFilter =
+        new ParquetDictionaryRowGroupFilter(SCHEMA, metricResidual);
+    Expression dictResidual =
+        dictFilter.residualFor(
+            parquetSchema, rowGroupMetadata, reader.getDictionaryReader(rowGroupMetadata));
+
+    assertThat(expected.isEquivalentTo(dictResidual))
+        .as("Expected residual: %s, actual residual: %s", expected, dictResidual);
+
+    expected = Expressions.alwaysFalse();
+    ParquetBloomRowGroupFilter bloomFilter = new ParquetBloomRowGroupFilter(SCHEMA, dictResidual);
+    Expression bloomResidual =
+        bloomFilter.residualFor(
+            parquetSchema, rowGroupMetadata, reader.getBloomFilterDataReader(rowGroupMetadata));
+


It feels a bit odd to test the combined filter residual logic in this test method in TestBloomRowGroupFilter. See my comment above on a separate test class which encapsulates the logic, which should also make it easier for testing since we can then move this to a separate test class.

As far as testing goes, here are the cases I think:

1.) We know the existing tests should cover the cases where an individual filter is always true or always false.
2.) So following 1, then we'd want the following tests for the combined tests:
a.) Where the metrics filter has a residual that's always true/false
b.) Where the metrics filter has a residual that's not true/false and the dictionary filter has one that is true/false.
c.) Where the metrics filter has a residual that's not true/false, the dictionary filter does not have one that is true/false, and the bloom filter returns a residual that's not true/false.
d.) Same as c but the bloom filter does return true/false

Hi @amogh-jahagirdar, thanks for the comments.

I'm sorry for the delay, comments have been addressed, please take a look when you have time.

github-actions bot added API parquet data labels Apr 7, 2024

zhongyujiang mentioned this pull request Apr 7, 2024

Parquet: Make row-group filters cooperate to filter #6893

Closed

amogh-jahagirdar self-requested a review April 9, 2024 15:23

Fokko self-requested a review April 30, 2024 18:30

Fokko requested a review from nastra May 8, 2024 21:24

danielcweeks self-requested a review May 16, 2024 15:45

Fokko reviewed Jun 8, 2024

View reviewed changes

amogh-jahagirdar reviewed Jun 15, 2024

View reviewed changes

zhongyujiang force-pushed the residual-parq-rowgroup-evaluator-2 branch from 47302b4 to dd70a4c Compare June 29, 2024 14:25

zhongyujiang added 7 commits June 30, 2024 11:40

Parquet: Make row-group filters cooperate to filter.

7e97545

Switch to AssertJ and Junit5.

76ec914

Remove SupportsLazyEvaluation.

2d1b36d

Fix extra line.

a4f47d2

Add comment.

bb55c3e

Improve tests.

a680ece

Fix.

9b8cbe5

zhongyujiang force-pushed the residual-parq-rowgroup-evaluator-2 branch from dd70a4c to 9b8cbe5 Compare June 30, 2024 03:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet: Make row-group filters cooperate to filter #10090

Parquet: Make row-group filters cooperate to filter #10090

zhongyujiang commented Apr 7, 2024

Fokko Jun 8, 2024

zhongyujiang Jun 11, 2024

amogh-jahagirdar commented Jun 10, 2024

amogh-jahagirdar left a comment

amogh-jahagirdar Jun 14, 2024

zhongyujiang Jun 29, 2024 •

edited

Loading

amogh-jahagirdar Jun 15, 2024

zhongyujiang Jun 29, 2024

Parquet: Make row-group filters cooperate to filter #10090

Are you sure you want to change the base?

Parquet: Make row-group filters cooperate to filter #10090

Conversation

zhongyujiang commented Apr 7, 2024

Fokko Jun 8, 2024

Choose a reason for hiding this comment

zhongyujiang Jun 11, 2024

Choose a reason for hiding this comment

amogh-jahagirdar commented Jun 10, 2024

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

amogh-jahagirdar Jun 14, 2024

Choose a reason for hiding this comment

zhongyujiang Jun 29, 2024 • edited Loading

Choose a reason for hiding this comment

amogh-jahagirdar Jun 15, 2024

Choose a reason for hiding this comment

zhongyujiang Jun 29, 2024

Choose a reason for hiding this comment

zhongyujiang Jun 29, 2024 •

edited

Loading