Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix update_expr for projection pushdown #9096

Merged
merged 6 commits into from
Feb 2, 2024

Conversation

viirya
Copy link
Member

@viirya viirya commented Feb 1, 2024

Which issue does this PR close?

Closes #9091.

Rationale for this change

projection_pushdown rule uses function update_expr to check if a plan's expressions could be updated with projection expressions. The function only checks column name so if there are columns with same name, the check will return incorrect RewrittenValid state. Column index should be checked too in the function update_expr.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Feb 1, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @viirya -- super sluthing 🕵️

FYI @DDtKey

@@ -916,7 +916,9 @@ fn update_expr(
.find_map(|(index, (projected_expr, alias))| {
projected_expr.as_any().downcast_ref::<Column>().and_then(
|projected_column| {
column.name().eq(projected_column.name()).then(|| {
(column.name().eq(projected_column.name())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another similar check a bit further down in the file in new_columns_for_join_on

https://github.com/apache/arrow-datafusion/blob/c9049eda85683d958a39eca5fd3a382943ee7fa6/datafusion/core/src/physical_optimizer/projection_pushdown.rs#L1046

Likewise in new_indices_for_join_filter

https://github.com/apache/arrow-datafusion/blob/c9049eda85683d958a39eca5fd3a382943ee7fa6/datafusion/core/src/physical_optimizer/projection_pushdown.rs#L1132

I wonder if those checks need to extended to check the index as well? Maybe we could extract the "find matching projection column" logic as a function 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on a new, more comprehensive and less error-prone projection optimizer. I plan to open a PR after our internal review. Just so you know, all these utils will be removed in the new version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, then I don't touch other places like new_columns_for_join_on, but only address the reported issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, for new_columns_for_join_on, because it applies on particular joining side (left or right) with left/right joining keys and left/right projection expressions. It should not have the same issue as update_expr.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for new_indices_for_join_filter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If another rewrite is imminent, it is all the more important that we thoroughly test for this behavior

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for new_columns_for_join_on and new_indices_for_join_filter, you won't hit a similar issue like this. The reason is #9096 (comment).

Alice 100
Alice 50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that after this PR the results matches expected ones, in comment above.
Should we update the comment?

the current query results are incorrect, becuase the query was incorrectly rewritten as:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we should have both test-cases (just in case):

  • SELECT t1.a, t1.b FROM t1 JOIN t2 ON t1.a = t2.a ORDER BY t1.a, t1.b;
Alice 	50
Alice 	50
Alice 	100
Alice 	100
  • SELECT t1.a, t1.b FROM t1 JOIN t2 ON t1.a = t2.a ORDER BY t1.a, t2.b;
Alice 	50
Alice 	100
Alice 	50
Alice 	100

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, why do we have test-cases with incorrect results, knowing this?
It seems it would be useful to have a set of tests that can fail, but do not block CI, etc (something like TDD, we know it fails and we can make it work)

It will be much easier to find test-cases that don't work right now. The tests won't even have to be changed, except perhaps excluded from the "expectedly fail" category.

сс @alamb

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Yea, this existing test now matches correct behaviors after this PR. Actually it is same issue (order by with same column name in joined sides).

Updated the test case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it would be useful to have a set of tests that can fail, but do not block CI, etc (something like TDD, we know it fails and we can make it work)

I agree this would be useful -- we can do it in the rust tests with #[ignore]. Maybe we need to add something similar to sqllogictests

It will be much easier to find test-cases that don't work right now. The tests won't even have to be changed, except perhaps excluded from the "expectedly fail" category.

One way to do this is to search for issue links https://github.com...

@DDtKey
Copy link
Contributor

DDtKey commented Feb 1, 2024

@viirya huge thanks for such a reactive response to the issue! 🚀

Operator::Modulo,
Arc::new(Column::new("e", 5)),
Arc::new(Column::new("e", 4)),
))),
)?),
];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than changing this test like that,

Suggested change
];
let projected_exprs: Vec<(Arc<dyn PhysicalExpr>, String)> = vec![
(Arc::new(Column::new("a", 3)), "a".to_owned()),
(Arc::new(Column::new("b", 1)), "b_new".to_owned()),
(Arc::new(Column::new("c", 0)), "c".to_owned()),
(Arc::new(Column::new("d", 2)), "d_new".to_owned()),
(Arc::new(Column::new("e", 5)), "e".to_owned()),
(Arc::new(Column::new("f", 4)), "f_new".to_owned()),
];

these changes will be simpler and greater coverage

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, updated.

@viirya viirya merged commit a6ef1be into apache:main Feb 2, 2024
22 checks passed
@viirya
Copy link
Member Author

viirya commented Feb 2, 2024

Thank you @alamb @DDtKey @berkaysynnada

@alamb
Copy link
Contributor

alamb commented Feb 2, 2024

BTW #9111 tracks the work @berkaysynnada is doing for ProjectionPushdown in case anyone has thoughts they would like to share

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression: incorrect order on a joined table column (without selecting its fields)
4 participants