Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DictionaryString for Regex matching operators #12768

Merged
merged 16 commits into from
Oct 10, 2024

Conversation

blaginin
Copy link
Contributor

@blaginin blaginin commented Oct 4, 2024

Which issue does this PR close?

Closes #12618

Rationale for this change

As explained in the original PR, regex comparison operations don't support dictionaries like ~*.

What changes are included in this PR?

When building a logical query, the query is initially correctly coerced to Utf8 (which is supported in BinaryExpr), but then unwrap_cast_in_comparison removes the column cast.

image

Interestingly, for the Operator::LikeMatch operator, which is similar to regex operators, unwrap_cast_in_comparison doesn't rewrite the query. This happens because of the function currently named is_comparison_operator.

According to its usages, it is only used in interval propagation, and the number of actual arguments supported is actually smaller than the number mentioned in the method.

Moreover, the name is_comparison_operator seems a bit confusing, as it doesn't match the list of comparison operators in the docs. Therefore, I have renamed the method.

Are these changes tested?

Yes, I uncommented tests in datafusion/sqllogictest/test_files/string.

Are there any user-facing changes?

No

@github-actions github-actions bot added physical-expr Physical Expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Oct 4, 2024
@alamb
Copy link
Contributor

alamb commented Oct 5, 2024

Thank you for this contribution @blaginin -- I agree this PR seems to have fixed the issue, but I am a little worried about it.

  1. It is probably slower ( I think this PR effectively leaves the cast to Utf8 there (which would basically unpack the dictionary array))
  2. It seems to change interval analysis and propagation to not support LIKE etc.

FYI @ozankabak I am a bit worried that the changes to the propagation logic did not cause any test failures 🤔 Maybe we need to increase test coverage in that area.

@alamb
Copy link
Contributor

alamb commented Oct 5, 2024

What would you think about explicitly implementing physical operator support for DictionaryArrays? That would mean basically applying the operation on the DictionaryArray::values

That would look something like

let input_array: DictionaryArray<Int32, String> = ...;
let values = input_array.values();
// apply regexp match on the values of the array
let regexp_match_result = regexp_match(values, ..);
// form the output boolean array by looking up the result
let mut bool_builder = BooleanBuilder::new()
for key in input_array.keys() {
  bool_builder.push(regexp_match_result.value(key))
}
bool_builder.build(); // create output boolean arary

@alamb
Copy link
Contributor

alamb commented Oct 5, 2024

You would have to find the relevant place to plumb this in to binary.rs too I think

# Conflicts:
#	datafusion/sqllogictest/test_files/string/large_string.slt
#	datafusion/sqllogictest/test_files/string/string_view.slt
@alamb alamb marked this pull request as draft October 6, 2024 11:31
@alamb
Copy link
Contributor

alamb commented Oct 6, 2024

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

@blaginin
Copy link
Contributor Author

blaginin commented Oct 8, 2024

Thanks for the review and detailed feedback, @alamb! 😍

To give a bit of context, when submitting the PR, I was choosing between removing Regex operators from is_comparison_operator and manually unwrapping the dict. I chose the first option because expressions like dict_column ~* 'scalar' are the only outlier. dict_column ~~ 'scalar' and even dict_column ~* dict_column are both cast to UTF from dicts, so I thought consistent behavior was desirable.

However, your performance point is really valid!! Unwrapping the dict with bools directly can be much more efficient than unwrapping strings and then constructing booleans... I think I've made a change; could you take a look?

A few things to consider on top of this PR:

  • Maybe we should stop casting in other cases too (mentioed above)
  • Feels like we might want to return results as a dict, rather than unwrapping it. I tried to make that change in this PR, but it became bigger than the actual fix 😅

@blaginin blaginin marked this pull request as ready for review October 8, 2024 12:23
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @blaginin -- this looks great.

The only thing I think we should do is to deprecate is_comparison_op to ease people's upgrades

/// For example, 'Binary(a, >, b)' would be a comparison expression.
pub fn is_comparison_operator(&self) -> bool {
/// For example, 'Binary(a, >, b)' expression supports propagation.
pub fn supports_propagation(&self) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is technically an API change -- Can you please avoid the change and help people upgrade by adding back is_comparison_operator that calls supports_propagation and mark it deprecated?

#[deprecated(
since = "40.0.0",
note = "please use `statistics_from_parquet_meta_calc` instead"
)]
pub async fn statistics_from_parquet_meta(
metadata: &ParquetMetaData,
table_schema: SchemaRef,
) -> Result<Statistics> {
statistics_from_parquet_meta_calc(metadata, table_schema)
}

@alamb
Copy link
Contributor

alamb commented Oct 9, 2024

(We could add the deprecation as a follow on)

/// propagation
///
/// For example, 'Binary(a, >, b)' expression supports propagation.
#[deprecated(since = "43.0.0", note = "please use `supports_propagation` instead")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@goldmedal goldmedal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @blaginin for working on this. It looks good to me 👍

@alamb alamb merged commit 4534a28 into apache:main Oct 10, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Oct 10, 2024

Thanks @blaginin and @goldmedal for the review

@blaginin blaginin deleted the fix-dict-string-regex-match branch October 10, 2024 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support DictionaryString for Regex matching operators
3 participants