Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Binary/LargeBinary --> Utf8/LargeUtf8 in ilike and string functions #7840

Merged
merged 1 commit into from
Oct 18, 2023

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Oct 16, 2023

Which issue does this PR close?

Closes #7342
Closes #7345
Closes #7344

It also closes several draft / PRs:
Closes #7349
Closes #7350
Closes #7365

Rationale for this change

Some of the clickbench queries have this pattern

What changes are included in this PR?

  1. Add a bunch of comments
  2. Add automatic coercion rules from Binary and LargeBinary to the appropriate Utf8 (string) variants for LIKE and ILIKE
  3. Add appropriate binary --> string logic in return type calculation for functions

Are these changes tested?

Yes, new tests

Are there any user-facing changes?

Yes now some queries will not error

@github-actions github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) labels Oct 16, 2023
@alamb alamb marked this pull request as draft October 16, 2023 21:28
@alamb alamb changed the title Support automatic Binary/LargeBinary --> Utf8/LargeUtf8 coercion Support Binary/LargeBinary --> Utf8/LargeUtf8 in ilike Oct 17, 2023
@alamb alamb changed the title Support Binary/LargeBinary --> Utf8/LargeUtf8 in ilike Support Binary/LargeBinary --> Utf8/LargeUtf8 in ilike and string functions Oct 17, 2023
@alamb alamb marked this pull request as ready for review October 17, 2023 11:29
DataType::LargeUtf8 => $largeUtf8Type,
// LargeBinary inputs are automatically coerced to Utf8
DataType::LargeBinary => $largeUtf8Type,
DataType::Utf8 => $utf8Type,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the fix for string functions taking binary arguments

The new cases for Binary/LargeBinary are all that is new here -- the rest is just comments

/// Coercion rules for binary (Binary/LargeBinary) to string (Utf8/LargeUtf8):
/// If one argument is binary and the other is a string then coerce to string
/// (e.g. for `like`)
fn binary_to_string_coercion(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the new rule, only used for ilke

I originally tried modifying string_coercion to also include Binary --> String coercion, however, that has the (bad) side effect of making a binary = string comparison into cast(binary)= string which is bad because:

  1. The comparison is valid even if the binary column contains non UTF8 data (but the cast will fail in this case)
  2. The cast from binary --> utf8 is not free (it has to check each value for utf8 correctness)

@@ -681,6 +708,7 @@ fn binary_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType>
/// This is a union of string coercion rules and dictionary coercion rules
pub fn like_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> {
string_coercion(lhs_type, rhs_type)
.or_else(|| binary_to_string_coercion(lhs_type, rhs_type))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the fix for like

let (lhs, rhs) = get_input_types(&$A_TYPE, &$OP, &$B_TYPE)?;
assert_eq!(lhs, $C_TYPE);
assert_eq!(rhs, $C_TYPE);
($LHS_TYPE:expr, $RHS_TYPE:expr, $OP:expr, $RESULT_TYPE:expr) => {{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same logic, but I renamed the arguments for clarity


query error DataFusion error: type_coercion
SELECT largebinary FROM t where largebinary LIKE '%F';
query ?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They work 🎉

@alamb
Copy link
Contributor Author

alamb commented Oct 17, 2023

@jonahgao I wonder if you have some time to review this PR

cc @Weijun-H and @JayjeetAtGithub and @parkma99

Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I review this PR carefully, a great job!
Thanks @alamb .

@jonahgao
Copy link
Member

@alamb LGTM 👍.

I also tested some other string functions on this branch, and they worked as expected.
Although some outputs may look strange, it’s just a display issue.

DataFusion CLI v32.0.0
❯ select lower(X'0000102023');
+-------------------------------+
| lower(Binary("0,0,16,32,35")) |
+-------------------------------+
|  #                            |
+-------------------------------+
1 row in set. Query took 0.004 seconds.

Copy link
Member

@Weijun-H Weijun-H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @alamb 👍

@alamb
Copy link
Contributor Author

alamb commented Oct 18, 2023

Thanks everyone for the reviews!

@alamb alamb merged commit 7772453 into apache:main Oct 18, 2023
23 of 24 checks passed
make_utf8_to_return_type!(utf8_to_str_type, DataType::LargeUtf8, DataType::Utf8);

// `utf8_to_str_type`: returns either a Int32 or Int64 based on the input type size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utf8_to_str_type -> utf8_to_int_type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find -- filed #7861

@@ -648,8 +652,9 @@ fn string_concat_internal_coercion(
}
}

/// Coercion rules for Strings: the type that both lhs and rhs can be
/// casted to for the purpose of a string computation
/// Coercion rules for string types (Utf8/LargeUtf8): If at least on argument is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least on argument -> at least one argument

}
}

/// Coercion rules for binary types (Binary/LargeBinary): If at least on argument is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least on argument -> at least one argument

///
/// For example, if a function `func` accepts arguments of `(int64, int64)`,
/// but was called with `(int32, int64)`, this function could match the
/// valid_types by by coercing the first argument to `int64`, and would return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valid_types by by coercing -> valid_types by coercing

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a few typos.

@alamb
Copy link
Contributor Author

alamb commented Oct 19, 2023

Thank you @viirya -- I will fix the typos

@alamb alamb deleted the alamb/ilike_coercion branch October 19, 2023 15:52
@alamb
Copy link
Contributor Author

alamb commented Oct 19, 2023

Found a few typos.

PR to fix: #7873

@andygrove andygrove added the enhancement New feature or request label Nov 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
6 participants