Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: There isn't a common type to coerce Binary and Utf8 in LIKE expression #7342

Closed
JayjeetAtGithub opened this issue Aug 21, 2023 · 5 comments · Fixed by #7840
Closed
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@JayjeetAtGithub
Copy link
Contributor

JayjeetAtGithub commented Aug 21, 2023

Describe the bug

On running the 4 queries below on the Clickbench multi file dataset,

SELECT COUNT(*) FROM hits WHERE "URL" LIKE '%google%';

SELECT "SearchPhrase", MIN("URL"), COUNT(*) AS c FROM hits WHERE "URL" LIKE '%google%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;

SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, COUNT(DISTINCT "UserID") FROM hits WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;

SELECT * FROM hits WHERE "URL" LIKE '%google%' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10;

we get this error,

type_coercion caused by Error during planning: There isn't a common type to coerce Binary and Utf8 in LIKE expression

To Reproduce

Download the data using,

 ./benchmarks/bench.sh data clickbench_partitioned

A hits_multi directory with the parquet files will be created.

Execute the above queries,

datafusion-cli -c "CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION 'hits_multi';" "{query}"

Expected behavior

The queries should run successfully without erroring.

Additional context

Datafusion 29.0.0

@JayjeetAtGithub JayjeetAtGithub added the bug Something isn't working label Aug 21, 2023
@alamb
Copy link
Contributor

alamb commented Aug 21, 2023

This looks similar to #7039 which @jonahgao fixed by adding a coercion from binary --> UTF8 for comparison. I think we could do something similar here.

@alamb alamb added the good first issue Good for newcomers label Aug 21, 2023
@alamb
Copy link
Contributor

alamb commented Aug 21, 2023

Marking as a good first issue as there is a reproducer and I think the fix should be relatively straightforward

@JayjeetAtGithub
Copy link
Contributor Author

Working on a fix

@JayjeetAtGithub
Copy link
Contributor Author

I looked into this issue a little bit. Looks like there needs to be changes in arrow-string which is basically a part of arrow-rs. Specifically, I found out that there is no support for like operation for binary operands in arrow-datafusion. It only exists for utf8 operands. Support for the like operation in utf8 stems from this code in arrow-string. We basically need to add a binary version of this function in arrow-rs. Does that sound correct ?

@alamb
Copy link
Contributor

alamb commented Aug 23, 2023

We basically need to add a binary version of this function in arrow-rs. Does that sound correct ?

I am not quite sure

I think that LIKE only really makes sense on string (UTF-8) data.

Thus, a better approach might be to try and automatically coerce (cast) the BinaryArray to a StringArray (which will verify that the input in the BinaryArray is valid UTF-8) and then call the existing String based LIKE kernel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
2 participants