Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HEX + CONVERT / INSERT with utf8 returns different values with MySQL #32599

Closed
espresso98 opened this issue Feb 24, 2022 · 9 comments
Closed
Assignees
Labels
affects-5.4 This bug affects 5.4.x versions. affects-6.0 affects-6.1 affects-6.2 affects-6.3 severity/major sig/sql-infra SIG: SQL Infra type/bug The issue is confirmed as a bug.

Comments

@espresso98
Copy link
Collaborator

espresso98 commented Feb 24, 2022

Bug Report

1. Minimal reproduce step

select hex(convert(char(2557 using latin1) using utf8));
SELECT HEX(INSERT(_utf8 0xD18FD18E, 2, 1, 0x20));

2. What did you expect to see?

mysql>  select hex(convert(char(2557 using latin1) using utf8));
+--------------------------------------------------+
| hex(convert(char(2557 using latin1) using utf8)) |
+--------------------------------------------------+
| 09C3BD                                           |
+--------------------------------------------------+
1 row in set, 1 warning (0.01 sec)

mysql> SELECT HEX(INSERT(_utf8 0xD18FD18E, 2, 1, 0x20));
+-------------------------------------------+
| HEX(INSERT(_utf8 0xD18FD18E, 2, 1, 0x20)) |
+-------------------------------------------+
| D18F20                                    |
+-------------------------------------------+
1 row in set, 1 warning (0.01 sec)

mysql> select version();
+-----------+
| version() |
+-----------+
| 8.0.28    |
+-----------+
1 row in set (0.03 sec)

3. What did you see instead

tidb> select hex(convert(char(2557 using latin1) using utf8));
+--------------------------------------------------+
| hex(convert(char(2557 using latin1) using utf8)) |
+--------------------------------------------------+
| 093F                                             |
+--------------------------------------------------+
1 row in set (0.01 sec)

tidb> SELECT HEX(INSERT(_utf8 0xD18FD18E, 2, 1, 0x20));
+-------------------------------------------+
| HEX(INSERT(_utf8 0xD18FD18E, 2, 1, 0x20)) |
+-------------------------------------------+
| D120D18E                                  |
+-------------------------------------------+
1 row in set (0.02 sec)

4. What is your TiDB version?

tidb_version(): Release Version: v5.5.0-alpha-210-g11f4ca802
Edition: Community
Git Commit Hash: 11f4ca802083ee38d5972730ba8f9b72395316fb
Git Branch: master
UTC Build Time: 2022-02-24 04:52:01
GoVersion: go1.17.2
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
@espresso98 espresso98 added the type/bug The issue is confirmed as a bug. label Feb 24, 2022
@espresso98 espresso98 changed the title HEX with CONVERT returns different values HEX(CONVERT(CHAR (N using charset) returns different values Feb 24, 2022
@espresso98 espresso98 changed the title HEX(CONVERT(CHAR (N using charset) returns different values HEX + CONVERT/INSERT returns different values Feb 24, 2022
@espresso98 espresso98 changed the title HEX + CONVERT/INSERT returns different values HEX + CONVERT / INSERT with utf8 returns different values with MySQL Feb 24, 2022
@aytrack
Copy link
Contributor

aytrack commented Feb 24, 2022

for the first case, it's may cause by this as describe in https://docs.pingcap.com/tidb/stable/character-set-and-collation#character-sets-and-collations-supported-by-tidb

TiDB incorrectly treats latin1 as a subset of utf8. This can lead to unexpected behaviors when you store characters that differ between latin1 and utf8 encodings. It is strongly recommended to the utf8mb4 character set. See TiDB #18955 for more details.

but there maybe another bug here

MySQL root@127.0.0.1:(none)> select char(2557 using latin1)
'utf-8' codec can't decode byte 0xfd in position 1: invalid start byte
MySQL root@127.0.0.1:(none)> use test;
Packet sequence number wrong - got 5 expected 1

case 2:
tidb has the same behavior with mysql 8.0.23

mysql> select version();
+-----------+
| version() |
+-----------+
| 8.0.26    |
+-----------+
1 row in set (0.04 sec)

mysql> SELECT HEX(INSERT(_utf8 0xD18FD18E, 2, 1, 0x20));
+-------------------------------------------+
| HEX(INSERT(_utf8 0xD18FD18E, 2, 1, 0x20)) |
+-------------------------------------------+
| D18F20                                    |
+-------------------------------------------+
1 row in set, 1 warning (0.04 sec)


# mysql 8.0.23

MySQL root@172.16.5.103:test> SELECT HEX(INSERT(_utf8 0xD18FD18E, 2, 1, 0x20));
+-------------------------------------------+
| HEX(INSERT(_utf8 0xD18FD18E, 2, 1, 0x20)) |
+-------------------------------------------+
| D120D18E                                  |
+-------------------------------------------+

1 row in set
Time: 0.062s
MySQL root@172.16.5.103:test>
MySQL root@172.16.5.103:test> select version();
+-----------+
| version() |
+-----------+
| 8.0.23    |
+-----------+

@ti-chi-bot ti-chi-bot added may-affects-4.0 This bug maybe affects 4.0.x versions. may-affects-5.0 This bug maybe affects 5.0.x versions. may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. labels Feb 24, 2022
@aytrack
Copy link
Contributor

aytrack commented Feb 24, 2022

tidb v5.3.0 has a same behavior with mysql 8.0.26, but since v5.4.0 the behavior is changed.

MySQL root@127.0.0.1:(none)> select tidb_version()\G
***************************[ 1. row ]***************************
tidb_version() | Release Version: v5.3.0
Edition: Community
Git Commit Hash: 4a1b2e9fe5b5afb1068c56de47adb07098d768d6
Git Branch: heads/refs/tags/v5.3.0
UTC Build Time: 2021-11-24 13:31:24
GoVersion: go1.16.5
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false

1 row in set
Time: 0.001s
MySQL root@127.0.0.1:(none)> SELECT HEX(INSERT(_utf8 0xD18FD18E, 2, 1, 0x20));
                         ->
+-------------------------------------------+
| HEX(INSERT(_utf8 0xD18FD18E, 2, 1, 0x20)) |
+-------------------------------------------+
| D18F20                                    |
+-------------------------------------------+

@aytrack aytrack added affects-5.4 This bug affects 5.4.x versions. and removed may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-4.0 This bug maybe affects 4.0.x versions. may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.0 This bug maybe affects 5.0.x versions. labels Feb 24, 2022
@espresso98
Copy link
Collaborator Author

espresso98 commented Feb 24, 2022

Thanks for the in-depth investigation. I am working on MySQL v.8.0.28

@solotzg
Copy link
Contributor

solotzg commented Jun 6, 2022

For select hex(convert(char(2557 using latin1) using utf8));

https://nicj.net/mysql-converting-an-incorrect-latin1-column-to-utf8/

latin1 is a single-byte encoding, so each of the 256 characters are just a single byte.

The UTF-8 encoding was designed to be backward-compatible with ASCII documents, for the first 128 characters. For characters above 128, a multi-byte sequence describes the character. According to http://en.wikipedia.org/wiki/UTF-8#Description.

For 2557(0x09FD), whose size is 2 bytes, will be covert to utf8 code 0x09C3BD, whose size is 3 bytes.

2557 to binary 1001 1111 1101, 0xFD is bigger than 127, need to be converted to 2 bytes 11000011 10111101. We got 1001 11000011 10111101 which is 0x9c3bd.

But for tidb >= 5.4, result is 0x093F.


Root cause is #30288

@solotzg
Copy link
Contributor

solotzg commented Jun 17, 2022

For SELECT HEX(INSERT(_utf8 0xD18FD18E, 2, 1, 0x20));

According to https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_insert, INSERT(str,pos,len,newstr) use str as input. Convert _utf8 0xD18FD18E to str will be [0xD18F, 0xD18E] in hex.

INSERT(_utf8 0xD18FD18E, 2, 1, 0x20) will replace 0xD18E by 0x20 and get 0xD18F20 finally.

tidb >= 5.4 will get unexpected result.

if types.IsBinaryStr(bf.tp) {
sig = &builtinInsertSig{bf, maxAllowedPacket}
sig.setPbCode(tipb.ScalarFuncSig_Insert)
} else {
sig = &builtinInsertUTF8Sig{bf, maxAllowedPacket}
sig.setPbCode(tipb.ScalarFuncSig_InsertUTF8)
}
return sig, nil
The type is binary rather than utf8.

@zanmato1984
Copy link
Contributor

Seems like some encoding/collation change broke some of our old behavior, thus this issue. Changing to sig/sql-infra.

@zanmato1984 zanmato1984 added sig/sql-infra SIG: SQL Infra and removed sig/execution SIG execution labels Jun 23, 2022
@YangKeao
Copy link
Member

YangKeao commented Oct 12, 2022

For SELECT HEX(INSERT(_utf8 0xD18FD18E, 2, 1, 0x20));

Another interesting case:

MySQL [(none)]> SELECT HEX(INSERT(_utf8 0xD18FD18E, 2, 1, _utf8 0x20));
+-------------------------------------------------+
| HEX(INSERT(_utf8 0xD18FD18E, 2, 1, _utf8 0x20)) |
+-------------------------------------------------+
| D18F20                                          |
+-------------------------------------------------+
1 row in set (1.646 sec)

It's related with the following codes:

if dstCollation == charset.CollationBin || argCollation == charset.CollationBin {
if coercibility > arg.Coercibility() || (coercibility == arg.Coercibility() && argCollation == charset.CollationBin) {
coercibility, dstCharset, dstCollation = arg.Coercibility(), argCharset, argCollation
}
repertoire |= arg.Repertoire()
continue
}

If the third argument is binary, the return collation will be inferred as binary 🤔

@YangKeao YangKeao self-assigned this Oct 12, 2022
@YangKeao
Copy link
Member

YangKeao commented Oct 12, 2022

And the insert chooses to cast charset according to the return type, but not the first argument. This change is brought by #29905 .

@YangKeao
Copy link
Member

well, I remember the behavior is the same with the MySQL version before 8.0.24

The behavior of SELECT HEX(INSERT(_utf8 0xD18FD18E, 2, 1, 0x20)); is compatible with older version of MySQL (so does lpad/rpad/...)

I'll close this issue as it's expected (both problems). If you need further discussion, feel free to reopen this issue 😃 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.4 This bug affects 5.4.x versions. affects-6.0 affects-6.1 affects-6.2 affects-6.3 severity/major sig/sql-infra SIG: SQL Infra type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants