Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird discrepancy with ICU4X #105

Open
aumetra opened this issue Oct 9, 2024 · 3 comments
Open

Weird discrepancy with ICU4X #105

aumetra opened this issue Oct 9, 2024 · 3 comments

Comments

@aumetra
Copy link

aumetra commented Oct 9, 2024

So, to set the scene here, I have a proptest between two libraries set up. One of the libraries uses unicode-normalization under the hood, the other icu_normalizer.

I expected that both output the same values, but my CI exploded at some point on the weird string "\u{11366}\u{113ce}".
When put through the NFC normalizer, you get two different outputs:

  1. unicode-normalization: "\u{113ce}\u{11366}"
  2. icu_normalizer: "\u{11366}\u{113ce}"

Just a fun little thing I thought I'd report since it's technically a correctness issue (I'm just not good enough with Unicode to determine whether it's an issue with ICU4X or this crate).

@aumetra
Copy link
Author

aumetra commented Oct 9, 2024

One more. I don't know why proptest suddenly finds so many:

Original: "\u{113c2}\u{113b8}"
unicode-normalization: "\u{113c7}"
icu_normalizer: "\u{113c2}\u{113b8}"

@Manishearth
Copy link
Member

This crate hasn't been updated to Unicode 16.0 yet. Doing so is not super straightforward this time due to some of the newer characters having interesting combinations of properties.

@aumetra
Copy link
Author

aumetra commented Oct 9, 2024

Ah, interesting. Thanks for the info! Good to know this is due to a new standard revision and not due to a bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants