Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22112 word break updates for @,colon; colon tailorings for fi,sv #2159

Conversation

pedberg-icu
Copy link
Contributor

Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22112
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

Word break updates:

@pedberg-icu
Copy link
Contributor Author

/azp run CI-Exhaustive

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

richgillam
richgillam previously approved these changes Aug 23, 2022
macchiati
macchiati previously approved these changes Aug 23, 2022
@pedberg-icu
Copy link
Contributor Author

pedberg-icu commented Aug 23, 2022

Hmm, failures on MSVC and Cygwin for C, likely the non-ASCII in the MidLetter set, will change to \u notation (not necessary for J but changing there too for consistency)

@pedberg-icu pedberg-icu dismissed stale reviews from macchiati and richgillam via 5808ced August 23, 2022 15:02
@pedberg-icu pedberg-icu force-pushed the ICU-22112-word-break-updates-for-@-colon-tailor-for-fi-sv branch from 5808ced to 1679750 Compare August 23, 2022 15:03
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@pedberg-icu
Copy link
Contributor Author

/azp run CI-Exhaustive

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@pedberg-icu
Copy link
Contributor Author

Hmm, failures on MSVC and Cygwin for C, likely the non-ASCII in the MidLetter set, will change to \u notation (not necessary for J but changing there too for consistency)

And that did fix the MSVC etc. failures.

@pedberg-icu
Copy link
Contributor Author

pedberg-icu commented Aug 23, 2022

@macchiati @richgillam Thanks for prior approvals, I had to fix CI build failures on MSVC due to the non-ASCII colons in the MidLetter set removals, changed to using \uNNNN notation. Please re-approve when you have a chance... Actually wait, there is now a merge conflict I need to resolve.

richgillam
richgillam previously approved these changes Aug 23, 2022
@pedberg-icu pedberg-icu force-pushed the ICU-22112-word-break-updates-for-@-colon-tailor-for-fi-sv branch from a2e8712 to b8c6de9 Compare August 23, 2022 17:56
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@pedberg-icu
Copy link
Contributor Author

/azp run CI-Exhaustive

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@pedberg-icu
Copy link
Contributor Author

@macchiati @richgillam Thanks for prior approvals, I had to fix CI build failures on MSVC due to the non-ASCII colons in the MidLetter set removals, changed to using \uNNNN notation, then I had to resolve a merge conflict (just rebuilding the jars). Please re-approve when you have a chance... (now ready)

@pedberg-icu pedberg-icu merged commit 49d192f into unicode-org:main Aug 23, 2022
@pedberg-icu pedberg-icu deleted the ICU-22112-word-break-updates-for-@-colon-tailor-for-fi-sv branch August 23, 2022 19:45
@FrankYFTang
Copy link
Contributor

FrankYFTang commented Feb 7, 2023

we now got complain from chrome users on this https://bugs.chromium.org/p/chromium/issues/detail?id=1410331

@pedberg-icu
Copy link
Contributor Author

we now got complain from chrome users on this https://bugs.chromium.org/p/chromium/issues/detail?id=1410331

@FrankYFTang That is interesting, since the change was initially promoted by Google, see https://unicode-org.atlassian.net/browse/CLDR-15767

@FrankYFTang
Copy link
Contributor

initially promoted by Google

hum? who from Google lobby for that?

@pedberg-icu
Copy link
Contributor Author

hum? who from Google lobby for that?

@FrankYFTang Mark and Markus as I recall. This was not something that came from me, or from Apple.

@markmentovai
Copy link

#2159 (comment):

we now got complain from chrome users on this https://bugs.chromium.org/p/chromium/issues/detail?id=1410331

This is covered by Chromium bug 1410331.

The problem in Chromium was that we already had a local patch to break at '.', so in Chromium, ICU 72’s new behavior of not breaking at '@' didn’t have the intended effect of not breaking e-mail addresses. Instead, it made e-mail addresses break incredibly awkwardly.

Previously in Chromium, "user.name@example.com" would break into {"user", ".", "name", "@", "example", ".", "com"}. Not breaking at '@' meant that this address began breaking into {"user", ".", "name@example", ".", "com"}. This was unexpected and undesirable.

Given that the problematic breaking in Chromium was triggered by the interaction of ICU’s new don’t-break-at-'@' behavior and Chromium’s local break-at-'.' behavior, I don’t think that there’s any action necessary in ICU, with one exception:

word_POSIX.txt appears to have the same behavior as Chromium of breaking at '.' (lines 44, 46), in addition to the new ICU behavior of not breaking at '@' (line 41), so it’s possible that it’s breaking e-mail addresses in the same undesirable way that we saw in Chromium. More attention may be needed to word_POSIX.txt’s handling, even in upstream ICU.

Chromium’s break-at-'.' behavior was a change made to restore previous ICU behavior when upgrading from 3.8 to 4.2.1 (https://codereview.chromium.org/202011, https://crbug.com/21142, https://src.chromium.org/viewvc/chrome?revision=25651&view=revision, 2009-09-08). It’s unclear whether Chromium still needs to break at '.', and if we’re able to stop doing so, we can likely also start breaking at '@'.

pedberg-icu added a commit to pedberg-icu/icu that referenced this pull request May 5, 2023
pedberg-icu added a commit that referenced this pull request May 7, 2023
FrankYFTang pushed a commit to FrankYFTang/icu that referenced this pull request May 11, 2023
pedberg-icu added a commit to pedberg-icu/icu that referenced this pull request May 12, 2023
…tter for wordbreak, update tests

(cherry picked from commit 5618203)
pedberg-icu added a commit that referenced this pull request May 12, 2023
…rdbreak, update tests

(cherry picked from commit 5618203)
catamorphism pushed a commit to catamorphism/icu that referenced this pull request Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants