normalize search string to NFC before comparison #1272

megahirt · 2022-01-07T14:06:37Z

Description

Normalize the search string to NFC since all data in LF is normalized to NFC on disk. This allows for exact match or ignore diacritic queries to work regardless of form or language, e.g. Korean.

A note about this fix:

All data is normalized to NFC in the database on write. It's been this way for years.
@longrunningprocess 's addition in Ignore diacritics by default when searching #1243 normalized the query to NFD for the purposes of removing diacritics from the data and query. This a fine approach.
This PR could have chosen to normalize all data to NFD for comparison under all circumstances given the second point above, however I chose to stick with NFC since that is what the data is underneath. Either way works.

Fixes #1244

Type of Change

Bug fix (non-breaking change which fixes an issue)

Tests and Test Data

Consider the single Korean character below:

감=감

To the left of the equals sign is the NFC single composed character. To the right is the NFD decomposed form (3 code points). They are canonically equivalent and should display identically where Korean is properly supported. Interestingly, my Windows machine isn't rendering the NFD portion correctly, as seen in the character identifier screenshot below. My web browser displays it just fine, as you see it in this PR description.

Test 1 - Query match with "match diacritics"

Steps:

Paste the NFC character 감 (left side) into a LF entry data field
Do a search using the NFC character 감 to verify a match
Do a search using the NFD characters 감 to verify a match (currently this fails on production)

Test 2 - Query match with "ignore diacritics" (default behavior)

Steps:

Paste the NFC character 감 (left side) into a LF entry data field
Do a search using the NFC character 감 to verify a match
Do a search using the NFD characters 감 to verify a match

Screencast demo from this branch

Checklist:

I have performed a self-review of my own code
I have reviewed the title/description of this PR which will be used as the squashed PR commit message
I have commented my code, particularly in hard-to-understand areas
The tests above demonstrate my fix is effective or that my feature works

Normalize the search string to NFC since all data in LF is normalized to NFC on disk. This allows for exact match queries to work regardless of form. Attempt to fix a bug where the default behavior of ignoring diacritics would cause missing search results for complex scripts with combining characters that are not diacritics (e.g. Japanese or Korean) fixes #1244

longrunningprocess

one little thought to simplify things if possible.

src/angular-app/bellows/core/offline/editor-data.service.ts

Add a comment explaining NFD/NFC conversion to remove diacritics

megahirt marked this pull request as ready for review January 10, 2022 10:19

longrunningprocess approved these changes Jan 10, 2022

View reviewed changes

src/angular-app/bellows/core/offline/editor-data.service.ts Show resolved Hide resolved

megahirt commented Jan 11, 2022

View reviewed changes

src/angular-app/bellows/core/offline/editor-data.service.ts Outdated Show resolved Hide resolved

address review comments

fb466f2

Add a comment explaining NFD/NFC conversion to remove diacritics

longrunningprocess approved these changes Jan 11, 2022

View reviewed changes

megahirt merged commit 453a09a into develop Jan 11, 2022

megahirt deleted the bugfix/nfcSearch branch January 11, 2022 08:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

normalize search string to NFC before comparison #1272

normalize search string to NFC before comparison #1272

megahirt commented Jan 7, 2022 •

edited

Loading

longrunningprocess left a comment

normalize search string to NFC before comparison #1272

normalize search string to NFC before comparison #1272

Conversation

megahirt commented Jan 7, 2022 • edited Loading

Description

Type of Change

Tests and Test Data

Test 1 - Query match with "match diacritics"

Test 2 - Query match with "ignore diacritics" (default behavior)

Screencast demo from this branch

Checklist:

longrunningprocess left a comment

Choose a reason for hiding this comment

megahirt commented Jan 7, 2022 •

edited

Loading