Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalize search string to NFC before comparison #1272

Merged
merged 2 commits into from
Jan 11, 2022
Merged

Conversation

megahirt
Copy link
Collaborator

@megahirt megahirt commented Jan 7, 2022

Description

Normalize the search string to NFC since all data in LF is normalized to NFC on disk. This allows for exact match or ignore diacritic queries to work regardless of form or language, e.g. Korean.

A note about this fix:

  • All data is normalized to NFC in the database on write. It's been this way for years.
  • @longrunningprocess 's addition in Ignore diacritics by default when searching #1243 normalized the query to NFD for the purposes of removing diacritics from the data and query. This a fine approach.
  • This PR could have chosen to normalize all data to NFD for comparison under all circumstances given the second point above, however I chose to stick with NFC since that is what the data is underneath. Either way works.

Fixes #1244

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

Tests and Test Data

Consider the single Korean character below:

감=감

To the left of the equals sign is the NFC single composed character. To the right is the NFD decomposed form (3 code points). They are canonically equivalent and should display identically where Korean is properly supported. Interestingly, my Windows machine isn't rendering the NFD portion correctly, as seen in the character identifier screenshot below. My web browser displays it just fine, as you see it in this PR description.
image

Test 1 - Query match with "match diacritics"

Steps:

  1. Paste the NFC character 감 (left side) into a LF entry data field
  2. Do a search using the NFC character 감 to verify a match
  3. Do a search using the NFD characters 감 to verify a match (currently this fails on production)

Test 2 - Query match with "ignore diacritics" (default behavior)

Steps:

  1. Paste the NFC character 감 (left side) into a LF entry data field
  2. Do a search using the NFC character 감 to verify a match
  3. Do a search using the NFD characters 감 to verify a match

Screencast demo from this branch

Animation2

Checklist:

  • I have performed a self-review of my own code
  • I have reviewed the title/description of this PR which will be used as the squashed PR commit message
  • I have commented my code, particularly in hard-to-understand areas
  • The tests above demonstrate my fix is effective or that my feature works

Normalize the search string to NFC since all data in LF is normalized to NFC on disk.  This allows for exact match queries to work regardless of form.

Attempt to fix a bug where the default behavior of ignoring diacritics would cause missing search results for complex scripts with combining characters that are not diacritics (e.g. Japanese or Korean)

fixes #1244
@megahirt megahirt marked this pull request as ready for review January 10, 2022 10:19
Copy link
Contributor

@longrunningprocess longrunningprocess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one little thought to simplify things if possible.

Add a comment explaining NFD/NFC conversion to remove diacritics
@megahirt megahirt merged commit 453a09a into develop Jan 11, 2022
@megahirt megahirt deleted the bugfix/nfcSearch branch January 11, 2022 08:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert search terms to NFC
2 participants