Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore diacritics by default when searching #1243

Conversation

longrunningprocess
Copy link
Contributor

@longrunningprocess longrunningprocess commented Nov 9, 2021

Description

This feature will change the default search to include "diacritic-agnostic" results. It will also add a new advanced option allowing users to narrow results based on diacritics.

Fixes #1114

Type of Change

  • New feature (non-breaking change which adds functionality)
  • UI change

Screenshots

Provided with test cases below

How Has This Been Tested?

Project used for testing (5491 entries): Alex's project.zip

Functional testing (should work the same in both list and entry view)

  • Searching for chaca or CHACA should result in 124 matches with no advanced options selected

image

  • Searching for chaca or CHACA should result in 90 matches when "matching diacritics"

image

  • Searching for chacä or CHACÄ should result in 124 matches with no advanced options selected

image

  • Searching for chacä or CHACÄ should result in 9 matches when "matching diacritics"

image

  • Searching for chacå or CHACÅ should result in 124 matches with no advanced options selected

image

  • Searching for chacå or CHACÅ should result in 0 matches when "matching diacritics"

image

  • Checking the "Match diacritics" option should cause the "Options" changed indicator to appear

image

Checklist:

  • I have performed a self-review of my own code
  • I have reviewed the title/description of this PR which will be used as the squashed PR commit message
  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works

@longrunningprocess
Copy link
Contributor Author

longrunningprocess commented Nov 9, 2021

Reviewer can check this for browser support: https://caniuse.com/?search=unicode%20property%20escapes. Please ensure these are ok.

I saw something about another property called InCombiningDiacriticalMarks and I'm not sure yet whether that is more appropriate or not, if reviewer has some thoughts I'd appreciate them.

Copy link
Collaborator

@megahirt megahirt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with a ⭐ !
I'm impressed with your independent work to take advantage of research resources (C Hubbard) and to design and implement the solution. You also took care to make sure that the UX is consistent across products and consulted Alex as power user to make sure we are getting it right.

This is great work and will be immediately helpful to users when it is shipped.

Your PR description and test scenarios are first class - well done.

// https://stackoverflow.com/a/37511463/10818013


return input.normalize('NFD').replace(/\p{Diacritic}/gu, '')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💥 This is it! Well done, and elegant too. I didn't know about Regex Unicode Property Escapes so my own solution probably would have been multiple lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, I'm glad you're pleased with it, thanks for reviewing. Just to confirm, you're good with the browser support and you think \p{Diacritic} is good, no need to further investigate InCombiningDiacriticalMarks?

@megahirt
Copy link
Collaborator

I would like to add an additional test case that demonstrates diacritics keyed in both NFC and NFD form. The test should demonstrate that diacritics are ignored regardless of NFD/NFC input. I snagged some images from the Unicode docs on normalization forms to show some sample characters to search for with their codepoints, where these diacritics are safely ignored in search. How you key in these characters depends on your OS and keyboard software.

image

image

As I write this, I realize that "exact match" might not work in certain edge cases where the input query is not in the same normalized form (NFC/NFD) as the data being searched. The fix is of course to normalize the query and data before comparison. This would be a separate issue to look into, not part of this PR.

billy clark and others added 4 commits November 10, 2021 10:00
Co-authored-by: Christopher Hirt <chris@hirtfamily.net>
Co-authored-by: Christopher Hirt <chris@hirtfamily.net>
Co-authored-by: Christopher Hirt <chris@hirtfamily.net>
@longrunningprocess
Copy link
Contributor Author

longrunningprocess commented Nov 10, 2021

I would like to add an additional test case that demonstrates diacritics keyed in both NFC and NFD form. The test should demonstrate that diacritics are ignored regardless of NFD/NFC input. I snagged some images from the Unicode docs on normalization forms to show some sample characters to search for with their codepoints, where these diacritics are safely ignored in search. How you key in these characters depends on your OS and keyboard software.

image

image

As I write this, I realize that "exact match" might not work in certain edge cases where the input query is not in the same normalized form (NFC/NFD) as the data being searched. The fix is of course to normalize the query and data before comparison. This would be a separate issue to look into, not part of this PR.

ok, thanks for thinking of some additional use cases, I've felt like I was on shaking ground with my limited testing. I'll see if I can figure out if I can doing something in the browser console to test this.

for my ref:

String.fromCharCode(0x1e69)
'ṩ'
String.fromCharCode(0x1e0b, 0x0323)
'ḍ̇'
String.fromCharCode(0x0071, 0x0307, 0x0323)
'q̣̇'

@longrunningprocess
Copy link
Contributor Author

FYI @megahirt a brief NFD/NFC test:

good (except for highlighting) in the default search mode:
image

no good when attempting to match diacritics:
image

@longrunningprocess
Copy link
Contributor Author

I'm going to go ahead and merge this under the assumption you did check the browser support and were ok as well as anticipating InCombiningDiacriticalMarks is probably irrelevant.

@longrunningprocess longrunningprocess merged commit 19a82a4 into develop Nov 10, 2021
@longrunningprocess longrunningprocess deleted the feature/1114-ignore-diacritics-by-default-when-searching branch November 10, 2021 18:36
@megahirt
Copy link
Collaborator

megahirt commented Dec 1, 2021

@longrunningprocess in my testing this, I realized that we didn't address the highlighting that matches a string. I'm pretty sure you are already aware of it, and that's fine. I'm mostly just commenting about it explicitly for future reference.

The highlightMatches() method would need to be rewritten or modified to take into a account the part about ignoring diacritics.

When I search for "chu" in the default mode, only the exact matches for "chu" are selected, even though the results show those that include diacritics, see below.

image
(whole word is selected as an option to reduce hits)

What I would expect to see is that all headwords that contain a form of "chu" are highlighted, including:

  • ćhü
  • ćhu
    etc.

@longrunningprocess
Copy link
Contributor Author

@longrunningprocess in my testing this, I realized that we didn't address the highlighting that matches a string. I'm pretty sure you are already aware of it, and that's fine. I'm mostly just commenting about it explicitly for future reference.

I wasn't even looking at it, I'm glad you put fresh eyes on it. I'll convert your finding to a new issue.

megahirt added a commit that referenced this pull request Jan 11, 2022
Normalize the search string to NFC since all data in LF is normalized to NFC on disk.  This allows for exact match or ignore diacritic queries to work regardless of form or language, e.g. Korean.

A note about this fix:
- All data is normalized to NFC in the database on write.  It's been this way for years.
- @longrunningprocess 's addition in #1243 normalized the query to NFD for the purposes of removing diacritics from the data and query.  This a fine approach.
- This PR could have chosen to normalize all data to NFD for comparison under all circumstances given the second point above, however I chose to stick with NFC since that is what the data is underneath.  Either way works.

Fixes #1244
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: ignore diacritics as new default search behavior; add "match diacritics" search option for "exact match"
2 participants