Ignore diacritics by default when searching #1243

longrunningprocess · 2021-11-09T19:21:33Z

Description

This feature will change the default search to include "diacritic-agnostic" results. It will also add a new advanced option allowing users to narrow results based on diacritics.

Fixes #1114

Type of Change

New feature (non-breaking change which adds functionality)
UI change

Screenshots

Provided with test cases below

How Has This Been Tested?

Project used for testing (5491 entries): Alex's project.zip

Functional testing (should work the same in both list and entry view)

Searching for chaca or CHACA should result in 124 matches with no advanced options selected

Searching for chaca or CHACA should result in 90 matches when "matching diacritics"

Searching for chacä or CHACÄ should result in 124 matches with no advanced options selected

Searching for chacä or CHACÄ should result in 9 matches when "matching diacritics"

Searching for chacå or CHACÅ should result in 124 matches with no advanced options selected

Searching for chacå or CHACÅ should result in 0 matches when "matching diacritics"

Checking the "Match diacritics" option should cause the "Options" changed indicator to appear

see Add "whole word" search option #1229 for more regression tests

Checklist:

I have performed a self-review of my own code
I have reviewed the title/description of this PR which will be used as the squashed PR commit message
I have commented my code, particularly in hard-to-understand areas
I have added tests that prove my fix is effective or that my feature works

longrunningprocess · 2021-11-09T22:42:54Z

Reviewer can check this for browser support: https://caniuse.com/?search=unicode%20property%20escapes. Please ensure these are ok.

I saw something about another property called InCombiningDiacriticalMarks and I'm not sure yet whether that is more appropriate or not, if reviewer has some thoughts I'd appreciate them.

megahirt

Approved with a ⭐ !
I'm impressed with your independent work to take advantage of research resources (C Hubbard) and to design and implement the solution. You also took care to make sure that the UX is consistent across products and consulted Alex as power user to make sure we are getting it right.

This is great work and will be immediately helpful to users when it is shipped.

Your PR description and test scenarios are first class - well done.

megahirt · 2021-11-10T03:21:22Z

src/angular-app/bellows/core/offline/editor-data.service.ts

+    // https://stackoverflow.com/a/37511463/10818013
+
+
+    return input.normalize('NFD').replace(/\p{Diacritic}/gu, '')


💥 This is it! Well done, and elegant too. I didn't know about Regex Unicode Property Escapes so my own solution probably would have been multiple lines.

great, I'm glad you're pleased with it, thanks for reviewing. Just to confirm, you're good with the browser support and you think \p{Diacritic} is good, no need to further investigate InCombiningDiacriticalMarks?

src/angular-app/bellows/core/offline/editor-data.service.ts

megahirt · 2021-11-10T05:00:38Z

I would like to add an additional test case that demonstrates diacritics keyed in both NFC and NFD form. The test should demonstrate that diacritics are ignored regardless of NFD/NFC input. I snagged some images from the Unicode docs on normalization forms to show some sample characters to search for with their codepoints, where these diacritics are safely ignored in search. How you key in these characters depends on your OS and keyboard software.

As I write this, I realize that "exact match" might not work in certain edge cases where the input query is not in the same normalized form (NFC/NFD) as the data being searched. The fix is of course to normalize the query and data before comparison. This would be a separate issue to look into, not part of this PR.

Co-authored-by: Christopher Hirt <chris@hirtfamily.net>

longrunningprocess · 2021-11-10T16:26:35Z

I would like to add an additional test case that demonstrates diacritics keyed in both NFC and NFD form. The test should demonstrate that diacritics are ignored regardless of NFD/NFC input. I snagged some images from the Unicode docs on normalization forms to show some sample characters to search for with their codepoints, where these diacritics are safely ignored in search. How you key in these characters depends on your OS and keyboard software.

As I write this, I realize that "exact match" might not work in certain edge cases where the input query is not in the same normalized form (NFC/NFD) as the data being searched. The fix is of course to normalize the query and data before comparison. This would be a separate issue to look into, not part of this PR.

ok, thanks for thinking of some additional use cases, I've felt like I was on shaking ground with my limited testing. I'll see if I can figure out if I can doing something in the browser console to test this.

for my ref:

String.fromCharCode(0x1e69)
'ṩ'
String.fromCharCode(0x1e0b, 0x0323)
'ḍ̇'
String.fromCharCode(0x0071, 0x0307, 0x0323)
'q̣̇'

longrunningprocess · 2021-11-10T18:31:59Z

FYI @megahirt a brief NFD/NFC test:

good (except for highlighting) in the default search mode:

no good when attempting to match diacritics:

longrunningprocess · 2021-11-10T18:35:40Z

I'm going to go ahead and merge this under the assumption you did check the browser support and were ok as well as anticipating InCombiningDiacriticalMarks is probably irrelevant.

megahirt · 2021-12-01T07:25:14Z

@longrunningprocess in my testing this, I realized that we didn't address the highlighting that matches a string. I'm pretty sure you are already aware of it, and that's fine. I'm mostly just commenting about it explicitly for future reference.

The highlightMatches() method would need to be rewritten or modified to take into a account the part about ignoring diacritics.

When I search for "chu" in the default mode, only the exact matches for "chu" are selected, even though the results show those that include diacritics, see below.

(whole word is selected as an option to reduce hits)

What I would expect to see is that all headwords that contain a form of "chu" are highlighted, including:

ćhü
ćhu
etc.

longrunningprocess · 2021-12-01T13:10:11Z

@longrunningprocess in my testing this, I realized that we didn't address the highlighting that matches a string. I'm pretty sure you are already aware of it, and that's fine. I'm mostly just commenting about it explicitly for future reference.

I wasn't even looking at it, I'm glad you put fresh eyes on it. I'll convert your finding to a new issue.

@longrunningprocess

Normalize the search string to NFC since all data in LF is normalized to NFC on disk. This allows for exact match or ignore diacritic queries to work regardless of form or language, e.g. Korean. A note about this fix: - All data is normalized to NFC in the database on write. It's been this way for years. - @longrunningprocess 's addition in #1243 normalized the query to NFD for the purposes of removing diacritics from the data and query. This a fine approach. - This PR could have chosen to normalize all data to NFD for comparison under all circumstances given the second point above, however I chose to stick with NFC since that is what the data is underneath. Either way works. Fixes #1244

billy clark added 3 commits November 8, 2021 13:42

added checkbox and got it looking right on all viewports

928a247

establish awareness of new option

521ac4a

integrate diacritic-agnostic logic

079012e

longrunningprocess requested review from rmunn, megahirt and palmtreefrb November 9, 2021 19:37

made entry view consistent with list view

617c739

megahirt approved these changes Nov 10, 2021

View reviewed changes

billy clark and others added 4 commits November 10, 2021 10:00

PR feedback, name of function was not accurate

8e248f0

Co-authored-by: Christopher Hirt <chris@hirtfamily.net>

PR feedback, name of function was not accurate

c41a546

Co-authored-by: Christopher Hirt <chris@hirtfamily.net>

PR feedback, name of function was not accurate

a538798

Co-authored-by: Christopher Hirt <chris@hirtfamily.net>

fixed some whitespace

426a328

longrunningprocess merged commit 19a82a4 into develop Nov 10, 2021

longrunningprocess deleted the feature/1114-ignore-diacritics-by-default-when-searching branch November 10, 2021 18:36

longrunningprocess mentioned this pull request Nov 11, 2021

Convert search terms to NFC #1244

Closed

longrunningprocess mentioned this pull request Dec 1, 2021

Diacritic matches not being highlighted properly #1260

Closed

megahirt mentioned this pull request Jan 10, 2022

normalize search string to NFC before comparison #1272

Merged

4 tasks

rmunn mentioned this pull request Mar 1, 2022

feat: Consider using "compatibility decomposed form" (NFKD) when searching #1316

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore diacritics by default when searching #1243

Ignore diacritics by default when searching #1243

longrunningprocess commented Nov 9, 2021 •

edited

Loading

longrunningprocess commented Nov 9, 2021 •

edited

Loading

megahirt left a comment

megahirt Nov 10, 2021

longrunningprocess Nov 10, 2021

megahirt commented Nov 10, 2021

longrunningprocess commented Nov 10, 2021 •

edited

Loading

longrunningprocess commented Nov 10, 2021

longrunningprocess commented Nov 10, 2021

megahirt commented Dec 1, 2021 •

edited

Loading

longrunningprocess commented Dec 1, 2021

		// https://stackoverflow.com/a/37511463/10818013


		return input.normalize('NFD').replace(/\p{Diacritic}/gu, '')

Ignore diacritics by default when searching #1243

Ignore diacritics by default when searching #1243

Conversation

longrunningprocess commented Nov 9, 2021 • edited Loading

Description

Type of Change

Screenshots

How Has This Been Tested?

Functional testing (should work the same in both list and entry view)

Checklist:

longrunningprocess commented Nov 9, 2021 • edited Loading

megahirt left a comment

Choose a reason for hiding this comment

megahirt Nov 10, 2021

Choose a reason for hiding this comment

longrunningprocess Nov 10, 2021

Choose a reason for hiding this comment

megahirt commented Nov 10, 2021

longrunningprocess commented Nov 10, 2021 • edited Loading

longrunningprocess commented Nov 10, 2021

longrunningprocess commented Nov 10, 2021

megahirt commented Dec 1, 2021 • edited Loading

longrunningprocess commented Dec 1, 2021

longrunningprocess commented Nov 9, 2021 •

edited

Loading

longrunningprocess commented Nov 9, 2021 •

edited

Loading

longrunningprocess commented Nov 10, 2021 •

edited

Loading

megahirt commented Dec 1, 2021 •

edited

Loading