Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LT-20351: Fix LT-20351 and implement Novel Root Guesser #245

Merged
merged 30 commits into from
Sep 30, 2024
Merged

Conversation

jtmaxwell3
Copy link
Collaborator

@jtmaxwell3 jtmaxwell3 commented Sep 18, 2024

  1. This adds a one line fix for LT-20351 proposed by Andy Black in its comments.
  2. This adds a MaxUnapplications parameter to make debugging very slow words possible.
  3. This implements the Novel Root Guesser requested in LT-21911 by adding lexical patterns using natural classes and some regular operators (e.g. [Seg], ([Seg]), [Seg]*).

Sorry for all of the formatting errors. I don't have a Resharper license and I'm new to C#.


This change is Reviewable

@codecov-commenter
Copy link

codecov-commenter commented Sep 18, 2024

Codecov Report

Attention: Patch coverage is 69.31818% with 81 lines in your changes missing coverage. Please review.

Project coverage is 69.88%. Comparing base (217b6cf) to head (2f824de).

Files with missing lines Patch % Lines
src/SIL.Machine.Morphology.HermitCrab/Morpher.cs 69.27% 48 Missing and 3 partials ⚠️
....Morphology.HermitCrab/CharacterDefinitionTable.cs 62.85% 17 Missing and 9 partials ⚠️
...chine.Morphology.HermitCrab/AnalysisStratumRule.cs 0.00% 1 Missing and 1 partial ⚠️
...hine.Morphology.HermitCrab/HermitCrabExtensions.cs 75.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #245      +/-   ##
==========================================
+ Coverage   69.80%   69.88%   +0.07%     
==========================================
  Files         379      379              
  Lines       31480    31732     +252     
  Branches     4400     4450      +50     
==========================================
+ Hits        21976    22175     +199     
- Misses       8483     8525      +42     
- Partials     1021     1032      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 9 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @isaac091 and @jtmaxwell3)


src/SIL.Machine/Annotations/Annotation.cs line 142 at r2 (raw file):

        /// </value>

        public bool IsNaturalClass

The Annotation class is intended to be general-purpose. These properties are specific to HC. You can either use the Data property or add these properties to the FeatureStruct. If you add them as features, you will need to update HC's feature system, i.e. HCFeatureSystem.


src/SIL.Machine.Morphology.HermitCrab/CharacterDefinitionTable.cs line 138 at r2 (raw file):

                // Check for pattern language.
                // NB: This only happens when the characters don't match.
                if (normalized[i] == '[')

Did you consider using a regex to parse the pattern language?


src/SIL.Machine.Morphology.HermitCrab/Morpher.cs line 88 at r2 (raw file):

        }

        public IList<RootAllomorph> LexicalPatterns

This should return a read-only type, such as IEnumerable.


src/SIL.Machine.Morphology.HermitCrab/Morpher.cs line 135 at r2 (raw file):

            File.WriteAllLines("analyses.txt", lines.OrderBy(l => l));
#endif
            var origAnalyses = guessRoot ? analyses.ToList() : null;

Please only use var when the type is explicit elsewhere in the line.


src/SIL.Machine.Morphology.HermitCrab/Morpher.cs line 137 at r2 (raw file):

            var origAnalyses = guessRoot ? analyses.ToList() : null;
            var syntheses = Synthesize(word, analyses);
            if (guessRoot && syntheses.Count() == 0)

C# uses lazy evaluation for LINQ functions. If you are going to iterate over an enumerable multiple times (Count() will iterate over the enumerable), you should first convert it to a list or array.


src/SIL.Machine.Morphology.HermitCrab/Morpher.cs line 372 at r2 (raw file):

            IEnumerable<ShapeNode> shapeNodes = input.Shape.GetNodes(input.Range);
            HashSet<string> shapeSet = new HashSet<string>();
            foreach (RootAllomorph lexicalPattern in _lexicalPatterns)

Did you consider implementing this using an FSA or the Pattern class?


src/SIL.Machine.Morphology.HermitCrab/Morpher.cs line 430 at r2 (raw file):

            if (prefix == null)
                prefix = new List<ShapeNode>();
            if (pattern.Count() == p)

You should use the Count property.


src/SIL.Machine.Morphology.HermitCrab/Morpher.cs line 440 at r2 (raw file):

                // Try skipping this item in the pattern.
                results.AddRange(MatchNodesWithPattern(nodes, pattern, n, p + 1, false, prefix));
            if (nodes.Count() == n)

You should use the Count property.


src/SIL.Machine.Morphology.HermitCrab/XmlLanguageLoader.cs line 720 at r2 (raw file):

            _language.NaturalClasses.Add(nc);
            _natClasses[(string)natClassElem.Attribute("id")] = nc;
            foreach (var table in _language.CharacterDefinitionTables)

Please only use var when the type is explicit elsewhere in the line.

@jtmaxwell3
Copy link
Collaborator Author

I didn't use a regular expression to find or match patterns because I wanted this to be backwards compatible with existing projects. The characters '[', ']', '(', ')', and '*' all could be valid characters in the character table. So, I can't tell if a lexeme with [...] in it is a pattern without analyzing the shape. I only recognize the pattern if the lexeme would have an invalid shape otherwise. If I have to analyze the shape anyway, it isn't hard to make the shape a pattern to match against other shapes.

I fixed the other issues.

@jtmaxwell3
Copy link
Collaborator Author

One advantage of matching the shapes rather than using a regular expression is that I unify the feature structures while matching. So, the pattern can add some details to the lexeme being matched (such as accent marks or tones).

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 7 files at r3, 1 of 1 files at r5, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @isaac091 and @jtmaxwell3)


src/SIL.Machine/Annotations/Annotation.cs line 140 at r5 (raw file):

        /// 	<c>true</c> if this annotation is iterative, otherwise <c>false</c>.
        /// </value>
        public bool Iterative

This property is also specific to HC.


src/SIL.Machine.Morphology.HermitCrab/CharacterDefinitionTable.cs line 138 at r5 (raw file):

                // Check for pattern language.
                // NB: This only happens when the characters don't match.
                if (normalized[i] == '[')

This should be restricted so that it only applies to root allomorphs. You could pass in an optional Boolean that enables this parsing code.


src/SIL.Machine.Morphology.HermitCrab/Morpher.cs line 372 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Did you consider implementing this using an FSA or the Pattern class?

Machine already provides multiple ways to implement this type of functionality (FSA or Pattern class). One of these methods should be used. If you don't have time to implement this right now, you should add a TODO comment that indicates that this should be reimplemented in the future using FSAs.


src/SIL.Machine.Morphology.HermitCrab/RootAllomorph.cs line 35 at r5 (raw file):

        /// Does this represent a lexical pattern (e.g. [Seg]*)?
        /// </summary>
        public bool IsPattern

I would move this code to the constructor, so that it is only computed once.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 9 files at r6, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @isaac091 and @jtmaxwell3)


src/SIL.Machine/Annotations/ShapeNode.cs line 51 at r6 (raw file):

        /// Whether this is an iterative node in a lexical pattern.
        /// </summary>
        public bool Iterative

You should use an extension method. See the HermitCrabExtensions.cs file for examples.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 5 of 5 files at r7, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @isaac091)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r8, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @isaac091)

@ddaspit ddaspit merged commit 4442ce3 into master Sep 30, 2024
4 checks passed
@ddaspit ddaspit deleted the LT-20351 branch September 30, 2024 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants