Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for non-verse text segments in Scripture corpora #179

Merged
merged 2 commits into from
Apr 11, 2024

Conversation

ddaspit
Copy link
Contributor

@ddaspit ddaspit commented Apr 2, 2024

  • add new ScriptureRef corpus ref class
  • update Scripture corpora classes to use ScriptureRef
  • add ScriptureRefUsfmParserHandlerBase class to track ScriptureRef in USFM
  • update UsfmTextUpdater and UsfmTextBase to use ScriptureRefUsfmParserHandlerBase
  • add support for updating non-Scripture paragraphs and notes
  • update NmtPreprocessBuildJob to support non-Scripture segments

This change is Reviewable

@codecov-commenter
Copy link

codecov-commenter commented Apr 2, 2024

Codecov Report

Attention: Patch coverage is 88.82083% with 73 lines in your changes are missing coverage. Please review.

Project coverage is 67.04%. Comparing base (fa65835) to head (a894ab9).

Files Patch % Lines
src/SIL.Machine/Corpora/ScriptureRef.cs 70.73% 22 Missing and 2 partials ⚠️
src/SIL.Machine/Corpora/ScriptureElement.cs 61.11% 13 Missing and 1 partial ⚠️
src/SIL.Machine/Corpora/UsfmTextUpdater.cs 91.74% 5 Missing and 4 partials ⚠️
src/SIL.Machine/Corpora/ScriptureText.cs 82.60% 7 Missing and 1 partial ⚠️
...chine/Corpora/ScriptureRefUsfmParserHandlerBase.cs 95.27% 7 Missing ⚠️
src/SIL.Machine/Corpora/CorporaExtensions.cs 50.00% 6 Missing ⚠️
src/SIL.Machine/Corpora/ParallelTextRow.cs 0.00% 1 Missing and 1 partial ⚠️
...chine.AspNetCore/Services/NmtPreprocessBuildJob.cs 98.30% 0 Missing and 1 partial ⚠️
src/SIL.Machine/Corpora/UsfmTextBase.cs 98.61% 1 Missing ⚠️
.../SIL.Machine/Corpora/UsxFileAlignmentCollection.cs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #179      +/-   ##
==========================================
+ Coverage   66.72%   67.04%   +0.31%     
==========================================
  Files         438      441       +3     
  Lines       34396    34831     +435     
  Branches     4613     4670      +57     
==========================================
+ Hits        22951    23351     +400     
- Misses      10355    10387      +32     
- Partials     1090     1093       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@johnml1135 johnml1135 self-requested a review April 3, 2024 12:10
Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big update! Pretty cool to have this capability - thank you for your hard work ❤️ . Not much I feel equipped to comment on - the rest is all on you, John 😆 .

Reviewed 34 of 34 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/SIL.Machine/Corpora/ScriptureRef.cs line 42 at r1 (raw file):

        public VerseRef VerseRef { get; }
        public IReadOnlyList<ScriptureElement> Path { get; }

Why is this called Path given it's a list of scripture elements associated with the ref?


src/SIL.Machine/Corpora/TextCorpusBase.cs line 10 at r1 (raw file):

        public abstract IEnumerable<IText> Texts { get; }
        public abstract bool IsTokenized { get; }
        public abstract ScrVers Versification { get; }

I'm sure you've thought this through better than I can, but is there no more semantically appropriate alternative to having all the corpora have versifications (even if they're just null)? Couldn't you have parallel scripture/non-scripture inheritance structures? Is what you've done preferable because there's no easy way to reuse code across parallel paths like that?


src/SIL.Machine/Corpora/UsfmTextBase.cs line 182 at r1 (raw file):

                else if (
                    (CurrentTextType == ScriptureTextType.Verse && state.IsVerseText)
                    || CurrentTextType != ScriptureTextType.Verse

Is there a reason you can't/shouldn't simplify?

Code snippet:

CurrentTextType != ScriptureTextType.Verse || state.IsVerseText

src/SIL.Machine/Corpora/UsfmTextBase.cs line 218 at r1 (raw file):

                    && (
                        (CurrentTextType == ScriptureTextType.Verse && state.IsVerseText)
                        || CurrentTextType != ScriptureTextType.Verse

Same as note above about simplification.

Copy link
Contributor Author

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


src/SIL.Machine/Corpora/ScriptureRef.cs line 42 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Why is this called Path given it's a list of scripture elements associated with the ref?

I agree. It is not a great name, but I wasn't sure what else to call it. I called it Path, because it captures the path to the particular segment in the USFM hierarchy.


src/SIL.Machine/Corpora/TextCorpusBase.cs line 10 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I'm sure you've thought this through better than I can, but is there no more semantically appropriate alternative to having all the corpora have versifications (even if they're just null)? Couldn't you have parallel scripture/non-scripture inheritance structures? Is what you've done preferable because there's no easy way to reuse code across parallel paths like that?

I didn't love this change, but I couldn't come up with a better way. I need some way to easily tell if a corpus is a Scripture corpus and uses ScriptureRefs. Machine supports operations on corpora, which are implemented through decorator classes. These decorator classes are generic and can be applied to any type of corpus. This obscures the original type of the corpus, so we can't reliably use is ScriptureTextCorpus or anything like that. The other way to determine if it is a Scripture corpus was to actually iterate through the rows, grab the ref of the first row, and check if it is a ScriptureRef, but that is really hacky. I think this is the best option.


src/SIL.Machine/Corpora/UsfmTextBase.cs line 182 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Is there a reason you can't/shouldn't simplify?

There isn't really any reason I can't simplify to the suggested conditional. I was just trying to make the intent clearer. I will go ahead and simplify it.

Copy link
Contributor Author

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 33 of 34 files reviewed, 4 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


src/SIL.Machine/Corpora/UsfmTextBase.cs line 182 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

There isn't really any reason I can't simplify to the suggested conditional. I was just trying to make the intent clearer. I will go ahead and simplify it.

Done


src/SIL.Machine/Corpora/UsfmTextBase.cs line 218 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Same as note above about simplification.

Done.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/SIL.Machine/Corpora/ScriptureRef.cs line 42 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I agree. It is not a great name, but I wasn't sure what else to call it. I called it Path, because it captures the path to the particular segment in the USFM hierarchy.

I see. OK. Nothing better comes to mind. Maybe John will have a thought.


src/SIL.Machine/Corpora/TextCorpusBase.cs line 10 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I didn't love this change, but I couldn't come up with a better way. I need some way to easily tell if a corpus is a Scripture corpus and uses ScriptureRefs. Machine supports operations on corpora, which are implemented through decorator classes. These decorator classes are generic and can be applied to any type of corpus. This obscures the original type of the corpus, so we can't reliably use is ScriptureTextCorpus or anything like that. The other way to determine if it is a Scripture corpus was to actually iterate through the rows, grab the ref of the first row, and check if it is a ScriptureRef, but that is really hacky. I think this is the best option.

That makes sense. This might be even more hacky, but is there some way you could make a helper method that uses reflection to see if the corpus has a versification property? I suppose this isn't really practically so different than having the versification where inapplicable set to null, but could it keep Versification from floating around in classes where it doesn't belong?

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ScriptureRef.cs line 29 at r2 (raw file):

                    path.Add(new ScriptureElement(0, elem[0]));
                else
                    path.Add(new ScriptureElement(int.Parse(elem[0], CultureInfo.InvariantCulture), elem[1]));

This looks fragile. There should be a spec that this is designed to. Either it is general, or there is a specific set of labels that should work in a very specific format. I notice that something like ABC:123 would break it, in that it would be split into 2, and ABC would not be able to parse into a number.

@johnml1135
Copy link
Collaborator

Notes from the code coverage:

  • Overlaps, comparisons and hash codes untested for ScriptureReference and ScriptureElement
  • CreateRow in ScriptureText untested
  • Ref in UsfmTextUpdater untested

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ScriptureRef.cs line 29 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This looks fragile. There should be a spec that this is designed to. Either it is general, or there is a specific set of labels that should work in a very specific format. I notice that something like ABC:123 would break it, in that it would be split into 2, and ABC would not be able to parse into a number.

We should make a wiki page explaining how the parsing works - or an extended comment at the beginning of this file.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 16 at r2 (raw file):

    public abstract class ScriptureRefUsfmParserHandlerBase : UsfmParserHandlerBase
    {
        private VerseRef _verseRef;

This should be _curVerseRef - as it can change.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 16 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This should be _curVerseRef - as it can change.

And _curElements

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 16 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

And _curElements

Hmm. I'm looking at it more. I am just wondering why you use _curTextType, but not _cur for the other ones...

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 230 at r2 (raw file):

                _verseRef.HasMultiple ? _verseRef.AllVerses().Last() : _verseRef,
                _elements.Where(e => e.Position > 0).Reverse()
            );

Genius.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 205 at r2 (raw file):

            ScriptureElement prevElem = _elements.Pop();
            _elements.Push(new ScriptureElement(prevElem.Position + 1, marker));
        }

I think I got it. The reference would be:

  • Verse reference: MAT 3:12-22
  • Elements: Section 4, Paragraph 3, Cell 18.
  • Elements as they are in here: Section 4, Section 0, Paragraph 3, Paragraph 0, Cell 18 (0's get removed)
  • "Parent elements" can contain others - element "0" symbolically

There should be an explanation of this (at least some examples) - it took me a while to figure out.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 230 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Genius.

So you create these for Notes (all types), Paragraphs (all types) and Cells (no other types of characters). Is this the intention (it seems reasonable from little that I see)? This needs to be documented in Swagger.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 35 at r2 (raw file):

            {
                EndVerseText(state, CreateVerseRefs());
                _curTextType.Pop();

Why is this popped here? It would not be used anymore. I would assume either it would be cleared or left alone...

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 69 at r2 (raw file):

                    EndVerseText(state, CreateVerseRefs());
                    _curTextType.Pop();
                }

So, for duplicated verses, just take the first one and then ignore all duplicates? Needs documentation...

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 69 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

So, for duplicated verses, just take the first one and then ignore all duplicates? Needs documentation...

Why was the duplicate / merge verse logic needed to be added now? Was it just moved from somewhere else? It is not specific to non-verse text...

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 71 at r2 (raw file):

                }
                _duplicateVerse = true;
                UpdateVerseRef(state.VerseRef, marker);

If it's truly a duplicate verse (and we are just throwing it on the ground), why are we also updating the verse reference? Wouldn't the marker be the same? And if it was not, we would want the first marker, not the one that came from the duplicated text.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 60 at r2 (raw file):

            string marker,
            string altNumber,
            string pubNumber

What happens for alternative verses in descriptive sections? They would not be handled as verses.. https://ubsicap.github.io/usfm/chapters_verses/index.html#va-va. Would /va be counted as a verse? Would then the description be counted as duplicate?

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 102 at r2 (raw file):

        {
            if (_verseRef.IsDefault)
                UpdateVerseRef(state.VerseRef, marker);

What is intended here? Only at the beginning of books do we update from the parser state, but not later? Why not later?

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 18 at r2 (raw file):

        private VerseRef _verseRef;
        private readonly Stack<ScriptureElement> _elements;
        private readonly Stack<ScriptureTextType> _curTextType;

As another question - _verseRef is never initialized and I assume it starts as a null. Should this be the case? Or should it be initialized to default?

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 126 at r2 (raw file):

            if (CurrentTextType == ScriptureTextType.NonVerse)
                StartParentElement(marker);
        }

So, if it is a verse (or a note), just strip out all formatting? If there is a paragraph or a table in a verse or cell, squish it together and translate the whole thing?

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 126 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

So, if it is a verse (or a note), just strip out all formatting? If there is a paragraph or a table in a verse or cell, squish it together and translate the whole thing?

Documentation...

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ScriptureText.cs line 104 at r2 (raw file):

        }

        protected TextRow CreateRow(

This appears to never be called

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/TextCorpusBase.cs line 10 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

That makes sense. This might be even more hacky, but is there some way you could make a helper method that uses reflection to see if the corpus has a versification property? I suppose this isn't really practically so different than having the versification where inapplicable set to null, but could it keep Versification from floating around in classes where it doesn't belong?

Or, add another property to TextCorpusBase called "CorpusType" or similar that would be hard coded to either a string or an enum of "TextCorpus" or "ScriptureCorpus". This should survive decorators and be more explicit when trying to determine the underlying type.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/UsfmParser.cs line 144 at r2 (raw file):

            if (State.Index >= State.Tokens.Count - 1)
            {
                CloseAll();

So, close all USFM markers on the end of USFM (makes sense), the beginnings of notes, chapters and books as well as verses that don't have verse text? Is that correct?

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/UsfmTextBase.cs line 70 at r2 (raw file):

            public IEnumerable<TextRow> Rows => _rows;

            public override void Verse(

This is where _verseRef was set to default - but not anymore. But it is still asked if it IsDefault...

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/UsfmTextBase.cs line 78 at r2 (raw file):

            )
            {
                base.Verse(state, number, marker, altNumber, pubNumber);

This logic is retranscribed - but also duplicates are checked, with odd logic (by my estimation).

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/UsfmTextBase.cs line 111 at r2 (raw file):

                    OutputMarker(state);
                }
                else if (CurrentTextType == ScriptureTextType.Verse)

This is an odd dance between the base and parent class - handling non-verse cells in the base and verse cells right here...

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/UsfmTextBase.cs line 180 at r2 (raw file):

                    _rowTexts.Peek().Append("//");
                }
                else if (CurrentTextType != ScriptureTextType.Verse || state.IsVerseText)

Ok, so if it isn't a verse or is a verse? I don't understand...

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/UsfmTextBase.cs line 254 at r2 (raw file):

                _rowTexts.Push(new StringBuilder());
            }

So, if I understand, If it's a verse, it is all glumbed together. If it is out of a verse, each part is treated individually. If markers are included, keep the notes in place.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/UsfmTextBase.cs line 293 at r2 (raw file):

                }
                if (!state.IsVersePara)
                    _sentenceStart = true;

This logic replaces Sentence completed function. Has the logic changed?

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 16 of 34 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 21 unresolved discussions (waiting on @ddaspit and @Enkidu93)

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/UsfmTextUpdater.cs line 351 at r2 (raw file):

        private void SkipTokens(UsfmParserState state)
        {
            _tokenIndex = state.Index + 1 + state.SpecialTokenCount;

So, alternate verses and other special tokens will be skipped over if we are replacing the text? Would we get those from the translated text? Are we sure we want to blow away the special tokens?

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/UsfmTextUpdater.cs line 268 at r2 (raw file):

            {
                _tokens.Add(state.Token);
                _tokens.Add(new UsfmToken(UsfmTokenType.Character, "ft", null, "ft*"));

Are we sure there are only footnotes? What about endnotes? Don't we have the marker in the Scripture reference that we can pull out and use?

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 34 files at r1.
Reviewable status: all files reviewed, 23 unresolved discussions (waiting on @ddaspit and @Enkidu93)

- add new ScriptureRef corpus ref class
- update Scripture corpora classes to use ScriptureRef
- add ScriptureRefUsfmParserHandlerBase class to track ScriptureRef in USFM
- update UsfmTextUpdater and UsfmTextBase to use ScriptureRefUsfmParserHandlerBase
- add support for updating non-Scripture paragraphs and notes
- update NmtPreprocessBuildJob to support non-Scripture segments
Copy link
Contributor Author

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 32 of 34 files reviewed, 22 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


src/SIL.Machine/Corpora/ScriptureRef.cs line 29 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

We should make a wiki page explaining how the parsing works - or an extended comment at the beginning of this file.

Unfortunately, there is no spec for specifying a reference to a particular text segment in a USFM/USX file. I had to make one up that fits our needs. The ScriptureRef gets serialized to a specific format. If the format is invalid, then it might fail to parse.


src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 16 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Hmm. I'm looking at it more. I am just wondering why you use _curTextType, but not _cur for the other ones...

Done.


src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 18 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

As another question - _verseRef is never initialized and I assume it starts as a null. Should this be the case? Or should it be initialized to default?

VerseRef is a struct (value type), so it isn't nullable by default.


src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 35 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Why is this popped here? It would not be used anymore. I would assume either it would be cleared or left alone...

It doesn't have to be. I was just keeping the logic consistent across the entire class.


src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 60 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

What happens for alternative verses in descriptive sections? They would not be handled as verses.. https://ubsicap.github.io/usfm/chapters_verses/index.html#va-va. Would /va be counted as a verse? Would then the description be counted as duplicate?

They would be ignored.


src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 69 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Why was the duplicate / merge verse logic needed to be added now? Was it just moved from somewhere else? It is not specific to non-verse text...

This logic was moved from the UsfmTextBase class.


src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 71 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

If it's truly a duplicate verse (and we are just throwing it on the ground), why are we also updating the verse reference? Wouldn't the marker be the same? And if it was not, we would want the first marker, not the one that came from the duplicated text.

Done.


src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 102 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

What is intended here? Only at the beginning of books do we update from the parser state, but not later? Why not later?

We only update the verse ref once we hit the first paragraph. Prior to that we are in the book identification section.


src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 126 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Documentation...

Yes, that is correct.


src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 205 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I think I got it. The reference would be:

  • Verse reference: MAT 3:12-22
  • Elements: Section 4, Paragraph 3, Cell 18.
  • Elements as they are in here: Section 4, Section 0, Paragraph 3, Paragraph 0, Cell 18 (0's get removed)
  • "Parent elements" can contain others - element "0" symbolically

There should be an explanation of this (at least some examples) - it took me a while to figure out.

I added a description of how scripture refs work in the class documentation.


src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 230 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

So you create these for Notes (all types), Paragraphs (all types) and Cells (no other types of characters). Is this the intention (it seems reasonable from little that I see)? This needs to be documented in Swagger.

Yes, that is correct.


src/SIL.Machine/Corpora/ScriptureText.cs line 104 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This appears to never be called

I left it in as a helper method for classes that extend this abstract base class.


src/SIL.Machine/Corpora/TextCorpusBase.cs line 10 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Or, add another property to TextCorpusBase called "CorpusType" or similar that would be hard coded to either a string or an enum of "TextCorpus" or "ScriptureCorpus". This should survive decorators and be more explicit when trying to determine the underlying type.

@Enkidu93 Reflection should only be used as a last resort.

@johnml1135 Yes, but we would still need to have the Versification property, so I figured it was easier to just add this one property and return null if it isn't a Scripture corpus. There is an IsScripture extension method on ITextCorpus that checks this property.


src/SIL.Machine/Corpora/UsfmParser.cs line 144 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

So, close all USFM markers on the end of USFM (makes sense), the beginnings of notes, chapters and books as well as verses that don't have verse text? Is that correct?

This is a fix for what was essentially a bug in the UsfmParser where the End<element> methods would not get called at the end of the USFM file. This fix ensures that they get called properly.


src/SIL.Machine/Corpora/UsfmTextBase.cs line 70 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This is where _verseRef was set to default - but not anymore. But it is still asked if it IsDefault...

I'm not sure I understand what you mean.


src/SIL.Machine/Corpora/UsfmTextBase.cs line 78 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This logic is retranscribed - but also duplicates are checked, with odd logic (by my estimation).

I'm not sure what you mean.


src/SIL.Machine/Corpora/UsfmTextBase.cs line 111 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This is an odd dance between the base and parent class - handling non-verse cells in the base and verse cells right here...

If a table cell occurs inside of a verse, then we need to make sure that we insert a space in the collected verse text.


src/SIL.Machine/Corpora/UsfmTextBase.cs line 180 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Ok, so if it isn't a verse or is a verse? I don't understand...

The state.IsVerseText refers specifically to verse text within a verse. A verse can contain text that isn't verse text.


src/SIL.Machine/Corpora/UsfmTextBase.cs line 254 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

So, if I understand, If it's a verse, it is all glumbed together. If it is out of a verse, each part is treated individually. If markers are included, keep the notes in place.

Yes, that sound about right.


src/SIL.Machine/Corpora/UsfmTextBase.cs line 293 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This logic replaces Sentence completed function. Has the logic changed?

No, it hasn't. I added this in so that the sentence start property is set correctly for non-verse rows.


src/SIL.Machine/Corpora/UsfmTextUpdater.cs line 268 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Are we sure there are only footnotes? What about endnotes? Don't we have the marker in the Scripture reference that we can pull out and use?

This marker is used for both footnotes and endnotes.


src/SIL.Machine/Corpora/UsfmTextUpdater.cs line 351 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

So, alternate verses and other special tokens will be skipped over if we are replacing the text? Would we get those from the translated text? Are we sure we want to blow away the special tokens?

Yes, if we are skipping a marker, then we should also skip over all of the special tokens that are related to that marker.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ScriptureRef.cs line 29 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Unfortunately, there is no spec for specifying a reference to a particular text segment in a USFM/USX file. I had to make one up that fits our needs. The ScriptureRef gets serialized to a specific format. If the format is invalid, then it might fail to parse.

I like the description - that tied together with the tests make it understandable for someone coming in for the first time. We can also add something to Swagger to help people if they pull it from the JSON: sillsdev/serval#356.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 18 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

VerseRef is a struct (value type), so it isn't nullable by default.

Ah, the details of Dotnet.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/UsfmTextBase.cs line 70 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I'm not sure I understand what you mean.

I misread.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/UsfmTextBase.cs line 78 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I'm not sure what you mean.

The logic was moved of looking for overlapping verses, etc. from here to ScriptureRefParserHandlerBase. Also when it was moved duplicate verse checking was modified. I am ok with it.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/UsfmTextUpdater.cs line 268 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This marker is used for both footnotes and endnotes.

https://ubsicap.github.io/usfm/notes_basic/fnotes.html?highlight=endnote. Reading the documentation is a little confusing but a few thoughts:

  • \ft always appears to be in a broader \f ... \f* frame - and I see no \ft* examples
  • There are some other things that would likely need to be translated including keywords (fk)
  • quotes (fq) should also be translated
  • The footnote markers appear to change contexts - do we only translate stuff in the "ft" context?

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 16 of 34 files at r1, 2 of 2 files at r3, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit)

@johnml1135
Copy link
Collaborator

Did you update any tests for the code coverage?

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates to test coverage?

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit)

Copy link
Contributor Author

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @johnml1135)


src/SIL.Machine/Corpora/UsfmTextUpdater.cs line 268 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

https://ubsicap.github.io/usfm/notes_basic/fnotes.html?highlight=endnote. Reading the documentation is a little confusing but a few thoughts:

  • \ft always appears to be in a broader \f ... \f* frame - and I see no \ft* examples
  • There are some other things that would likely need to be translated including keywords (fk)
  • quotes (fq) should also be translated
  • The footnote markers appear to change contexts - do we only translate stuff in the "ft" context?

We translate everything, we just don't have any way to preserve the markers, so we just put everything in an ft marker.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ScriptureRef.cs line 29 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I like the description - that tied together with the tests make it understandable for someone coming in for the first time. We can also add something to Swagger to help people if they pull it from the JSON: sillsdev/serval#356.

I added a wiki page to explain the parsing: https://github.com/sillsdev/serval/wiki/USFM-Parsing-and-Translation

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/UsfmTextUpdater.cs line 268 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We translate everything, we just don't have any way to preserve the markers, so we just put everything in an ft marker.

So, here is an example:

\f + \fr 7.38: \ft Jesus' words in verses
37-38 may be translated: \fqa “Whoever is thirsty should come to me and drink.
\+fv 38\+fv* As the scripture says, ‘Streams of life-giving water will pour out from
within anyone who believes in me.’”\f*

So you are saying that we just translate the whole ft and strip out the \fr, \fqa, +fv's while keeping the \f and \f* bracketing marks?

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/UsfmTextUpdater.cs line 268 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

So, here is an example:

\f + \fr 7.38: \ft Jesus' words in verses
37-38 may be translated: \fqa “Whoever is thirsty should come to me and drink.
\+fv 38\+fv* As the scripture says, ‘Streams of life-giving water will pour out from
within anyone who believes in me.’”\f*

So you are saying that we just translate the whole ft and strip out the \fr, \fqa, +fv's while keeping the \f and \f* bracketing marks?

Whatever it is is good enough. If it becomes an issue, we can make a later improvement.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @ddaspit)

@johnml1135
Copy link
Collaborator

I'll make a new issue for test coverage - #181

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @ddaspit)

@johnml1135 johnml1135 merged commit a9058ce into master Apr 11, 2024
4 checks passed
@ddaspit ddaspit deleted the non-scripture branch April 16, 2024 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants