Add support for non-verse text segments in Scripture corpora #179

ddaspit · 2024-04-02T21:50:44Z

add new ScriptureRef corpus ref class
update Scripture corpora classes to use ScriptureRef
add ScriptureRefUsfmParserHandlerBase class to track ScriptureRef in USFM
update UsfmTextUpdater and UsfmTextBase to use ScriptureRefUsfmParserHandlerBase
add support for updating non-Scripture paragraphs and notes
update NmtPreprocessBuildJob to support non-Scripture segments

This change is

codecov-commenter · 2024-04-02T21:52:40Z

Codecov Report

Attention: Patch coverage is 88.82083% with 73 lines in your changes are missing coverage. Please review.

Project coverage is 67.04%. Comparing base (fa65835) to head (a894ab9).

Files	Patch %	Lines
src/SIL.Machine/Corpora/ScriptureRef.cs	70.73%	22 Missing and 2 partials ⚠️
src/SIL.Machine/Corpora/ScriptureElement.cs	61.11%	13 Missing and 1 partial ⚠️
src/SIL.Machine/Corpora/UsfmTextUpdater.cs	91.74%	5 Missing and 4 partials ⚠️
src/SIL.Machine/Corpora/ScriptureText.cs	82.60%	7 Missing and 1 partial ⚠️
...chine/Corpora/ScriptureRefUsfmParserHandlerBase.cs	95.27%	7 Missing ⚠️
src/SIL.Machine/Corpora/CorporaExtensions.cs	50.00%	6 Missing ⚠️
src/SIL.Machine/Corpora/ParallelTextRow.cs	0.00%	1 Missing and 1 partial ⚠️
...chine.AspNetCore/Services/NmtPreprocessBuildJob.cs	98.30%	0 Missing and 1 partial ⚠️
src/SIL.Machine/Corpora/UsfmTextBase.cs	98.61%	1 Missing ⚠️
.../SIL.Machine/Corpora/UsxFileAlignmentCollection.cs	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #179      +/-   ##
==========================================
+ Coverage   66.72%   67.04%   +0.31%     
==========================================
  Files         438      441       +3     
  Lines       34396    34831     +435     
  Branches     4613     4670      +57     
==========================================
+ Hits        22951    23351     +400     
- Misses      10355    10387      +32     
- Partials     1090     1093       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Enkidu93

Big update! Pretty cool to have this capability - thank you for your hard work ❤️ . Not much I feel equipped to comment on - the rest is all on you, John 😆 .

Reviewed 34 of 34 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @ddaspit and @johnml1135)

src/SIL.Machine/Corpora/ScriptureRef.cs line 42 at r1 (raw file):

        public VerseRef VerseRef { get; }
        public IReadOnlyList<ScriptureElement> Path { get; }

Why is this called Path given it's a list of scripture elements associated with the ref?

src/SIL.Machine/Corpora/TextCorpusBase.cs line 10 at r1 (raw file):

        public abstract IEnumerable<IText> Texts { get; }
        public abstract bool IsTokenized { get; }
        public abstract ScrVers Versification { get; }

I'm sure you've thought this through better than I can, but is there no more semantically appropriate alternative to having all the corpora have versifications (even if they're just null)? Couldn't you have parallel scripture/non-scripture inheritance structures? Is what you've done preferable because there's no easy way to reuse code across parallel paths like that?

src/SIL.Machine/Corpora/UsfmTextBase.cs line 182 at r1 (raw file):

                else if (
                    (CurrentTextType == ScriptureTextType.Verse && state.IsVerseText)
                    || CurrentTextType != ScriptureTextType.Verse

Is there a reason you can't/shouldn't simplify?

Code snippet:

CurrentTextType != ScriptureTextType.Verse || state.IsVerseText

src/SIL.Machine/Corpora/UsfmTextBase.cs line 218 at r1 (raw file):

                    && (
                        (CurrentTextType == ScriptureTextType.Verse && state.IsVerseText)
                        || CurrentTextType != ScriptureTextType.Verse

Same as note above about simplification.

ddaspit

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @Enkidu93 and @johnml1135)

src/SIL.Machine/Corpora/ScriptureRef.cs line 42 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Why is this called Path given it's a list of scripture elements associated with the ref?

I agree. It is not a great name, but I wasn't sure what else to call it. I called it Path, because it captures the path to the particular segment in the USFM hierarchy.

src/SIL.Machine/Corpora/TextCorpusBase.cs line 10 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I'm sure you've thought this through better than I can, but is there no more semantically appropriate alternative to having all the corpora have versifications (even if they're just null)? Couldn't you have parallel scripture/non-scripture inheritance structures? Is what you've done preferable because there's no easy way to reuse code across parallel paths like that?

I didn't love this change, but I couldn't come up with a better way. I need some way to easily tell if a corpus is a Scripture corpus and uses ScriptureRefs. Machine supports operations on corpora, which are implemented through decorator classes. These decorator classes are generic and can be applied to any type of corpus. This obscures the original type of the corpus, so we can't reliably use is ScriptureTextCorpus or anything like that. The other way to determine if it is a Scripture corpus was to actually iterate through the rows, grab the ref of the first row, and check if it is a ScriptureRef, but that is really hacky. I think this is the best option.

src/SIL.Machine/Corpora/UsfmTextBase.cs line 182 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Is there a reason you can't/shouldn't simplify?

There isn't really any reason I can't simplify to the suggested conditional. I was just trying to make the intent clearer. I will go ahead and simplify it.

ddaspit

Reviewable status: 33 of 34 files reviewed, 4 unresolved discussions (waiting on @Enkidu93 and @johnml1135)

src/SIL.Machine/Corpora/UsfmTextBase.cs line 182 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

There isn't really any reason I can't simplify to the suggested conditional. I was just trying to make the intent clearer. I will go ahead and simplify it.

Done

src/SIL.Machine/Corpora/UsfmTextBase.cs line 218 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Same as note above about simplification.

Done.

Enkidu93

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit and @johnml1135)

src/SIL.Machine/Corpora/ScriptureRef.cs line 42 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I agree. It is not a great name, but I wasn't sure what else to call it. I called it Path, because it captures the path to the particular segment in the USFM hierarchy.

I see. OK. Nothing better comes to mind. Maybe John will have a thought.

src/SIL.Machine/Corpora/TextCorpusBase.cs line 10 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I didn't love this change, but I couldn't come up with a better way. I need some way to easily tell if a corpus is a Scripture corpus and uses ScriptureRefs. Machine supports operations on corpora, which are implemented through decorator classes. These decorator classes are generic and can be applied to any type of corpus. This obscures the original type of the corpus, so we can't reliably use is ScriptureTextCorpus or anything like that. The other way to determine if it is a Scripture corpus was to actually iterate through the rows, grab the ref of the first row, and check if it is a ScriptureRef, but that is really hacky. I think this is the best option.

That makes sense. This might be even more hacky, but is there some way you could make a helper method that uses reflection to see if the corpus has a versification property? I suppose this isn't really practically so different than having the versification where inapplicable set to null, but could it keep Versification from floating around in classes where it doesn't belong?

johnml1135 · 2024-04-03T21:40:57Z

src/SIL.Machine/Corpora/ScriptureRef.cs line 29 at r2 (raw file):

                    path.Add(new ScriptureElement(0, elem[0]));
                else
                    path.Add(new ScriptureElement(int.Parse(elem[0], CultureInfo.InvariantCulture), elem[1]));

This looks fragile. There should be a spec that this is designed to. Either it is general, or there is a specific set of labels that should work in a very specific format. I notice that something like ABC:123 would break it, in that it would be split into 2, and ABC would not be able to parse into a number.

johnml1135 · 2024-04-04T11:19:23Z

Notes from the code coverage:

Overlaps, comparisons and hash codes untested for ScriptureReference and ScriptureElement
CreateRow in ScriptureText untested
Ref in UsfmTextUpdater untested

johnml1135 · 2024-04-04T11:20:32Z

src/SIL.Machine/Corpora/ScriptureRef.cs line 29 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This looks fragile. There should be a spec that this is designed to. Either it is general, or there is a specific set of labels that should work in a very specific format. I notice that something like ABC:123 would break it, in that it would be split into 2, and ABC would not be able to parse into a number.

We should make a wiki page explaining how the parsing works - or an extended comment at the beginning of this file.

johnml1135 · 2024-04-04T11:59:57Z

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 16 at r2 (raw file):

    public abstract class ScriptureRefUsfmParserHandlerBase : UsfmParserHandlerBase
    {
        private VerseRef _verseRef;

This should be _curVerseRef - as it can change.

johnml1135 · 2024-04-04T12:00:29Z

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 16 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This should be _curVerseRef - as it can change.

And _curElements

johnml1135 · 2024-04-04T13:09:46Z

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 16 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

And _curElements

Hmm. I'm looking at it more. I am just wondering why you use _curTextType, but not _cur for the other ones...

johnml1135 · 2024-04-04T13:10:06Z

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 230 at r2 (raw file):

                _verseRef.HasMultiple ? _verseRef.AllVerses().Last() : _verseRef,
                _elements.Where(e => e.Position > 0).Reverse()
            );

Genius.

johnml1135 · 2024-04-04T13:37:28Z

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 205 at r2 (raw file):

            ScriptureElement prevElem = _elements.Pop();
            _elements.Push(new ScriptureElement(prevElem.Position + 1, marker));
        }

I think I got it. The reference would be:

Verse reference: MAT 3:12-22
Elements: Section 4, Paragraph 3, Cell 18.
Elements as they are in here: Section 4, Section 0, Paragraph 3, Paragraph 0, Cell 18 (0's get removed)
"Parent elements" can contain others - element "0" symbolically

There should be an explanation of this (at least some examples) - it took me a while to figure out.

johnml1135 · 2024-04-04T14:32:19Z

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 230 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Genius.

So you create these for Notes (all types), Paragraphs (all types) and Cells (no other types of characters). Is this the intention (it seems reasonable from little that I see)? This needs to be documented in Swagger.

johnml1135 · 2024-04-04T14:35:38Z

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 35 at r2 (raw file):

            {
                EndVerseText(state, CreateVerseRefs());
                _curTextType.Pop();

Why is this popped here? It would not be used anymore. I would assume either it would be cleared or left alone...

johnml1135 · 2024-04-04T14:40:57Z

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 69 at r2 (raw file):

                    EndVerseText(state, CreateVerseRefs());
                    _curTextType.Pop();
                }

So, for duplicated verses, just take the first one and then ignore all duplicates? Needs documentation...

johnml1135 · 2024-04-04T14:41:58Z

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 69 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

So, for duplicated verses, just take the first one and then ignore all duplicates? Needs documentation...

Why was the duplicate / merge verse logic needed to be added now? Was it just moved from somewhere else? It is not specific to non-verse text...

johnml1135 · 2024-04-04T14:44:31Z

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 71 at r2 (raw file):

                }
                _duplicateVerse = true;
                UpdateVerseRef(state.VerseRef, marker);

If it's truly a duplicate verse (and we are just throwing it on the ground), why are we also updating the verse reference? Wouldn't the marker be the same? And if it was not, we would want the first marker, not the one that came from the duplicated text.

johnml1135 · 2024-04-04T14:58:09Z

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 60 at r2 (raw file):

            string marker,
            string altNumber,
            string pubNumber

What happens for alternative verses in descriptive sections? They would not be handled as verses.. https://ubsicap.github.io/usfm/chapters_verses/index.html#va-va. Would /va be counted as a verse? Would then the description be counted as duplicate?

johnml1135 · 2024-04-04T15:00:56Z

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 102 at r2 (raw file):

        {
            if (_verseRef.IsDefault)
                UpdateVerseRef(state.VerseRef, marker);

What is intended here? Only at the beginning of books do we update from the parser state, but not later? Why not later?

johnml1135 · 2024-04-04T16:56:06Z

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 18 at r2 (raw file):

        private VerseRef _verseRef;
        private readonly Stack<ScriptureElement> _elements;
        private readonly Stack<ScriptureTextType> _curTextType;

As another question - _verseRef is never initialized and I assume it starts as a null. Should this be the case? Or should it be initialized to default?

johnml1135 · 2024-04-04T17:24:28Z

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 126 at r2 (raw file):

            if (CurrentTextType == ScriptureTextType.NonVerse)
                StartParentElement(marker);
        }

So, if it is a verse (or a note), just strip out all formatting? If there is a paragraph or a table in a verse or cell, squish it together and translate the whole thing?

johnml1135 · 2024-04-04T17:24:36Z

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 126 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

So, if it is a verse (or a note), just strip out all formatting? If there is a paragraph or a table in a verse or cell, squish it together and translate the whole thing?

Documentation...

johnml1135 · 2024-04-04T17:34:26Z

src/SIL.Machine/Corpora/ScriptureText.cs line 104 at r2 (raw file):

        }

        protected TextRow CreateRow(

This appears to never be called

johnml1135 · 2024-04-04T17:40:32Z

src/SIL.Machine/Corpora/TextCorpusBase.cs line 10 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

That makes sense. This might be even more hacky, but is there some way you could make a helper method that uses reflection to see if the corpus has a versification property? I suppose this isn't really practically so different than having the versification where inapplicable set to null, but could it keep Versification from floating around in classes where it doesn't belong?

Or, add another property to TextCorpusBase called "CorpusType" or similar that would be hard coded to either a string or an enum of "TextCorpus" or "ScriptureCorpus". This should survive decorators and be more explicit when trying to determine the underlying type.

johnml1135 · 2024-04-04T17:52:43Z

src/SIL.Machine/Corpora/UsfmParser.cs line 144 at r2 (raw file):

            if (State.Index >= State.Tokens.Count - 1)
            {
                CloseAll();

So, close all USFM markers on the end of USFM (makes sense), the beginnings of notes, chapters and books as well as verses that don't have verse text? Is that correct?

johnml1135 · 2024-04-04T17:54:31Z

src/SIL.Machine/Corpora/UsfmTextBase.cs line 70 at r2 (raw file):

            public IEnumerable<TextRow> Rows => _rows;

            public override void Verse(

This is where _verseRef was set to default - but not anymore. But it is still asked if it IsDefault...

johnml1135 · 2024-04-04T17:55:17Z

src/SIL.Machine/Corpora/UsfmTextBase.cs line 78 at r2 (raw file):

            )
            {
                base.Verse(state, number, marker, altNumber, pubNumber);

This logic is retranscribed - but also duplicates are checked, with odd logic (by my estimation).

johnml1135 · 2024-04-04T17:57:16Z

src/SIL.Machine/Corpora/UsfmTextBase.cs line 111 at r2 (raw file):

                    OutputMarker(state);
                }
                else if (CurrentTextType == ScriptureTextType.Verse)

This is an odd dance between the base and parent class - handling non-verse cells in the base and verse cells right here...

johnml1135 · 2024-04-04T18:12:46Z

src/SIL.Machine/Corpora/UsfmTextBase.cs line 180 at r2 (raw file):

                    _rowTexts.Peek().Append("//");
                }
                else if (CurrentTextType != ScriptureTextType.Verse || state.IsVerseText)

Ok, so if it isn't a verse or is a verse? I don't understand...

johnml1135 · 2024-04-04T19:16:33Z

src/SIL.Machine/Corpora/UsfmTextBase.cs line 254 at r2 (raw file):

                _rowTexts.Push(new StringBuilder());
            }

So, if I understand, If it's a verse, it is all glumbed together. If it is out of a verse, each part is treated individually. If markers are included, keep the notes in place.

johnml1135 · 2024-04-04T19:21:20Z

src/SIL.Machine/Corpora/UsfmTextBase.cs line 293 at r2 (raw file):

                }
                if (!state.IsVersePara)
                    _sentenceStart = true;

This logic replaces Sentence completed function. Has the logic changed?

johnml1135

Reviewed 16 of 34 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 21 unresolved discussions (waiting on @ddaspit and @Enkidu93)

johnml1135 · 2024-04-04T20:58:26Z

src/SIL.Machine/Corpora/UsfmTextUpdater.cs line 351 at r2 (raw file):

        private void SkipTokens(UsfmParserState state)
        {
            _tokenIndex = state.Index + 1 + state.SpecialTokenCount;

So, alternate verses and other special tokens will be skipped over if we are replacing the text? Would we get those from the translated text? Are we sure we want to blow away the special tokens?

johnml1135 · 2024-04-04T21:01:52Z

src/SIL.Machine/Corpora/UsfmTextUpdater.cs line 268 at r2 (raw file):

            {
                _tokens.Add(state.Token);
                _tokens.Add(new UsfmToken(UsfmTokenType.Character, "ft", null, "ft*"));

Are we sure there are only footnotes? What about endnotes? Don't we have the marker in the Scripture reference that we can pull out and use?

johnml1135

Reviewed 1 of 34 files at r1.
Reviewable status: all files reviewed, 23 unresolved discussions (waiting on @ddaspit and @Enkidu93)

- add new ScriptureRef corpus ref class - update Scripture corpora classes to use ScriptureRef - add ScriptureRefUsfmParserHandlerBase class to track ScriptureRef in USFM - update UsfmTextUpdater and UsfmTextBase to use ScriptureRefUsfmParserHandlerBase - add support for updating non-Scripture paragraphs and notes - update NmtPreprocessBuildJob to support non-Scripture segments

ddaspit

Reviewable status: 32 of 34 files reviewed, 22 unresolved discussions (waiting on @Enkidu93 and @johnml1135)

src/SIL.Machine/Corpora/ScriptureRef.cs line 29 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

We should make a wiki page explaining how the parsing works - or an extended comment at the beginning of this file.

Unfortunately, there is no spec for specifying a reference to a particular text segment in a USFM/USX file. I had to make one up that fits our needs. The ScriptureRef gets serialized to a specific format. If the format is invalid, then it might fail to parse.

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 16 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Hmm. I'm looking at it more. I am just wondering why you use _curTextType, but not _cur for the other ones...

Done.

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 18 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

As another question - _verseRef is never initialized and I assume it starts as a null. Should this be the case? Or should it be initialized to default?

VerseRef is a struct (value type), so it isn't nullable by default.

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 35 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Why is this popped here? It would not be used anymore. I would assume either it would be cleared or left alone...

It doesn't have to be. I was just keeping the logic consistent across the entire class.

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 60 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

What happens for alternative verses in descriptive sections? They would not be handled as verses.. https://ubsicap.github.io/usfm/chapters_verses/index.html#va-va. Would /va be counted as a verse? Would then the description be counted as duplicate?

They would be ignored.

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 69 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Why was the duplicate / merge verse logic needed to be added now? Was it just moved from somewhere else? It is not specific to non-verse text...

This logic was moved from the UsfmTextBase class.

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 71 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

If it's truly a duplicate verse (and we are just throwing it on the ground), why are we also updating the verse reference? Wouldn't the marker be the same? And if it was not, we would want the first marker, not the one that came from the duplicated text.

Done.

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 102 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

What is intended here? Only at the beginning of books do we update from the parser state, but not later? Why not later?

We only update the verse ref once we hit the first paragraph. Prior to that we are in the book identification section.

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 126 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Documentation...

Yes, that is correct.

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 205 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I think I got it. The reference would be:

Verse reference: MAT 3:12-22

Elements: Section 4, Paragraph 3, Cell 18.

Elements as they are in here: Section 4, Section 0, Paragraph 3, Paragraph 0, Cell 18 (0's get removed)

"Parent elements" can contain others - element "0" symbolically

There should be an explanation of this (at least some examples) - it took me a while to figure out.

I added a description of how scripture refs work in the class documentation.

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 230 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

So you create these for Notes (all types), Paragraphs (all types) and Cells (no other types of characters). Is this the intention (it seems reasonable from little that I see)? This needs to be documented in Swagger.

Yes, that is correct.

src/SIL.Machine/Corpora/ScriptureText.cs line 104 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This appears to never be called

I left it in as a helper method for classes that extend this abstract base class.

src/SIL.Machine/Corpora/TextCorpusBase.cs line 10 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Or, add another property to TextCorpusBase called "CorpusType" or similar that would be hard coded to either a string or an enum of "TextCorpus" or "ScriptureCorpus". This should survive decorators and be more explicit when trying to determine the underlying type.

@Enkidu93 Reflection should only be used as a last resort.

@johnml1135 Yes, but we would still need to have the Versification property, so I figured it was easier to just add this one property and return null if it isn't a Scripture corpus. There is an IsScripture extension method on ITextCorpus that checks this property.

src/SIL.Machine/Corpora/UsfmParser.cs line 144 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

So, close all USFM markers on the end of USFM (makes sense), the beginnings of notes, chapters and books as well as verses that don't have verse text? Is that correct?

This is a fix for what was essentially a bug in the UsfmParser where the End<element> methods would not get called at the end of the USFM file. This fix ensures that they get called properly.

src/SIL.Machine/Corpora/UsfmTextBase.cs line 70 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This is where _verseRef was set to default - but not anymore. But it is still asked if it IsDefault...

I'm not sure I understand what you mean.

src/SIL.Machine/Corpora/UsfmTextBase.cs line 78 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This logic is retranscribed - but also duplicates are checked, with odd logic (by my estimation).

I'm not sure what you mean.

src/SIL.Machine/Corpora/UsfmTextBase.cs line 111 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This is an odd dance between the base and parent class - handling non-verse cells in the base and verse cells right here...

If a table cell occurs inside of a verse, then we need to make sure that we insert a space in the collected verse text.

src/SIL.Machine/Corpora/UsfmTextBase.cs line 180 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Ok, so if it isn't a verse or is a verse? I don't understand...

The state.IsVerseText refers specifically to verse text within a verse. A verse can contain text that isn't verse text.

src/SIL.Machine/Corpora/UsfmTextBase.cs line 254 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

So, if I understand, If it's a verse, it is all glumbed together. If it is out of a verse, each part is treated individually. If markers are included, keep the notes in place.

Yes, that sound about right.

src/SIL.Machine/Corpora/UsfmTextBase.cs line 293 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This logic replaces Sentence completed function. Has the logic changed?

No, it hasn't. I added this in so that the sentence start property is set correctly for non-verse rows.

src/SIL.Machine/Corpora/UsfmTextUpdater.cs line 268 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Are we sure there are only footnotes? What about endnotes? Don't we have the marker in the Scripture reference that we can pull out and use?

This marker is used for both footnotes and endnotes.

src/SIL.Machine/Corpora/UsfmTextUpdater.cs line 351 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

So, alternate verses and other special tokens will be skipped over if we are replacing the text? Would we get those from the translated text? Are we sure we want to blow away the special tokens?

Yes, if we are skipping a marker, then we should also skip over all of the special tokens that are related to that marker.

johnml1135 · 2024-04-08T12:17:36Z

src/SIL.Machine/Corpora/ScriptureRef.cs line 29 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Unfortunately, there is no spec for specifying a reference to a particular text segment in a USFM/USX file. I had to make one up that fits our needs. The ScriptureRef gets serialized to a specific format. If the format is invalid, then it might fail to parse.

I like the description - that tied together with the tests make it understandable for someone coming in for the first time. We can also add something to Swagger to help people if they pull it from the JSON: sillsdev/serval#356.

johnml1135 · 2024-04-08T12:18:18Z

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 18 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

VerseRef is a struct (value type), so it isn't nullable by default.

Ah, the details of Dotnet.

johnml1135 · 2024-04-08T12:32:19Z

src/SIL.Machine/Corpora/UsfmTextBase.cs line 70 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I'm not sure I understand what you mean.

I misread.

johnml1135 · 2024-04-08T12:35:29Z

src/SIL.Machine/Corpora/UsfmTextBase.cs line 78 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I'm not sure what you mean.

The logic was moved of looking for overlapping verses, etc. from here to ScriptureRefParserHandlerBase. Also when it was moved duplicate verse checking was modified. I am ok with it.

johnml1135 · 2024-04-08T12:52:44Z

src/SIL.Machine/Corpora/UsfmTextUpdater.cs line 268 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This marker is used for both footnotes and endnotes.

https://ubsicap.github.io/usfm/notes_basic/fnotes.html?highlight=endnote. Reading the documentation is a little confusing but a few thoughts:

\ft always appears to be in a broader \f ... \f* frame - and I see no \ft* examples
There are some other things that would likely need to be translated including keywords (fk)
quotes (fq) should also be translated
The footnote markers appear to change contexts - do we only translate stuff in the "ft" context?

johnml1135

Reviewed 16 of 34 files at r1, 2 of 2 files at r3, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit)

johnml1135 · 2024-04-08T13:04:31Z

Did you update any tests for the code coverage?

johnml1135

Updates to test coverage?

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit)

ddaspit

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @johnml1135)

src/SIL.Machine/Corpora/UsfmTextUpdater.cs line 268 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

https://ubsicap.github.io/usfm/notes_basic/fnotes.html?highlight=endnote. Reading the documentation is a little confusing but a few thoughts:

\ft always appears to be in a broader \f ... \f* frame - and I see no \ft* examples

There are some other things that would likely need to be translated including keywords (fk)

quotes (fq) should also be translated

The footnote markers appear to change contexts - do we only translate stuff in the "ft" context?

We translate everything, we just don't have any way to preserve the markers, so we just put everything in an ft marker.

johnml1135 · 2024-04-10T18:10:22Z

src/SIL.Machine/Corpora/ScriptureRef.cs line 29 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I like the description - that tied together with the tests make it understandable for someone coming in for the first time. We can also add something to Swagger to help people if they pull it from the JSON: sillsdev/serval#356.

I added a wiki page to explain the parsing: https://github.com/sillsdev/serval/wiki/USFM-Parsing-and-Translation

johnml1135 · 2024-04-10T18:15:45Z

src/SIL.Machine/Corpora/UsfmTextUpdater.cs line 268 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We translate everything, we just don't have any way to preserve the markers, so we just put everything in an ft marker.

So, here is an example:

\f + \fr 7.38: \ft Jesus' words in verses
37-38 may be translated: \fqa “Whoever is thirsty should come to me and drink.
\+fv 38\+fv* As the scripture says, ‘Streams of life-giving water will pour out from
within anyone who believes in me.’”\f*

So you are saying that we just translate the whole ft and strip out the \fr, \fqa, +fv's while keeping the \f and \f* bracketing marks?

johnml1135

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)

johnml1135 · 2024-04-11T15:13:51Z

src/SIL.Machine/Corpora/UsfmTextUpdater.cs line 268 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

So, here is an example:
\f + \fr 7.38: \ft Jesus' words in verses
37-38 may be translated: \fqa “Whoever is thirsty should come to me and drink.
\+fv 38\+fv* As the scripture says, ‘Streams of life-giving water will pour out from
within anyone who believes in me.’”\f*
So you are saying that we just translate the whole ft and strip out the \fr, \fqa, +fv's while keeping the \f and \f* bracketing marks?

Whatever it is is good enough. If it becomes an issue, we can make a later improvement.

johnml1135

Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @ddaspit)

johnml1135 · 2024-04-11T19:13:48Z

I'll make a new issue for test coverage - #181

johnml1135

Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @ddaspit)

johnml1135 self-requested a review April 3, 2024 12:10

Enkidu93 requested changes Apr 3, 2024

View reviewed changes

ddaspit commented Apr 3, 2024

View reviewed changes

ddaspit force-pushed the non-scripture branch from 6915586 to 4387991 Compare April 3, 2024 16:39

ddaspit commented Apr 3, 2024

View reviewed changes

Enkidu93 approved these changes Apr 3, 2024

View reviewed changes

johnml1135 reviewed Apr 4, 2024

View reviewed changes

ddaspit force-pushed the non-scripture branch from 4387991 to cc9b473 Compare April 5, 2024 21:55

ddaspit commented Apr 5, 2024

View reviewed changes

johnml1135 reviewed Apr 8, 2024

View reviewed changes

johnml1135 requested changes Apr 9, 2024

View reviewed changes

ddaspit commented Apr 10, 2024

View reviewed changes

johnml1135 requested changes Apr 10, 2024

View reviewed changes

johnml1135 approved these changes Apr 11, 2024

View reviewed changes

Merge branch 'master' into non-scripture

a894ab9

johnml1135 merged commit a9058ce into master Apr 11, 2024
4 checks passed

ddaspit deleted the non-scripture branch April 16, 2024 16:13

Add support for non-verse text segments in Scripture corpora #179

Add support for non-verse text segments in Scripture corpora #179

Conversation

ddaspit commented Apr 2, 2024 • edited Loading

codecov-commenter commented Apr 2, 2024 • edited Loading

Codecov Report

Enkidu93 left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

Enkidu93 left a comment

Choose a reason for hiding this comment

johnml1135 commented Apr 3, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 left a comment

Choose a reason for hiding this comment

johnml1135 commented Apr 4, 2024

johnml1135 commented Apr 4, 2024

johnml1135 left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Apr 8, 2024

johnml1135 commented Apr 8, 2024

johnml1135 commented Apr 8, 2024

johnml1135 commented Apr 8, 2024

johnml1135 commented Apr 8, 2024

johnml1135 left a comment

Choose a reason for hiding this comment

johnml1135 commented Apr 8, 2024

johnml1135 left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Apr 10, 2024

johnml1135 commented Apr 10, 2024

johnml1135 left a comment

Choose a reason for hiding this comment

johnml1135 commented Apr 11, 2024

johnml1135 left a comment

Choose a reason for hiding this comment

johnml1135 commented Apr 11, 2024

johnml1135 left a comment

Choose a reason for hiding this comment

ddaspit commented Apr 2, 2024 •

edited

Loading

codecov-commenter commented Apr 2, 2024 •

edited

Loading