-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for non-verse text segments in Scripture corpora #179
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big update! Pretty cool to have this capability - thank you for your hard work ❤️ . Not much I feel equipped to comment on - the rest is all on you, John 😆 .
Reviewed 34 of 34 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/SIL.Machine/Corpora/ScriptureRef.cs
line 42 at r1 (raw file):
public VerseRef VerseRef { get; } public IReadOnlyList<ScriptureElement> Path { get; }
Why is this called Path
given it's a list of scripture elements associated with the ref?
src/SIL.Machine/Corpora/TextCorpusBase.cs
line 10 at r1 (raw file):
public abstract IEnumerable<IText> Texts { get; } public abstract bool IsTokenized { get; } public abstract ScrVers Versification { get; }
I'm sure you've thought this through better than I can, but is there no more semantically appropriate alternative to having all the corpora have versifications (even if they're just null)? Couldn't you have parallel scripture/non-scripture inheritance structures? Is what you've done preferable because there's no easy way to reuse code across parallel paths like that?
src/SIL.Machine/Corpora/UsfmTextBase.cs
line 182 at r1 (raw file):
else if ( (CurrentTextType == ScriptureTextType.Verse && state.IsVerseText) || CurrentTextType != ScriptureTextType.Verse
Is there a reason you can't/shouldn't simplify?
Code snippet:
CurrentTextType != ScriptureTextType.Verse || state.IsVerseText
src/SIL.Machine/Corpora/UsfmTextBase.cs
line 218 at r1 (raw file):
&& ( (CurrentTextType == ScriptureTextType.Verse && state.IsVerseText) || CurrentTextType != ScriptureTextType.Verse
Same as note above about simplification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @Enkidu93 and @johnml1135)
src/SIL.Machine/Corpora/ScriptureRef.cs
line 42 at r1 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Why is this called
Path
given it's a list of scripture elements associated with the ref?
I agree. It is not a great name, but I wasn't sure what else to call it. I called it Path
, because it captures the path to the particular segment in the USFM hierarchy.
src/SIL.Machine/Corpora/TextCorpusBase.cs
line 10 at r1 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
I'm sure you've thought this through better than I can, but is there no more semantically appropriate alternative to having all the corpora have versifications (even if they're just null)? Couldn't you have parallel scripture/non-scripture inheritance structures? Is what you've done preferable because there's no easy way to reuse code across parallel paths like that?
I didn't love this change, but I couldn't come up with a better way. I need some way to easily tell if a corpus is a Scripture corpus and uses ScriptureRef
s. Machine supports operations on corpora, which are implemented through decorator classes. These decorator classes are generic and can be applied to any type of corpus. This obscures the original type of the corpus, so we can't reliably use is ScriptureTextCorpus
or anything like that. The other way to determine if it is a Scripture corpus was to actually iterate through the rows, grab the ref of the first row, and check if it is a ScriptureRef
, but that is really hacky. I think this is the best option.
src/SIL.Machine/Corpora/UsfmTextBase.cs
line 182 at r1 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Is there a reason you can't/shouldn't simplify?
There isn't really any reason I can't simplify to the suggested conditional. I was just trying to make the intent clearer. I will go ahead and simplify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 33 of 34 files reviewed, 4 unresolved discussions (waiting on @Enkidu93 and @johnml1135)
src/SIL.Machine/Corpora/UsfmTextBase.cs
line 182 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
There isn't really any reason I can't simplify to the suggested conditional. I was just trying to make the intent clearer. I will go ahead and simplify it.
Done
src/SIL.Machine/Corpora/UsfmTextBase.cs
line 218 at r1 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Same as note above about simplification.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/SIL.Machine/Corpora/ScriptureRef.cs
line 42 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I agree. It is not a great name, but I wasn't sure what else to call it. I called it
Path
, because it captures the path to the particular segment in the USFM hierarchy.
I see. OK. Nothing better comes to mind. Maybe John will have a thought.
src/SIL.Machine/Corpora/TextCorpusBase.cs
line 10 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I didn't love this change, but I couldn't come up with a better way. I need some way to easily tell if a corpus is a Scripture corpus and uses
ScriptureRef
s. Machine supports operations on corpora, which are implemented through decorator classes. These decorator classes are generic and can be applied to any type of corpus. This obscures the original type of the corpus, so we can't reliably useis ScriptureTextCorpus
or anything like that. The other way to determine if it is a Scripture corpus was to actually iterate through the rows, grab the ref of the first row, and check if it is aScriptureRef
, but that is really hacky. I think this is the best option.
That makes sense. This might be even more hacky, but is there some way you could make a helper method that uses reflection to see if the corpus has a versification property? I suppose this isn't really practically so different than having the versification where inapplicable set to null, but could it keep Versification
from floating around in classes where it doesn't belong?
This looks fragile. There should be a spec that this is designed to. Either it is general, or there is a specific set of labels that should work in a very specific format. I notice that something like ABC:123 would break it, in that it would be split into 2, and ABC would not be able to parse into a number. |
Notes from the code coverage:
|
Previously, johnml1135 (John Lambert) wrote…
We should make a wiki page explaining how the parsing works - or an extended comment at the beginning of this file. |
This should be _curVerseRef - as it can change. |
Previously, johnml1135 (John Lambert) wrote…
And _curElements |
Previously, johnml1135 (John Lambert) wrote…
Hmm. I'm looking at it more. I am just wondering why you use _curTextType, but not _cur for the other ones... |
Genius. |
I think I got it. The reference would be:
There should be an explanation of this (at least some examples) - it took me a while to figure out. |
Previously, johnml1135 (John Lambert) wrote…
So you create these for Notes (all types), Paragraphs (all types) and Cells (no other types of characters). Is this the intention (it seems reasonable from little that I see)? This needs to be documented in Swagger. |
Why is this popped here? It would not be used anymore. I would assume either it would be cleared or left alone... |
So, for duplicated verses, just take the first one and then ignore all duplicates? Needs documentation... |
Previously, johnml1135 (John Lambert) wrote…
Why was the duplicate / merge verse logic needed to be added now? Was it just moved from somewhere else? It is not specific to non-verse text... |
If it's truly a duplicate verse (and we are just throwing it on the ground), why are we also updating the verse reference? Wouldn't the marker be the same? And if it was not, we would want the first marker, not the one that came from the duplicated text. |
What happens for alternative verses in descriptive sections? They would not be handled as verses.. https://ubsicap.github.io/usfm/chapters_verses/index.html#va-va. Would /va be counted as a verse? Would then the description be counted as duplicate? |
What is intended here? Only at the beginning of books do we update from the parser state, but not later? Why not later? |
As another question - _verseRef is never initialized and I assume it starts as a null. Should this be the case? Or should it be initialized to default? |
So, if it is a verse (or a note), just strip out all formatting? If there is a paragraph or a table in a verse or cell, squish it together and translate the whole thing? |
Previously, johnml1135 (John Lambert) wrote…
Documentation... |
This appears to never be called |
Previously, Enkidu93 (Eli C. Lowry) wrote…
Or, add another property to TextCorpusBase called "CorpusType" or similar that would be hard coded to either a string or an enum of "TextCorpus" or "ScriptureCorpus". This should survive decorators and be more explicit when trying to determine the underlying type. |
So, close all USFM markers on the end of USFM (makes sense), the beginnings of notes, chapters and books as well as verses that don't have verse text? Is that correct? |
This is where _verseRef was set to default - but not anymore. But it is still asked if it IsDefault... |
This logic is retranscribed - but also duplicates are checked, with odd logic (by my estimation). |
This is an odd dance between the base and parent class - handling non-verse cells in the base and verse cells right here... |
Ok, so if it isn't a verse or is a verse? I don't understand... |
So, if I understand, If it's a verse, it is all glumbed together. If it is out of a verse, each part is treated individually. If markers are included, keep the notes in place. |
This logic replaces Sentence completed function. Has the logic changed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 16 of 34 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 21 unresolved discussions (waiting on @ddaspit and @Enkidu93)
So, alternate verses and other special tokens will be skipped over if we are replacing the text? Would we get those from the translated text? Are we sure we want to blow away the special tokens? |
Are we sure there are only footnotes? What about endnotes? Don't we have the marker in the Scripture reference that we can pull out and use? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 34 files at r1.
Reviewable status: all files reviewed, 23 unresolved discussions (waiting on @ddaspit and @Enkidu93)
- add new ScriptureRef corpus ref class - update Scripture corpora classes to use ScriptureRef - add ScriptureRefUsfmParserHandlerBase class to track ScriptureRef in USFM - update UsfmTextUpdater and UsfmTextBase to use ScriptureRefUsfmParserHandlerBase - add support for updating non-Scripture paragraphs and notes - update NmtPreprocessBuildJob to support non-Scripture segments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 32 of 34 files reviewed, 22 unresolved discussions (waiting on @Enkidu93 and @johnml1135)
src/SIL.Machine/Corpora/ScriptureRef.cs
line 29 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
We should make a wiki page explaining how the parsing works - or an extended comment at the beginning of this file.
Unfortunately, there is no spec for specifying a reference to a particular text segment in a USFM/USX file. I had to make one up that fits our needs. The ScriptureRef
gets serialized to a specific format. If the format is invalid, then it might fail to parse.
src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs
line 16 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Hmm. I'm looking at it more. I am just wondering why you use _curTextType, but not _cur for the other ones...
Done.
src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs
line 18 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
As another question - _verseRef is never initialized and I assume it starts as a null. Should this be the case? Or should it be initialized to default?
VerseRef
is a struct (value type), so it isn't nullable by default.
src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs
line 35 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Why is this popped here? It would not be used anymore. I would assume either it would be cleared or left alone...
It doesn't have to be. I was just keeping the logic consistent across the entire class.
src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs
line 60 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
What happens for alternative verses in descriptive sections? They would not be handled as verses.. https://ubsicap.github.io/usfm/chapters_verses/index.html#va-va. Would /va be counted as a verse? Would then the description be counted as duplicate?
They would be ignored.
src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs
line 69 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Why was the duplicate / merge verse logic needed to be added now? Was it just moved from somewhere else? It is not specific to non-verse text...
This logic was moved from the UsfmTextBase
class.
src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs
line 71 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
If it's truly a duplicate verse (and we are just throwing it on the ground), why are we also updating the verse reference? Wouldn't the marker be the same? And if it was not, we would want the first marker, not the one that came from the duplicated text.
Done.
src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs
line 102 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
What is intended here? Only at the beginning of books do we update from the parser state, but not later? Why not later?
We only update the verse ref once we hit the first paragraph. Prior to that we are in the book identification section.
src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs
line 126 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Documentation...
Yes, that is correct.
src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs
line 205 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
I think I got it. The reference would be:
- Verse reference: MAT 3:12-22
- Elements: Section 4, Paragraph 3, Cell 18.
- Elements as they are in here: Section 4, Section 0, Paragraph 3, Paragraph 0, Cell 18 (0's get removed)
- "Parent elements" can contain others - element "0" symbolically
There should be an explanation of this (at least some examples) - it took me a while to figure out.
I added a description of how scripture refs work in the class documentation.
src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs
line 230 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
So you create these for Notes (all types), Paragraphs (all types) and Cells (no other types of characters). Is this the intention (it seems reasonable from little that I see)? This needs to be documented in Swagger.
Yes, that is correct.
src/SIL.Machine/Corpora/ScriptureText.cs
line 104 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
This appears to never be called
I left it in as a helper method for classes that extend this abstract base class.
src/SIL.Machine/Corpora/TextCorpusBase.cs
line 10 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Or, add another property to TextCorpusBase called "CorpusType" or similar that would be hard coded to either a string or an enum of "TextCorpus" or "ScriptureCorpus". This should survive decorators and be more explicit when trying to determine the underlying type.
@Enkidu93 Reflection should only be used as a last resort.
@johnml1135 Yes, but we would still need to have the Versification
property, so I figured it was easier to just add this one property and return null
if it isn't a Scripture corpus. There is an IsScripture
extension method on ITextCorpus
that checks this property.
src/SIL.Machine/Corpora/UsfmParser.cs
line 144 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
So, close all USFM markers on the end of USFM (makes sense), the beginnings of notes, chapters and books as well as verses that don't have verse text? Is that correct?
This is a fix for what was essentially a bug in the UsfmParser
where the End<element>
methods would not get called at the end of the USFM file. This fix ensures that they get called properly.
src/SIL.Machine/Corpora/UsfmTextBase.cs
line 70 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
This is where _verseRef was set to default - but not anymore. But it is still asked if it IsDefault...
I'm not sure I understand what you mean.
src/SIL.Machine/Corpora/UsfmTextBase.cs
line 78 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
This logic is retranscribed - but also duplicates are checked, with odd logic (by my estimation).
I'm not sure what you mean.
src/SIL.Machine/Corpora/UsfmTextBase.cs
line 111 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
This is an odd dance between the base and parent class - handling non-verse cells in the base and verse cells right here...
If a table cell occurs inside of a verse, then we need to make sure that we insert a space in the collected verse text.
src/SIL.Machine/Corpora/UsfmTextBase.cs
line 180 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Ok, so if it isn't a verse or is a verse? I don't understand...
The state.IsVerseText
refers specifically to verse text within a verse. A verse can contain text that isn't verse text.
src/SIL.Machine/Corpora/UsfmTextBase.cs
line 254 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
So, if I understand, If it's a verse, it is all glumbed together. If it is out of a verse, each part is treated individually. If markers are included, keep the notes in place.
Yes, that sound about right.
src/SIL.Machine/Corpora/UsfmTextBase.cs
line 293 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
This logic replaces Sentence completed function. Has the logic changed?
No, it hasn't. I added this in so that the sentence start property is set correctly for non-verse rows.
src/SIL.Machine/Corpora/UsfmTextUpdater.cs
line 268 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Are we sure there are only footnotes? What about endnotes? Don't we have the marker in the Scripture reference that we can pull out and use?
This marker is used for both footnotes and endnotes.
src/SIL.Machine/Corpora/UsfmTextUpdater.cs
line 351 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
So, alternate verses and other special tokens will be skipped over if we are replacing the text? Would we get those from the translated text? Are we sure we want to blow away the special tokens?
Yes, if we are skipping a marker, then we should also skip over all of the special tokens that are related to that marker.
Previously, ddaspit (Damien Daspit) wrote…
I like the description - that tied together with the tests make it understandable for someone coming in for the first time. We can also add something to Swagger to help people if they pull it from the JSON: sillsdev/serval#356. |
Previously, ddaspit (Damien Daspit) wrote…
Ah, the details of Dotnet. |
Previously, ddaspit (Damien Daspit) wrote…
I misread. |
Previously, ddaspit (Damien Daspit) wrote…
The logic was moved of looking for overlapping verses, etc. from here to ScriptureRefParserHandlerBase. Also when it was moved duplicate verse checking was modified. I am ok with it. |
Previously, ddaspit (Damien Daspit) wrote…
https://ubsicap.github.io/usfm/notes_basic/fnotes.html?highlight=endnote. Reading the documentation is a little confusing but a few thoughts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 16 of 34 files at r1, 2 of 2 files at r3, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit)
Did you update any tests for the code coverage? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updates to test coverage?
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @johnml1135)
src/SIL.Machine/Corpora/UsfmTextUpdater.cs
line 268 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
https://ubsicap.github.io/usfm/notes_basic/fnotes.html?highlight=endnote. Reading the documentation is a little confusing but a few thoughts:
- \ft always appears to be in a broader \f ... \f* frame - and I see no \ft* examples
- There are some other things that would likely need to be translated including keywords (fk)
- quotes (fq) should also be translated
- The footnote markers appear to change contexts - do we only translate stuff in the "ft" context?
We translate everything, we just don't have any way to preserve the markers, so we just put everything in an ft
marker.
Previously, johnml1135 (John Lambert) wrote…
I added a wiki page to explain the parsing: https://github.com/sillsdev/serval/wiki/USFM-Parsing-and-Translation |
Previously, ddaspit (Damien Daspit) wrote…
So, here is an example:
So you are saying that we just translate the whole ft and strip out the \fr, \fqa, +fv's while keeping the \f and \f* bracketing marks? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)
Previously, johnml1135 (John Lambert) wrote…
Whatever it is is good enough. If it becomes an issue, we can make a later improvement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @ddaspit)
I'll make a new issue for test coverage - #181 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @ddaspit)
This change is