Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added chapter-level filtering #161

Merged
merged 4 commits into from
Feb 6, 2024
Merged

Conversation

Enkidu93
Copy link
Collaborator

@Enkidu93 Enkidu93 commented Jan 26, 2024

Addresses sillsdev/serval#150

I'm noticing that (as far as I can tell) the "trainOn" logic is not present in SMT jobs. It makes sense why pretranslate logic wouldn't be there, but should we incorporate the trainOn logic? I won't add the biblicalRange logic to SMT for training until it's been verified that that was a mistake.


This change is Reviewable

@Enkidu93
Copy link
Collaborator Author

My csharpier was broken (+ I forgot about the mismatched .NET versions). I'll go ahead and take care of that.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than require that all engine implementations be able to parse the Scripture range string, I think it would be better to perform the parsing in Serval. Serval will already have a dependency on SIL.Machine for USFM generation, so we can do the same for the Scripture range parsing.

Reviewed 13 of 13 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93 and @johnml1135)


src/SIL.Machine.AspNetCore/Services/BiblicalRangeStringParser.cs line 0 at r1 (raw file):
I order to better match up with Machine.py:

  • this class should be moved to SIL.Machine/Scripture
  • Turned into a static class and renamed to ScriptureRangeParser
  • The Parse should be made static and renamed to GetChapters.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then pass a dictionary rather than a string from Serval to Machine? You're not suggesting the other filtering logic is somehow put into Serval, right?

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)


src/SIL.Machine.AspNetCore/Services/BiblicalRangeStringParser.cs line at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I order to better match up with Machine.py:

  • this class should be moved to SIL.Machine/Scripture
  • Turned into a static class and renamed to ScriptureRangeParser
  • The Parse should be made static and renamed to GetChapters.

That seems reasonable.

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 141 at r1 (raw file):

                        var parser = new BiblicalRangeStringParser(stc.Versification);
                        if(corpus.TrainOnBiblicalRange != null && corpus.TrainOnBiblicalRange != ""){
                            Dictionary<string, List<int>> trainOnBiblicalRangeChapters = parser.Parse(corpus.TrainOnBiblicalRange); //TODO calculate once

So, it's calculated for every row of text? That should be fixed...

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 142 at r1 (raw file):

                        if(corpus.TrainOnBiblicalRange != null && corpus.TrainOnBiblicalRange != ""){
                            Dictionary<string, List<int>> trainOnBiblicalRangeChapters = parser.Parse(corpus.TrainOnBiblicalRange); //TODO calculate once
                            isInTrainOnRange = rowChaptersPerBook.Join(trainOnBiblicalRangeChapters, rcpb => rcpb.Key, tobrc => tobrc.Key, (rcbp, tobrc) =>

So, to be clear, if the verse is in either the source or the target versification, it will work? Is that what we want?

@Enkidu93
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 141 at r1 (raw file):

                        var parser = new BiblicalRangeStringParser(stc.Versification);
                        if(corpus.TrainOnBiblicalRange != null && corpus.TrainOnBiblicalRange != ""){
                            Dictionary<string, List<int>> trainOnBiblicalRangeChapters = parser.Parse(corpus.TrainOnBiblicalRange); //TODO calculate once

So, it's calculated for every row of text? That should be fixed...

Yes, apparently, I missed my own TODO haha. Thank you!

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we would just parse the string into a data structure that we pass to Machine.

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @Enkidu93 and @johnml1135)

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 147 at r1 (raw file):

                        }
                        if(corpus.PretranslateBiblicalRange != null && corpus.PretranslateBiblicalRange != ""){
                            Dictionary<string, List<int>> pretranslateBiblicalRangeChapters = parser.Parse(corpus.PretranslateBiblicalRange);

Also this one needs pulled out of the loop too.

@Enkidu93
Copy link
Collaborator Author

src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 147 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Also this one needs pulled out of the loop too.

Yep! Thank you

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 137 at r1 (raw file):

                    bool isInTrainOnRange = false;
                    bool isInPretranslateRange = false;
                    if(targetCorpora[CorpusType.Text] is ScriptureTextCorpus stc && row.Refs.All(r => r is VerseRef)){

We should probably only do these checks if there is a biblical range - if not, just skip it. Less computation?

@johnml1135
Copy link
Collaborator

tests/SIL.Machine.AspNetCore.Tests/Services/data/paratext2/TermRenderings.xml line 2 at r1 (raw file):

<TermRenderingsList>
  <TermRendering Id="Abba" Guess="false">

What is the purpose of these term renderings and other files?

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 13 of 13 files at r1, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @ddaspit and @Enkidu93)


tests/SIL.Machine.AspNetCore.Tests/Services/BiblicalRangeStringParserTests.cs line 73 at r1 (raw file):

        yield return new TestCaseData("NT,OT,-MRK,-EXO", new Dictionary<string, List<int>>(), true);
        yield return new TestCaseData("OT,MAT1", new Dictionary<string, List<int>>(), true);
        yield return new TestCaseData("OT,MAT-LUK", new Dictionary<string, List<int>>(), true);

Is this the same list from python?

@johnml1135
Copy link
Collaborator

Hmmm. It could make errors easier to handle. Wouldn't it also be good then to pull textId filtering into Serval as well?

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think the idea is to keep the filtering in Machine, just move the actual parsing of the string to Machine. Ultimately, there will still be filtration code in all engine jobs regardless - it just means that Serval is responsible for parsing the range - which kind of makes sense semantically. This way too, we can catch that parsing error earlier on.

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @ddaspit and @johnml1135)


tests/SIL.Machine.AspNetCore.Tests/Services/BiblicalRangeStringParserTests.cs line 73 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Is this the same list from python?

Yep, all the same test cases


tests/SIL.Machine.AspNetCore.Tests/Services/data/paratext2/TermRenderings.xml line 2 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

What is the purpose of these term renderings and other files?

Originally, our only Paratext NmtPreprocessBuildJob tests were mapping a single Paratext project as source and target. That didn't allow me enough control to make good biblical range tests.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddaspit I'm realizing the downside of this approach is that we don't have access to the versification in Serval (which the ScriptureRangeParser is using to determine books/counts accurately). I'm not sure how we'd get around that unless you don't imagine that will really be an issue. Let me know as soon as possible what you think.

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @ddaspit and @johnml1135)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to write code that accesses the versification for the USFM generation PR. I will abstract that functionality into a service that you can use.

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @Enkidu93 and @johnml1135)

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I guess if you're interdependent, you might as well go all the way haha. Thank you Let me know when that's in place.

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @ddaspit and @johnml1135)

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Models/Corpus.cs line 11 at r2 (raw file):

    public bool PretranslateAll { get; set; }
    public Dictionary<string, List<int>>? TrainOnChapters { get; set; }
    public Dictionary<string, List<int>>? PretranslateChapters { get; set; }

Chapters or ScriptureRange?

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Models/Corpus.cs line 11 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Chapters or ScriptureRange?

I see Damien's comment at the bottom. Ok.

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 156 at r2 (raw file):

                                            || (rcbp.Value.Count() > 0 && tobrc.Value.Count() == 0) //Empty list means all chapters from book
                                    )
                                    .Any(b => b);

That logic appears solid, namely:

  • Both train on chapters and rows are dictionaries: book (string): chapters (list of ints)
  • Join the two, looking for intersections. If there is an intersection, well and good. If, not, if both row and filter have a book, but the filter has no chapter, it is valid as well. Any matches, include the verse.

The only thing I wonder is if could be optimized more (see here for intersection explanation - https://stackoverflow.com/questions/14527595/intersection-of-two-sets-in-most-optimized-way). An alternative could be:

  • Per row verse reference, get filter's book. if ref chapter in filter list, good, or if filter list size == 0. Any hits and it is included.
  • This would be O(# of verse references in row * number of chapters in book * rows)
  • We could be crazy about optimization and make an array 256 boolean for the filter and then check the chapter as a lookup in that array and get O(# of verse references in row * rows), but it would be more obtuse and give diminishing returns.

It may save some cycles - what do you think? I am fine with it as is as I am not anticipating a meaningful performance hit even with O(NMP).

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 183 at r2 (raw file):

                            corpus.PretranslateAll
                            || corpus.PretranslateTextIds.Contains(row.TextId)
                            || isInPretranslateChapters

Just to clarify how references work. There can be multiple references - is it the source and target reference? Are we assuming only one actual verse reference? And therefore only one book (Daniel vs. Bell and the Dragon)? It's a severe edge case - just checking.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 9 files at r2, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @ddaspit and @Enkidu93)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 9 files at r2, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @Enkidu93)


src/SIL.Machine/Scripture/ScriptureRangeParser.cs line 8 at r2 (raw file):

using SIL.Scripture;

public class ScriptureRangeParser

It would be nice to add a static convenience method GetChapters(string chapterSelection, ScrVers versification = null).


src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 138 at r2 (raw file):

                    {
                        Dictionary<string, List<int>>? rowChaptersPerBook = null;
                        if (corpus.TrainOnChapters != null || corpus.PretranslateChapters != null)

This seems much simpler and clearer to me:

bool IsInChapters(Dictionary<string, List<int>> bookChapters, object rowRef)
{
    if (rowRef is not VerseRef vr)
        return false;
    return bookChapters.TryGetValue(vr.Book, out List<int>? chapters)
        && chapters.Contains(vr.ChapterNum);
}

if (corpus.TrainOnChapters is not null)
    isInTrainOnChapters = row.Refs.Any(r => IsInChapters(corpus.TrainOnChapters, r));
if (corpus.PretranslateChapters is not null)
    isInPretranslateChapters = row.Refs.Any(r => IsInChapters(corpus.PretranslateChapters, r));

Also, it would be slightly more efficient ifTrainOnChapters and PretranslateChapters were Dictionary<string, HashSet<int>>.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 137 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

We should probably only do these checks if there is a biblical range - if not, just skip it. Less computation?

Yeah - and that should be rolled in with the refactoring of the logic Damien suggests.


src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 142 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

So, to be clear, if the verse is in either the source or the target versification, it will work? Is that what we want?

I'm not sure what you mean here. Are you saying we should just use one of either SourceRefs or TargetRefs here (whichever we use Serval-side)? The refs is kind of a shorthand for SourceRefs if there are any; otehrwise, targetrefs.


src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 138 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This seems much simpler and clearer to me:

bool IsInChapters(Dictionary<string, List<int>> bookChapters, object rowRef)
{
    if (rowRef is not VerseRef vr)
        return false;
    return bookChapters.TryGetValue(vr.Book, out List<int>? chapters)
        && chapters.Contains(vr.ChapterNum);
}

if (corpus.TrainOnChapters is not null)
    isInTrainOnChapters = row.Refs.Any(r => IsInChapters(corpus.TrainOnChapters, r));
if (corpus.PretranslateChapters is not null)
    isInPretranslateChapters = row.Refs.Any(r => IsInChapters(corpus.PretranslateChapters, r));

Also, it would be slightly more efficient ifTrainOnChapters and PretranslateChapters were Dictionary<string, HashSet<int>>.

Good call. This can be collapsed in a way that was more difficult before when the parsing was here. My mind is working at flu speed at the moment, but I don't think your code would properly handle the 'empty list = all chapters' idiom, right? It's an easy addition, but it just brought to my attention that we should make sure we're all on the same page about that.

As for the HashSet - definitely, I can convert it to a hash set when it's being mapped (which is a cost in and of itself), but probably still cleaner.


src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 156 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

That logic appears solid, namely:

  • Both train on chapters and rows are dictionaries: book (string): chapters (list of ints)
  • Join the two, looking for intersections. If there is an intersection, well and good. If, not, if both row and filter have a book, but the filter has no chapter, it is valid as well. Any matches, include the verse.

The only thing I wonder is if could be optimized more (see here for intersection explanation - https://stackoverflow.com/questions/14527595/intersection-of-two-sets-in-most-optimized-way). An alternative could be:

  • Per row verse reference, get filter's book. if ref chapter in filter list, good, or if filter list size == 0. Any hits and it is included.
  • This would be O(# of verse references in row * number of chapters in book * rows)
  • We could be crazy about optimization and make an array 256 boolean for the filter and then check the chapter as a lookup in that array and get O(# of verse references in row * rows), but it would be more obtuse and give diminishing returns.

It may save some cycles - what do you think? I am fine with it as is as I am not anticipating a meaningful performance hit even with O(NMP).

Definitely - the optimizations are endless haha. Because this is bounded (we know how many chapters there are), I don't know how crazy we need to get here. Let me know if you'd like a motion in this direction.


src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 183 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Just to clarify how references work. There can be multiple references - is it the source and target reference? Are we assuming only one actual verse reference? And therefore only one book (Daniel vs. Bell and the Dragon)? It's a severe edge case - just checking.

No, not source and target. If you peek at ParallelTextRow, you'll see that the refs is shorthand for source refs if there are any; otherwise, target refs. For the most part, there should only be one verse reference. Damien can correct me, but I think the idea of multiple references is more that sometimes a chunk of verses is translated as a unit because of the peculiarities of a language (like say it's topic-comment and rearranging phrases from contiguous verses is necessary to make the reading natural). I could be way off haha. @ddaspit ?

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 156 at r2 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Definitely - the optimizations are endless haha. Because this is bounded (we know how many chapters there are), I don't know how crazy we need to get here. Let me know if you'd like a motion in this direction.

I think Damien resolves all optimization issues, including the hashset - bringing it to O(n).

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 142 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I'm not sure what you mean here. Are you saying we should just use one of either SourceRefs or TargetRefs here (whichever we use Serval-side)? The refs is kind of a shorthand for SourceRefs if there are any; otehrwise, targetrefs.

It should be good enough. If we run into future problems, we can seek to address it.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r3, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93)


src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 183 at r2 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

No, not source and target. If you peek at ParallelTextRow, you'll see that the refs is shorthand for source refs if there are any; otherwise, target refs. For the most part, there should only be one verse reference. Damien can correct me, but I think the idea of multiple references is more that sometimes a chunk of verses is translated as a unit because of the peculiarities of a language (like say it's topic-comment and rearranging phrases from contiguous verses is necessary to make the reading natural). I could be way off haha. @ddaspit ?

@Enkidu93, you are correct. That is an accurate description.


src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 142 at r3 (raw file):

                                return false;
                            return bookChapters.TryGetValue(vr.Book, out HashSet<int>? chapters)
                                && (chapters.Contains(vr.ChapterNum) || chapters.Count() == 0);

You should use the Count property instead of the Count() LINQ method. Using Count() can be dangerous, because in certain situations it computes the count by enumerating over the whole collection.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 14 of 15 files reviewed, 2 unresolved discussions (waiting on @ddaspit)


src/SIL.Machine/Scripture/ScriptureRangeParser.cs line 8 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

It would be nice to add a static convenience method GetChapters(string chapterSelection, ScrVers versification = null).

Done.


src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 142 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should use the Count property instead of the Count() LINQ method. Using Count() can be dangerous, because in certain situations it computes the count by enumerating over the whole collection.

Done. Thank you! Good catch. I'm so accustomed to using the method because I'm working with a generic IEnumerable.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 5 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r5, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

@johnml1135 johnml1135 merged commit 4ddecac into master Feb 6, 2024
1 of 3 checks passed
@ddaspit ddaspit deleted the chapter_level_specification branch February 7, 2024 14:28
Enkidu93 added a commit that referenced this pull request Feb 7, 2024
@Enkidu93 Enkidu93 restored the chapter_level_specification branch February 7, 2024 15:12
ddaspit added a commit that referenced this pull request Feb 7, 2024
@ddaspit ddaspit deleted the chapter_level_specification branch February 7, 2024 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants