Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move preprocess logic to toolkit #512

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

Enkidu93
Copy link
Collaborator

@Enkidu93 Enkidu93 commented Oct 16, 2024

Fixes #396

How should we go about testing the new Echo engine functionality? Just add integration tests?


This change is Reviewable

@johnml1135
Copy link
Collaborator

src/Serval/src/Serval.Shared/Serval.Shared.csproj line 22 at r1 (raw file):

		<PackageReference Include="Grpc.HealthCheck" Version="2.65.0" />
		<PackageReference Include="Grpc.Net.ClientFactory" Version="2.65.0" />
		<PackageReference Include="SIL.Machine" Version="3.2.8" Condition="!Exists('..\..\..\..\..\machine\src\SIL.Machine\SIL.Machine.csproj')" />

Pull the most recent from main - we are up to Machine 3.4.0 - and there are 4 locations that all need to be updated together.

@johnml1135
Copy link
Collaborator

src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs line 157 at r1 (raw file):

            }

            foreach (Row row in AlignPretranslateCorpus(sourcePretranslateCorpora, targetCorpora[0].TextCorpus))

Pretranslations are not universal. Could this be broken up into a few major pieces - and then one Machine specific function inherit from this class to tie the main pieces together? The main processing of "monocorpus" and "parallelCorpus" are universal - and we should be able to have a few routines to slice and dice these things so that the 10 line linq expression only happens once...

@johnml1135
Copy link
Collaborator

src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs line 179 at r1 (raw file):

    )
    {
        srcCorpora = srcCorpora.Select(sc => sc.Transform(CleanSegment)).ToArray();

So does this follow the logic - Mix the sources, chose the first target? I can't see it clearly - should it be documented here?

@johnml1135
Copy link
Collaborator

src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs line 295 at r1 (raw file):

    }

    private static IEnumerable<Row> AlignPretranslateCorpus(ITextCorpus[] srcCorpora, ITextCorpus trgCorpus)

Align Train and Align Pretranslate. If these will be in the ServiceToolkit, could we change the name to something more generic? AlignTrainCorpus may be good enough, but I would prefer something like "AlignResultsGenerationCorpus" or even more generically, "AlignMixSource" and "AlignChooseFirstSource". Would that work?

@johnml1135
Copy link
Collaborator

Add a new set of unit tests in service toolkit?

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 22 of 22 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @ddaspit and @Enkidu93)

Fix build error

Update Echo engine to use toolkit

Fix async stream issue

Fix test: Add ability to specify CorpusService mock
@Enkidu93 Enkidu93 force-pushed the move_preprocess_logic_to_toolkit branch from 5a61291 to 3ec2bce Compare October 16, 2024 18:46
Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a unit test in the toolkit. I think ideally we'll port almost all the current PreprocessBuildTests to the toolkit. I'm wondering though if we should try to get this in first and then follow up with that in a separate PR since this PR will be difficult to merge once other changes have been put through.

Reviewable status: 17 of 29 files reviewed, 7 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/Echo/src/EchoTranslationEngine/TranslationEngineServiceV1.cs line 98 at r3 (raw file):

                                    },
                                    cancellationToken
                                );

@ddaspit, it might just be because it's late, but it's not clear to me how to go about doing this: I seem to be doing something wrong with how I'm passing in this async delegate and it's causing the Echo engine to hang and not respond. Any thoughts? I've not exhausted my debugging (although the debugger keeps crashing on me), but I thought maybe you'd just see it right off.


src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs line 94 at r3 (raw file):

    [Test]
    public async Task RunAsync_PretranslateTextIdsOverlapWithTrainOnTextIds()

I also fixed a bug regarding this situation and added a test.


src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs line 157 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Pretranslations are not universal. Could this be broken up into a few major pieces - and then one Machine specific function inherit from this class to tie the main pieces together? The main processing of "monocorpus" and "parallelCorpus" are universal - and we should be able to have a few routines to slice and dice these things so that the 10 line linq expression only happens once...

Discussed in meeting


src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs line 179 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

So does this follow the logic - Mix the sources, chose the first target? I can't see it clearly - should it be documented here?

That happens elsewhere, but yep, that's the logic. I can add more comments if something seems unclear.


src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs line 295 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Align Train and Align Pretranslate. If these will be in the ServiceToolkit, could we change the name to something more generic? AlignTrainCorpus may be good enough, but I would prefer something like "AlignResultsGenerationCorpus" or even more generically, "AlignMixSource" and "AlignChooseFirstSource". Would that work?

Discussed in meeting.


src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs line 313 at r3 (raw file):

                .SelectMany(sc => trgCorpora.Select(tc => sc.AlignRows(tc, allSourceRows: true)))
                .ZipMany(rows =>
                    rows.Where(r => r.SourceSegment.Count > 0 && r.TargetSegment.Count == 0).FirstOrDefault()

This addresses #514


src/Serval/src/Serval.Shared/Serval.Shared.csproj line 22 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Pull the most recent from main - we are up to Machine 3.4.0 - and there are 4 locations that all need to be updated together.

Done.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 17 of 29 files reviewed, 7 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/Echo/src/EchoTranslationEngine/TranslationEngineServiceV1.cs line 98 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

@ddaspit, it might just be because it's late, but it's not clear to me how to go about doing this: I seem to be doing something wrong with how I'm passing in this async delegate and it's causing the Echo engine to hang and not respond. Any thoughts? I've not exhausted my debugging (although the debugger keeps crashing on me), but I thought maybe you'd just see it right off.

Nevermind. It didn't have to do with that at all. I just had an outdated version of Machine locally and was hitting the bug John had fixed in the ZipMany function where it hits and infinite loop.

@johnml1135
Copy link
Collaborator

src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs line 209 at r4 (raw file):

            Assert.That(termCount, Is.EqualTo(0));
        });
        Assert.That(

Were these tests wrong originally? Why did the values change?

@johnml1135
Copy link
Collaborator

src/ServiceToolkit/test/SIL.ServiceToolkit/Utils/data/source2.txt line 1 at r4 (raw file):

Source two, Line 1

These should be removed from Machine/test, they are no longer needed.

@johnml1135
Copy link
Collaborator

Agreed. Let's just do this small thing and get it released to solve the immediate issue.

@johnml1135
Copy link
Collaborator

src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs line 94 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I also fixed a bug regarding this situation and added a test.

Great!

@johnml1135
Copy link
Collaborator

src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs line 313 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

This addresses #514

Looks good to me.

@codecov-commenter
Copy link

codecov-commenter commented Oct 17, 2024

Codecov Report

Attention: Patch coverage is 90.53254% with 32 lines in your changes missing coverage. Please review.

Project coverage is 56.71%. Comparing base (bdf43fa) to head (d3300c7).

Files with missing lines Patch % Lines
...ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs 90.81% 22 Missing and 5 partials ⚠️
...hared/Services/ServalTranslationEngineServiceV1.cs 0.00% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #512      +/-   ##
==========================================
+ Coverage   56.67%   56.71%   +0.04%     
==========================================
  Files         299      301       +2     
  Lines       15626    15676      +50     
  Branches     2154     2160       +6     
==========================================
+ Hits         8856     8891      +35     
- Misses       6114     6128      +14     
- Partials      656      657       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my knowledge, everything is working properly now on this branch. The echo engine is only covered by an E2E test, so we should consider either giving up on testing the Echo engine too extensively OR adding an echo-specific suite of tests. I'm not too bothered by this since the logic is now shared via the toolkit (i.e., the tricky part is tested elsewhere). Also, like I mentioned above, the preprocess build job tests should be ported eventually.

Reviewable status: 16 of 31 files reviewed, 10 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs line 209 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Were these tests wrong originally? Why did the values change?

Most of these are reversions to the values they had pre-mixed-source-support. I was confused about the intended functionality regarding pretranslation, so I changed the values. But with the changes in this PR, they can be changed back. Let me hand reckon the numbers one more time to make sure.


src/Serval/test/Serval.E2ETests/ServalClientHelper.cs line 194 at r6 (raw file):

                FileFormat.Text,
                targetLanguage,
                isTarget: true

@johnml1135, do you mind if I just do it like this? I found the echo-specific logic in the helper to be confusing.


src/ServiceToolkit/test/SIL.ServiceToolkit/Utils/data/source2.txt line 1 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

These should be removed from Machine/test, they are no longer needed.

Once the other tests are ported, they won't be needed, right. But for now, they're duplicated.

@johnml1135
Copy link
Collaborator

src/Serval/src/Serval.Translation/Services/EngineService.cs line 687 at r6 (raw file):

        if (sourceCorpus.Files.Count > 0)
            corpus.SourceCorpora.Add(sourceCorpus);
        if (targetCorpus.Files.Count > 0)

What is the significance of the logic change?

@johnml1135
Copy link
Collaborator

src/Serval/test/Serval.E2ETests/ServalClientHelper.cs line 194 at r6 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

@johnml1135, do you mind if I just do it like this? I found the echo-specific logic in the helper to be confusing.

That's fine - it looks clearer.

@johnml1135
Copy link
Collaborator

I agree - all complicated things in echo are either tested elsewhere (like in the service toolkit) or are themselves needed for testing, at which time we would be testing test code. The E2E echo tests are more about testing the Serval API independently of the Machine engines.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 3 files at r2, 8 of 10 files at r3, 2 of 2 files at r4, 2 of 3 files at r5, 3 of 3 files at r6, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit and @Enkidu93)

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Still working on a pretranslation bug. I'll ping when it should finally be reviewed.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 16 of 22 files at r1, 2 of 3 files at r2, 7 of 10 files at r3, 1 of 2 files at r4, 2 of 3 files at r5, 3 of 3 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @Enkidu93)


src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs line 3 at r6 (raw file):

namespace SIL.ServiceToolkit.Utils;

public class ParallelCorpusPreprocessor

This should be implemented as a service that we inject using DI. You should create an extension method that adds the required services to the DI in an IServiceCollectionExtensions class.


src/ServiceToolkit/test/SIL.ServiceToolkit/SIL.ServiceToolkit.Tests.csproj line 0 at r7 (raw file):
The directory name should match the assembly name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Echo translation engine doesn't work for Paratext projects
4 participants