-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move preprocess logic to toolkit #512
base: main
Are you sure you want to change the base?
Conversation
Pull the most recent from main - we are up to Machine 3.4.0 - and there are 4 locations that all need to be updated together. |
Pretranslations are not universal. Could this be broken up into a few major pieces - and then one Machine specific function inherit from this class to tie the main pieces together? The main processing of "monocorpus" and "parallelCorpus" are universal - and we should be able to have a few routines to slice and dice these things so that the 10 line linq expression only happens once... |
So does this follow the logic - Mix the sources, chose the first target? I can't see it clearly - should it be documented here? |
Align Train and Align Pretranslate. If these will be in the ServiceToolkit, could we change the name to something more generic? AlignTrainCorpus may be good enough, but I would prefer something like "AlignResultsGenerationCorpus" or even more generically, "AlignMixSource" and "AlignChooseFirstSource". Would that work? |
Add a new set of unit tests in service toolkit? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 22 of 22 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @ddaspit and @Enkidu93)
Fix build error Update Echo engine to use toolkit Fix async stream issue Fix test: Add ability to specify CorpusService mock
5a61291
to
3ec2bce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a unit test in the toolkit. I think ideally we'll port almost all the current PreprocessBuildTests to the toolkit. I'm wondering though if we should try to get this in first and then follow up with that in a separate PR since this PR will be difficult to merge once other changes have been put through.
Reviewable status: 17 of 29 files reviewed, 7 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/Echo/src/EchoTranslationEngine/TranslationEngineServiceV1.cs
line 98 at r3 (raw file):
}, cancellationToken );
@ddaspit, it might just be because it's late, but it's not clear to me how to go about doing this: I seem to be doing something wrong with how I'm passing in this async delegate and it's causing the Echo engine to hang and not respond. Any thoughts? I've not exhausted my debugging (although the debugger keeps crashing on me), but I thought maybe you'd just see it right off.
src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs
line 94 at r3 (raw file):
[Test] public async Task RunAsync_PretranslateTextIdsOverlapWithTrainOnTextIds()
I also fixed a bug regarding this situation and added a test.
src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs
line 157 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Pretranslations are not universal. Could this be broken up into a few major pieces - and then one Machine specific function inherit from this class to tie the main pieces together? The main processing of "monocorpus" and "parallelCorpus" are universal - and we should be able to have a few routines to slice and dice these things so that the 10 line linq expression only happens once...
Discussed in meeting
src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs
line 179 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
So does this follow the logic - Mix the sources, chose the first target? I can't see it clearly - should it be documented here?
That happens elsewhere, but yep, that's the logic. I can add more comments if something seems unclear.
src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs
line 295 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Align Train and Align Pretranslate. If these will be in the ServiceToolkit, could we change the name to something more generic? AlignTrainCorpus may be good enough, but I would prefer something like "AlignResultsGenerationCorpus" or even more generically, "AlignMixSource" and "AlignChooseFirstSource". Would that work?
Discussed in meeting.
src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs
line 313 at r3 (raw file):
.SelectMany(sc => trgCorpora.Select(tc => sc.AlignRows(tc, allSourceRows: true))) .ZipMany(rows => rows.Where(r => r.SourceSegment.Count > 0 && r.TargetSegment.Count == 0).FirstOrDefault()
This addresses #514
src/Serval/src/Serval.Shared/Serval.Shared.csproj
line 22 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Pull the most recent from main - we are up to Machine 3.4.0 - and there are 4 locations that all need to be updated together.
Done.
…anslations for segments with target text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 17 of 29 files reviewed, 7 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/Echo/src/EchoTranslationEngine/TranslationEngineServiceV1.cs
line 98 at r3 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
@ddaspit, it might just be because it's late, but it's not clear to me how to go about doing this: I seem to be doing something wrong with how I'm passing in this async delegate and it's causing the Echo engine to hang and not respond. Any thoughts? I've not exhausted my debugging (although the debugger keeps crashing on me), but I thought maybe you'd just see it right off.
Nevermind. It didn't have to do with that at all. I just had an outdated version of Machine locally and was hitting the bug John had fixed in the ZipMany
function where it hits and infinite loop.
Were these tests wrong originally? Why did the values change? |
These should be removed from Machine/test, they are no longer needed. |
Agreed. Let's just do this small thing and get it released to solve the immediate issue. |
Previously, Enkidu93 (Eli C. Lowry) wrote…
Great! |
Previously, Enkidu93 (Eli C. Lowry) wrote…
Looks good to me. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #512 +/- ##
==========================================
+ Coverage 56.67% 56.71% +0.04%
==========================================
Files 299 301 +2
Lines 15626 15676 +50
Branches 2154 2160 +6
==========================================
+ Hits 8856 8891 +35
- Misses 6114 6128 +14
- Partials 656 657 +1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my knowledge, everything is working properly now on this branch. The echo engine is only covered by an E2E test, so we should consider either giving up on testing the Echo engine too extensively OR adding an echo-specific suite of tests. I'm not too bothered by this since the logic is now shared via the toolkit (i.e., the tricky part is tested elsewhere). Also, like I mentioned above, the preprocess build job tests should be ported eventually.
Reviewable status: 16 of 31 files reviewed, 10 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs
line 209 at r4 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Were these tests wrong originally? Why did the values change?
Most of these are reversions to the values they had pre-mixed-source-support. I was confused about the intended functionality regarding pretranslation, so I changed the values. But with the changes in this PR, they can be changed back. Let me hand reckon the numbers one more time to make sure.
src/Serval/test/Serval.E2ETests/ServalClientHelper.cs
line 194 at r6 (raw file):
FileFormat.Text, targetLanguage, isTarget: true
@johnml1135, do you mind if I just do it like this? I found the echo-specific logic in the helper to be confusing.
src/ServiceToolkit/test/SIL.ServiceToolkit/Utils/data/source2.txt
line 1 at r4 (raw file):
Previously, johnml1135 (John Lambert) wrote…
These should be removed from Machine/test, they are no longer needed.
Once the other tests are ported, they won't be needed, right. But for now, they're duplicated.
What is the significance of the logic change? |
Previously, Enkidu93 (Eli C. Lowry) wrote…
That's fine - it looks clearer. |
I agree - all complicated things in echo are either tested elsewhere (like in the service toolkit) or are themselves needed for testing, at which time we would be testing test code. The E2E echo tests are more about testing the Serval API independently of the Machine engines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 3 files at r2, 8 of 10 files at r3, 2 of 2 files at r4, 2 of 3 files at r5, 3 of 3 files at r6, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit and @Enkidu93)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
Still working on a pretranslation bug. I'll ping when it should finally be reviewed.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 16 of 22 files at r1, 2 of 3 files at r2, 7 of 10 files at r3, 1 of 2 files at r4, 2 of 3 files at r5, 3 of 3 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @Enkidu93)
src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs
line 3 at r6 (raw file):
namespace SIL.ServiceToolkit.Utils; public class ParallelCorpusPreprocessor
This should be implemented as a service that we inject using DI. You should create an extension method that adds the required services to the DI in an IServiceCollectionExtensions
class.
src/ServiceToolkit/test/SIL.ServiceToolkit/SIL.ServiceToolkit.Tests.csproj
line 0 at r7 (raw file):
The directory name should match the assembly name.
Fixes #396
How should we go about testing the new Echo engine functionality? Just add integration tests?
This change is