-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for parallel data curation #193
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…set test, fix a few data and import bugs Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…rget Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…ataset is sometimes turned into its parent class by mistake, add write to simple bitext functionality, update bitext tutorial Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…xtensions are removed twice before writing Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…ntScoreFilter can take more than one fields for source and target Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…date README Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…, run black formatter Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…eakage Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such good work! I have a few additional requests around documentation. Can you create one or more documentation pages about curating parallel datasets in our docs/user-guide/
? You can see how it currently is rendered here. Also, please add all the classes and functions you expect a user to use in our API reference
) | ||
|
||
for idx, (src_line, tgt_line) in enumerate(zip(open(src_file), open(tgt_file))): | ||
assert ds.df["src"].compute()[idx] == src_line.rstrip("\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: instead of calling compute multiple times, call it once and then do all the assert statements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, I get errors when I call it only once:
TypeError: Trying to convert dd.Scalar<series-..., dtype=bool> to a boolean value. Because Dask objects are lazily evaluated, they cannot be converted to a boolean value or used in boolean conditions like if statements. Try calling .compute() to force computation prior to converting to a boolean value or using in a conditional statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh I love this! Thanks for adding a tutorial!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the very thorough review! I left a few questions and comments. mainly concerned about the refactoring you proposed. Meanwhile I'll proceed with the lower-level changes.
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…accomodate custom field names, pause doc repartition since it causes problems Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…tern, test currently failing Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…led, fix tutorial Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Description
This PR adds support for parallel data curation. Namely:
ParallelDataset
that supports loading and writing parallel data in simple bitext format.ScoreFilter
subclassParallelScoreFilter
that allows application of existing monolingual filters on parallel data while maintaining the alignment of sentence/document pairs.ScoreFilter
subclassJointScoreFilter
that allows implementation of filters that takes both fields of the parallel sentence/document pairs.HistogramFilter
andLengthRatioFilter
.QualityEstimationFilter
.comet
andcometoid
.Joint work at MTMA 2024 with @nverma1.
Usage
See
tutorials/bitext_cleaning/main.py
.Checklist