Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for parallel data curation #193

Open
wants to merge 50 commits into
base: main
Choose a base branch
from

Conversation

shuoyangd
Copy link

Description

This PR adds support for parallel data curation. Namely:

  • A new dataset class ParallelDataset that supports loading and writing parallel data in simple bitext format.
  • A new ScoreFilter subclass ParallelScoreFilter that allows application of existing monolingual filters on parallel data while maintaining the alignment of sentence/document pairs.
  • A new ScoreFilter subclass JointScoreFilter that allows implementation of filters that takes both fields of the parallel sentence/document pairs.
  • New heuristic filters: HistogramFilter and LengthRatioFilter.
  • Adding model-based filters with quality estimation models: QualityEstimationFilter.
  • Support for two families of quality estimation models: comet and cometoid.
  • A tutorial for parallel data curation.
  • Tests accompanying new features.

Joint work at MTMA 2024 with @nverma1.

Usage

See tutorials/bitext_cleaning/main.py.

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

shuoyangd and others added 30 commits August 8, 2024 14:28
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…set test, fix a few data and import bugs

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…rget

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…ataset is sometimes turned into its parent class by mistake, add write to simple bitext functionality, update bitext tutorial

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…xtensions are removed twice before writing

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…ntScoreFilter can take more than one fields for source and target

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…date README

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…, run black formatter

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…eakage

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
@ryantwolf ryantwolf self-requested a review August 15, 2024 18:11
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such good work! I have a few additional requests around documentation. Can you create one or more documentation pages about curating parallel datasets in our docs/user-guide/? You can see how it currently is rendered here. Also, please add all the classes and functions you expect a user to use in our API reference

nemo_curator/datasets/__init__.py Outdated Show resolved Hide resolved
nemo_curator/datasets/doc_dataset.py Outdated Show resolved Hide resolved
nemo_curator/datasets/doc_dataset.py Outdated Show resolved Hide resolved
nemo_curator/datasets/doc_dataset.py Outdated Show resolved Hide resolved
nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved
nemo_curator/utils/file_utils.py Outdated Show resolved Hide resolved
)

for idx, (src_line, tgt_line) in enumerate(zip(open(src_file), open(tgt_file))):
assert ds.df["src"].compute()[idx] == src_line.rstrip("\n")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: instead of calling compute multiple times, call it once and then do all the assert statements.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, I get errors when I call it only once:

TypeError: Trying to convert dd.Scalar<series-..., dtype=bool> to a boolean value. Because Dask objects are lazily evaluated, they cannot be converted to a boolean value or used in boolean conditions like if statements. Try calling .compute() to force computation prior to converting to a boolean value or using in a conditional statement.

tutorials/bitext_cleaning/README.md Outdated Show resolved Hide resolved
tutorials/bitext_cleaning/main.py Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh I love this! Thanks for adding a tutorial!

Copy link
Author

@shuoyangd shuoyangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the very thorough review! I left a few questions and comments. mainly concerned about the refactoring you proposed. Meanwhile I'll proceed with the lower-level changes.

nemo_curator/filters/heuristic_filter.py Outdated Show resolved Hide resolved
nemo_curator/modules/filter.py Show resolved Hide resolved
nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…accomodate custom field names, pause doc repartition since it causes problems

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…tern, test currently failing

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…led, fix tutorial

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants