Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add jupyter notebook tutorial for single node mulilingual dataset #30

Merged
merged 36 commits into from
May 24, 2024

Conversation

nicoleeeluo
Copy link
Contributor

This PR adds a jupyter notebook workflow for a sample curation pipeline for Thai Wikipedia data.

Modules included in this workflows are

  1. wikipedia downloading
  2. language separation
  3. unicode formatter
  4. add id helper
  5. GPU exact deduplication
  6. GPU fuzzy deduplication
  7. heuristic filtering for non-en dataset.

@nicoleeeluo nicoleeeluo marked this pull request as ready for review April 11, 2024 13:06
@arhamm1
Copy link
Collaborator

arhamm1 commented Apr 25, 2024

@ryantwolf / @Maghoumi can we review and get this merged?

@arhamm1
Copy link
Collaborator

arhamm1 commented May 13, 2024

pinging again, @Maghoumi can you help review this - this will help teams trying to create multilingual datasets with Curator!

@ayushdg
Copy link
Collaborator

ayushdg commented May 14, 2024

Thanks for opening the PR @nicoleeeluo All our PR's require commits to be verified (signed) and signed off.
To achieve this you need to commit with the -sS flags. (more info in the Contributing Guide.
More details on signing commits can be found on the guide: https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits

@nicoleeeluo
Copy link
Contributor Author

@ayushdg Thanks for reminding!

Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for creating this tutorial! It's quite extensive which is super helpful. There are a couple of changes I'd appreciate if you could make, please let me know if you have any concerns or disagree with my requests. Thanks again.

tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Outdated Show resolved Hide resolved
tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Outdated Show resolved Hide resolved
"def pre_imports():\n",
" import cudf \n",
"\n",
"def load_dataset(input_data_dir, file_type='jsonl'):\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think you could use the DocumentDataset.read_json and DocumentDataset.read_parquet methods we have added recently? Let me know if something about them would prevent you from using them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

"source": [
"## 2.Language separation and unicode fixing\n",
"\n",
"**Note**: In order to be run on interactive python. Please comment `from.code import *` and the related imports in `./nemo_curator/filters/__init__.py`"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try this again with the latest version of curator? And please let us know what errors you get if you get any.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixed. I will amend the notebook accordingly

"TH wikipedia data do have `id` field, but the `id` field contains number only. It will be better if we unified the `id` field and transform it to the format of `<prefix>_<id>`. In this way, when handling multiple dataset, we will able to know which document from which dataset has been removed. This `id` will be useful when we are running deduplication and heuristic filtering. The function we will be using is `AddID()`. Arguments for this function include:\n",
"- `id_field`: fields will be added to input .json file. If the key already exists in the .jsonl, it's value will be replaced.\n",
"- `id_prefix`: prefix used in ID. Default is 'doc-id'\n",
"- `start_index`: starting index in ID. Default is 0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This recently changed. The default is now None and the id is considered "unordered" by default to improve speed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the description. In the code section, I keep the start_index = 0 for easier reference

tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Outdated Show resolved Hide resolved
tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Outdated Show resolved Hide resolved
tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Outdated Show resolved Hide resolved
tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Outdated Show resolved Hide resolved
" 2. Fuzzy deduplication\n",
"4. Heuristic filtering\n",
"\n",
"What is not included:\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably also worth mentioning that this also doesn't include

  1. Distributed data classification with PyTorch models
  2. Personal identifiable information (PII) redaction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@nicoleeeluo
Copy link
Contributor Author

@ryantwolf Hi Ryan, I have pushed a new version to include the fixes you mentioned.

Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect! I have one last comment about the docker image you refer too, but other than that it looks great!

Though, as Ayush mentioned all our PR's require commits to be verified (signed) and signed off. Quoting from him earlier:

To achieve this you need to commit with the -sS flags. (more info in the Contributing Guide.
More details on signing commits can be found on the guide: https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits

Also it looks like the style guide is failing for the config file you made. You should be able to fix it by running pip install pre-commit && pre-commit install && pre-commit run --all. Thanks again!

" Password: <Your NGC Key>\n",
"- Get NeMo NeMo Framework Training Container\n",
" ```bash\n",
" docker pull nvcr.io/ea-bignlp/ga-participants/nemofw-training:24.01\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd either change this to the dev container or the 24.05 tag (which hasn't been released yet).

docker pull nvcr.io/nvidia/nemo:dev.framework
docker pull nvcr.io/nvidia/nemo:24.05.framework

Since the other container versions don't have the latest version of NeMo Curator that the tutorial uses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

nicoleeeluo and others added 22 commits May 20, 2024 07:54
Signed-off-by: Nicole Luo <nluo@nvidia.com>
* Fix metadata inference with pandas and dask

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix datatypes for task decontamination

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Use targetted import

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

---------

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
* Move tokenizer import

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Reduce inductor threads

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Change env int to string

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Change location of env var

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add comment linking issue

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

---------

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
* Add fast id method

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add type conversion

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix off by one errors in tests

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

---------

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
* Move GPU imports and make them optional

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Move gpu dependencies to a seperate install

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Remove unused import

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Switch to placeholder import that raises on usage

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Remove deprecated utils usage

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Add cuML attribution

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Safe import tests, improve install instruction, update gha workflow

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Fix pytests due to loc bug

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* update install instructions

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Raise on non module-not-found errors, update logging

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Update logging to not change root logger

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

---------

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
* [K8s]: Adds a helper script to create a dask cluster on k8s and includes
instructions for how to a Curator workload on k8s

Signed-off-by: Terry Kong <terryk@nvidia.com>

* black formatting

Signed-off-by: Terry Kong <terryk@nvidia.com>

* big_english -> my_dataset

Signed-off-by: Terry Kong <terryk@nvidia.com>

* 24.01 -> 24.03 default container

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Add help kwarg to all flags

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Clarify why venv is needed

Signed-off-by: Terry Kong <terryk@nvidia.com>

* fix precommit failures

Signed-off-by: Terry Kong <terryk@nvidia.com>

---------

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
* Refactor common utils and remove unused code

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* More cleanup

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* More updates/shuffling

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Move gpu_dedup scripts into subfolder

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Remove gpu_deduplication subfolder

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Add readme to fuzzy dedup scripts section

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Fix typo and relative links

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Remove legacy script entrypoints

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Remove legacy scripts and add init file

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Update GpuDeduplication.rst

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

---------

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
* Fix lang id example

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add classifier unit tests

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add test for failure

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Remove failure test

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

---------

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
* Add initial dataset blending function

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add blend unit tests

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add self parameter

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix return type of blend dataset

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix blending tests

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Change assert statement for very uneven blend

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix key error

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add proper proportion blending test

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add four dataset blend and clarify docs

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add shuffle module

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add blend example and tests

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix random method name

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Wrap return type in DocumentDataset

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Save result of column drop

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Change equality check for shuffle tests

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix expected order after shuffle

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add more documents to shuffle test

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add assert statement

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add within partition shuffle

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Refactor add rand column for shuffle

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix filename tests

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add determinism handling for shuffle

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Change numpy random function

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix tests with new random method

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Remove length call from blending

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Improve scaling of blending function

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix blend tests

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add blending script

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add additional file paths call

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add documentation

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Reformat docs

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Remove backticks

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add context manager for shuffle tests

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add better deterministic shuffle path

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Update documentation and reset index

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

---------

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
* Initial pass at fuzzy dedup api

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Update deprecated shuffle arg

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* dask_cuda gpu only import

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Move fuzzy_dedup imports to optional

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* more tests

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Move FuzzyDeDupConfig to it's own class

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Add example script and config file, fix typo

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Remove slurm examples for gpu dedup

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Add config module

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Rename FuzzyDeDupConfig and minhash_length to  FuzzyDuplicatesConfig, num_hashes

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Add comments and update example

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Write to same format as input in fuzzy dedup example

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

---------

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
* Fix pii index issue

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add sequential wrapper

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix pii tests

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

---------

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
…g speed (NVIDIA#57)

This commit fixes issue NVIDIA#43 (empty files created when invoking reshard_jsonl method at nemo_curator.utils.file_utils.py) by double-checking the files size after being generated, and deleting them with size zero.

In addition to that, I have noticed there is no need to parse to JSON object the content of the different lines, which should be already in json format. By removing that extra-parsing, there is a significant speed up in the execution of this method.

Signed-off-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
This PR adds a new tutorial to demonstrate data curation for PEFT
use-cases.

Signed-off-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
* Move PII constants to a seperate file that does not import presidio/spacy and other GPU dependencies

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Add comment around import, move constant import to global scope

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

---------

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
Signed-off-by: Nicoel Luo <nluo@nvidia.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
nicoleeeluo and others added 10 commits May 20, 2024 07:54
Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
…zy deduplication wrapper example

Signed-off-by: Nicole Luo <nluo@nvidia.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
nicoleeeluo and others added 2 commits May 20, 2024 16:28
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
@nicoleeeluo
Copy link
Contributor Author

@ryantwolf Hi Ryan, I have fixed the commits accordingly. Would you help to review and see if there is any issue? Thank you!

}
],
"source": [
"client = get_client(args, args.device)\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah we just merged in a change that changes the function signature of this to no longer require an argparse object. It should be easier to use, but it does mean that you need to update it here. Ping me again when this is changed and I'll merge it in ASAP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noting. I have updated accordingly.

Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tutorial! This is super great

@ryantwolf ryantwolf merged commit 6fbc3ad into NVIDIA:main May 24, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants