Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align extract_partitioning_index logic with upstream shuffling #60

Merged
merged 10 commits into from
May 15, 2024

Conversation

rjzamora
Copy link
Contributor

@rjzamora rjzamora commented May 10, 2024

Dask modified how partitioning_index is used for shuffling in dask/dask#10705 (included in dask>=2023.12.1). This PR modifies extract_partitioning_index to use the same logic.

TODO:

  • Testing

@rjzamora
Copy link
Contributor Author

cc @VibhuJawa

Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review since this is a wip.

  • I can test the fix out on larger datasets if you feel it's ready for testing.
  • You might need to amend/update existing commits to include a signoff (commit -s), to pass the DCO check

nemo_curator/utils/fuzzy_dedup_utils/merge_utils.py Outdated Show resolved Hide resolved
rjzamora and others added 3 commits May 13, 2024 07:07
Signed-off-by: rjzamora <rzamora217@gmail.com>
This PR adds a new tutorial to demonstrate data curation for PEFT
use-cases.

Signed-off-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com>
Signed-off-by: rjzamora <rzamora217@gmail.com>
Signed-off-by: rjzamora <rzamora217@gmail.com>
@rjzamora rjzamora force-pushed the revise-extract_partitioning_index branch from d453b6e to 1f28a35 Compare May 13, 2024 14:08
@rjzamora
Copy link
Contributor Author

rjzamora commented May 13, 2024

@ayushdg - thanks for the review. Still don't have an evironment to test this myself. I will try to do that later today, but if it's easy for you to test it is very welcome on my end :)

@ayushdg
Copy link
Collaborator

ayushdg commented May 13, 2024

Can confirm I'm seeing expected results on the larger scale dataset with this fix. I'll run a few more tests on my end but it's generally looking good.
@rjzamora Do you have a small example that checks consistency of the behavior of these two shuffle approaches, that could be added as a unit test with the PR?

ayushdg and others added 2 commits May 14, 2024 06:38
* Move PII constants to a seperate file that does not import presidio/spacy and other GPU dependencies

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Add comment around import, move constant import to global scope

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

---------

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
@rjzamora
Copy link
Contributor Author

Do you have a small example that checks consistency of the behavior of these two shuffle approaches, that could be added as a unit test with the PR?

@VibhuJawa - Still trying to fix my environment so I can confirm locally, but won't the FuzzyDuplicates tests fail without this fix in place?

@ayushdg
Copy link
Collaborator

ayushdg commented May 14, 2024

Do you have a small example that checks consistency of the behavior of these two shuffle approaches, that could be added as a unit test with the PR?

@VibhuJawa - Still trying to fix my environment so I can confirm locally, but won't the FuzzyDuplicates tests fail without this fix in place?

In theory the shuffle gives incorrect results, but the dataset/num_partitions here is small enough that it doesn't impact the correctness of final results (duplicate documents detected)

Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, lets merge it in when we add the requested unit test.

Signed-off-by: rjzamora <rzamora217@gmail.com>
@rjzamora rjzamora marked this pull request as ready for review May 15, 2024 03:58
tests/test_fuzzy_dedup.py Outdated Show resolved Hide resolved
Comment on lines +418 to +421
npartitions_right,
npartitions_right,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble wrapping my head around the "correct" way to test the batched case. However, my sense is that this test already covers the critical requirement that we are extracting a partitioning index that is consistent with ddf.shuffle.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I don't think this specific bug impacts the batched case any differently than the reproducer here so I think it should be good to go.

Signed-off-by: rjzamora <rzamora217@gmail.com>
Signed-off-by: rjzamora <rzamora217@gmail.com>
@rjzamora rjzamora force-pushed the revise-extract_partitioning_index branch from b0e41ab to 647406f Compare May 15, 2024 13:50
Signed-off-by: rjzamora <rzamora217@gmail.com>
Comment on lines +215 to +217
from nemo_curator.utils.fuzzy_dedup_utils.shuffle_utils import (
rearange_by_column_direct,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I moved this import into merge_left_to_shuffled_right, because shuffle_utils currently requires cudf/dask_cuda, while other logic in this module does not.

Signed-off-by: rjzamora <rzamora217@gmail.com>
Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @rjzamora

Comment on lines +418 to +421
npartitions_right,
npartitions_right,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I don't think this specific bug impacts the batched case any differently than the reproducer here so I think it should be good to go.

@ayushdg ayushdg merged commit ecd4f4b into NVIDIA:main May 15, 2024
3 checks passed
@rjzamora rjzamora deleted the revise-extract_partitioning_index branch May 15, 2024 19:14
VibhuJawa pushed a commit to VibhuJawa/NeMo-Curator that referenced this pull request May 16, 2024
…DIA#60)

* update extract_partitioning_index with compat code

Signed-off-by: rjzamora <rzamora217@gmail.com>

* [Tutorials] Add a tutorial for PEFT data curation (NVIDIA#45)

This PR adds a new tutorial to demonstrate data curation for PEFT
use-cases.

Signed-off-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com>
Signed-off-by: rjzamora <rzamora217@gmail.com>

* move compat code to _compat file

Signed-off-by: rjzamora <rzamora217@gmail.com>

* Only import PII constants during Curator import (NVIDIA#61)

* Move PII constants to a seperate file that does not import presidio/spacy and other GPU dependencies

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Add comment around import, move constant import to global scope

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

---------

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* add unit test

Signed-off-by: rjzamora <rzamora217@gmail.com>

* add pytest.mark.gpu

Signed-off-by: rjzamora <rzamora217@gmail.com>

* move extract_partitioning_index import for now

Signed-off-by: rjzamora <rzamora217@gmail.com>

* test both cudf and pandas

Signed-off-by: rjzamora <rzamora217@gmail.com>

* spelling

Signed-off-by: rjzamora <rzamora217@gmail.com>

---------

Signed-off-by: rjzamora <rzamora217@gmail.com>
Signed-off-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Co-authored-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com>
Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants