Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RPV2 Pre-training Data Curation Tutorial #267

Closed
wants to merge 7 commits into from

Conversation

yyu22
Copy link
Contributor

@yyu22 yyu22 commented Sep 26, 2024

Description

Adding the tutorial on using NeMo-Curator for LLM pre-training data curator with RPV2 dataset. The tutorial use the following curator modules:

  • Data resharding
  • Add ID
  • Language ID and Separation
  • Text cleaning
  • Exact Deduplication
  • Fuzzy Deduplication
  • Quality filtering

Usage

Running the notebook that walk through the data curation workflow

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@yyu22 yyu22 marked this pull request as ready for review September 26, 2024 22:49
Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, thank you so much for doing this! Just have a few minor changes.

tutorials/pretraining-data-curation/README.md Outdated Show resolved Hide resolved
tutorials/pretraining-data-curation/README.md Outdated Show resolved Hide resolved
"outputs": [],
"source": [
"from nemo_curator import AddId\n",
"from nemo_curator.utils.distributed_utils import read_data, write_to_disk\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You import these twice in the notebook. Can you remove one instance of the imports? Probably this one so all the imports are in one place.

Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more things I forgot.

@ryantwolf
Copy link
Collaborator

Looks like the precommit and DCO checks are failing too, could you fix them? Our CONTRIBUTING.md has information on how to fix it.

Yang Yu added 4 commits October 2, 2024 11:09
Signed-off-by: Yang Yu <yayu@yayu-mlt.client.nvidia.com>
Signed-off-by: Yang Yu <yayu@yayu-mlt.client.nvidia.com>
Signed-off-by: Yang Yu <yayu@yayu-mlt.client.nvidia.com>
Signed-off-by: Yang Yu <yayu@yayu-mlt.client.nvidia.com>
Yang Yu added 2 commits October 2, 2024 16:57
Signed-off-by: Yang Yu <yayu@yayu-mlt.client.nvidia.com>
@yyu22
Copy link
Contributor Author

yyu22 commented Oct 4, 2024

Looks like the precommit and DCO checks are failing too, could you fix them? Our CONTRIBUTING.md has information on how to fix it.

Fixed

@yyu22
Copy link
Contributor Author

yyu22 commented Oct 4, 2024

@ryantwolf I've addressed your comments. can you review again?

@sarahyurick sarahyurick added the documentation Improvements or additions to documentation label Oct 7, 2024
@yyu22 yyu22 requested a review from ryantwolf October 8, 2024 21:42
Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful work!

@ryantwolf
Copy link
Collaborator

Ah shoot, not all commits are signed. Can you follow the instructions here to retroactively sign all your commits?

@yyu22 yyu22 force-pushed the rpv2-tutorial branch 6 times, most recently from f04416e to 4a08100 Compare October 10, 2024 05:55
Signed-off-by: Yang Yu <yayu@nvidia.com>
@ryantwolf
Copy link
Collaborator

@yyu22 looks like the commits still aren't signed. There are two options you need to specify when committing, one is for "signing" and the other is for "signing off". You can specify them with -sS like git commit -sS. This page has more details on the "signing" part: https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification. Make sure you are doing both as specified in the PR guidelines.

@ryantwolf
Copy link
Collaborator

Closing in favor of #292

@ryantwolf ryantwolf closed this Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants