Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NeMo Curator ReadMe Updates #62

Closed
wants to merge 6 commits into from
Closed

Conversation

jgerh
Copy link
Contributor

@jgerh jgerh commented May 10, 2024

No description provided.

@jgerh jgerh marked this pull request as ready for review May 10, 2024 21:42
Copy link
Collaborator

@Maghoumi Maghoumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this PR together. I took one pass and left some minor comments.

I'll let the rest of the team also review and leave feedback.

README.md Outdated
These modules offer flexibility and permit reordering, with only a few exceptions. In addition, the [NeMo Framework Launcher](https://github.com/NVIDIA/NeMo-Megatron-Launcher) provides pre-built pipelines that can serve as a foundation for your customization use cases.

## Get Started

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want users to jump around too much to get started. Thus, I would add Installation as the first subheading here, like so:

Get Started

Installation

Before installing, ensure the following requirements are met

- blah blah

Clone the repo....
To install the CPU-only modules.... etc.

Use the Python Library

(The code snippet you have below).

Tutorials

Link to the two tutorials we have (here: https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials)
tinystories which focuses on data curation for training from scratch.
peft-curation which focuses on data curation for parameter-efficient fine-tuning use-cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

README.md Outdated

- Python 3.10 or above
- CUDA 12 (or above)
- NVIDIA GPU
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVIDIA GPU is optional.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

README.md Outdated

## Prerequisites

- Python 3.10 or above
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(or above)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

NeMo Curator is a Python library composed of several scalable data-mining modules, specifically designed for curating Natural Language Processing (NLP) data to train Large Language Models (LLMs). It enables NLP researchers to extract high-quality text from vast, uncurated web corpora efficiently, supporting the development of more accurate and powerful language models.

NeMo Curator leverages [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids) to scale data curation and provide GPU acceleration. The Python library offers easy-to-use methods for expanding the functionality of your curation pipeline while eliminating scalability concerns.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a table of content here right after the intro. I think it's standard practice nowadays.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enhancing LLM training accuracy through GPU-accelerated data curation using ...
This sentence is not technically accurate. The usage of GPU acceleration alone does not directly lead to enhanced LLM training accuracy.

It's best to say it greatly accelerates data curation using GPUs, thus saving time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised introductory paragraphs to include comment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Two more minor things:

  • DocumentDataset -> DocumentDataset (wrap in backticks)
  • dataframe -> DataFrame (CapitalCase and backticks)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added backticks


NeMo Curator leverages [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids) to scale data curation and provide GPU acceleration. The Python library offers easy-to-use methods for expanding the functionality of your curation pipeline while eliminating scalability concerns.

## Key Features
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads a bit long to me. Maybe we could retain the original main bullet points/links, but remove the sub-bullet points? What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

README.md Outdated

Note: Other methods are available to run NeMo Curator on SLURM. For example, refer to the example scripts in [`examples/slurm`](examples/slurm/) for information on how to run NeMo Curator on SLURM without the NeMo Framework Launcher.

## Implement NeMo Curator
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the contents here are already covered by everything else you've added. I'd vote for removing it for brevity.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

NeMo Curator is a Python library composed of several scalable data-mining modules, specifically designed for curating Natural Language Processing (NLP) data to train Large Language Models (LLMs). It enables NLP researchers to extract high-quality text from vast, uncurated web corpora efficiently, supporting the development of more accurate and powerful language models.

NeMo Curator leverages [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids) to scale data curation and provide GPU acceleration. The Python library offers easy-to-use methods for expanding the functionality of your curation pipeline while eliminating scalability concerns.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enhancing LLM training accuracy through GPU-accelerated data curation using ...
This sentence is not technically accurate. The usage of GPU acceleration alone does not directly lead to enhanced LLM training accuracy.

It's best to say it greatly accelerates data curation using GPUs, thus saving time.

README.md Outdated

## Installation
NeMo Curator is available in the [NeMo Framework Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags). The latest release of NeMo Curator comes preinstalled in the container.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, there are two ways to use NeMo Curator: either install from this repo (which you are already explaining), or through the NeMo framework container.

Since you are explaining how to install from this repo, I suggest to move this a few lines up (before Install NeMo Curator).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised section and added two subtopics for tasks

README.md Outdated

We provide CLI scripts to use as well in case those are more convienent. The scripts under `nemo_curator/scripts` map closely with each of the created python modules. Visit the [documentation](docs) for each of the python modules for more information about the scripts associated with it.
+- tinystories which focuses on data curation for training from scratch.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap in backticks so it appears as tinystories

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added backticks

README.md Outdated

We provide CLI scripts to use as well in case those are more convienent. The scripts under `nemo_curator/scripts` map closely with each of the created python modules. Visit the [documentation](docs) for each of the python modules for more information about the scripts associated with it.
+- tinystories which focuses on data curation for training from scratch.
+- peft-curation which focuses on data curation for parameter-efficient fine-tuning use-cases.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap in backticks so it appears as peft-curation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added backticks

README.md Outdated
@@ -132,8 +171,8 @@ Additionally, using the CPU-based modules the table below shows the time require
</tbody>
</table>

## Implementation
## Contribute to NeMo
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contribute to NeMo Curator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised topic title


NeMo Curator leverages [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids) to scale data curation and provide GPU acceleration. The Python library offers easy-to-use methods for expanding the functionality of your curation pipeline while eliminating scalability concerns.

## Key Features
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

NeMo Curator is a Python library composed of several scalable data-mining modules, specifically designed for curating Natural Language Processing (NLP) data to train Large Language Models (LLMs). It enables NLP researchers to extract high-quality text from vast, uncurated web corpora efficiently, supporting the development of more accurate and powerful language models.

NeMo Curator leverages [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids) to scale data curation and provide GPU acceleration. The Python library offers easy-to-use methods for expanding the functionality of your curation pipeline while eliminating scalability concerns.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Two more minor things:

  • DocumentDataset -> DocumentDataset (wrap in backticks)
  • dataframe -> DataFrame (CapitalCase and backticks)

README.md Outdated
To install the CPU-only modules:

```
pip install
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command should have a dot in the end:

pip install .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added dot


### Install from the Repository

1. Clone the NeMo Curator repository in GitHub.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Clone the NeMo Curator repository in GitHub.

Could you change this to:

  1. Clone the NeMo Curator repository in GitHub.
git clone https://github.com/NVIDIA/NeMo-Curator.git
cd NeMo-Curator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add command line

README.md Outdated

### Python Library
To download your dataset, build your pipeline, and curate your dataset:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't read well. Could you please modify so it says something along the following lines:

The snippet below demonstrates the creation of a small data curation pipeline that downloads and curates a small subset of the Common Crawl dataset:

(the snippet goes here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added sentence.


## Access Python Modules
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little too verbose. Could you please paraphrase and shorten?

Copy link
Contributor Author

@jgerh jgerh May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted repetitive sentences, not needed in this section, and concept explained earlier.

@Maghoumi
Copy link
Collaborator

Looks good to me @jgerh, thanks for addressing all the comments.

@ayushdg Could you please review/approve? It seems I cannot resolve conversations.

@ayushdg
Copy link
Collaborator

ayushdg commented May 29, 2024

Thanks for the ping. @jgerh Looks like the commits for the PR were not signed and signed off using the git commit -sS flags which is a requirement for curator. More info on signing commits.

Additionally the style check failures can be fixed with by running pre-commit locally.

pip install pre-commit && pre-commit install && pre-commit run --all

This was referenced May 30, 2024
@ayushdg
Copy link
Collaborator

ayushdg commented May 31, 2024

Closing in favor of #93

@ayushdg ayushdg closed this May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants