-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NeMo Curator ReadMe Updates #62
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this PR together. I took one pass and left some minor comments.
I'll let the rest of the team also review and leave feedback.
README.md
Outdated
These modules offer flexibility and permit reordering, with only a few exceptions. In addition, the [NeMo Framework Launcher](https://github.com/NVIDIA/NeMo-Megatron-Launcher) provides pre-built pipelines that can serve as a foundation for your customization use cases. | ||
|
||
## Get Started | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want users to jump around too much to get started. Thus, I would add Installation
as the first subheading here, like so:
Get Started
Installation
Before installing, ensure the following requirements are met
- blah blah
Clone the repo....
To install the CPU-only modules.... etc.
Use the Python Library
(The code snippet you have below).
Tutorials
Link to the two tutorials we have (here: https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials)
tinystories
which focuses on data curation for training from scratch.
peft-curation
which focuses on data curation for parameter-efficient fine-tuning use-cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
README.md
Outdated
|
||
- Python 3.10 or above | ||
- CUDA 12 (or above) | ||
- NVIDIA GPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVIDIA GPU is optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
README.md
Outdated
|
||
## Prerequisites | ||
|
||
- Python 3.10 or above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(or above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
NeMo Curator is a Python library composed of several scalable data-mining modules, specifically designed for curating Natural Language Processing (NLP) data to train Large Language Models (LLMs). It enables NLP researchers to extract high-quality text from vast, uncurated web corpora efficiently, supporting the development of more accurate and powerful language models. | ||
|
||
NeMo Curator leverages [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids) to scale data curation and provide GPU acceleration. The Python library offers easy-to-use methods for expanding the functionality of your curation pipeline while eliminating scalability concerns. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a table of content here right after the intro. I think it's standard practice nowadays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enhancing LLM training accuracy through GPU-accelerated data curation using ...
This sentence is not technically accurate. The usage of GPU acceleration alone does not directly lead to enhanced LLM training accuracy.
It's best to say it greatly accelerates data curation using GPUs, thus saving time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised introductory paragraphs to include comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Two more minor things:
- DocumentDataset ->
DocumentDataset
(wrap in backticks) - dataframe ->
DataFrame
(CapitalCase and backticks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added backticks
|
||
NeMo Curator leverages [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids) to scale data curation and provide GPU acceleration. The Python library offers easy-to-use methods for expanding the functionality of your curation pipeline while eliminating scalability concerns. | ||
|
||
## Key Features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reads a bit long to me. Maybe we could retain the original main bullet points/links, but remove the sub-bullet points? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
README.md
Outdated
|
||
Note: Other methods are available to run NeMo Curator on SLURM. For example, refer to the example scripts in [`examples/slurm`](examples/slurm/) for information on how to run NeMo Curator on SLURM without the NeMo Framework Launcher. | ||
|
||
## Implement NeMo Curator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the contents here are already covered by everything else you've added. I'd vote for removing it for brevity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
NeMo Curator is a Python library composed of several scalable data-mining modules, specifically designed for curating Natural Language Processing (NLP) data to train Large Language Models (LLMs). It enables NLP researchers to extract high-quality text from vast, uncurated web corpora efficiently, supporting the development of more accurate and powerful language models. | ||
|
||
NeMo Curator leverages [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids) to scale data curation and provide GPU acceleration. The Python library offers easy-to-use methods for expanding the functionality of your curation pipeline while eliminating scalability concerns. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enhancing LLM training accuracy through GPU-accelerated data curation using ...
This sentence is not technically accurate. The usage of GPU acceleration alone does not directly lead to enhanced LLM training accuracy.
It's best to say it greatly accelerates data curation using GPUs, thus saving time.
README.md
Outdated
|
||
## Installation | ||
NeMo Curator is available in the [NeMo Framework Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags). The latest release of NeMo Curator comes preinstalled in the container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, there are two ways to use NeMo Curator: either install from this repo (which you are already explaining), or through the NeMo framework container.
Since you are explaining how to install from this repo, I suggest to move this a few lines up (before Install NeMo Curator).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised section and added two subtopics for tasks
README.md
Outdated
|
||
We provide CLI scripts to use as well in case those are more convienent. The scripts under `nemo_curator/scripts` map closely with each of the created python modules. Visit the [documentation](docs) for each of the python modules for more information about the scripts associated with it. | ||
+- tinystories which focuses on data curation for training from scratch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap in backticks so it appears as tinystories
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added backticks
README.md
Outdated
|
||
We provide CLI scripts to use as well in case those are more convienent. The scripts under `nemo_curator/scripts` map closely with each of the created python modules. Visit the [documentation](docs) for each of the python modules for more information about the scripts associated with it. | ||
+- tinystories which focuses on data curation for training from scratch. | ||
+- peft-curation which focuses on data curation for parameter-efficient fine-tuning use-cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap in backticks so it appears as peft-curation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added backticks
README.md
Outdated
@@ -132,8 +171,8 @@ Additionally, using the CPU-based modules the table below shows the time require | |||
</tbody> | |||
</table> | |||
|
|||
## Implementation | |||
## Contribute to NeMo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Contribute to NeMo Curator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised topic title
|
||
NeMo Curator leverages [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids) to scale data curation and provide GPU acceleration. The Python library offers easy-to-use methods for expanding the functionality of your curation pipeline while eliminating scalability concerns. | ||
|
||
## Key Features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
NeMo Curator is a Python library composed of several scalable data-mining modules, specifically designed for curating Natural Language Processing (NLP) data to train Large Language Models (LLMs). It enables NLP researchers to extract high-quality text from vast, uncurated web corpora efficiently, supporting the development of more accurate and powerful language models. | ||
|
||
NeMo Curator leverages [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids) to scale data curation and provide GPU acceleration. The Python library offers easy-to-use methods for expanding the functionality of your curation pipeline while eliminating scalability concerns. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Two more minor things:
- DocumentDataset ->
DocumentDataset
(wrap in backticks) - dataframe ->
DataFrame
(CapitalCase and backticks)
README.md
Outdated
To install the CPU-only modules: | ||
|
||
``` | ||
pip install |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This command should have a dot in the end:
pip install .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added dot
|
||
### Install from the Repository | ||
|
||
1. Clone the NeMo Curator repository in GitHub. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Clone the NeMo Curator repository in GitHub.
Could you change this to:
- Clone the NeMo Curator repository in GitHub.
git clone https://github.com/NVIDIA/NeMo-Curator.git
cd NeMo-Curator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add command line
README.md
Outdated
|
||
### Python Library | ||
To download your dataset, build your pipeline, and curate your dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't read well. Could you please modify so it says something along the following lines:
The snippet below demonstrates the creation of a small data curation pipeline that downloads and curates a small subset of the Common Crawl dataset:
(the snippet goes here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added sentence.
|
||
## Access Python Modules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little too verbose. Could you please paraphrase and shorten?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted repetitive sentences, not needed in this section, and concept explained earlier.
Thanks for the ping. @jgerh Looks like the commits for the PR were not signed and signed off Additionally the style check failures can be fixed with by running
|
Closing in favor of #93 |
No description provided.