NeMo Curator ReadMe Updates #62

jgerh · 2024-05-10T21:41:49Z

No description provided.

Maghoumi

Thanks for putting this PR together. I took one pass and left some minor comments.

I'll let the rest of the team also review and leave feedback.

Maghoumi · 2024-05-13T16:47:36Z

README.md

+These modules offer flexibility and permit reordering, with only a few exceptions. In addition, the [NeMo Framework Launcher](https://github.com/NVIDIA/NeMo-Megatron-Launcher) provides pre-built pipelines that can serve as a foundation for your customization use cases.
+
+## Get Started
+


We don't want users to jump around too much to get started. Thus, I would add Installation as the first subheading here, like so:

Get Started

Installation

Before installing, ensure the following requirements are met

- blah blah

Clone the repo....
To install the CPU-only modules.... etc.

Use the Python Library

(The code snippet you have below).

Tutorials

Link to the two tutorials we have (here: https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials)
tinystories which focuses on data curation for training from scratch.
peft-curation which focuses on data curation for parameter-efficient fine-tuning use-cases.

Maghoumi · 2024-05-13T16:50:36Z

README.md

+
+- Python 3.10 or above
+- CUDA 12 (or above)
+- NVIDIA GPU


NVIDIA GPU is optional.

Maghoumi · 2024-05-13T16:50:47Z

README.md

+
+## Prerequisites
+
+- Python 3.10 or above


Maghoumi · 2024-05-13T16:56:00Z

README.md

+NeMo Curator is a Python library composed of several scalable data-mining modules, specifically designed for curating Natural Language Processing (NLP) data to train Large Language Models (LLMs). It enables NLP researchers to extract high-quality text from vast, uncurated web corpora efficiently, supporting the development of more accurate and powerful language models.
+
+NeMo Curator leverages [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids) to scale data curation and provide GPU acceleration. The Python library offers easy-to-use methods for expanding the functionality of your curation pipeline while eliminating scalability concerns.
+


Let's add a table of content here right after the intro. I think it's standard practice nowadays.

enhancing LLM training accuracy through GPU-accelerated data curation using ...
This sentence is not technically accurate. The usage of GPU acceleration alone does not directly lead to enhanced LLM training accuracy.

It's best to say it greatly accelerates data curation using GPUs, thus saving time.

Revised introductory paragraphs to include comment

Thanks. Two more minor things:

DocumentDataset -> DocumentDataset (wrap in backticks)

dataframe -> DataFrame (CapitalCase and backticks)

Added backticks

Maghoumi · 2024-05-13T17:32:37Z

README.md

+
+NeMo Curator leverages [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids) to scale data curation and provide GPU acceleration. The Python library offers easy-to-use methods for expanding the functionality of your curation pipeline while eliminating scalability concerns.
+
+## Key Features


This reads a bit long to me. Maybe we could retain the original main bullet points/links, but remove the sub-bullet points? What do you think?

Maghoumi · 2024-05-13T17:37:35Z

README.md

+
+Note: Other methods are available to run NeMo Curator on SLURM. For example, refer to the example scripts in [`examples/slurm`](examples/slurm/) for information on how to run NeMo Curator on SLURM without the NeMo Framework Launcher.
+
+## Implement NeMo Curator


I think the contents here are already covered by everything else you've added. I'd vote for removing it for brevity.

Maghoumi · 2024-05-25T14:17:19Z

README.md

+NeMo Curator is a Python library composed of several scalable data-mining modules, specifically designed for curating Natural Language Processing (NLP) data to train Large Language Models (LLMs). It enables NLP researchers to extract high-quality text from vast, uncurated web corpora efficiently, supporting the development of more accurate and powerful language models.
+
+NeMo Curator leverages [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids) to scale data curation and provide GPU acceleration. The Python library offers easy-to-use methods for expanding the functionality of your curation pipeline while eliminating scalability concerns.
+


enhancing LLM training accuracy through GPU-accelerated data curation using ...
This sentence is not technically accurate. The usage of GPU acceleration alone does not directly lead to enhanced LLM training accuracy.

It's best to say it greatly accelerates data curation using GPUs, thus saving time.

Maghoumi · 2024-05-25T14:19:14Z

README.md


-## Installation
+NeMo Curator is available in the [NeMo Framework Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags). The latest release of NeMo Curator comes preinstalled in the container.


FYI, there are two ways to use NeMo Curator: either install from this repo (which you are already explaining), or through the NeMo framework container.

Since you are explaining how to install from this repo, I suggest to move this a few lines up (before Install NeMo Curator).

Revised section and added two subtopics for tasks

Maghoumi · 2024-05-25T14:20:17Z

README.md


-We provide CLI scripts to use as well in case those are more convienent. The scripts under `nemo_curator/scripts` map closely with each of the created python modules. Visit the [documentation](docs) for each of the python modules for more information about the scripts associated with it.
+- tinystories which focuses on data curation for training from scratch.


Wrap in backticks so it appears as tinystories

Added backticks

Maghoumi · 2024-05-25T14:20:26Z

README.md


-We provide CLI scripts to use as well in case those are more convienent. The scripts under `nemo_curator/scripts` map closely with each of the created python modules. Visit the [documentation](docs) for each of the python modules for more information about the scripts associated with it.
+- tinystories which focuses on data curation for training from scratch.
+- peft-curation which focuses on data curation for parameter-efficient fine-tuning use-cases.


Wrap in backticks so it appears as peft-curation

Added backticks

Maghoumi · 2024-05-25T14:22:09Z

README.md

@@ -132,8 +171,8 @@ Additionally, using the CPU-based modules the table below shows the time require
  </tbody>
 </table>

-## Implementation
+## Contribute to NeMo


Contribute to NeMo Curator

Revised topic title

Maghoumi · 2024-05-28T17:52:38Z

README.md

+
+NeMo Curator leverages [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids) to scale data curation and provide GPU acceleration. The Python library offers easy-to-use methods for expanding the functionality of your curation pipeline while eliminating scalability concerns.
+
+## Key Features


Maghoumi · 2024-05-28T17:53:55Z

README.md

+NeMo Curator is a Python library composed of several scalable data-mining modules, specifically designed for curating Natural Language Processing (NLP) data to train Large Language Models (LLMs). It enables NLP researchers to extract high-quality text from vast, uncurated web corpora efficiently, supporting the development of more accurate and powerful language models.
+
+NeMo Curator leverages [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids) to scale data curation and provide GPU acceleration. The Python library offers easy-to-use methods for expanding the functionality of your curation pipeline while eliminating scalability concerns.
+


Thanks. Two more minor things:

DocumentDataset -> DocumentDataset (wrap in backticks)

dataframe -> DataFrame (CapitalCase and backticks)

Maghoumi · 2024-05-28T18:10:58Z

README.md

+    To install the CPU-only modules:
+
+    ```
+    pip install


This command should have a dot in the end:

pip install .

Maghoumi · 2024-05-28T18:12:17Z

README.md

+
+### Install from the Repository
+
+1. Clone the NeMo Curator repository in GitHub.


Clone the NeMo Curator repository in GitHub.

Could you change this to:

Clone the NeMo Curator repository in GitHub.

git clone https://github.com/NVIDIA/NeMo-Curator.git cd NeMo-Curator

add command line

Maghoumi · 2024-05-28T18:14:53Z

README.md


-### Python Library
+To download your dataset, build your pipeline, and curate your dataset:


This doesn't read well. Could you please modify so it says something along the following lines:

The snippet below demonstrates the creation of a small data curation pipeline that downloads and curates a small subset of the Common Crawl dataset:

(the snippet goes here)

Added sentence.

Maghoumi · 2024-05-28T18:16:04Z

README.md


+## Access Python Modules


It's a little too verbose. Could you please paraphrase and shorten?

Deleted repetitive sentences, not needed in this section, and concept explained earlier.

Maghoumi · 2024-05-28T20:59:10Z

Looks good to me @jgerh, thanks for addressing all the comments.

@ayushdg Could you please review/approve? It seems I cannot resolve conversations.

ayushdg · 2024-05-29T00:53:04Z

Thanks for the ping. @jgerh Looks like the commits for the PR were not signed and signed off using the git commit -sS flags which is a requirement for curator. More info on signing commits.

Additionally the style check failures can be fixed with by running pre-commit locally.

pip install pre-commit && pre-commit install && pre-commit run --all

ayushdg · 2024-05-31T16:54:59Z

Closing in favor of #93

jgerh added 2 commits May 8, 2024 16:11

Update

7e270fd

NeMo Curator ReadMe Updates

17fdf61

jgerh marked this pull request as ready for review May 10, 2024 21:42

Maghoumi reviewed May 13, 2024

View reviewed changes

NeMo Curator Updates

8914de5

Maghoumi reviewed May 25, 2024

View reviewed changes

NeMo Curator ReadMe Updates Rev

06bcba4

Maghoumi reviewed May 28, 2024

View reviewed changes

NeMo Curator ReadMe Updates

998914c

This was referenced May 30, 2024

Update documentation for new version #83

Merged

Update readme #93

Merged

Merge branch 'NVIDIA:main' into update-readme

ab9d317

ayushdg closed this May 31, 2024

		These modules offer flexibility and permit reordering, with only a few exceptions. In addition, the [NeMo Framework Launcher](https://github.com/NVIDIA/NeMo-Megatron-Launcher) provides pre-built pipelines that can serve as a foundation for your customization use cases.

		## Get Started

		NeMo Curator is a Python library composed of several scalable data-mining modules, specifically designed for curating Natural Language Processing (NLP) data to train Large Language Models (LLMs). It enables NLP researchers to extract high-quality text from vast, uncurated web corpora efficiently, supporting the development of more accurate and powerful language models.

		NeMo Curator leverages [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids) to scale data curation and provide GPU acceleration. The Python library offers easy-to-use methods for expanding the functionality of your curation pipeline while eliminating scalability concerns.


		NeMo Curator leverages [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids) to scale data curation and provide GPU acceleration. The Python library offers easy-to-use methods for expanding the functionality of your curation pipeline while eliminating scalability concerns.

		## Key Features


		Note: Other methods are available to run NeMo Curator on SLURM. For example, refer to the example scripts in [`examples/slurm`](examples/slurm/) for information on how to run NeMo Curator on SLURM without the NeMo Framework Launcher.

		## Implement NeMo Curator


		## Installation
		NeMo Curator is available in the [NeMo Framework Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags). The latest release of NeMo Curator comes preinstalled in the container.


		We provide CLI scripts to use as well in case those are more convienent. The scripts under `nemo_curator/scripts` map closely with each of the created python modules. Visit the [documentation](docs) for each of the python modules for more information about the scripts associated with it.
		+- tinystories which focuses on data curation for training from scratch.


		### Install from the Repository

		1. Clone the NeMo Curator repository in GitHub.


		### Python Library
		To download your dataset, build your pipeline, and curate your dataset:

NeMo Curator ReadMe Updates #62

NeMo Curator ReadMe Updates #62

Conversation

jgerh commented May 10, 2024

Maghoumi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Get Started

Installation

Use the Python Library

Tutorials

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh May 28, 2024 • edited Loading

Choose a reason for hiding this comment

Maghoumi commented May 28, 2024

ayushdg commented May 29, 2024

ayushdg commented May 31, 2024

jgerh May 28, 2024 •

edited

Loading