Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update documentation for new version #83

Merged
merged 3 commits into from
Jun 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 28 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,13 +67,29 @@ Before installing NeMo Curator, ensure that the following requirements are met:

## Install NeMo Curator

Two options are available for installing NeMo Curator. You can install it from the repository or through the NeMo Framework container.
You can install NeMo-Curator from PyPi, from source or get it through the NeMo Framework container.

### Install from the Repository
### PyPi

NeMo Curator can be installed via PyPi as follows -

To install the CPU-only modules:

```bash
pip install nemo-curator
```

To install the CPU and CUDA-accelerated modules:

```bash
pip install --extra-index-url https://pypi.nvidia.com nemo-curator[cuda12x]
```

### From Source

1. Clone the NeMo Curator repository in GitHub.

```
```bash
git clone https://github.com/NVIDIA/NeMo-Curator.git
cd NeMo-Curator
```
Expand All @@ -82,20 +98,27 @@ Two options are available for installing NeMo Curator. You can install it from

To install the CPU-only modules:

```
```bash
pip install .
```

To install the CPU and CUDA-accelerated modules:

```
```bash
pip install --extra-index-url https://pypi.nvidia.com ".[cuda12x]"
```

### Install from the NeMo Framework Container

NeMo Curator is available in the [NeMo Framework Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags). The latest release of NeMo Curator comes preinstalled in the container.

If you want the latest commit inside the container, uninstall the existing version using:

```bash
pip uninstall nemo-curator
```
And follow the instructions for installing from source from [above](#from-source).

## Use the Python Library

The following snippet demonstrates how to create a small data curation pipeline that downloads and curates a small subset of the Common Crawl dataset.
Expand Down
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@

setup(
name="nemo_curator",
version="0.2.0",
version="0.3.0",
description="Scalable Data Preprocessing Tool for "
"Training Large Language Models",
long_description=long_description,
Expand Down Expand Up @@ -54,7 +54,7 @@
"jieba==0.42.1",
"comment_parser",
"beautifulsoup4",
"mwparserfromhell @ git+https://github.com/earwig/mwparserfromhell.git@0f89f44",
"mwparserfromhell==0.6.5",
"spacy>=3.6.0, <4.0.0",
"presidio-analyzer==2.2.351",
"presidio-anonymizer==2.2.351",
Expand Down