Skip to content

Commit

Permalink
Merge branch 'NVIDIA:main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
dpadmanabhan03 authored Jul 19, 2024
2 parents e812559 + fb12646 commit 30a5e68
Show file tree
Hide file tree
Showing 37 changed files with 4,375 additions and 83 deletions.
4 changes: 2 additions & 2 deletions .pre-commit-config.yaml
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ ci:

repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
rev: v4.6.0
hooks:
- id: check-added-large-files
args: ['--maxkb=1000']
Expand All @@ -35,7 +35,7 @@ repos:
- id: trailing-whitespace

- repo: https://github.com/psf/black
rev: 24.3.0
rev: 24.4.2
hooks:
- id: black
name: Format code
Expand Down
7 changes: 7 additions & 0 deletions docs/user-guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@
:ref:`GPU Accelerated Exact and Fuzzy Deduplication <data-curator-gpu-deduplication>`
Both exact and fuzzy deduplication functionalities are supported in NeMo Curator and accelerated using RAPIDS cuDF.

:ref:`Synthetic Data Generation <data-curator-syntheticdata>`
Synthetic data generation tools and example piplines are available within NeMo Curator.

:ref:`Downstream Task Decontamination <data-curator-downstream>`
After training, large language models are usually evaluated by their performance on downstream tasks consisting of unseen test data. When dealing with large datasets, there is a potential for leakage of this test data into the model’s training dataset. NeMo Curator allows you to remove sections of documents in your dataset that are present in downstream tasks.

Expand All @@ -27,6 +30,9 @@
:ref:`NeMo Curator on Kubernetes <data-curator-kubernetes>`
Demonstration of how to run the NeMo Curator on a Dask Cluster deployed on top of Kubernetes

:ref:`NeMo Curator with NeMo SDK <data-curator-nemo-sdk>`
Example of how to use NeMo Curator with NeMo SDK to run on various platforms

`Tutorials <https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials>`__
To get started, you can explore the NeMo Curator GitHub repository and follow the available tutorials and notebooks. These resources cover various aspects of data curation, including training from scratch and Parameter-Efficient Fine-Tuning (PEFT).

Expand All @@ -46,3 +52,4 @@
personalidentifiableinformationidentificationandremoval.rst
distributeddataclassification.rst
kubernetescurator.rst
nemosdk.rst
127 changes: 127 additions & 0 deletions docs/user-guide/nemosdk.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
.. _data-curator-nemo-sdk:

======================================
NeMo Curator with NeMo SDK
======================================
-----------------------------------------
NeMo SDK
-----------------------------------------

The NeMo SDK is a general purpose tool for configuring and executing Python functions and scripts acrosss various computing environments.
It is used across the NeMo Framework for managing machine learning experiments.
One of the key features of the NeMo SDK is the ability to run code locally or on platforms like SLURM with minimal changes.

-----------------------------------------
Usage
-----------------------------------------

We recommend getting slightly familiar with NeMo SDK before jumping into this. The documentation can be found here.

Let's walk through the example usage for how you can launch a slurm job using `examples/launch_slurm.py <https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/nemo_sdk/launch_slurm.py>`_.

.. code-block:: python
import nemo_sdk as sdk
from nemo_sdk.core.execution import SlurmExecutor
from nemo_curator.nemo_sdk import SlurmJobConfig
@sdk.factory
def nemo_curator_slurm_executor() -> SlurmExecutor:
"""
Configure the following function with the details of your SLURM cluster
"""
return SlurmExecutor(
job_name_prefix="nemo-curator",
account="my-account",
nodes=2,
exclusive=True,
time="04:00:00",
container_image="nvcr.io/nvidia/nemo:dev",
container_mounts=["/path/on/machine:/path/in/container"],
)
First, we need to define a factory that can produce a ``SlurmExecutor``.
This exectuor is where you define all your cluster parameters. Note: NeMo SDK only supports running on SLURM clusters with `Pyxis <https://github.com/NVIDIA/pyxis>`_ right now.
After this, there is the main function

.. code-block:: python
# Path to NeMo-Curator/examples/slurm/container_entrypoint.sh on the SLURM cluster
container_entrypoint = "/cluster/path/slurm/container_entrypoint.sh"
# The NeMo Curator command to run
curator_command = "text_cleaning --input-data-dir=/path/to/data --output-clean-dir=/path/to/output"
curator_job = SlurmJobConfig(
job_dir="/home/user/jobs",
container_entrypoint=container_entrypoint,
script_command=curator_command,
)
First, we need to specify the path to `examples/slurm/container-entrypoint.sh <https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/slurm/container-entrypoint.sh>`_ on the cluster.
This shell script is responsible for setting up the Dask cluster on Slurm and will be the main script run.
Therefore, we need to define the path to it.

Second, we need to establish the NeMo Curator script we want to run.
This can be a command line utility like ``text_cleaning`` we have above, or it can be your own custom script ran with ``python path/to/script.py``


Finally, we combine all of these into a ``SlurmJobConfig``. This config has many options for configuring the Dask cluster.
We'll highlight a couple of important ones:

* ``device="cpu"`` determines the type of Dask cluster to initialize. If you are using GPU modules, please set this equal to ``"gpu"``.
* ``interface="etho0"`` specifies the network interface to use for communication within the Dask cluster. It will likely be different for your Slurm cluster, so please modify as needed. You can determine what interfaces are available by running the following function on your cluster.

.. code-block:: python
from nemo_curator import get_network_interfaces
print(get_network_interfaces())
.. code-block:: python
executor = sdk.resolve(SlurmExecutor, "nemo_curator_slurm_executor")
with sdk.Experiment("example_nemo_curator_exp", executor=executor) as exp:
exp.add(curator_job.to_script(), tail_logs=True)
exp.run(detach=False)
After configuring the job, we can finally run it.
First, we use the sdk to resolve our custom factory.
Next, we use it to begin an experiment named "example_nemo_curator_exp" running on our Slurm exectuor.

``exp.add(curator_job.to_script(), tail_logs=True)`` adds the NeMo Curator script to be part of the experiment.
It converts the ``SlurmJobConfig`` to a ``sdk.Script``.
This ``curator_job.to_script()`` has two important parameters.
* ``add_scheduler_file=True``
* ``add_device=True``

Both of these modify the command specified in ``curator_command``.
Setting both to ``True`` (the default) transforms the original command from:

.. code-block:: bash
# Original command
text_cleaning \
--input-data-dir=/path/to/data \
--output-clean-dir=/path/to/output
to:

.. code-block:: bash
# Modified commmand
text_cleaning \
--input-data-dir=/path/to/data \
--output-clean-dir=/path/to/output \
--scheduler-file=/path/to/scheduler/file \
--device="cpu"
As you can see, ``add_scheduler_file=True`` causes ``--scheduler-file=/path/to/scheduer/file`` to be appended to the command, and ``add_device=True`` causes ``--device="cpu"`` (or whatever the device is set to) to be appended.
``/path/to/scheduer/file`` is determined by ``SlurmJobConfig``, and ``device`` is what the user specified in the ``device`` parameter previously.

The scheduler file argument is necessary to connect to the Dask cluster on Slurm.
All NeMo Curator scripts accept both arguments, so the default is to automatically add them.
If your script is configured differently, feel free to turn these off.

The final line ``exp.run(detach=False)`` starts the experiment on the Slurm cluster.
18 changes: 18 additions & 0 deletions docs/user-guide/syntheticdata.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@

.. _data-curator-syntheticdata:

======================================
Synthetic Data Generation
======================================
--------------------------------------
Background
--------------------------------------
Synthetic data generation has become increasing useful in large language model training.
It is used in pretraining, fine-tuning, and evalutation.
Synthetically generated data can be useful for adapting an LLM to low resource languages/domains, or performing knowledge distillation from other models among other purposes.
There are a variety of ways to construct synthetic data generation pipelines, with numerous LLM and classical filters.

NeMo Curator has a simple, easy-to-use set of tools that allow you to use prebuilt synthetic generation pipelines or build your own.
Any model inference service that uses the OpenAI API is compatible with the synthetic data generation module, allowing you to generate your data from any model.
NeMo Curator has prebuilt synthetic data generation pipelines for supervised fine-tuning (SFT) and preference data that were used to generate data for the training of `Nemotron-4 340B <https://research.nvidia.com/publication/2024-06_nemotron-4-340b>`_.
And, you can easily interweave filtering and deduplication steps in your synthetic data pipeline with the other modules in NeMo Curator.
56 changes: 56 additions & 0 deletions examples/nemo_sdk/launch_slurm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import nemo_sdk as sdk
from nemo_sdk.core.execution import SlurmExecutor

from nemo_curator.nemo_sdk import SlurmJobConfig


@sdk.factory
def nemo_curator_slurm_executor() -> SlurmExecutor:
"""
Configure the following function with the details of your SLURM cluster
"""
return SlurmExecutor(
job_name_prefix="nemo-curator",
account="my-account",
nodes=2,
exclusive=True,
time="04:00:00",
container_image="nvcr.io/nvidia/nemo:dev",
container_mounts=["/path/on/machine:/path/in/container"],
)


def main():
# Path to NeMo-Curator/examples/slurm/container_entrypoint.sh on the SLURM cluster
container_entrypoint = "/cluster/path/slurm/container_entrypoint.sh"
# The NeMo Curator command to run
# This command can be susbstituted with any NeMo Curator command
curator_command = "text_cleaning --input-data-dir=/path/to/data --output-clean-dir=/path/to/output"
curator_job = SlurmJobConfig(
job_dir="/home/user/jobs",
container_entrypoint=container_entrypoint,
script_command=curator_command,
)

executor = sdk.resolve(SlurmExecutor, "nemo_curator_slurm_executor")
with sdk.Experiment("example_nemo_curator_exp", executor=executor) as exp:
exp.add(curator_job.to_script(), tail_logs=True)
exp.run(detach=False)


if __name__ == "__main__":
main()
8 changes: 7 additions & 1 deletion examples/slurm/container-entrypoint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,12 @@

# Start the scheduler on the rank 0 node
if [[ -z "$SLURM_NODEID" ]] || [[ $SLURM_NODEID == 0 ]]; then
# Make the directories needed
echo "Making log directory $LOGDIR"
mkdir -p $LOGDIR
echo "Making profile directory $PROFILESDIR"
mkdir -p $PROFILESDIR

echo "Starting scheduler"
if [[ $DEVICE == 'cpu' ]]; then
dask scheduler \
Expand Down Expand Up @@ -58,7 +64,7 @@ fi
sleep 60

if [[ -z "$SLURM_NODEID" ]] || [[ $SLURM_NODEID == 0 ]]; then
echo "Starting $SCRIPT_PATH"
echo "Starting $SCRIPT_COMMAND"
bash -c "$SCRIPT_COMMAND"
touch $DONE_MARKER
fi
Expand Down
6 changes: 2 additions & 4 deletions examples/slurm/start-slurm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,8 @@
export BASE_JOB_DIR=`pwd`/nemo-curator-jobs
export JOB_DIR=$BASE_JOB_DIR/$SLURM_JOB_ID

# Logging information
# Directory for Dask cluster communication and logging
# Must be paths inside the container that are accessible across nodes
export LOGDIR=$JOB_DIR/logs
export PROFILESDIR=$JOB_DIR/profiles
export SCHEDULER_FILE=$LOGDIR/scheduler.json
Expand Down Expand Up @@ -74,9 +75,6 @@ export DASK_DATAFRAME__QUERY_PLANNING=False
# End easy customization
# =================================================================

mkdir -p $LOGDIR
mkdir -p $PROFILESDIR

# Start the container
srun \
--container-mounts=${MOUNTS} \
Expand Down
9 changes: 8 additions & 1 deletion nemo_curator/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,14 @@


from .modules import *
from .utils.distributed_utils import get_client
from .services import (
AsyncLLMClient,
AsyncOpenAIClient,
LLMClient,
NemoDeployClient,
OpenAIClient,
)
from .utils.distributed_utils import get_client, get_network_interfaces

# Dask will automatically convert the list score type
# to a string without this option.
Expand Down
40 changes: 39 additions & 1 deletion nemo_curator/datasets/doc_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import List, Union
from typing import List, Optional, Union

import dask.dataframe as dd

Expand Down Expand Up @@ -130,6 +130,44 @@ def to_pickle(
):
raise NotImplementedError("DocumentDataset does not support to_pickle yet")

@classmethod
def from_pandas(
cls,
data,
npartitions: Optional[int] = 1,
chunksize: Optional[int] = None,
sort: Optional[bool] = True,
name: Optional[str] = None,
):
"""
Creates a document dataset from a pandas data frame.
For more information on the arguments see Dask's from_pandas documentation
https://docs.dask.org/en/stable/generated/dask.dataframe.from_pandas.html
Args:
data: A pandas dataframe
Returns:
A document dataset with a pandas backend (on the CPU).
"""
return cls(
dd.from_pandas(
data=data,
npartitions=npartitions,
chunksize=chunksize,
sort=sort,
name=name,
)
)

def to_pandas(self):
"""
Creates a pandas dataframe from a DocumentDataset
Returns:
A pandas dataframe (on the CPU)
"""
return self.df.to_backend("pandas").compute()


def _read_json_or_parquet(
input_files: Union[str, List[str]],
Expand Down
4 changes: 2 additions & 2 deletions nemo_curator/modules/semantic_dedup.py
Original file line number Diff line number Diff line change
Expand Up @@ -525,8 +525,8 @@ def __init__(
cache_dir = config.cache_dir
self.embedding_creator = EmbeddingCreator(
embedding_model_name_or_path=config.embedding_model_name_or_path,
max_memory=config.embedding_max_mem_gb,
batch_size=config.embedding_batch_size,
embedding_max_mem_gb=config.embedding_max_mem_gb,
embedding_batch_size=config.embedding_batch_size,
input_column=config.input_column,
embedding_output_dir=os.path.join(cache_dir, config.embeddings_save_loc),
logger=logger,
Expand Down
17 changes: 17 additions & 0 deletions nemo_curator/nemo_sdk/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from .slurm import SlurmJobConfig

__all__ = ["SlurmJobConfig"]
Loading

0 comments on commit 30a5e68

Please sign in to comment.