Skip to content

Commit

Permalink
Add Synthetic Data Generation Module (#136)
Browse files Browse the repository at this point in the history
* Begin implementation on OpenAI client

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix relative import

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add temperature

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Modify client interface and begin ultrachat

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Change type annotation in openai client

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Make imports easier

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Reformat to match nemotron report

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add yaml conversion

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix index error

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add error handling for yaml parsing

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix error

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add additional yaml parsing check

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add more yaml error handling

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Export conversion error

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Change variable naming

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Make error catching more general

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Refactor list out of nemotron

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add prompt helper function

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add revisions and writing prompts

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix default prompt templates

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add closed qa

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix prompt

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add math and coding

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add problem generation

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Rename function

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add dialogue support

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix mispell

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add two turn generation

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add reward model as judge

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Refactor reward query

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add error handling for non-reward models

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add error handling to sync client

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add open qa pipeline

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Improve docs and add writing pipeline

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add closed qa pipeline

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add math pipeline

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add python pipeline

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add async nemotron generator

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix await with index

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add seed parameter

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add missing await

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix parameter names

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix subscript await issues

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Switch parsing method for reward model

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add initial docs

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add nemo deploy client

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add easy import

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Move conversation formatter

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add other file

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Update nemotron import

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Update model client import

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Remove model in query call

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add extra index

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix response indexing

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add top k

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Remove extras

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add safe import for nemo deploy

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add pandas conversions

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add partition default

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add no format

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Move no format location

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Use top_k in nemo client

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Address vibhu's review

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add logging import

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix import

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix tqdm

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add missing awaits

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Standardize names

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Address Ayush nit

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

---------

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
  • Loading branch information
ryantwolf authored Jul 9, 2024
1 parent 88d8be0 commit f572314
Show file tree
Hide file tree
Showing 18 changed files with 3,883 additions and 2 deletions.
3 changes: 3 additions & 0 deletions docs/user-guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@
:ref:`GPU Accelerated Exact and Fuzzy Deduplication <data-curator-gpu-deduplication>`
Both exact and fuzzy deduplication functionalities are supported in NeMo Curator and accelerated using RAPIDS cuDF.

:ref:`Synthetic Data Generation <data-curator-syntheticdata>`
Synthetic data generation tools and example piplines are available within NeMo Curator.

:ref:`Downstream Task Decontamination <data-curator-downstream>`
After training, large language models are usually evaluated by their performance on downstream tasks consisting of unseen test data. When dealing with large datasets, there is a potential for leakage of this test data into the model’s training dataset. NeMo Curator allows you to remove sections of documents in your dataset that are present in downstream tasks.

Expand Down
18 changes: 18 additions & 0 deletions docs/user-guide/syntheticdata.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@

.. _data-curator-syntheticdata:

======================================
Synthetic Data Generation
======================================
--------------------------------------
Background
--------------------------------------
Synthetic data generation has become increasing useful in large language model training.
It is used in pretraining, fine-tuning, and evalutation.
Synthetically generated data can be useful for adapting an LLM to low resource languages/domains, or performing knowledge distillation from other models among other purposes.
There are a variety of ways to construct synthetic data generation pipelines, with numerous LLM and classical filters.

NeMo Curator has a simple, easy-to-use set of tools that allow you to use prebuilt synthetic generation pipelines or build your own.
Any model inference service that uses the OpenAI API is compatible with the synthetic data generation module, allowing you to generate your data from any model.
NeMo Curator has prebuilt synthetic data generation pipelines for supervised fine-tuning (SFT) and preference data that were used to generate data for the training of `Nemotron-4 340B <https://research.nvidia.com/publication/2024-06_nemotron-4-340b>`_.
And, you can easily interweave filtering and deduplication steps in your synthetic data pipeline with the other modules in NeMo Curator.
7 changes: 7 additions & 0 deletions nemo_curator/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,13 @@


from .modules import *
from .services import (
AsyncLLMClient,
AsyncOpenAIClient,
LLMClient,
NemoDeployClient,
OpenAIClient,
)
from .utils.distributed_utils import get_client

# Dask will automatically convert the list score type
Expand Down
40 changes: 39 additions & 1 deletion nemo_curator/datasets/doc_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import List, Union
from typing import List, Optional, Union

import dask.dataframe as dd

Expand Down Expand Up @@ -130,6 +130,44 @@ def to_pickle(
):
raise NotImplementedError("DocumentDataset does not support to_pickle yet")

@classmethod
def from_pandas(
cls,
data,
npartitions: Optional[int] = 1,
chunksize: Optional[int] = None,
sort: Optional[bool] = True,
name: Optional[str] = None,
):
"""
Creates a document dataset from a pandas data frame.
For more information on the arguments see Dask's from_pandas documentation
https://docs.dask.org/en/stable/generated/dask.dataframe.from_pandas.html
Args:
data: A pandas dataframe
Returns:
A document dataset with a pandas backend (on the CPU).
"""
return cls(
dd.from_pandas(
data=data,
npartitions=npartitions,
chunksize=chunksize,
sort=sort,
name=name,
)
)

def to_pandas(self):
"""
Creates a pandas dataframe from a DocumentDataset
Returns:
A pandas dataframe (on the CPU)
"""
return self.df.to_backend("pandas").compute()


def _read_json_or_parquet(
input_files: Union[str, List[str]],
Expand Down
26 changes: 26 additions & 0 deletions nemo_curator/services/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .conversation_formatter import ConversationFormatter
from .model_client import AsyncLLMClient, LLMClient
from .nemo_client import NemoDeployClient
from .openai_client import AsyncOpenAIClient, OpenAIClient

__all__ = [
"AsyncLLMClient",
"LLMClient",
"AsyncOpenAIClient",
"OpenAIClient",
"NemoDeployClient",
"ConversationFormatter",
]
28 changes: 28 additions & 0 deletions nemo_curator/services/conversation_formatter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from abc import ABC, abstractmethod
from typing import List


class ConversationFormatter(ABC):
"""
Represents a way of formatting a conversation with an LLM
such that it can response appropriately
"""

@abstractmethod
def format_conversation(self, conv: List[dict]) -> str:
raise NotImplementedError(
"format_converstaion must be implemented by subclasses"
)
93 changes: 93 additions & 0 deletions nemo_curator/services/model_client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from abc import ABC, abstractmethod
from typing import Iterable, List, Optional, Union

from nemo_curator.services.conversation_formatter import ConversationFormatter


class LLMClient(ABC):
"""
Interface representing a client connecting to an LLM inference server
and making requests synchronously
"""

@abstractmethod
def query_model(
self,
*,
messages: Iterable,
model: str,
conversation_formatter: Optional[ConversationFormatter] = None,
max_tokens: Optional[int] = None,
n: Optional[int] = 1,
seed: Optional[int] = None,
stop: Union[Optional[str], List[str]] = None,
stream: bool = False,
temperature: Optional[float] = None,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
) -> List[str]:
raise NotImplementedError("Subclass of LLMClient must implement 'query_model'")

@abstractmethod
def query_reward_model(
self,
*,
messages: Iterable,
model: str,
conversation_formatter: Optional[ConversationFormatter] = None,
) -> dict:
raise NotImplementedError(
"Subclass of LLMClient must implement 'query_reward_model'"
)


class AsyncLLMClient(ABC):
"""
Interface representing a client connecting to an LLM inference server
and making requests asynchronously
"""

@abstractmethod
async def query_model(
self,
*,
messages: Iterable,
model: str,
conversation_formatter: Optional[ConversationFormatter] = None,
max_tokens: Optional[int] = None,
n: Optional[int] = 1,
seed: Optional[int] = None,
stop: Union[Optional[str], List[str]] = None,
stream: bool = False,
temperature: Optional[float] = None,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
) -> List[str]:
raise NotImplementedError(
"Subclass of AsyncLLMClient must implement 'query_model'"
)

@abstractmethod
async def query_reward_model(
self,
*,
messages: Iterable,
model: str,
conversation_formatter: Optional[ConversationFormatter] = None,
) -> dict:
raise NotImplementedError(
"Subclass of LLMClient must implement 'query_reward_model'"
)
100 changes: 100 additions & 0 deletions nemo_curator/services/nemo_client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import warnings
from typing import Iterable, List, Optional, Union

from nemo_curator.services.conversation_formatter import ConversationFormatter
from nemo_curator.utils.import_utils import safe_import_from

from .model_client import AsyncLLMClient, LLMClient

NemoQueryLLM = safe_import_from("nemo.deploy.nlp", "NemoQueryLLM")


class NemoDeployClient(LLMClient):
"""
A wrapper around NemoQueryLLM for querying models in synthetic data generation
"""

def __init__(self, nemo_deploy: NemoQueryLLM) -> None:
self.client = nemo_deploy

def query_model(
self,
*,
messages: Iterable,
model: str,
conversation_formatter: Optional[ConversationFormatter] = None,
max_tokens: Optional[int] = None,
n: Optional[int] = None,
seed: Optional[int] = None,
stop: Union[Optional[str], List[str]] = None,
stream: bool = False,
temperature: Optional[float] = None,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
) -> List[str]:
if conversation_formatter is None:
raise ValueError(
"NemoDeployClient's query_model requires a conversation_formatter"
)

prompt = conversation_formatter.format_conversation(messages)
self.client.model_name = model

if n is not None:
warnings.warn("n is not supported in NemoDeployClient")
if stream:
warnings.warn("streamming is not supported in NeMoDeployClient")

if isinstance(stop, str):
stop = [stop]

response = self.client.query_llm(
prompts=[prompt],
max_output_len=max_tokens,
random_seed=seed,
stop_words_list=stop,
temperature=temperature,
top_p=top_p,
top_k=top_k,
)[0]

return self._postprocess_response(response, stop)

@staticmethod
def _postprocess_response(responses: List[str], stop_words: List[str]) -> List[str]:
processed_responses = []
for response in responses:
for stop in stop_words:
if response.endswith(stop):
response = response[: -len(stop)]
processed_responses.append(response.strip())
return processed_responses

def query_reward_model(self, *, messages: Iterable, model: str) -> dict:
"""
Prompts an LLM Reward model to score a conversation between a user and assistant
Args:
messages: The conversation to calculate a score for.
Should be formatted like:
[{"role": "user", "content": "Write a sentence"}, {"role": "assistant", "content": "This is a sentence"}, ...]
model: The name of the model that should be used to calculate the reward.
Must be a reward model, cannot be a regular LLM.
Returns:
A mapping of score_name -> score
"""
raise NotImplementedError(
"Reward model inference is not supported in NeMo Deploy Clients"
)
Loading

0 comments on commit f572314

Please sign in to comment.