Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Synthetic Data Generation Module #136

Merged
merged 69 commits into from
Jul 9, 2024
Merged
Show file tree
Hide file tree
Changes from 62 commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
4bb6ddc
Begin implementation on OpenAI client
ryantwolf Jun 26, 2024
850403f
Fix relative import
ryantwolf Jun 26, 2024
e6cac7a
Add temperature
ryantwolf Jun 26, 2024
131b2d6
Modify client interface and begin ultrachat
ryantwolf Jun 26, 2024
fb82737
Change type annotation in openai client
ryantwolf Jun 26, 2024
f3e6309
Make imports easier
ryantwolf Jun 26, 2024
5ad683f
Reformat to match nemotron report
ryantwolf Jun 30, 2024
0d552b4
Add yaml conversion
ryantwolf Jul 1, 2024
87ebfc4
Fix index error
ryantwolf Jul 1, 2024
bb72a68
Add error handling for yaml parsing
ryantwolf Jul 1, 2024
32c7f55
Fix error
ryantwolf Jul 1, 2024
a6d306e
Add additional yaml parsing check
ryantwolf Jul 1, 2024
ece34b5
Add more yaml error handling
ryantwolf Jul 1, 2024
28d3a08
Export conversion error
ryantwolf Jul 1, 2024
8cf295e
Change variable naming
ryantwolf Jul 1, 2024
7fcd719
Make error catching more general
ryantwolf Jul 1, 2024
76ddfda
Refactor list out of nemotron
ryantwolf Jul 1, 2024
2f7a03b
Add prompt helper function
ryantwolf Jul 1, 2024
76c4bdd
Add revisions and writing prompts
ryantwolf Jul 1, 2024
2f15d89
Fix default prompt templates
ryantwolf Jul 1, 2024
cc18dfe
Add closed qa
ryantwolf Jul 1, 2024
d4755c0
Fix prompt
ryantwolf Jul 1, 2024
366fea8
Add math and coding
ryantwolf Jul 1, 2024
f563018
Add problem generation
ryantwolf Jul 1, 2024
294a390
Rename function
ryantwolf Jul 1, 2024
728d585
Add dialogue support
ryantwolf Jul 1, 2024
4c64c3a
Fix mispell
ryantwolf Jul 1, 2024
8db6019
Add two turn generation
ryantwolf Jul 1, 2024
2d13d63
Add reward model as judge
ryantwolf Jul 2, 2024
8336452
Refactor reward query
ryantwolf Jul 2, 2024
87acce0
Add error handling for non-reward models
ryantwolf Jul 2, 2024
fd1f066
Add error handling to sync client
ryantwolf Jul 2, 2024
69c431f
Add open qa pipeline
ryantwolf Jul 2, 2024
2408972
Improve docs and add writing pipeline
ryantwolf Jul 2, 2024
c8c8039
Add closed qa pipeline
ryantwolf Jul 2, 2024
babdb40
Add math pipeline
ryantwolf Jul 2, 2024
c3a9998
Add python pipeline
ryantwolf Jul 2, 2024
48665ee
Add async nemotron generator
ryantwolf Jul 2, 2024
494c141
Fix await with index
ryantwolf Jul 2, 2024
2fb48db
Add seed parameter
ryantwolf Jul 2, 2024
39acac1
Add missing await
ryantwolf Jul 2, 2024
4c888e4
Fix parameter names
ryantwolf Jul 2, 2024
4724d68
Fix subscript await issues
ryantwolf Jul 2, 2024
de27abc
Switch parsing method for reward model
ryantwolf Jul 2, 2024
8daea94
Add initial docs
ryantwolf Jul 2, 2024
6ae83b1
Add nemo deploy client
ryantwolf Jul 5, 2024
7daefb7
Add easy import
ryantwolf Jul 5, 2024
c0509f9
Move conversation formatter
ryantwolf Jul 5, 2024
e964712
Add other file
ryantwolf Jul 5, 2024
e500814
Update nemotron import
ryantwolf Jul 5, 2024
2b4d3ff
Update model client import
ryantwolf Jul 5, 2024
7acbee9
Remove model in query call
ryantwolf Jul 5, 2024
06b7310
Add extra index
ryantwolf Jul 5, 2024
f05b13a
Fix response indexing
ryantwolf Jul 5, 2024
0efc808
Add top k
ryantwolf Jul 5, 2024
c8d1419
Remove extras
ryantwolf Jul 5, 2024
2d11a8c
Add safe import for nemo deploy
ryantwolf Jul 5, 2024
20afd89
Add pandas conversions
ryantwolf Jul 5, 2024
2987c9a
Add partition default
ryantwolf Jul 5, 2024
3f8dcc8
Add no format
ryantwolf Jul 5, 2024
0926cbd
Move no format location
ryantwolf Jul 5, 2024
e2beb5b
Use top_k in nemo client
ryantwolf Jul 5, 2024
b918c14
Address vibhu's review
ryantwolf Jul 9, 2024
b79ce6b
Add logging import
ryantwolf Jul 9, 2024
8957e12
Fix import
ryantwolf Jul 9, 2024
1400a32
Fix tqdm
ryantwolf Jul 9, 2024
0926d6e
Add missing awaits
ryantwolf Jul 9, 2024
fbe9292
Standardize names
ryantwolf Jul 9, 2024
8f66396
Address Ayush nit
ryantwolf Jul 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/user-guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@
:ref:`GPU Accelerated Exact and Fuzzy Deduplication <data-curator-gpu-deduplication>`
Both exact and fuzzy deduplication functionalities are supported in NeMo Curator and accelerated using RAPIDS cuDF.

:ref:`Synthetic Data Generation <data-curator-syntheticdata>`
Synthetic data generation tools and example piplines are available within NeMo Curator.

:ref:`Downstream Task Decontamination <data-curator-downstream>`
After training, large language models are usually evaluated by their performance on downstream tasks consisting of unseen test data. When dealing with large datasets, there is a potential for leakage of this test data into the model’s training dataset. NeMo Curator allows you to remove sections of documents in your dataset that are present in downstream tasks.

Expand Down
22 changes: 22 additions & 0 deletions docs/user-guide/syntheticdata.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@

.. _data-curator-syntheticdata:

======================================
Synthetic Data Generation
======================================
--------------------------------------
Background
--------------------------------------
Synthetic data generation has become increasing useful in large language model training.
It is used in pretraining, fine-tuning, and evalutation.
Synthetically generated data can be useful for adapting an LLM to low resource languages/domains, or performing knowledge distillation from other models among other purposes.
There are a variety of ways to construct synthetic data generation pipelines, with numerous LLM and classical filters.

NeMo Curator has a simple, easy-to-use set of tools that allow you to use prebuilt synthetic generation pipelines or build your own.
Any model inference service that uses the OpenAI API is compatible with the synthetic data generation module, allowing you to generate your data from any model.
NeMo Curator has prebuilt synthetic data generation pipelines for supervised fine-tuning (SFT) and preference data that were used to generate data for the training of `Nemotron-4 340B <https://research.nvidia.com/publication/2024-06_nemotron-4-340b>`_.
And, you can easily interweave filtering and deduplication steps in your synthetic data pipeline with the other modules in NeMo Curator.

-----------------------------------------
Usage
-----------------------------------------
ryantwolf marked this conversation as resolved.
Show resolved Hide resolved
7 changes: 7 additions & 0 deletions nemo_curator/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,13 @@


from .modules import *
from .services import (
AsyncLLMClient,
AsyncOpenAIClient,
LLMClient,
NemoDeployClient,
OpenAIClient,
)
from .utils.distributed_utils import get_client

# Dask will automatically convert the list score type
Expand Down
40 changes: 39 additions & 1 deletion nemo_curator/datasets/doc_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import List, Union
from typing import List, Optional, Union

import dask.dataframe as dd

Expand Down Expand Up @@ -130,6 +130,44 @@ def to_pickle(
):
raise NotImplementedError("DocumentDataset does not support to_pickle yet")

@classmethod
def from_pandas(
cls,
data,
npartitions: Optional[int] = 1,
chunksize: Optional[int] = None,
sort: Optional[bool] = True,
name: Optional[str] = None,
):
"""
Creates a document dataset from a pandas data frame.
For more information on the arguments see Dask's from_pandas documentation
https://docs.dask.org/en/stable/generated/dask.dataframe.from_pandas.html

Args:
data: A pandas dataframe
Returns:
A document dataset with a pandas backend (on the CPU).
"""
return cls(
dd.from_pandas(
data=data,
npartitions=npartitions,
chunksize=chunksize,
sort=sort,
name=name,
)
)

def to_pandas(self):
"""
Creates a pandas dataframe from a DocumentDataset

Returns:
A pandas dataframe (on the CPU)
"""
return self.df.compute()
ryantwolf marked this conversation as resolved.
Show resolved Hide resolved


def _read_json_or_parquet(
input_files: Union[str, List[str]],
Expand Down
26 changes: 26 additions & 0 deletions nemo_curator/services/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .conversation_formatter import ConversationFormatter
from .model_client import AsyncLLMClient, LLMClient
from .nemo_client import NemoDeployClient
from .openai_client import AsyncOpenAIClient, OpenAIClient

__all__ = [
"AsyncLLMClient",
"LLMClient",
"AsyncOpenAIClient",
"OpenAIClient",
"NemoDeployClient",
"ConversationFormatter",
]
28 changes: 28 additions & 0 deletions nemo_curator/services/conversation_formatter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from abc import ABC, abstractmethod
from typing import List


class ConversationFormatter(ABC):
"""
Represents a way of formatting a conversation with an LLM
such that it can response appropriately
"""

@abstractmethod
def format_conversation(self, conv: List[dict]) -> str:
raise NotImplementedError(
"format_converstaion must be implemented by subclasses"
)
91 changes: 91 additions & 0 deletions nemo_curator/services/model_client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from abc import ABC, abstractmethod
from typing import Iterable, List, Optional, Union

from nemo_curator.services.conversation_formatter import ConversationFormatter


class LLMClient(ABC):
"""
Interface representing a client connecting to an LLM inference server
and making requests synchronously
"""

@abstractmethod
def query_model(
self,
*,
messages: Iterable,
model: str,
conversation_formatter: Optional[ConversationFormatter] = None,
max_tokens: Optional[int] = None,
n: Optional[int] = 1,
seed: Optional[int] = None,
stop: Union[Optional[str], List[str]] = None,
temperature: Optional[float] = None,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
) -> List[str]:
raise NotImplementedError("Subclass of LLMClient must implement 'query_model'")

@abstractmethod
def query_reward_model(
self,
*,
messages: Iterable,
model: str,
conversation_formatter: Optional[ConversationFormatter] = None,
) -> dict:
raise NotImplementedError(
"Subclass of LLMClient must implement 'query_reward_model'"
)


class AsyncLLMClient(ABC):
"""
Interface representing a client connecting to an LLM inference server
and making requests asynchronously
"""

@abstractmethod
async def query_model(
self,
*,
messages: Iterable,
model: str,
conversation_formatter: Optional[ConversationFormatter] = None,
max_tokens: Optional[int] = None,
n: Optional[int] = 1,
seed: Optional[int] = None,
stop: Union[Optional[str], List[str]] = None,
temperature: Optional[float] = None,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
) -> List[str]:
raise NotImplementedError(
"Subclass of AsyncLLMClient must implement 'query_model'"
)

@abstractmethod
async def query_reward_model(
self,
*,
messages: Iterable,
model: str,
conversation_formatter: Optional[ConversationFormatter] = None,
) -> dict:
raise NotImplementedError(
"Subclass of LLMClient must implement 'query_reward_model'"
)
97 changes: 97 additions & 0 deletions nemo_curator/services/nemo_client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import warnings
from typing import Iterable, List, Optional, Union

from nemo_curator.services.conversation_formatter import ConversationFormatter
from nemo_curator.utils.import_utils import safe_import_from

from .model_client import AsyncLLMClient, LLMClient

NemoQueryLLM = safe_import_from("nemo.deploy.nlp", "NemoQueryLLM")


class NemoDeployClient(LLMClient):
"""
A wrapper around NemoQueryLLM for querying models in synthetic data generation
"""

def __init__(self, nemo_deploy: NemoQueryLLM) -> None:
self.client = nemo_deploy

def query_model(
self,
*,
messages: Iterable,
model: str,
conversation_formatter: Optional[ConversationFormatter] = None,
max_tokens: Optional[int] = None,
n: Optional[int] = None,
seed: Optional[int] = None,
stop: Union[Optional[str], List[str]] = None,
temperature: Optional[float] = None,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
) -> List[str]:
if conversation_formatter is None:
raise ValueError(
"NemoDeployClient's query_model requires a conversation_formatter"
)

prompt = conversation_formatter.format_conversation(messages)
self.client.model_name = model

if n is not None:
warnings.warn("n is not supported in NemoDeployClient")

if isinstance(stop, str):
stop = [stop]

response = self.client.query_llm(
prompts=[prompt],
max_output_len=max_tokens,
random_seed=seed,
stop_words_list=stop,
temperature=temperature,
top_p=top_p,
top_k=top_k,
)[0]

return self._postprocess_response(response, stop)

@staticmethod
def _postprocess_response(responses: List[str], stop_words: List[str]) -> List[str]:
processed_responses = []
for response in responses:
for stop in stop_words:
if response.endswith(stop):
response = response[: -len(stop)]
processed_responses.append(response.strip())
return processed_responses

def query_reward_model(self, *, messages: Iterable, model: str) -> dict:
"""
Prompts an LLM Reward model to score a conversation between a user and assistant
Args:
messages: The conversation to calculate a score for.
Should be formatted like:
[{"role": "user", "content": "Write a sentence"}, {"role": "assistant", "content": "This is a sentence"}, ...]
model: The name of the model that should be used to calculate the reward.
Must be a reward model, cannot be a regular LLM.
Returns:
A mapping of score_name -> score
"""
raise NotImplementedError(
"Reward model inference is not supported in NeMo Deploy Clients"
)
Loading
Loading