-
Notifications
You must be signed in to change notification settings - Fork 68
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add Synthetic Data Generation Module (#136)
* Begin implementation on OpenAI client Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix relative import Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add temperature Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Modify client interface and begin ultrachat Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Change type annotation in openai client Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Make imports easier Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Reformat to match nemotron report Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add yaml conversion Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix index error Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add error handling for yaml parsing Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix error Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add additional yaml parsing check Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add more yaml error handling Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Export conversion error Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Change variable naming Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Make error catching more general Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Refactor list out of nemotron Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add prompt helper function Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add revisions and writing prompts Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix default prompt templates Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add closed qa Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix prompt Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add math and coding Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add problem generation Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Rename function Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add dialogue support Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix mispell Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add two turn generation Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add reward model as judge Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Refactor reward query Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add error handling for non-reward models Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add error handling to sync client Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add open qa pipeline Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Improve docs and add writing pipeline Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add closed qa pipeline Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add math pipeline Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add python pipeline Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add async nemotron generator Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix await with index Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add seed parameter Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add missing await Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix parameter names Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix subscript await issues Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Switch parsing method for reward model Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add initial docs Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add nemo deploy client Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add easy import Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Move conversation formatter Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add other file Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Update nemotron import Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Update model client import Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Remove model in query call Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add extra index Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix response indexing Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add top k Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Remove extras Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add safe import for nemo deploy Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add pandas conversions Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add partition default Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add no format Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Move no format location Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Use top_k in nemo client Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Address vibhu's review Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add logging import Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix import Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix tqdm Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add missing awaits Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Standardize names Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Address Ayush nit Signed-off-by: Ryan Wolf <rywolf@nvidia.com> --------- Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
- Loading branch information
Showing
18 changed files
with
3,883 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
|
||
.. _data-curator-syntheticdata: | ||
|
||
====================================== | ||
Synthetic Data Generation | ||
====================================== | ||
-------------------------------------- | ||
Background | ||
-------------------------------------- | ||
Synthetic data generation has become increasing useful in large language model training. | ||
It is used in pretraining, fine-tuning, and evalutation. | ||
Synthetically generated data can be useful for adapting an LLM to low resource languages/domains, or performing knowledge distillation from other models among other purposes. | ||
There are a variety of ways to construct synthetic data generation pipelines, with numerous LLM and classical filters. | ||
|
||
NeMo Curator has a simple, easy-to-use set of tools that allow you to use prebuilt synthetic generation pipelines or build your own. | ||
Any model inference service that uses the OpenAI API is compatible with the synthetic data generation module, allowing you to generate your data from any model. | ||
NeMo Curator has prebuilt synthetic data generation pipelines for supervised fine-tuning (SFT) and preference data that were used to generate data for the training of `Nemotron-4 340B <https://research.nvidia.com/publication/2024-06_nemotron-4-340b>`_. | ||
And, you can easily interweave filtering and deduplication steps in your synthetic data pipeline with the other modules in NeMo Curator. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
from .conversation_formatter import ConversationFormatter | ||
from .model_client import AsyncLLMClient, LLMClient | ||
from .nemo_client import NemoDeployClient | ||
from .openai_client import AsyncOpenAIClient, OpenAIClient | ||
|
||
__all__ = [ | ||
"AsyncLLMClient", | ||
"LLMClient", | ||
"AsyncOpenAIClient", | ||
"OpenAIClient", | ||
"NemoDeployClient", | ||
"ConversationFormatter", | ||
] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
from abc import ABC, abstractmethod | ||
from typing import List | ||
|
||
|
||
class ConversationFormatter(ABC): | ||
""" | ||
Represents a way of formatting a conversation with an LLM | ||
such that it can response appropriately | ||
""" | ||
|
||
@abstractmethod | ||
def format_conversation(self, conv: List[dict]) -> str: | ||
raise NotImplementedError( | ||
"format_converstaion must be implemented by subclasses" | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
from abc import ABC, abstractmethod | ||
from typing import Iterable, List, Optional, Union | ||
|
||
from nemo_curator.services.conversation_formatter import ConversationFormatter | ||
|
||
|
||
class LLMClient(ABC): | ||
""" | ||
Interface representing a client connecting to an LLM inference server | ||
and making requests synchronously | ||
""" | ||
|
||
@abstractmethod | ||
def query_model( | ||
self, | ||
*, | ||
messages: Iterable, | ||
model: str, | ||
conversation_formatter: Optional[ConversationFormatter] = None, | ||
max_tokens: Optional[int] = None, | ||
n: Optional[int] = 1, | ||
seed: Optional[int] = None, | ||
stop: Union[Optional[str], List[str]] = None, | ||
stream: bool = False, | ||
temperature: Optional[float] = None, | ||
top_k: Optional[int] = None, | ||
top_p: Optional[float] = None, | ||
) -> List[str]: | ||
raise NotImplementedError("Subclass of LLMClient must implement 'query_model'") | ||
|
||
@abstractmethod | ||
def query_reward_model( | ||
self, | ||
*, | ||
messages: Iterable, | ||
model: str, | ||
conversation_formatter: Optional[ConversationFormatter] = None, | ||
) -> dict: | ||
raise NotImplementedError( | ||
"Subclass of LLMClient must implement 'query_reward_model'" | ||
) | ||
|
||
|
||
class AsyncLLMClient(ABC): | ||
""" | ||
Interface representing a client connecting to an LLM inference server | ||
and making requests asynchronously | ||
""" | ||
|
||
@abstractmethod | ||
async def query_model( | ||
self, | ||
*, | ||
messages: Iterable, | ||
model: str, | ||
conversation_formatter: Optional[ConversationFormatter] = None, | ||
max_tokens: Optional[int] = None, | ||
n: Optional[int] = 1, | ||
seed: Optional[int] = None, | ||
stop: Union[Optional[str], List[str]] = None, | ||
stream: bool = False, | ||
temperature: Optional[float] = None, | ||
top_k: Optional[int] = None, | ||
top_p: Optional[float] = None, | ||
) -> List[str]: | ||
raise NotImplementedError( | ||
"Subclass of AsyncLLMClient must implement 'query_model'" | ||
) | ||
|
||
@abstractmethod | ||
async def query_reward_model( | ||
self, | ||
*, | ||
messages: Iterable, | ||
model: str, | ||
conversation_formatter: Optional[ConversationFormatter] = None, | ||
) -> dict: | ||
raise NotImplementedError( | ||
"Subclass of LLMClient must implement 'query_reward_model'" | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
import warnings | ||
from typing import Iterable, List, Optional, Union | ||
|
||
from nemo_curator.services.conversation_formatter import ConversationFormatter | ||
from nemo_curator.utils.import_utils import safe_import_from | ||
|
||
from .model_client import AsyncLLMClient, LLMClient | ||
|
||
NemoQueryLLM = safe_import_from("nemo.deploy.nlp", "NemoQueryLLM") | ||
|
||
|
||
class NemoDeployClient(LLMClient): | ||
""" | ||
A wrapper around NemoQueryLLM for querying models in synthetic data generation | ||
""" | ||
|
||
def __init__(self, nemo_deploy: NemoQueryLLM) -> None: | ||
self.client = nemo_deploy | ||
|
||
def query_model( | ||
self, | ||
*, | ||
messages: Iterable, | ||
model: str, | ||
conversation_formatter: Optional[ConversationFormatter] = None, | ||
max_tokens: Optional[int] = None, | ||
n: Optional[int] = None, | ||
seed: Optional[int] = None, | ||
stop: Union[Optional[str], List[str]] = None, | ||
stream: bool = False, | ||
temperature: Optional[float] = None, | ||
top_k: Optional[int] = None, | ||
top_p: Optional[float] = None, | ||
) -> List[str]: | ||
if conversation_formatter is None: | ||
raise ValueError( | ||
"NemoDeployClient's query_model requires a conversation_formatter" | ||
) | ||
|
||
prompt = conversation_formatter.format_conversation(messages) | ||
self.client.model_name = model | ||
|
||
if n is not None: | ||
warnings.warn("n is not supported in NemoDeployClient") | ||
if stream: | ||
warnings.warn("streamming is not supported in NeMoDeployClient") | ||
|
||
if isinstance(stop, str): | ||
stop = [stop] | ||
|
||
response = self.client.query_llm( | ||
prompts=[prompt], | ||
max_output_len=max_tokens, | ||
random_seed=seed, | ||
stop_words_list=stop, | ||
temperature=temperature, | ||
top_p=top_p, | ||
top_k=top_k, | ||
)[0] | ||
|
||
return self._postprocess_response(response, stop) | ||
|
||
@staticmethod | ||
def _postprocess_response(responses: List[str], stop_words: List[str]) -> List[str]: | ||
processed_responses = [] | ||
for response in responses: | ||
for stop in stop_words: | ||
if response.endswith(stop): | ||
response = response[: -len(stop)] | ||
processed_responses.append(response.strip()) | ||
return processed_responses | ||
|
||
def query_reward_model(self, *, messages: Iterable, model: str) -> dict: | ||
""" | ||
Prompts an LLM Reward model to score a conversation between a user and assistant | ||
Args: | ||
messages: The conversation to calculate a score for. | ||
Should be formatted like: | ||
[{"role": "user", "content": "Write a sentence"}, {"role": "assistant", "content": "This is a sentence"}, ...] | ||
model: The name of the model that should be used to calculate the reward. | ||
Must be a reward model, cannot be a regular LLM. | ||
Returns: | ||
A mapping of score_name -> score | ||
""" | ||
raise NotImplementedError( | ||
"Reward model inference is not supported in NeMo Deploy Clients" | ||
) |
Oops, something went wrong.