Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Synthetic Data Generation Module #136

Merged
merged 69 commits into from
Jul 9, 2024
Merged

Add Synthetic Data Generation Module #136

merged 69 commits into from
Jul 9, 2024

Conversation

ryantwolf
Copy link
Collaborator

@ryantwolf ryantwolf commented Jul 2, 2024

Description

Adds a suite of tools for interacting with LLM services. These LLM services are then used to build synthetic data generation tools and example pipelines following the Nemotron 340B Technical Report. The prompt templates used in the report are supplied as defaults throughout the code.

Usage

OpenAI API

from nemo_curator import AsyncOpenAIClient
from nemo_curator.synthetic import (
    AsyncNemotronGenerator,
    NemotronGenerator,
)
from openai import OpenAI, AsyncOpenAI

async def demo():
  openai_client = AsyncOpenAI(
      base_url="https://integrate.api.nvidia.com/v1",
      api_key="",
  )
  client = AsyncOpenAIClient(openai_client)
  generator = AsyncNemotronGenerator(client)
  
  model = "nvidia/nemotron-4-340b-instruct"
  model_kwargs = {
      "top_p": 0.7,
      "max_tokens": 1024,
      "seed": 1234,
  }
  
  openlines = await generator.run_open_qa_pipeline(
      n_macro_topics=5,
      n_subtopics=3,
      n_openlines=3,
      n_revisions=2,
      model=model,
      base_model_kwargs=model_kwargs,
      conversion_model_kwargs=model_kwargs,
      ignore_conversion_failure=True,
  )
  
  dialogue = await generator.generate_dialogue(
      openline=openlines[0],
      user_model=model,
      assistant_model=model,
      user_model_kwargs=model_kwargs,
      assistant_model_kwargs=model_kwargs,
  )

  print(dialogue)

NeMo Deploy

from nemo_curator import NemoDeployClient
from nemo_curator.synthetic import (
    AsyncNemotronGenerator,
    NemotronGenerator,
    NemotronFormatter,
)
from nemo.deploy.nlp import NemoQueryLLM

async def demo():
  model = "local_nemotron"
  model_kwargs = {
      "top_p": 0.7,
      "max_tokens": 1024,
      "seed": 1234,
      "conversation_formatter": NemotronFormatter(),
      "stop": ['<extra_id_1>'],
  }
  
  nemo_client = NemoQueryLLM(url="localhost:8000", model_name=model)
  client = NemoDeployClient(nemo_client)
  
  openlines = await generator.run_open_qa_pipeline(
      n_macro_topics=5,
      n_subtopics=3,
      n_openlines=3,
      n_revisions=2,
      model=model,
      base_model_kwargs=model_kwargs,
      conversion_model_kwargs=model_kwargs,
      ignore_conversion_failure=True,
  )

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @ryantwolf . This is a very important functionality we are adding, Very excited for it.

My major concern currently is around not having a way to rate limit the number of requests we are sending, everything else is mostly nits.

nemo_curator/datasets/doc_dataset.py Outdated Show resolved Hide resolved
nemo_curator/services/openai_client.py Show resolved Hide resolved
nemo_curator/services/openai_client.py Show resolved Hide resolved
nemo_curator/synthetic/async_nemotron.py Outdated Show resolved Hide resolved
nemo_curator/synthetic/async_nemotron.py Outdated Show resolved Hide resolved
nemo_curator/synthetic/async_nemotron.py Outdated Show resolved Hide resolved
nemo_curator/synthetic/async_nemotron.py Outdated Show resolved Hide resolved
nemo_curator/synthetic/async_nemotron.py Outdated Show resolved Hide resolved
nemo_curator/synthetic/prompts.py Show resolved Hide resolved
nemo_curator/synthetic/nemotron.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit but at a high level looks good to me! Thanks a lot for this effort

docs/user-guide/syntheticdata.rst Outdated Show resolved Hide resolved
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
@ryantwolf ryantwolf requested a review from VibhuJawa July 9, 2024 01:21
Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for working on this Ryan.

@ryantwolf ryantwolf merged commit f572314 into main Jul 9, 2024
3 checks passed
@ryantwolf ryantwolf deleted the rywolf/synth-data branch July 9, 2024 02:05
sarahyurick pushed a commit to sarahyurick/NeMo-Curator that referenced this pull request Jul 23, 2024
* Begin implementation on OpenAI client

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix relative import

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add temperature

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Modify client interface and begin ultrachat

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Change type annotation in openai client

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Make imports easier

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Reformat to match nemotron report

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add yaml conversion

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix index error

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add error handling for yaml parsing

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix error

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add additional yaml parsing check

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add more yaml error handling

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Export conversion error

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Change variable naming

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Make error catching more general

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Refactor list out of nemotron

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add prompt helper function

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add revisions and writing prompts

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix default prompt templates

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add closed qa

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix prompt

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add math and coding

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add problem generation

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Rename function

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add dialogue support

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix mispell

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add two turn generation

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add reward model as judge

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Refactor reward query

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add error handling for non-reward models

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add error handling to sync client

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add open qa pipeline

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Improve docs and add writing pipeline

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add closed qa pipeline

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add math pipeline

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add python pipeline

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add async nemotron generator

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix await with index

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add seed parameter

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add missing await

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix parameter names

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix subscript await issues

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Switch parsing method for reward model

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add initial docs

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add nemo deploy client

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add easy import

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Move conversation formatter

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add other file

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Update nemotron import

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Update model client import

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Remove model in query call

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add extra index

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix response indexing

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add top k

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Remove extras

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add safe import for nemo deploy

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add pandas conversions

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add partition default

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add no format

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Move no format location

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Use top_k in nemo client

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Address vibhu's review

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add logging import

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix import

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix tqdm

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add missing awaits

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Standardize names

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Address Ayush nit

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

---------

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants