Skip to content

DanielSc4/RewardLM

Repository files navigation

πŸ₯ž RewardLM

Reward a Language Model with pancakes πŸ₯ž

Usage

This repository gathers three main modules. Their operation is shared, allowing the training of any generative model following the two main techniques of Reinforcement Learning w/ PPO (πŸ₯ž RLAIF) and the more classical πŸ‘¨πŸΌβ€πŸ« fine-tune using PEFT techniques. The third module, βš–οΈ Toxicity Meter, deals with measuring the toxicity of the responses of the generative model, whether pre-trained or after the πŸ₯ž or πŸ‘¨πŸΌβ€πŸ« process.

πŸ₯ž Reinforcement Learning from AI Feedback (RLAIF)

This module allows the use of reinforcement learning algorithms (specifically PPO) to optimise models according to a direction decided by the reward model. The process is similar to RLHF (Reinforcement Learning from Human Feedback) but removes the human component from the loop to automate the process.

To πŸ₯ž Reward a generative LM using the DIALCONAN dataset:

  1. Select the generative and reward models you intend to use and other hyperparameters:
import torch
from rewardlm.core.RL.RLModel import RLModel

rlmanager = RLModel(
    model_id = 'EleutherAI/pythia-70m',
    reward_model_id = 'facebook/roberta-hate-speech-dynabench-r4-target',
    optimized = True,   # use 8-bit PEFT
    # log_method = 'wandb',
    bs = 256,
    # force the use of CPU on Apple Silicon devices (mps not supported):
    accelerator_kwargs = {
        'cpu': False if torch.cuda.is_available() else True,
    },
)
  1. Download the original dataset using the built in preprocessing functions:
from rewardlm.data.data_utils import get_DIALOCONAN_prepro

data = get_DIALOCONAN_prepro(delete_last_assistant_response = True)
dataset = rlmanager.generate_dataset(text = data)
  1. Start the PPO learning algorithm:
history = rlmanager.train_PPO(dataset = dataset)

πŸ‘¨πŸΌβ€πŸ« Model fine-tune

Each generative model can be fine-tuned on the same data used for Reinforcement Learning. In this way, it is possible to compare the results obtained from both techniques.

To fine-tune a generative model using the DIALCONAN dataset:

  1. Select the model you intend to use and the GenerativeModel to get the use it:
import torch
from rewardlm.core.GenerativeModel import GenerativeModel

model_id = 'facebook/opt-350m'
generator_manager = GenerativeModel(
    model_id,
    load_dtype = '8-bit' if torch.cuda.is_available() else 'fp32',
    # force the use of CPU on Apple Silicon devices (mps not supported):
    accelerator_kwargs = {
        'cpu': False if torch.cuda.is_available() else True,
    },
)
  1. Download the original dataset using the built in preprocessing functions:
from rewardlm.data.data_utils import get_DIALOCONAN_prepro
from rewardlm.data.CustomDatasets import PromptDataset_CLM

data = get_DIALOCONAN_prepro()

dataset = PromptDataset_CLM(
    tokenizer = generator_manager.tokenizer,
    text = data,
    custom_prompt = custom_prompt,
)
  1. Start the fine-tutning process:
generator_manager.fine_tune(
    torch_dataset = dataset, 
    optimized = True if torch.cuda.is_available() else False,
)

βš–οΈ ToxicityMeter

Toxicity meter allows measuring the toxicity of generative LM based on the output of a classifier (RoBERTa for hate speech as default if no RewardModel is used)

  1. Select a configuration (or create your own):
from rewardlm.utils import load_config
config = load_config(name = 'RedPajama-INCITE-Chat-3B-v1')
  1. Use the GenerativeModel class to get a generation manager:
import torch
from transformers import GenerationConfig
from rewardlm.core.GenerativeModel import GenerativeModel
from rewardlm.ToxicityMeter import ToxicityMeter
from rewardlm.utils import load_config

generator_manager = GenerativeModel(
    config['model_id'],
    load_from_peft = config['load_from_peft'],
    generation_config=config['generation']['generation_config'],
    # force the use of CPU on Apple Silicon devices (mps not supported):
    accelerator_kwargs = {
        'cpu': False if torch.cuda.is_available() else True,
    },
)
  1. Customize the prompt from the original dataset and generate the toxicity_df dataset:
from rewardlm.data.data_utils import get_real_toxicity_prompts

toxicity_meter = ToxicityMeter(generator_manager)
batchsize = 12
custom_prompt = (config['generation']['custom_prompt']['user_name'] + 
                 ' "{prompt}".\n' + 
                 config['generation']['custom_prompt']['bot_name'] + ' '
                )

df = get_real_toxicity_prompts()
toxicity_df = toxicity_meter.measure_toxicity(
    text_prompt = df if not config['data']['subset'] else df[:config['data']['subset_size']],
    custom_prompt=custom_prompt, 
    batch_size=batchsize,
    print_response=True,
)
  1. Save the obtained results:
fld = './result analysis/tmp'
toxicity_df.to_csv(
    fld + f'/measured_tox_instruct_{config["generation"]["model_id"].split("/")[-1]}_{load_dtype}.csv'
)

Tested Models and datasets

Generative language models:

  • LaMini-LM: Small-sized collection of efficient language models distilled from ChatGPT and trained on a large-scale dataset of 2.58M instructions GitHub, Paper

  • RedPajama-*: Source

  • BloomZ: Family of models capable of following human instructions in dozens of languages zero-shot GitHub, Paper

  • Pythia: Predominantly abandoned in favour of instructed models. Model(s) that combines interpretability analysis and scaling laws to understand how knowledge develops and evolves during training in autoregressive transformers. GitHub, Paper

  • Falcon-*-isntruct: Causal decoder-only model built by TII and trained on 1,500B tokens of RefinedWeb enhanced with curated corpora. Source, Source istructed 7B model.

Datasets:

  • Real Toxicity Prompts: Mainly used for the βš–οΈ ToxicityMeter module. Dataset of 100K naturally occurring, sentence-level prompts derived from a large corpus of English web text, paired with toxicity scores from a widely-used toxicity classifier. GitHub, Paper

  • DIALOCONAN: Mainly used for πŸ‘¨πŸΌβ€πŸ« fine-tuning and πŸ₯ž RLAF modules. Datasets of counter-narratives to fight online hate speech. GitHub, Paper

Reward models:

  • roberta-hate-speech-dynabench-r4-target: Model trained on ∼40,000 entries, generated and labelled by trained annotators over four rounds of dynamic data creation. Paper

Development

How to setup on Google Colab:

  1. Import the main notebook in colab
  2. Include the following cell at the beginning:
!git clone https://__TOKEN_GIT__:@github.com/DanielSc4/RewardLM.git
%cd RewardLM/
!pip install -r requirements.txt
from huggingface_hub import login
login(token = '__TOKEN_HF__')
  1. [Opt, only if the repo is private] Replace __TOKEN_GIT__ with your git token (more info here)
  2. Replace __TOKEN_HF__ with you πŸ€— HuggingFace personal token

How to setup developer environment

Dependency install:

  1. Install poetry, a Python package manager
  2. It is recommended to run the following command to let poetry create the virtual environment for the project directly inside the root folder, allowing IDEs to detect dependencies and executables
poetry config virtualenvs.in-project true
  1. Inside the root folder, run poetry install to get all the dependencies. See Poetry docs for a thorough explanation of how poetry works

Activating virtual env:

To run a project file, you will need to use the interpreter installed by Poetry in the virtual environment, usually located in rewardlm/.venv/bin/. To do that, you can use poetry run command, followed by the name of the script that you want to run (Poetry run doc).

You can also run the following command to ensure that the terminal will use the correct python version (the one downloaded in the virtual env) together with its whole set of dependencies:

source .venv/bin/activate

Backlog:

  • Catch & handle ValueError: Responses are too short. Make sure they are at least 4 tokens long. error skipping current batch that generates the anomaly.
  • Add support for checkpointing and tracking more info.
  • Add support for dynamic batch size based on Memory Utilities from πŸ€— HuggingFace.
  • [fix] Fix short responses behavior (less than 4 tokens) [fix based on generation_config, TODO: how the generation change w/ bigger models?]
  • Add support for model's sharing (and backup) on πŸ€— HuggingFace hub!
  • Add possibility of using a reward manager as a reward model, to have more control over the reward system.
  • Compatibility of βš–οΈ ToxicityMeter with other datasets (possibly instructional).
  • Extend βš–οΈ ToxicityMeter compatibility with πŸ€— Accelerate.
  • Extend the possibility of managing parameters and configurations to πŸ₯žRLAF.
  • Use of Inseq for analysis and interpretability of generative models at βš–οΈ ToxicityMeter.