GitHub - terarachang/DataICL: Data Valuation on In-Context Examples (ACL23)

DataICL

Data Curation Alone Can Stabilize In-context Learning (ACL 2023)
Ting-Yun Chang and Robin Jia

🎞️ https://www.youtube.com/watch?v=ZVz5pI06FRE

📜 https://arxiv.org/pdf/2212.10378.pdf

Abstract: In-context learning (ICL) enables large language models (LLMs) to perform new tasks by prompting them with a sequence of training examples. However, it is known that ICL is very sensitive to the choice of training examples: randomly sampling examples from a training set leads to high variance in performance. In this paper, we show that carefully curating a subset of training data greatly stabilizes ICL performance without any other changes to the ICL algorithm (e.g., prompt retrieval or calibration). We introduce two methods to choose training subsets -- both score training examples individually, then select the highest-scoring ones. CondAcc scores a training example by its average dev-set ICL accuracy when combined with random training examples, while Datamodels learns linear regressors that estimate how the presence of each training example influences LLM outputs. Across five tasks and two LLMs, sampling from stable subsets selected by CondAcc and Datamodels improves average accuracy over sampling from the entire training set by 7.7% and 6.3%, respectively. Surprisingly, the stable subset examples are not especially diverse in content or low in perplexity, in contrast with other work suggesting that diversity and perplexity are important when prompting LLMs.

This repository is based on MetaICL

Content

Quick Start
CondAcc
Datamodels
Baselines
Evaluation
- Labeled
- Unlabeled
- OOD
Data
Construct $\mathcal{D}_{\text{ICL}}$
How to use the released $\mathcal{D}_{\text{ICL}}$?
Stable Subset Examples

Quick Start

$ pip install -r requirements.txt
$ bash demo/download_dicl.sh will download the released prompt-output pairs
- see the Construct $\mathcal{D}_{\text{ICL}}$ section below for more details

CondAcc

The proposed CondAcc method is implemented in select_condacc.py
To reproduce the results in the paper, see Evaluation

Datamodels

To train datamodels, run:

$ bash scripts/train_datamodels.sh

To test datamodels (Appendix A3), run:

$ bash scripts/test_datamodels.sh

The Datamodels selection is implemented in select_datamodels.py
To reproduce the results in the paper, see Evaluation
Download pretrained datamodels out_datamodel.zip

Baselines

The Oneot baseline:
- First, run 1-shot ICL by $ bash scripts/run_oneshot.sh {gpu_i}
- The subset selection process is implemented in baseline_oneshot.py
The TopPrompts baseline is implemented in baseline_top_prompts.py
The Calib baseline is implemented in calib_evaluate.py
- $ bash scripts/run_calibration.sh {gpu_i}
To reproduce the results in the paper, see Evaluation

Evaluation

The following scripts will run the two proposed methods and all baselines.

Labeled Setup

$ bash scripts/run_label.sh {gpu_i}

Unlabeled Setup

$ bash scripts/run_unlabel.sh {gpu_i}

OOD Setup

$ bash scripts/run_ood.sh {gpu_i}

Data

We include the data in data/. The files are organized as follow:

data
├── glue-sst2
│   ├── *train.jsonl
│   ├── *dev.jsonl
│   ├── *test.jsonl
│   └── unlabeled
│       ├── *train.jsonl
│       └── is_groundtruth.npy
├── boolq/
├── subj/
├── scicite/
├── glue-mnli/
└── ag_news/

Each task folder contains *train.jsonl, *dev.jsonl, *test.jsonl, the gold-labeled train/dev/test splits, and the unlabeled training set unlabeled/*train.jsonl.
Note that both the labeled and unlabeled setups use the same dev sets for method developement and are evaluated on the same test sets.
To reproduce the data creation for the unlabeled setups, see create_unlabeled.py

Construct $\mathcal{D}_{\text{ICL}}$

We release the set of prompt-output pairs $\mathcal{D}_{\text{ICL}}$ in Dicl. The files are organized as follow:

Dicl
├── gpt-j-6b
│   ├── label_glue-sst2
│   │   ├── *train_ids.npy
│   │   ├── *permute_ids.npy
│   │   ├── *sampled.pkl
│   │   └── merged*.pt
│   ├── unlabel_glue-sst2/
│   ├── label_boolq/
│   ├── unlabel_boolq/
│   ├── label_subj/
│   ├── unlabel_subj/
│   ├── label_scicite/
│   ├── unlabel_scicite/
│   ├── label_ag_news/
│   └── unlabel_ag_news/
├── opt-13b/
├── opt-6.7b/
└── gpt-neo-2.7B/

Each task has three folders, e.g. label_glue-sst2 and unlabel_glue-sst2 for labeled/unlabeled setup (Sec 2), and test_glue-sst2 for evaluating how well Datamodels can approximate the target LLM on the held-out prompts (Appendix A3).
Each task folder contains 4 files:
- *sampled.pkl: the list of sampled prompts, where each prompt consists of a list of $K$ training examples.
- *train_ids.npy: the training example IDs in each prompt, where each row in the array consists of $K$ example IDs.
- *permute_ids.npy: the permutation IDs of each prompt.
- merged*.pt: given the prompts, the LLM's output logits before softmax.
To reproduce the construction of $\mathcal{D}_{\text{ICL}}$:
- $ bash scripts/build_dicl.sh {gpu_i} {segment_i} {permute_i}
  - gpu_i: the available gpu id, default 0.
  - segment_i: [0-4]. We divide the list of sampled prompts into 5 segments to run them in parallel.
  - permute_i: [0-1]. Given the same prompt, we run 2 different permutations.
  - add the argument --is_unlabel to build $\mathcal{D}_{\text{ICL}}$ for the unlabeled setup.
- Note: the construction process may take hundreds of GPU hours for a task

How to use the released $\mathcal{D}_{\text{ICL}}$?

We include an example code in demo
$ bash demo/download_dicl.sh will automatically download and unzip Dicl.zip

$ python -m demo.demo_dicl --model gpt-j-6b --task glue-sst2, use --is_unlabel for the unlabeled setup.

this will return a list of datapoints, where a datapoint is a dict that looks like:

{'train_examples': [{   'input': 'Review: whole mess \nSentiment:',
                        'options': ['negative', 'positive'],
                        'output': 'negative',
                        'task': 'glue-sst2'},
                    {   'input': 'Review: but it also comes with the '
                                 'laziness and arrogance of a thing that '
                                 "already knows it 's won . \n"
                                 'Sentiment:',
                        'options': ['negative', 'positive'],
                        'output': 'negative',
                        'task': 'glue-sst2'},
                    {   'input': 'Review: intelligent and moving . \n'
                                 'Sentiment:',
                        'options': ['negative', 'positive'],
                        'output': 'positive',
                        'task': 'glue-sst2'},
                    {   'input': 'Review: does point the way for '
                                 'adventurous indian filmmakers toward a '
                                 'crossover into nonethnic markets . \n'
                                 'Sentiment:',
                        'options': ['negative', 'positive'],
                        'output': 'positive',
                        'task': 'glue-sst2'}],
  'train_ids': np.array([101, 286, 666, 623]),
  'dev accuracy': 0.85,
  'logits': a torch.FloatTensor of shape [n_dev, n_labels]}

Stable Subset Examples

We include the identified stable subset examples in out_select
- label_stable_subsets/{model}-{task}-{CondAcc/Datamodels}.jsonl each file shows a stable subset (20 examples) identified by CondAcc/Datamodels in the labeled setup
- unlabel_stable_subsets/{model}-{task}-CondAcc.jsonl each file shows a stable subset (20 examples) identified by CondAcc in the unlabeled setup
- good_example_ids/*.npy each file shows the corresponding 20 example IDs

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
analysis		analysis
config		config
data		data
demo		demo
metaicl		metaicl
out_select		out_select
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
baseline_oneshot.py		baseline_oneshot.py
baseline_top_prompts.py		baseline_top_prompts.py
calib_evaluate.py		calib_evaluate.py
create_unlabeled.py		create_unlabeled.py
dicl_data_collection.py		dicl_data_collection.py
eval_k1.py		eval_k1.py
evaluate.py		evaluate.py
report_test_datamodels.py		report_test_datamodels.py
requirements.txt		requirements.txt
select_condacc.py		select_condacc.py
select_cross_llms.py		select_cross_llms.py
select_datamodels.py		select_datamodels.py
select_maxshots.py		select_maxshots.py
test_datamodels.py		test_datamodels.py
train_datamodels.py		train_datamodels.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataICL

Content

Quick Start

CondAcc

Datamodels

Baselines

Evaluation

Labeled Setup

Unlabeled Setup

OOD Setup

Data

Construct $\mathcal{D}_{\text{ICL}}$

How to use the released $\mathcal{D}_{\text{ICL}}$?

Stable Subset Examples

About

Releases

Packages

Languages

License

terarachang/DataICL

Folders and files

Latest commit

History

Repository files navigation

DataICL

Content

Quick Start

CondAcc

Datamodels

Baselines

Evaluation

Labeled Setup

Unlabeled Setup

OOD Setup

Data

Construct $\mathcal{D}_{\text{ICL}}$

How to use the released $\mathcal{D}_{\text{ICL}}$?

Stable Subset Examples

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages