The code available as part of this sub-directory can be used to train a general purpose multi-task model or the multi-format based models introduced in SciRepEval.
Post the quick setup step in ReadMe, you can choose to train the following base models: (Parenthesis denote how they are referred in the paper)
- General multi task model (MTL CLS) - [CLS] token embedding is considered document representation
- Multi-task training w. Control Codes (MTL CTRL) - Control codes prepended to input and their embedding is considered document representation
- BERT PALs (PALs) - Task specific modules
- Adapters and Fusion - Task specific adapters
cd ${ROOT}/training
Define the tasks and associated metadata as a list in a json config file. Refer to sample_data/tasks_config.json for SciRepEval training config**.
** The example config below assumes the training data has been downloaded locally and uses the "data_files"
property. For using HuggingFace datasets please follow the notes following the example config and use the "dataset"
property.
Example config:
[
{
"name": "fos",
"type": "classification",
"multi_label": true,
"data_files":
{
"train": "<scirepeval_data_dir>/train/fos/train.jsonl",
"dev": "<scirepeval_data_dir>/train/fos/val.jsonl"
},
"labels": "sample_data/fos_labels.txt",
"labels_field": "labels_text",
"ctrl_token": "[CLF]",
"sample_size":
{
"train": 600000,
"dev": 40000
}
}
]
Note
"type"
can be one of["classification", "regression", "ir", "triplet"]
."classification"
is suitable for tasks with categorical (discrete) labels,;"regression"
for tasks with continuous labels;"ir"
for retrieval tasks formatted as{"query": X, "candidates": [{}]}
and"triplet"
for contrastive learning tasks formatted as{"query": q, "pos": p, "neg": n}
.- For multi label classification, add
"multi_label": true
as in the above example. - By default the pre-processing code expects "title" and "abstract" in every example. To process specific fields, provide additional property as
"input_fields": ["title", "abstract", "venue", "year"]
. - For models apart from MTL CLS, provide the
"ctrl_token"
associated with each task, for MTL CTRL it works as the special control code and for PALs and Adapters it acts as the task id to determine the module to be used in the forward pass. - Some "ir" tasks like ad-hoc search [SRCH] might require different control codes forthe query and candidates which can be provided as
"ctrl_token": {"query": "[QRY]", "candidates": "[PRX]"}
. For PALs and Adapters, this task id is internally resolved to feed the queries and candidates to their relevant modules. "sample_size"
is not required if all the samples are to be processed for the splits.- If loading data from Huggingface datsets, instead of
"data_files"
, you can provide parameters forload_dataset
method as -"dataset": {"path": <hf dataset name>, "name": <optional config name for dataset with multiple configs>}
. if "type"=="regresion": <provide the "labels_field"> elif "type" =="classification": <provide the "labels" and "labels_field">
- Losses associated with each task type:
Type | Loss |
---|---|
Classification | Cross Entropy |
Multi-label Classification | Binary Cross Entropy |
Regression | Mean Squared Error |
IR/Triplet | Triplet or Contrastive Loss |
To run the training script with default params, based upon the type of models you want to train run one of the following commands: MTL CLS
python pl_training.py --gpu 2 --tasks-config sample_data/tasks_config.json <base model name/chkpoint path> <expt name>
MTL CTRL
python pl_training.py --gpu 2 --ctrl-tokens --tasks-config sample_data/tasks_config.json <base model name/chkpoint path> <expt name>
PALs
Requires pals config file for additional model configuration. Files present under bert_pals_config
directory.
python pl_training.py --gpu 2 --pals-config pals.config.json --tasks-config sample_data/tasks_config.json <base model name/chkpoint path> <expt name>
Adapters
python pl_training.py --gpu 2 --adapter-type single --tasks-config sample_data/tasks_config.json <base model name/chkpoint path> <expt name>
Fusion
python pl_training.py --gpu 2 --adapter-type fusion --tasks-config sample_data/tasks_config.json <base model name/chkpoint path> <expt name>
model HuggingFace model to be used
version experiment version
optional arguments:
-h, --help show this help message and exit
--tasks-config TASKS_CONFG path to the task config file
--tokenizer TOKENIZER HuggingFace tokenizer to be used (same as model name if not supplied)
--output OUTPUT dir to save checkpoints and finetuned model
--pals-config PALS_CONFIG name of config file for PALS architecture
--adapter-type ADAPTER_TYPE type of adapter architecture (single/fusion)
--batch-size BATCH_SIZE batch size
--lr LR initial learning rate
--peak-lr PEAK_LR initial learning rate
--warmup WARMUP number of warmup steps
--epochs EPOCHS number of epochs
--grad-accum GRAD_ACCUM grad accumulation steps
--ctrl-tokens use control codes for tasks
--gpu GPU number of gpus
--max-len MAX_LEN max sequence length
--val-check-interval VAL_CHECK_INTERVAL validation loop interval
--checkpoint CHECKPOINT resume from checkpoint path
TensorBoard logs and checkpoints are written to <output>/full_run/<version>
directory, by default ./lightning_logs/full_run/<expt name>
.