JABER (Junior Arabic BERt) is a 12-layer Arabic pretrained Language Model.
We only provide fine-tuning code for sentence classification tasks,
which will allow you reproduce the test set submission that obtained rank one
on ALUE leaderboard at 01/09/2021
.
We recommend to create a conda environment
conda create -n jaber_alue python=3.6.5
conda activate jaber_alue
- Run command below to install the environment
conda install pytorch==1.5.1 torchvision==0.6.1 cudatoolkit=10.1 -c pytorch
pip install -r envs/requirements.txt
-
Download the pretrained model from here and then place it under
JABER-PyTorch/pretrained_models/
. -
Follow the instructions to download the ALUE datasets from their official website, and then place them under
JABER-PyTorch/raw_datasets/
. -
You may need to contact the authors of ALUE in order to obtain the correct train/dev/test split of
MDD
task. -
You need to provide your own dev set for
MQ2Q
task, please follow these instructions:a. Download the English
QQP
dataset from GLUE website.b. Randomly select 2k positive and negative samples (4k overall) from
dev
set.c. Use an automatic translation service to translate the sentences to Arabic.
d. Create a file named
JABER-PyTorch/raw_datasets/mq2q.dev.tsv
where each line contains one sample as follow:lbl\tquestion_1\tquestion_2\n
(See the toy fileJABER-PyTorch/raw_datasets/toy.mq2q.dev.tsv
)
-
For some necessary pre-processings we refer you to the ArabBERT code-base: https://github.com/aub-mind/arabert. In this regards, you can follow the steps given below:
- Download the preprocess.py.
- Add the file under
/JABER-PyTorch
- Comment the
ArabertPreprocessor
class ingenerate_data.py
. - Add
from preprocess import ArabertPreprocessor
ingenerate_data.py
.
-
Please note that, our code will still run if you don't do the aforementioned step (the code will print a Warning) but it will not produce the expected input data.
-
Run this command to process ALUE datasets:
cd JABER-PyTorch
python generate_data.py --mode train
- Please check that directory\file names match those in
process_alue()
method ingenerate_data.py
.
- The following command will finetune JABER 5 times for a given ALUE task (FID in this demo):
#export CUDA_VISIBLE_DEVICES=0 # the ID of GPU to run the experiments on
export TASK=fid # mq2q | oold | ohsd | svreg | sec | fid | xnli | mdd
bash run_alue.sh $TASK
-
The above code will automatically run
run_alue.py
five times for each task using different random seeds (--seed -1
). -
To reproduce our test submission you need to finetune JABER on all tasks (40 experiments in total).
-
This would generate 40
./alue_predictions/jaber.{TASK}.{max_dev_score}.pkl
files, which each contains the test set predictions for the best performing checkpoint on its respective dev set. -
Here are the hyper-parameters we used to generate test files for ALUE leaderboard:
hp | MQ2Q | OOLD | OHSD | SVREG | SEC | FID | XNLI | MDD |
---|---|---|---|---|---|---|---|---|
batch_size | 64 | 128 | 32 | 8 | 16 | 32 | 16 | 32 |
lr | 2e-5 | 2e-5 | 7e-6 | 2e-5 | 2e-5 | 2e-5 | 2e-5 | 2e-5 |
hidden_dropout_prob | 0.3 | 0.2 | 0.3 | 0.1 | 0.1 | 0.1 | 0.1 | 0.2 |
-
However, we already entered these configurations in
run_alue.sh
. -
Finally, run the following command to generate the
.tsv
test sets submission to ALUE leaderboard.
cd JABER-PyTorch
python generate_data.py --mode test
- It will simply select, for each task, the test set predictions of the best model
performing on its respective dev set. You will find 8
.tsv
files inJABER-PyTorch/alue_test_submission
that you can directly submit to ALUE leaderboard.
- Main page: https://www.noahlab.com.hk/
- Github: https://github.com/huawei-noah
This project's license is under the Apache 2.0 license.
Please cite the following paper when using our code and model:
@misc{ghaddar2021jaber,
title={JABER: Junior Arabic BERt},
author={Abbas Ghaddar and Yimeng Wu and Ahmad Rashid and Khalil Bibi and Mehdi Rezagholizadeh and Chao Xing and Yasheng Wang and Duan Xinyu and Zhefeng Wang and Baoxing Huai and Xin Jiang and Qun Liu and Philippe Langlais},
year={2021},
eprint={2112.04329},
archivePrefix={arXiv},
primaryClass={cs.CL}
}