Name		Name	Last commit message	Last commit date
parent directory ..
NEZHA_PyTorch		NEZHA_PyTorch
alue_predictions		alue_predictions
alue_test_submission		alue_test_submission
pretrained_models		pretrained_models
raw_datasets		raw_datasets
LICENSE		LICENSE
README.md		README.md
THIRD PARTY OPEN SOURCE SOFTWARE NOTICE.txt		THIRD PARTY OPEN SOURCE SOFTWARE NOTICE.txt
compute_metrics.py		compute_metrics.py
generate_data.py		generate_data.py
processors.py		processors.py
requirements.txt		requirements.txt
run_alue.py		run_alue.py
run_alue.sh		run_alue.sh
tokenizationBBPE.py		tokenizationBBPE.py

README.md

JABER pytorch version

JABER (Junior Arabic BERt) is a 12-layer Arabic pretrained Language Model. We only provide fine-tuning code for sentence classification tasks, which will allow you reproduce the test set submission that obtained rank one on ALUE leaderboard at 01/09/2021.

Requirements

We recommend to create a conda environment

conda create -n jaber_alue python=3.6.5
conda activate jaber_alue

Run command below to install the environment

conda install pytorch==1.5.1 torchvision==0.6.1 cudatoolkit=10.1 -c pytorch
pip install -r envs/requirements.txt

Download Dependencies

Download the pretrained model from here and then place it under JABER-PyTorch/pretrained_models/.
Follow the instructions to download the ALUE datasets from their official website, and then place them under JABER-PyTorch/raw_datasets/.
You may need to contact the authors of ALUE in order to obtain the correct train/dev/test split of MDD task.
You need to provide your own dev set for MQ2Q task, please follow these instructions:

a. Download the English QQP dataset from GLUE website.

b. Randomly select 2k positive and negative samples (4k overall) from dev set.

c. Use an automatic translation service to translate the sentences to Arabic.

d. Create a file named JABER-PyTorch/raw_datasets/mq2q.dev.tsv where each line contains one sample as follow: lbl\tquestion_1\tquestion_2\n (See the toy file JABER-PyTorch/raw_datasets/toy.mq2q.dev.tsv)

Process Data

For some necessary pre-processings we refer you to the ArabBERT code-base: https://github.com/aub-mind/arabert. In this regards, you can follow the steps given below:
1. Download the preprocess.py.
2. Add the file under /JABER-PyTorch
3. Comment the ArabertPreprocessor class in generate_data.py.
4. Add from preprocess import ArabertPreprocessor in generate_data.py.
Please note that, our code will still run if you don't do the aforementioned step (the code will print a Warning) but it will not produce the expected input data.
Run this command to process ALUE datasets:

cd JABER-PyTorch
python generate_data.py --mode train

Please check that directory\file names match those in process_alue() method in generate_data.py.

ALUE FineTuning

The following command will finetune JABER 5 times for a given ALUE task (FID in this demo):

#export CUDA_VISIBLE_DEVICES=0 # the ID of GPU to run the experiments on
export TASK=fid # mq2q | oold | ohsd | svreg | sec  | fid  | xnli | mdd
bash run_alue.sh $TASK

The above code will automatically run run_alue.py five times for each task using different random seeds (--seed -1).
To reproduce our test submission you need to finetune JABER on all tasks (40 experiments in total).
This would generate 40 ./alue_predictions/jaber.{TASK}.{max_dev_score}.pkl files, which each contains the test set predictions for the best performing checkpoint on its respective dev set.
Here are the hyper-parameters we used to generate test files for ALUE leaderboard:

hp	MQ2Q	OOLD	OHSD	SVREG	SEC	FID	XNLI	MDD
batch_size	64	128	32	8	16	32	16	32
lr	2e-5	2e-5	7e-6	2e-5	2e-5	2e-5	2e-5	2e-5
hidden_dropout_prob	0.3	0.2	0.3	0.1	0.1	0.1	0.1	0.2

However, we already entered these configurations in run_alue.sh.
Finally, run the following command to generate the .tsv test sets submission to ALUE leaderboard.

cd JABER-PyTorch
python generate_data.py --mode test

It will simply select, for each task, the test set predictions of the best model performing on its respective dev set. You will find 8 .tsv files in JABER-PyTorch/alue_test_submission that you can directly submit to ALUE leaderboard.

Join the Huawei Noah's Ark community

Main page: https://www.noahlab.com.hk/
Github: https://github.com/huawei-noah

License

This project's license is under the Apache 2.0 license.

Citation

Please cite the following paper when using our code and model:

@misc{ghaddar2021jaber,
      title={JABER: Junior Arabic BERt}, 
      author={Abbas Ghaddar and Yimeng Wu and Ahmad Rashid and Khalil Bibi and Mehdi Rezagholizadeh and Chao Xing and Yasheng Wang and Duan Xinyu and Zhefeng Wang and Baoxing Huai and Xin Jiang and Qun Liu and Philippe Langlais},
      year={2021},
      eprint={2112.04329},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JABER-PyTorch

JABER-PyTorch

README.md

JABER pytorch version

Requirements

Download Dependencies

Process Data

ALUE FineTuning

Join the Huawei Noah's Ark community

License

Citation

Files

JABER-PyTorch

Directory actions

More options

Directory actions

More options

Latest commit

History

JABER-PyTorch

Folders and files

parent directory

README.md

JABER pytorch version

Requirements

Download Dependencies

Process Data

ALUE FineTuning

Join the Huawei Noah's Ark community

License

Citation