JABER pytorch version

Apache License

JABER (Junior Arabic BERt) is a 12-layer Arabic pretrained Language Model. We only provide fine-tuning code for sentence classification tasks, which will allow you reproduce the test set submission that obtained rank one on ALUE leaderboard at 01/09/2021.


We recommend to create a conda environment

conda create -n jaber_alue python=3.6.5
conda activate jaber_alue
  • Run command below to install the environment
conda install pytorch==1.5.1 torchvision==0.6.1 cudatoolkit=10.1 -c pytorch
pip install -r envs/requirements.txt

Download Dependencies

  1. Download the pretrained model from here and then place it under JABER-PyTorch/pretrained_models/.

  2. Follow the instructions to download the ALUE datasets from their official website, and then place them under JABER-PyTorch/raw_datasets/.

  3. You may need to contact the authors of ALUE in order to obtain the correct train/dev/test split of MDD task.

  4. You need to provide your own dev set for MQ2Q task, please follow these instructions:

    a. Download the English QQP dataset from GLUE website.

    b. Randomly select 2k positive and negative samples (4k overall) from dev set.

    c. Use an automatic translation service to translate the sentences to Arabic.

    d. Create a file named JABER-PyTorch/raw_datasets/ where each line contains one sample as follow: lbl\tquestion_1\tquestion_2\n (See the toy file JABER-PyTorch/raw_datasets/

Process Data

  • For some necessary pre-processings we refer you to the ArabBERT code-base: In this regards, you can follow the steps given below:

    1. Download the
    2. Add the file under /JABER-PyTorch
    3. Comment the ArabertPreprocessor class in
    4. Add from preprocess import ArabertPreprocessor in
  • Please note that, our code will still run if you don't do the aforementioned step (the code will print a Warning) but it will not produce the expected input data.

  • Run this command to process ALUE datasets:

cd JABER-PyTorch
python --mode train 
  • Please check that directory\file names match those in process_alue() method in

ALUE FineTuning

  • The following command will finetune JABER 5 times for a given ALUE task (FID in this demo):
#export CUDA_VISIBLE_DEVICES=0 # the ID of GPU to run the experiments on
export TASK=fid # mq2q | oold | ohsd | svreg | sec  | fid  | xnli | mdd
bash $TASK
  • The above code will automatically run five times for each task using different random seeds (--seed -1).

  • To reproduce our test submission you need to finetune JABER on all tasks (40 experiments in total).

  • This would generate 40 ./alue_predictions/jaber.{TASK}.{max_dev_score}.pkl files, which each contains the test set predictions for the best performing checkpoint on its respective dev set.

  • Here are the hyper-parameters we used to generate test files for ALUE leaderboard:

batch_size 64 128 32 8 16 32 16 32
lr 2e-5 2e-5 7e-6 2e-5 2e-5 2e-5 2e-5 2e-5
hidden_dropout_prob 0.3 0.2 0.3 0.1 0.1 0.1 0.1 0.2
  • However, we already entered these configurations in

  • Finally, run the following command to generate the .tsv test sets submission to ALUE leaderboard.

cd JABER-PyTorch
python --mode test 
  • It will simply select, for each task, the test set predictions of the best model performing on its respective dev set. You will find 8 .tsv files in JABER-PyTorch/alue_test_submission that you can directly submit to ALUE leaderboard.

This project's license is under the Apache 2.0 license.


Please cite the following paper when using our code and model:

      title={JABER: Junior Arabic BERt}, 
