Name	Name	Last commit message	Last commit date
parent directory ..
transformer	transformer
AutoTinyBERT_overview.PNG	AutoTinyBERT_overview.PNG
LICENSE	LICENSE
README.md	README.md
THIRD PARTY OPEN SOURCE SOFTWARE NOTICE	THIRD PARTY OPEN SOURCE SOFTWARE NOTICE
generate_data.py	generate_data.py
inference_time_evaluation.py	inference_time_evaluation.py
latency_predictor.py	latency_predictor.py
pre_training.py	pre_training.py
searcher.py	searcher.py
submodel_extractor.py	submodel_extractor.py
superbert_run_en_classifier.py	superbert_run_en_classifier.py
utils.py	utils.py

AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models (ACL 2021)

Overview

Pre-trained language models (PLMs) have achieved great success in natural language processing. Most of PLMs follow the default setting of architecture hyper-parameters (e.g., the hidden dimension is a quarter of the intermediate dimension in feed-forward sub-networks) in BERT. In this paper, we adopt the one-shot Neural Architecture Search (NAS) to automatically search architecture hyper-parameters for efficient pre-trained language models (at least 6x faster than BERT-base). Our framework is illustrated as follows:

For more details about the techniques of AutoTinyBERT, please refer to our paper:

Model Zoo

We release the Model Zoo of AutoTinyBERT here, Speedup is compared with BERT-base (L12D768)

Version	Speedup (CPU)	SQuADv1 (dev)	GLUE (dev)	Link
S1	7.2x	83.3	78.3	S1[b4db]
S2	15.7x	78.1	76.4	S2[pq9i]
S3	20.2x	75.8	75.3	S3[a52b]
S4	27.2x	71.9	73.0	S4[msen]
KD-S1	4.6x	87.6	81.2	KD-S1[lv15]
KD-S2	9.0x	84.6	77.5	KD-S2[agob]
KD-S3	10.7x	83.3	76.2	KD-S3[9pi2]
KD-S4	17.0x	78.7	73.5	KD-S4[l9lc]

Use in Transformers

Our released code can directly load the pre-trained models, and also the models can be used in Huggingface Transformers by small modifications as follows:

class BertSelfAttention(nn.Module):
    def __init__(self, config):
        ### Before modifications:
        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
        ### After modifications: 
        try:
            qkv_size = config.qkv_size
        except:
            qkv_size = config.hidden_size

        self.attention_head_size = int(qkv_size / config.num_attention_heads)

class BertSelfOutput(nn.Module):
    def __init__(self, config):
        ### Before modifications:
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        
        ### After modifications:
        try:
            qkv_size = config.qkv_size
        except:
            qkv_size = config.hidden_size
        self.dense = nn.Linear(qkv_size, config.hidden_size)

Train

Generate Data

We first generate the training data with generate_data.py for pre-training or knowledge distillation.

python generate_data.py --train_corpus ${wiki_book_corpus} --bert_model ${bert_base}$ --output_dir ${train_data_dir} \
      --do_lower_case --reduce_memory

${wiki_book_corpus}$ means the raw data. Each line is a sentence, and the document is separated by a blank line.
${bert_base}$ means the dir of bert-base, here we only use the vocab file.
${output_dir}$ means the dir of generated data.

Train SuperPLM

Then we use pre_training.py to train a SuperPLM with the mlm-loss or kd-loss.

### For the mlm-loss setting:
python -m torch.distributed.launch \
    --nproc_per_node=$1 \
    --nnodes=$2 \
    --node_rank=$3 \
    --master_addr=$4 \
    --master_port=$5 \
    pre_training.py \
    --pregenerated_data ${train_data_dir} \
    --cache_dir ${cache_dir} \
    --epochs ${epochs} \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --train_batch_size ${train_batch_size} \
    --learning_rate ${learning_rate} \
    --max_seq_length ${max_seq_length} \
    --student_model ${student_model} \
    --masked_lm_prob 0.15 \
    --do_lower_case --fp16 --scratch --mlm_loss


### For the kd-loss setting:
python -m torch.distributed.launch \
    --nproc_per_node=$1 \
    --nnodes=$2 \
    --node_rank=$3 \
    --master_addr=$4 \
    --master_port=$5 \
    pre_training.py \
    --pregenerated_data ${train_data_dir} \
    --cache_dir ${cache_dir} \
    --epochs ${epochs} \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --train_batch_size ${train_batch_size} \
    --learning_rate ${learning_rate} \
    --max_seq_length ${max_seq_length} \
    --student_model ${student_model} \
    --teacher_model ${teacher_model} \
    --masked_lm_prob 0 \
    --do_lower_case --fp16 --scratch 

${train_data_dir}$ is the dir of dataset generated by generate_data.py
${student_model}$ refers to the dir of SuperPLM
${teacher_model}$ means the dir of teacher model, we use ELECTRA-base in our paper. 
${cache_dir}$ means the output dir

Random|Fast|Evolved Search

We first build the Latency predictor Lat(*) by the inference_time_evaluation.py and latency_predictor.py. The first script is used to generate the dataset and second one aims to build the Lat(*) classifier trained on the generated dataset. Through these scripts, we get the model file time.pt of Lat(*). Then, we do the search as follows:

[1] Obtain candidates 
python searcher.py --ckpt_path latency/mlm_model/time.pt \
    --latency_constraint 7 --method Candidate --model MLM \
    --candidate_file cands/mlm_7x

the candidates will be saved in ${candidate_file}$ and you can set the specific ${latency_constraint}$.

[2] Random Search
python searcher.py --ckpt_path latency/mlm_model/time.pt \
    --candidate_file cands/mlm_7x --latency_constraint 7 \
     --method Random --model MLM --output_file cands/1st_generation.cands

[3] Fast Search
python searcher.py --ckpt_path latency/mlm_model/time.pt \
    --candidate_file cands/mlm_7x --latency_constraint 7 \
    --method Fast --model MLM --output_file cands/1st_generation.fast.cands

[4] Evaluation of candidates
python superbert_run_en_classifier.py --data_dir "dataset/glue/MNLI dataset/SQuAD" \
 --model model/SuperBERT_MLM/ --task_name "mnli squad" --output_dir output/ \
 --do_lower_case --arches_file cands/1st_generation.fast.cands 
 
 ${model}$ means the directory of pre-trained SuperBERT model.

[5] Evolved Search
 python searcher.py --ckpt_path latency/mlm_model/time.pt  --candidate_file cands/mlm_7x \
 --latency_constraint 7 --method Evolved --model MLM --output_file cands/1st_generation.evo.cands \
 --arch_perfs_file output/subbert.results
 
 ${arch_perfs_file}$ means the results of sub-models generated by [4].

For the evolutionary search, we should perform [2] to generate first generation of architectures, then evaluate it by [4], and do evolutionary algorithm [5] with the evaluation results to generate next generation. We iteratively perform the processes of [4] and [5] util the maximum iteration is achieved.

Further Train

After the search, we obtain the optimal architecture. Then we extract the corresponding the sub-model by submodel_extractor.py and do the further training by pre_training.py.

## Sub-model extraction
python submodel_extractor.py --model model/SuperBERT_MLM/ \
--arch "{'sample_layer_num': 5, 'sample_num_attention_heads': [8, 8, 8, 8, 8], 'sample_qkv_sizes': [512, 512, 512, 512, 512], 'sample_hidden_size': 564, 'sample_intermediate_sizes': [1054, 1054, 1054, 1054, 1054]}" \
--output extracted_model/

## Further train
### For the mlm-loss setting:
python -m torch.distributed.launch \
    --nproc_per_node=$1 \
    --nnodes=$2 \
    --node_rank=$3 \
    --master_addr=$4 \
    --master_port=$5 \
    pre_training.py \
    --pregenerated_data ${train_data_dir} \
    --cache_dir ${cache_dir} \
    --epochs ${epochs} \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --train_batch_size ${train_batch_size} \
    --learning_rate ${learning_rate} \
    --max_seq_length ${max_seq_length} \
    --student_model ${student_model} \
    --masked_lm_prob 0.15 \
    --do_lower_case --fp16 --mlm_loss --further_train

${student_model}$ means the extracted sub-model.

The kd-loss setting uses a similar command except for 'kd_loss' parameter.

Requirements

Latency is evaluated on Intel(R) Xeon(R) CPU E7-4850 v2 @ 2.30GHz
Apex for fp16 training
NVIDIA GPUs and NCCL

Acknowledgements

Our code is developed based on HAT and Transformers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoTinyBERT

AutoTinyBERT

README.md

AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models (ACL 2021)

Overview

Model Zoo

Use in Transformers

Train

Generate Data

Train SuperPLM

Random|Fast|Evolved Search

Further Train

Requirements

Acknowledgements

Files

AutoTinyBERT

Directory actions

More options

Directory actions

More options

Latest commit

History

AutoTinyBERT

Folders and files

parent directory

README.md

AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models (ACL 2021)

Overview

Model Zoo

Use in Transformers

Train

Generate Data

Train SuperPLM

Random|Fast|Evolved Search

Further Train

Requirements

Acknowledgements