Skip to content

Latest commit

 

History

History
 
 

AutoTinyBERT

AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models (ACL 2021)

Overview

Pre-trained language models (PLMs) have achieved great success in natural language processing. Most of PLMs follow the default setting of architecture hyper-parameters (e.g., the hidden dimension is a quarter of the intermediate dimension in feed-forward sub-networks) in BERT. In this paper, we adopt the one-shot Neural Architecture Search (NAS) to automatically search architecture hyper-parameters for efficient pre-trained language models (at least 6x faster than BERT-base). Our framework is illustrated as follows:

For more details about the techniques of AutoTinyBERT, please refer to our paper:

Model Zoo

We release the Model Zoo of AutoTinyBERT here, Speedup is compared with BERT-base (L12D768)

Version Speedup (CPU) SQuADv1 (dev) GLUE (dev) Link
S1 7.2x 83.3 78.3 S1[b4db]
S2 15.7x 78.1 76.4 S2[pq9i]
S3 20.2x 75.8 75.3 S3[a52b]
S4 27.2x 71.9 73.0 S4[msen]
KD-S1 4.6x 87.6 81.2 KD-S1[lv15]
KD-S2 9.0x 84.6 77.5 KD-S2[agob]
KD-S3 10.7x 83.3 76.2 KD-S3[9pi2]
KD-S4 17.0x 78.7 73.5 KD-S4[l9lc]

Use in Transformers

Our released code can directly load the pre-trained models, and also the models can be used in Huggingface Transformers by small modifications as follows:

class BertSelfAttention(nn.Module):
    def __init__(self, config):
        ### Before modifications:
        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
        ### After modifications: 
        try:
            qkv_size = config.qkv_size
        except:
            qkv_size = config.hidden_size

        self.attention_head_size = int(qkv_size / config.num_attention_heads)

class BertSelfOutput(nn.Module):
    def __init__(self, config):
        ### Before modifications:
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        
        ### After modifications:
        try:
            qkv_size = config.qkv_size
        except:
            qkv_size = config.hidden_size
        self.dense = nn.Linear(qkv_size, config.hidden_size)

Train

Generate Data

We first generate the training data with generate_data.py for pre-training or knowledge distillation.

python generate_data.py --train_corpus ${wiki_book_corpus} --bert_model ${bert_base}$ --output_dir ${train_data_dir} \
      --do_lower_case --reduce_memory

${wiki_book_corpus}$ means the raw data. Each line is a sentence, and the document is separated by a blank line.
${bert_base}$ means the dir of bert-base, here we only use the vocab file.
${output_dir}$ means the dir of generated data.

Train SuperPLM

Then we use pre_training.py to train a SuperPLM with the mlm-loss or kd-loss.

### For the mlm-loss setting:
python -m torch.distributed.launch \
    --nproc_per_node=$1 \
    --nnodes=$2 \
    --node_rank=$3 \
    --master_addr=$4 \
    --master_port=$5 \
    pre_training.py \
    --pregenerated_data ${train_data_dir} \
    --cache_dir ${cache_dir} \
    --epochs ${epochs} \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --train_batch_size ${train_batch_size} \
    --learning_rate ${learning_rate} \
    --max_seq_length ${max_seq_length} \
    --student_model ${student_model} \
    --masked_lm_prob 0.15 \
    --do_lower_case --fp16 --scratch --mlm_loss


### For the kd-loss setting:
python -m torch.distributed.launch \
    --nproc_per_node=$1 \
    --nnodes=$2 \
    --node_rank=$3 \
    --master_addr=$4 \
    --master_port=$5 \
    pre_training.py \
    --pregenerated_data ${train_data_dir} \
    --cache_dir ${cache_dir} \
    --epochs ${epochs} \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --train_batch_size ${train_batch_size} \
    --learning_rate ${learning_rate} \
    --max_seq_length ${max_seq_length} \
    --student_model ${student_model} \
    --teacher_model ${teacher_model} \
    --masked_lm_prob 0 \
    --do_lower_case --fp16 --scratch 

${train_data_dir}$ is the dir of dataset generated by generate_data.py
${student_model}$ refers to the dir of SuperPLM
${teacher_model}$ means the dir of teacher model, we use ELECTRA-base in our paper. 
${cache_dir}$ means the output dir

Random|Fast|Evolved Search

We first build the Latency predictor Lat(*) by the inference_time_evaluation.py and latency_predictor.py. The first script is used to generate the dataset and second one aims to build the Lat(*) classifier trained on the generated dataset. Through these scripts, we get the model file time.pt of Lat(*). Then, we do the search as follows:

[1] Obtain candidates 
python searcher.py --ckpt_path latency/mlm_model/time.pt \
    --latency_constraint 7 --method Candidate --model MLM \
    --candidate_file cands/mlm_7x

the candidates will be saved in ${candidate_file}$ and you can set the specific ${latency_constraint}$.

[2] Random Search
python searcher.py --ckpt_path latency/mlm_model/time.pt \
    --candidate_file cands/mlm_7x --latency_constraint 7 \
     --method Random --model MLM --output_file cands/1st_generation.cands

[3] Fast Search
python searcher.py --ckpt_path latency/mlm_model/time.pt \
    --candidate_file cands/mlm_7x --latency_constraint 7 \
    --method Fast --model MLM --output_file cands/1st_generation.fast.cands

[4] Evaluation of candidates
python superbert_run_en_classifier.py --data_dir "dataset/glue/MNLI dataset/SQuAD" \
 --model model/SuperBERT_MLM/ --task_name "mnli squad" --output_dir output/ \
 --do_lower_case --arches_file cands/1st_generation.fast.cands 
 
 ${model}$ means the directory of pre-trained SuperBERT model.

[5] Evolved Search
 python searcher.py --ckpt_path latency/mlm_model/time.pt  --candidate_file cands/mlm_7x \
 --latency_constraint 7 --method Evolved --model MLM --output_file cands/1st_generation.evo.cands \
 --arch_perfs_file output/subbert.results
 
 ${arch_perfs_file}$ means the results of sub-models generated by [4].

For the evolutionary search, we should perform [2] to generate first generation of architectures, then evaluate it by [4], and do evolutionary algorithm [5] with the evaluation results to generate next generation. We iteratively perform the processes of [4] and [5] util the maximum iteration is achieved.

Further Train

After the search, we obtain the optimal architecture. Then we extract the corresponding the sub-model by submodel_extractor.py and do the further training by pre_training.py.

## Sub-model extraction
python submodel_extractor.py --model model/SuperBERT_MLM/ \
--arch "{'sample_layer_num': 5, 'sample_num_attention_heads': [8, 8, 8, 8, 8], 'sample_qkv_sizes': [512, 512, 512, 512, 512], 'sample_hidden_size': 564, 'sample_intermediate_sizes': [1054, 1054, 1054, 1054, 1054]}" \
--output extracted_model/

## Further train
### For the mlm-loss setting:
python -m torch.distributed.launch \
    --nproc_per_node=$1 \
    --nnodes=$2 \
    --node_rank=$3 \
    --master_addr=$4 \
    --master_port=$5 \
    pre_training.py \
    --pregenerated_data ${train_data_dir} \
    --cache_dir ${cache_dir} \
    --epochs ${epochs} \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --train_batch_size ${train_batch_size} \
    --learning_rate ${learning_rate} \
    --max_seq_length ${max_seq_length} \
    --student_model ${student_model} \
    --masked_lm_prob 0.15 \
    --do_lower_case --fp16 --mlm_loss --further_train

${student_model}$ means the extracted sub-model.

The kd-loss setting uses a similar command except for 'kd_loss' parameter.

Requirements

  • Latency is evaluated on Intel(R) Xeon(R) CPU E7-4850 v2 @ 2.30GHz
  • Apex for fp16 training
  • NVIDIA GPUs and NCCL

Acknowledgements

Our code is developed based on HAT and Transformers.