Merge pull request huawei-noah#134 from zwjyyc/master

add AutoTinyBERT
enockipp · Jul 27, 2021 · 2d6bfa7 · 2d6bfa7
2 parents f69e7ff + cc8c2c7
commit 2d6bfa7
Show file tree

Hide file tree

Showing 18 changed files with 8,574 additions and 1 deletion.
diff --git a/AutoTinyBERT/AutoTinyBERT_overview.PNG b/AutoTinyBERT/AutoTinyBERT_overview.PNG
diff --git a/AutoTinyBERT/README.md b/AutoTinyBERT/README.md
@@ -0,0 +1,215 @@
+# AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models (ACL 2021)
+
+## Overview
+
+Pre-trained language models (PLMs) have achieved great success in natural language processing. 
+Most of PLMs follow the default setting of architecture hyper-parameters
+ (e.g., the hidden dimension is a quarter of the intermediate dimension in feed-forward sub-networks) 
+ in BERT. In this paper, we adopt the one-shot Neural Architecture Search (NAS) to 
+ automatically search architecture hyper-parameters for efficient pre-trained language models (at least 6x faster than BERT-base). 
+ Our framework is illustrated as follows: 
+
+<img src="AutoTinyBERT_overview.PNG" width="1000" height="610"/>
+
+For more details about the techniques of AutoTinyBERT, please refer to our paper:
+
+
+
+## Model Zoo
+
+We release the Model Zoo of AutoTinyBERT here, Speedup is compared with BERT-base (L12D768)
+
+| Version     | Speedup (CPU)     |  SQuADv1 (dev)    | GLUE (dev) | Link |
+|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|
+|  S1         | 7.2x | 83.3 | 78.3 | [S1](https://pan.baidu.com/s/16ugFWK5D9HhYPSpvptg1yg)[b4db] |
+|  S2         | 15.7x| 78.1 | 76.4 | [S2](https://pan.baidu.com/s/151iOhPPAQjFM4eSu_9PcpA)[pq9i] |
+|  S3         | 20.2x| 75.8 | 75.3 | [S3](https://pan.baidu.com/s/1PEDbD08-AZvuoAusyNxVIA)[a52b] |
+|  S4         | 27.2x| 71.9 | 73.0 | [S4](https://pan.baidu.com/s/1ykqNFHLK93TJBosJX876sQ)[msen] |
+|  KD-S1      | 4.6x | 87.6 | 81.2 | [KD-S1](https://pan.baidu.com/s/1uj8EuED2HeH6heMKAxHv_A)[lv15] |
+|  KD-S2      | 9.0x | 84.6 | 77.5 | [KD-S2](https://pan.baidu.com/s/18ytClliS4IEe7t60ZD7Dew)[agob] |
+|  KD-S3      | 10.7x| 83.3 | 76.2 | [KD-S3](https://pan.baidu.com/s/1pGpqZ_XDMqR69HY-YS-8GQ)[9pi2] |
+|  KD-S4      | 17.0x| 78.7 | 73.5 | [KD-S4](https://pan.baidu.com/s/1ceJ6CvaNrXXlrIt4lF6QSg)[l9lc] |
+
+
+
+## Use in Transformers
+Our released code can directly load the pre-trained models, and also the models can be used 
+in [Huggingface Transformers](https://github.com/huggingface/transformers) by small modifications as follows:
+
+```
+class BertSelfAttention(nn.Module):
+    def __init__(self, config):
+        ### Before modifications:
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        ### After modifications: 
+        try:
+            qkv_size = config.qkv_size
+        except:
+            qkv_size = config.hidden_size
+
+        self.attention_head_size = int(qkv_size / config.num_attention_heads)
+
+class BertSelfOutput(nn.Module):
+    def __init__(self, config):
+        ### Before modifications:
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        
+        ### After modifications:
+        try:
+            qkv_size = config.qkv_size
+        except:
+            qkv_size = config.hidden_size
+        self.dense = nn.Linear(qkv_size, config.hidden_size)
+```
+
+
+
+## Train
+### Generate Data
+We first generate the training data with `generate_data.py` for pre-training or knowledge distillation.  
+
+```
+python generate_data.py --train_corpus ${wiki_book_corpus} --bert_model ${bert_base}$ --output_dir ${train_data_dir} \
+      --do_lower_case --reduce_memory
+
+${wiki_book_corpus}$ means the raw data. Each line is a sentence, and the document is separated by a blank line.
+${bert_base}$ means the dir of bert-base, here we only use the vocab file.
+${output_dir}$ means the dir of generated data.
+```
+
+### Train SuperPLM
+Then we use `pre_training.py` to train a SuperPLM with the mlm-loss or kd-loss.
+```
+### For the mlm-loss setting:
+python -m torch.distributed.launch \
+    --nproc_per_node=$1 \
+    --nnodes=$2 \
+    --node_rank=$3 \
+    --master_addr=$4 \
+    --master_port=$5 \
+    pre_training.py \
+    --pregenerated_data ${train_data_dir} \
+    --cache_dir ${cache_dir} \
+    --epochs ${epochs} \
+    --gradient_accumulation_steps ${gradient_accumulation_steps} \
+    --train_batch_size ${train_batch_size} \
+    --learning_rate ${learning_rate} \
+    --max_seq_length ${max_seq_length} \
+    --student_model ${student_model} \
+    --masked_lm_prob 0.15 \
+    --do_lower_case --fp16 --scratch --mlm_loss
+
+
+### For the kd-loss setting:
+python -m torch.distributed.launch \
+    --nproc_per_node=$1 \
+    --nnodes=$2 \
+    --node_rank=$3 \
+    --master_addr=$4 \
+    --master_port=$5 \
+    pre_training.py \
+    --pregenerated_data ${train_data_dir} \
+    --cache_dir ${cache_dir} \
+    --epochs ${epochs} \
+    --gradient_accumulation_steps ${gradient_accumulation_steps} \
+    --train_batch_size ${train_batch_size} \
+    --learning_rate ${learning_rate} \
+    --max_seq_length ${max_seq_length} \
+    --student_model ${student_model} \
+    --teacher_model ${teacher_model} \
+    --masked_lm_prob 0 \
+    --do_lower_case --fp16 --scratch 
+
+${train_data_dir}$ is the dir of dataset generated by generate_data.py
+${student_model}$ refers to the dir of SuperPLM
+${teacher_model}$ means the dir of teacher model, we use ELECTRA-base in our paper. 
+${cache_dir}$ means the output dir
+
+```
+
+
+### Random|Fast|Evolved Search
+We first build the Latency predictor Lat(\*) by the `inference_time_evaluation.py` and `latency_predictor.py`. The first 
+script is used to generate the dataset and second one aims to build the Lat(\*) classifier trained on the generated dataset. 
+Through these scripts, we get the model file `time.pt` of Lat(\*). Then, we do the search as follows:
+
+```
+[1] Obtain candidates 
+python searcher.py --ckpt_path latency/mlm_model/time.pt \
+    --latency_constraint 7 --method Candidate --model MLM \
+    --candidate_file cands/mlm_7x
+
+the candidates will be saved in ${candidate_file}$ and you can set the specific ${latency_constraint}$.
+
+[2] Random Search
+python searcher.py --ckpt_path latency/mlm_model/time.pt \
+    --candidate_file cands/mlm_7x --latency_constraint 7 \
+     --method Random --model MLM --output_file cands/1st_generation.cands
+
+[3] Fast Search
+python searcher.py --ckpt_path latency/mlm_model/time.pt \
+    --candidate_file cands/mlm_7x --latency_constraint 7 \
+    --method Fast --model MLM --output_file cands/1st_generation.fast.cands
+
+[4] Evaluation of candidates
+python superbert_run_en_classifier.py --data_dir "dataset/glue/MNLI dataset/SQuAD" \
+ --model model/SuperBERT_MLM/ --task_name "mnli squad" --output_dir output/ \
+ --do_lower_case --arches_file cands/1st_generation.fast.cands 
+ 
+ ${model}$ means the directory of pre-trained SuperBERT model.
+
+[5] Evolved Search
+ python searcher.py --ckpt_path latency/mlm_model/time.pt  --candidate_file cands/mlm_7x \
+ --latency_constraint 7 --method Evolved --model MLM --output_file cands/1st_generation.evo.cands \
+ --arch_perfs_file output/subbert.results
+ 
+ ${arch_perfs_file}$ means the results of sub-models generated by [4].
+```
+
+For the evolutionary search, we should perform [2] to generate first generation of architectures, then evaluate it by [4],
+and do evolutionary algorithm [5] with the evaluation results to generate next generation. We iteratively 
+perform the processes of [4] and [5] util the maximum iteration is achieved.
+
+### Further Train
+After the search, we obtain the optimal architecture. Then we extract the corresponding the sub-model
+by `submodel_extractor.py` and do the further training by `pre_training.py`.
+
+```
+## Sub-model extraction
+python submodel_extractor.py --model model/SuperBERT_MLM/ \
+--arch "{'sample_layer_num': 5, 'sample_num_attention_heads': [8, 8, 8, 8, 8], 'sample_qkv_sizes': [512, 512, 512, 512, 512], 'sample_hidden_size': 564, 'sample_intermediate_sizes': [1054, 1054, 1054, 1054, 1054]}" \
+--output extracted_model/
+
+## Further train
+### For the mlm-loss setting:
+python -m torch.distributed.launch \
+    --nproc_per_node=$1 \
+    --nnodes=$2 \
+    --node_rank=$3 \
+    --master_addr=$4 \
+    --master_port=$5 \
+    pre_training.py \
+    --pregenerated_data ${train_data_dir} \
+    --cache_dir ${cache_dir} \
+    --epochs ${epochs} \
+    --gradient_accumulation_steps ${gradient_accumulation_steps} \
+    --train_batch_size ${train_batch_size} \
+    --learning_rate ${learning_rate} \
+    --max_seq_length ${max_seq_length} \
+    --student_model ${student_model} \
+    --masked_lm_prob 0.15 \
+    --do_lower_case --fp16 --mlm_loss --further_train
+
+${student_model}$ means the extracted sub-model.
+```
+The kd-loss setting uses a similar command except for 'kd_loss' parameter.
+
+## Requirements
+* Latency is evaluated on Intel(R) Xeon(R) CPU E7-4850 v2 @ 2.30GHz
+* Apex for fp16 training
+* NVIDIA GPUs and [NCCL](https://github.com/NVIDIA/nccl)
+
+
+## Acknowledgements
+Our code is developed based on [HAT](https://github.com/pytorch/fairseq) and
+ [Transformers](https://github.com/huggingface/transformers).