Skip to content

Commit

Permalink
Merge pull request huawei-noah#134 from zwjyyc/master
Browse files Browse the repository at this point in the history
add AutoTinyBERT
  • Loading branch information
jxfeb committed Jul 27, 2021
2 parents f69e7ff + cc8c2c7 commit 2d6bfa7
Show file tree
Hide file tree
Showing 18 changed files with 8,574 additions and 1 deletion.
Binary file added AutoTinyBERT/AutoTinyBERT_overview.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
215 changes: 215 additions & 0 deletions AutoTinyBERT/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,215 @@
# AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models (ACL 2021)

## Overview

Pre-trained language models (PLMs) have achieved great success in natural language processing.
Most of PLMs follow the default setting of architecture hyper-parameters
(e.g., the hidden dimension is a quarter of the intermediate dimension in feed-forward sub-networks)
in BERT. In this paper, we adopt the one-shot Neural Architecture Search (NAS) to
automatically search architecture hyper-parameters for efficient pre-trained language models (at least 6x faster than BERT-base).
Our framework is illustrated as follows:

<img src="AutoTinyBERT_overview.PNG" width="1000" height="610"/>

For more details about the techniques of AutoTinyBERT, please refer to our paper:



## Model Zoo

We release the Model Zoo of AutoTinyBERT here, Speedup is compared with BERT-base (L12D768)

| Version | Speedup (CPU) | SQuADv1 (dev) | GLUE (dev) | Link |
|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|
| S1 | 7.2x | 83.3 | 78.3 | [S1](https://pan.baidu.com/s/16ugFWK5D9HhYPSpvptg1yg)[b4db] |
| S2 | 15.7x| 78.1 | 76.4 | [S2](https://pan.baidu.com/s/151iOhPPAQjFM4eSu_9PcpA)[pq9i] |
| S3 | 20.2x| 75.8 | 75.3 | [S3](https://pan.baidu.com/s/1PEDbD08-AZvuoAusyNxVIA)[a52b] |
| S4 | 27.2x| 71.9 | 73.0 | [S4](https://pan.baidu.com/s/1ykqNFHLK93TJBosJX876sQ)[msen] |
| KD-S1 | 4.6x | 87.6 | 81.2 | [KD-S1](https://pan.baidu.com/s/1uj8EuED2HeH6heMKAxHv_A)[lv15] |
| KD-S2 | 9.0x | 84.6 | 77.5 | [KD-S2](https://pan.baidu.com/s/18ytClliS4IEe7t60ZD7Dew)[agob] |
| KD-S3 | 10.7x| 83.3 | 76.2 | [KD-S3](https://pan.baidu.com/s/1pGpqZ_XDMqR69HY-YS-8GQ)[9pi2] |
| KD-S4 | 17.0x| 78.7 | 73.5 | [KD-S4](https://pan.baidu.com/s/1ceJ6CvaNrXXlrIt4lF6QSg)[l9lc] |



## Use in Transformers
Our released code can directly load the pre-trained models, and also the models can be used
in [Huggingface Transformers](https://github.com/huggingface/transformers) by small modifications as follows:

```
class BertSelfAttention(nn.Module):
def __init__(self, config):
### Before modifications:
self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
### After modifications:
try:
qkv_size = config.qkv_size
except:
qkv_size = config.hidden_size
self.attention_head_size = int(qkv_size / config.num_attention_heads)
class BertSelfOutput(nn.Module):
def __init__(self, config):
### Before modifications:
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
### After modifications:
try:
qkv_size = config.qkv_size
except:
qkv_size = config.hidden_size
self.dense = nn.Linear(qkv_size, config.hidden_size)
```



## Train
### Generate Data
We first generate the training data with `generate_data.py` for pre-training or knowledge distillation.

```
python generate_data.py --train_corpus ${wiki_book_corpus} --bert_model ${bert_base}$ --output_dir ${train_data_dir} \
--do_lower_case --reduce_memory
${wiki_book_corpus}$ means the raw data. Each line is a sentence, and the document is separated by a blank line.
${bert_base}$ means the dir of bert-base, here we only use the vocab file.
${output_dir}$ means the dir of generated data.
```

### Train SuperPLM
Then we use `pre_training.py` to train a SuperPLM with the mlm-loss or kd-loss.
```
### For the mlm-loss setting:
python -m torch.distributed.launch \
--nproc_per_node=$1 \
--nnodes=$2 \
--node_rank=$3 \
--master_addr=$4 \
--master_port=$5 \
pre_training.py \
--pregenerated_data ${train_data_dir} \
--cache_dir ${cache_dir} \
--epochs ${epochs} \
--gradient_accumulation_steps ${gradient_accumulation_steps} \
--train_batch_size ${train_batch_size} \
--learning_rate ${learning_rate} \
--max_seq_length ${max_seq_length} \
--student_model ${student_model} \
--masked_lm_prob 0.15 \
--do_lower_case --fp16 --scratch --mlm_loss
### For the kd-loss setting:
python -m torch.distributed.launch \
--nproc_per_node=$1 \
--nnodes=$2 \
--node_rank=$3 \
--master_addr=$4 \
--master_port=$5 \
pre_training.py \
--pregenerated_data ${train_data_dir} \
--cache_dir ${cache_dir} \
--epochs ${epochs} \
--gradient_accumulation_steps ${gradient_accumulation_steps} \
--train_batch_size ${train_batch_size} \
--learning_rate ${learning_rate} \
--max_seq_length ${max_seq_length} \
--student_model ${student_model} \
--teacher_model ${teacher_model} \
--masked_lm_prob 0 \
--do_lower_case --fp16 --scratch
${train_data_dir}$ is the dir of dataset generated by generate_data.py
${student_model}$ refers to the dir of SuperPLM
${teacher_model}$ means the dir of teacher model, we use ELECTRA-base in our paper.
${cache_dir}$ means the output dir
```


### Random|Fast|Evolved Search
We first build the Latency predictor Lat(\*) by the `inference_time_evaluation.py` and `latency_predictor.py`. The first
script is used to generate the dataset and second one aims to build the Lat(\*) classifier trained on the generated dataset.
Through these scripts, we get the model file `time.pt` of Lat(\*). Then, we do the search as follows:

```
[1] Obtain candidates
python searcher.py --ckpt_path latency/mlm_model/time.pt \
--latency_constraint 7 --method Candidate --model MLM \
--candidate_file cands/mlm_7x
the candidates will be saved in ${candidate_file}$ and you can set the specific ${latency_constraint}$.
[2] Random Search
python searcher.py --ckpt_path latency/mlm_model/time.pt \
--candidate_file cands/mlm_7x --latency_constraint 7 \
--method Random --model MLM --output_file cands/1st_generation.cands
[3] Fast Search
python searcher.py --ckpt_path latency/mlm_model/time.pt \
--candidate_file cands/mlm_7x --latency_constraint 7 \
--method Fast --model MLM --output_file cands/1st_generation.fast.cands
[4] Evaluation of candidates
python superbert_run_en_classifier.py --data_dir "dataset/glue/MNLI dataset/SQuAD" \
--model model/SuperBERT_MLM/ --task_name "mnli squad" --output_dir output/ \
--do_lower_case --arches_file cands/1st_generation.fast.cands
${model}$ means the directory of pre-trained SuperBERT model.
[5] Evolved Search
python searcher.py --ckpt_path latency/mlm_model/time.pt --candidate_file cands/mlm_7x \
--latency_constraint 7 --method Evolved --model MLM --output_file cands/1st_generation.evo.cands \
--arch_perfs_file output/subbert.results
${arch_perfs_file}$ means the results of sub-models generated by [4].
```

For the evolutionary search, we should perform [2] to generate first generation of architectures, then evaluate it by [4],
and do evolutionary algorithm [5] with the evaluation results to generate next generation. We iteratively
perform the processes of [4] and [5] util the maximum iteration is achieved.

### Further Train
After the search, we obtain the optimal architecture. Then we extract the corresponding the sub-model
by `submodel_extractor.py` and do the further training by `pre_training.py`.

```
## Sub-model extraction
python submodel_extractor.py --model model/SuperBERT_MLM/ \
--arch "{'sample_layer_num': 5, 'sample_num_attention_heads': [8, 8, 8, 8, 8], 'sample_qkv_sizes': [512, 512, 512, 512, 512], 'sample_hidden_size': 564, 'sample_intermediate_sizes': [1054, 1054, 1054, 1054, 1054]}" \
--output extracted_model/
## Further train
### For the mlm-loss setting:
python -m torch.distributed.launch \
--nproc_per_node=$1 \
--nnodes=$2 \
--node_rank=$3 \
--master_addr=$4 \
--master_port=$5 \
pre_training.py \
--pregenerated_data ${train_data_dir} \
--cache_dir ${cache_dir} \
--epochs ${epochs} \
--gradient_accumulation_steps ${gradient_accumulation_steps} \
--train_batch_size ${train_batch_size} \
--learning_rate ${learning_rate} \
--max_seq_length ${max_seq_length} \
--student_model ${student_model} \
--masked_lm_prob 0.15 \
--do_lower_case --fp16 --mlm_loss --further_train
${student_model}$ means the extracted sub-model.
```
The kd-loss setting uses a similar command except for 'kd_loss' parameter.

## Requirements
* Latency is evaluated on Intel(R) Xeon(R) CPU E7-4850 v2 @ 2.30GHz
* Apex for fp16 training
* NVIDIA GPUs and [NCCL](https://github.com/NVIDIA/nccl)


## Acknowledgements
Our code is developed based on [HAT](https://github.com/pytorch/fairseq) and
[Transformers](https://github.com/huggingface/transformers).
Loading

0 comments on commit 2d6bfa7

Please sign in to comment.