AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models (ACL 2021)
Pre-trained language models (PLMs) have achieved great success in natural language processing. Most of PLMs follow the default setting of architecture hyper-parameters (e.g., the hidden dimension is a quarter of the intermediate dimension in feed-forward sub-networks) in BERT. In this paper, we adopt the one-shot Neural Architecture Search (NAS) to automatically search architecture hyper-parameters for efficient pre-trained language models (at least 6x faster than BERT-base). Our framework is illustrated as follows:
For more details about the techniques of AutoTinyBERT, please refer to our paper:
We release the Model Zoo of AutoTinyBERT here, Speedup is compared with BERT-base (L12D768)
Version | Speedup (CPU) | SQuADv1 (dev) | GLUE (dev) | Link |
---|---|---|---|---|
S1 | 7.2x | 83.3 | 78.3 | S1[b4db] |
S2 | 15.7x | 78.1 | 76.4 | S2[pq9i] |
S3 | 20.2x | 75.8 | 75.3 | S3[a52b] |
S4 | 27.2x | 71.9 | 73.0 | S4[msen] |
KD-S1 | 4.6x | 87.6 | 81.2 | KD-S1[lv15] |
KD-S2 | 9.0x | 84.6 | 77.5 | KD-S2[agob] |
KD-S3 | 10.7x | 83.3 | 76.2 | KD-S3[9pi2] |
KD-S4 | 17.0x | 78.7 | 73.5 | KD-S4[l9lc] |
Our released code can directly load the pre-trained models, and also the models can be used in Huggingface Transformers by small modifications as follows:
class BertSelfAttention(nn.Module):
def __init__(self, config):
### Before modifications:
self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
### After modifications:
try:
qkv_size = config.qkv_size
except:
qkv_size = config.hidden_size
self.attention_head_size = int(qkv_size / config.num_attention_heads)
class BertSelfOutput(nn.Module):
def __init__(self, config):
### Before modifications:
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
### After modifications:
try:
qkv_size = config.qkv_size
except:
qkv_size = config.hidden_size
self.dense = nn.Linear(qkv_size, config.hidden_size)
We first generate the training data with generate_data.py
for pre-training or knowledge distillation.
python generate_data.py --train_corpus ${wiki_book_corpus} --bert_model ${bert_base}$ --output_dir ${train_data_dir} \
--do_lower_case --reduce_memory
${wiki_book_corpus}$ means the raw data. Each line is a sentence, and the document is separated by a blank line.
${bert_base}$ means the dir of bert-base, here we only use the vocab file.
${output_dir}$ means the dir of generated data.
Then we use pre_training.py
to train a SuperPLM with the mlm-loss or kd-loss.
### For the mlm-loss setting:
python -m torch.distributed.launch \
--nproc_per_node=$1 \
--nnodes=$2 \
--node_rank=$3 \
--master_addr=$4 \
--master_port=$5 \
pre_training.py \
--pregenerated_data ${train_data_dir} \
--cache_dir ${cache_dir} \
--epochs ${epochs} \
--gradient_accumulation_steps ${gradient_accumulation_steps} \
--train_batch_size ${train_batch_size} \
--learning_rate ${learning_rate} \
--max_seq_length ${max_seq_length} \
--student_model ${student_model} \
--masked_lm_prob 0.15 \
--do_lower_case --fp16 --scratch --mlm_loss
### For the kd-loss setting:
python -m torch.distributed.launch \
--nproc_per_node=$1 \
--nnodes=$2 \
--node_rank=$3 \
--master_addr=$4 \
--master_port=$5 \
pre_training.py \
--pregenerated_data ${train_data_dir} \
--cache_dir ${cache_dir} \
--epochs ${epochs} \
--gradient_accumulation_steps ${gradient_accumulation_steps} \
--train_batch_size ${train_batch_size} \
--learning_rate ${learning_rate} \
--max_seq_length ${max_seq_length} \
--student_model ${student_model} \
--teacher_model ${teacher_model} \
--masked_lm_prob 0 \
--do_lower_case --fp16 --scratch
${train_data_dir}$ is the dir of dataset generated by generate_data.py
${student_model}$ refers to the dir of SuperPLM
${teacher_model}$ means the dir of teacher model, we use ELECTRA-base in our paper.
${cache_dir}$ means the output dir
We first build the Latency predictor Lat(*) by the inference_time_evaluation.py
and latency_predictor.py
. The first
script is used to generate the dataset and second one aims to build the Lat(*) classifier trained on the generated dataset.
Through these scripts, we get the model file time.pt
of Lat(*). Then, we do the search as follows:
[1] Obtain candidates
python searcher.py --ckpt_path latency/mlm_model/time.pt \
--latency_constraint 7 --method Candidate --model MLM \
--candidate_file cands/mlm_7x
the candidates will be saved in ${candidate_file}$ and you can set the specific ${latency_constraint}$.
[2] Random Search
python searcher.py --ckpt_path latency/mlm_model/time.pt \
--candidate_file cands/mlm_7x --latency_constraint 7 \
--method Random --model MLM --output_file cands/1st_generation.cands
[3] Fast Search
python searcher.py --ckpt_path latency/mlm_model/time.pt \
--candidate_file cands/mlm_7x --latency_constraint 7 \
--method Fast --model MLM --output_file cands/1st_generation.fast.cands
[4] Evaluation of candidates
python superbert_run_en_classifier.py --data_dir "dataset/glue/MNLI dataset/SQuAD" \
--model model/SuperBERT_MLM/ --task_name "mnli squad" --output_dir output/ \
--do_lower_case --arches_file cands/1st_generation.fast.cands
${model}$ means the directory of pre-trained SuperBERT model.
[5] Evolved Search
python searcher.py --ckpt_path latency/mlm_model/time.pt --candidate_file cands/mlm_7x \
--latency_constraint 7 --method Evolved --model MLM --output_file cands/1st_generation.evo.cands \
--arch_perfs_file output/subbert.results
${arch_perfs_file}$ means the results of sub-models generated by [4].
For the evolutionary search, we should perform [2] to generate first generation of architectures, then evaluate it by [4], and do evolutionary algorithm [5] with the evaluation results to generate next generation. We iteratively perform the processes of [4] and [5] util the maximum iteration is achieved.
After the search, we obtain the optimal architecture. Then we extract the corresponding the sub-model
by submodel_extractor.py
and do the further training by pre_training.py
.
## Sub-model extraction
python submodel_extractor.py --model model/SuperBERT_MLM/ \
--arch "{'sample_layer_num': 5, 'sample_num_attention_heads': [8, 8, 8, 8, 8], 'sample_qkv_sizes': [512, 512, 512, 512, 512], 'sample_hidden_size': 564, 'sample_intermediate_sizes': [1054, 1054, 1054, 1054, 1054]}" \
--output extracted_model/
## Further train
### For the mlm-loss setting:
python -m torch.distributed.launch \
--nproc_per_node=$1 \
--nnodes=$2 \
--node_rank=$3 \
--master_addr=$4 \
--master_port=$5 \
pre_training.py \
--pregenerated_data ${train_data_dir} \
--cache_dir ${cache_dir} \
--epochs ${epochs} \
--gradient_accumulation_steps ${gradient_accumulation_steps} \
--train_batch_size ${train_batch_size} \
--learning_rate ${learning_rate} \
--max_seq_length ${max_seq_length} \
--student_model ${student_model} \
--masked_lm_prob 0.15 \
--do_lower_case --fp16 --mlm_loss --further_train
${student_model}$ means the extracted sub-model.
The kd-loss setting uses a similar command except for 'kd_loss' parameter.
- Latency is evaluated on Intel(R) Xeon(R) CPU E7-4850 v2 @ 2.30GHz
- Apex for fp16 training
- NVIDIA GPUs and NCCL
Our code is developed based on HAT and Transformers.