Skip to content

Commit

Permalink
[BugFix] Fix model zoo relative import (PaddlePaddle#2130)
Browse files Browse the repository at this point in the history
* fix as relative import

* add soft link

* refine

* copy data tools to model

* fix gpt-3

* refine gpt readme.

* delete unused dataset utils code

* delete some code.

* softlink

* move data_tools to ernie.

* fix
  • Loading branch information
ZHUI committed May 13, 2022
1 parent 6de32e5 commit 361d91c
Show file tree
Hide file tree
Showing 28 changed files with 436 additions and 25 deletions.
44 changes: 43 additions & 1 deletion examples/language_model/gpt-3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,53 @@ GPT-[3](https://arxiv.org/pdf/2005.14165.pdf) 是以[Transformer](https://arxiv.
注:需要PaddlePaddle版本大于等于2.2rc,或者使用最新develop版本,安装方法请参见Paddle[官网](https://www.paddlepaddle.org.cn)


### 数据获取与制作

[OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/)是一个开源的英文网页文本数据集,数据来源于Reddit,经过去重、清洗、提取,最终包含800多万个文档。
本示例采用EleutherAI清洗好的[OpenWebText2数据](https://openwebtext2.readthedocs.io/en/latest/index.html#download-plug-and-play-version)

下载以后通过以下命令解压:

```shell
wget https://mystic.the-eye.eu/public/AI/pile_preliminary_components/openwebtext2.jsonl.zst.tar
tar -xvf openwebtext2.json.zst.tar -C /path/to/openwebtext
```

然后使用[data_tools](./data_tools)工具下的`create_pretraining_data.py`脚本进行数据集制作:
```
python -u create_pretraining_data.py \
--model_name gpt2-en \
--tokenizer_name GPTTokenizer \
--data_format JSON \
--input_path /path/to/openwebtext/ \
--append_eos \
--output_prefix gpt_openwebtext \
--workers 40 \
--log_interval 10000
```
处理时间约一个小时左右,就可以得到我们需要的`gpt_openwebtext_ids.npy`, `gpt_openwebtext_idx.npz`数据集文件。

为了方便用户运行测试本模型,本项目提供了处理好的300M的训练样本:
```shell
wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy
wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz
```

将所有预处理得到的文件统一放入一个文件夹中,以备训练使用:

```
mkdir data
mv gpt_en_dataset_300m_ids.npy ./data
mv gpt_en_dataset_300m_idx.npz ./data
```


```shell
cd static # 或者 cd dygraph
# 下载样例数据
mkdir data && cd data
wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt/train.data.json_ids.npz
wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy
wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz
cd ..
# 运行pretrian 脚本
sh run.sh
Expand Down
1 change: 1 addition & 0 deletions examples/language_model/gpt-3/data_tools
1 change: 0 additions & 1 deletion examples/language_model/gpt-3/dygraph/lr.py

This file was deleted.

49 changes: 49 additions & 0 deletions examples/language_model/gpt-3/dygraph/lr.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import math
import numpy
import warnings
from paddle import Tensor
from paddle.optimizer.lr import LRScheduler


class CosineAnnealingWithWarmupDecay(LRScheduler):
def __init__(self,
max_lr,
min_lr,
warmup_step,
decay_step,
last_epoch=0,
verbose=False):

self.decay_step = decay_step
self.warmup_step = warmup_step
self.max_lr = max_lr
self.min_lr = min_lr
super(CosineAnnealingWithWarmupDecay, self).__init__(max_lr, last_epoch,
verbose)

def get_lr(self):
if self.warmup_step > 0 and self.last_epoch <= self.warmup_step:
return float(self.max_lr) * (self.last_epoch) / self.warmup_step

if self.last_epoch > self.decay_step:
return self.min_lr

num_step_ = self.last_epoch - self.warmup_step
decay_step_ = self.decay_step - self.warmup_step
decay_ratio = float(num_step_) / float(decay_step_)
coeff = 0.5 * (math.cos(math.pi * decay_ratio) + 1.0)
return self.min_lr + coeff * (self.max_lr - self.min_lr)
2 changes: 1 addition & 1 deletion examples/language_model/gpt-3/dygraph/run_pretrain.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@

# to import data_tools
filepath = os.path.abspath(os.path.dirname(__file__))
sys.path.insert(0, os.path.join(filepath, "../../"))
sys.path.insert(0, os.path.join(filepath, "../"))

from dataset import create_pretrained_dataset
from args import parse_args
Expand Down
1 change: 0 additions & 1 deletion examples/language_model/gpt-3/static/args.py

This file was deleted.

Loading

0 comments on commit 361d91c

Please sign in to comment.