Skip to content

Commit

Permalink
[Pre-Training] ERNIE-CW pre-training tasks docs. (PaddlePaddle#3111)
Browse files Browse the repository at this point in the history
* add ernie-large config

* update

* update clue finetune.

* unused delete.

* update

* support no nsp for enrie.

* fix evaluation

* fix amp o2 save_dtype bugs.

* extand ernie.

* fix ernie pretrain with ## vocab.

* extend vocab

* support custom tokenizer.

* add some comments.

* fix bugs.

* add comments.

* fix bug.

* fix run_pretrain_static logging.

* fix all gather.

* fix a100

* fix

* fix bugs

* fix save

* tmp commit for pre-process.

* Update README.md

* Update README.md

* add amp o1 support

* ernie cw readme.

* fix

* throw error when dataset is invalid.

* update document.

* refine readme.

* fix

* refactor

* refator2

* Add pre-training introduction.

* update  image width.

* refine doc

* fit table width.

* fix c++ style

* fix table

* refine docs

* refine model_zoo/ernie-1.0/README.md

* readfine readme.

* fix link

* fix bug

* fix documents.

* add weight.

* fix config
  • Loading branch information
ZHUI committed Sep 9, 2022
1 parent 6b59ba2 commit 94dd90a
Show file tree
Hide file tree
Showing 28 changed files with 2,230 additions and 172 deletions.
134 changes: 134 additions & 0 deletions .copyright.hook
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals

import argparse
import io
import re
import sys
import os
import datetime

COPYRIGHT = '''Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.'''

def _generate_copyright(comment_mark):
copyright=COPYRIGHT.split(os.linesep)
header = copyright[0].rstrip()

p = re.search('(\d{4})', header).group(0)
now = datetime.datetime.now()

header = header.replace(p,str(now.year))

ans=[comment_mark + " " + header + os.linesep]
for idx, line in enumerate(copyright[1:]):
ans.append(comment_mark + " " + line.rstrip() + os.linesep)

return ans

def _get_comment_mark(path):
lang_type=re.compile(r"\.(py|sh)$")
if lang_type.search(path) is not None:
return "#"

lang_type=re.compile(r"\.(h|c|hpp|cc|cpp|cu|go|cuh|proto)$")
if lang_type.search(path) is not None:
return "//"

return None


RE_ENCODE = re.compile(r"^[ \t\v]*#.*?coding[:=]", re.IGNORECASE)
RE_COPYRIGHT = re.compile(r".*Copyright \(c\) \d{4}", re.IGNORECASE)
RE_SHEBANG = re.compile(r"^[ \t\v]*#[ \t]?\!")

def _check_copyright(path):
head=[]
try:
with open(path) as f:
head = [next(f) for x in range(4)]
except StopIteration:
pass

for idx, line in enumerate(head):
if RE_COPYRIGHT.search(line) is not None:
return True

return False

def generate_copyright(path, comment_mark):
original_contents = io.open(path, encoding="utf-8").readlines()
head = original_contents[0:4]

insert_line_no=0
for i, line in enumerate(head):
if RE_ENCODE.search(line) or RE_SHEBANG.search(line):
insert_line_no=i+1

copyright = _generate_copyright(comment_mark)
if insert_line_no == 0:
new_contents = copyright
if len(original_contents) > 0 and len(original_contents[0].strip()) != 0:
new_contents.append(os.linesep)
new_contents.extend(original_contents)
else:
new_contents=original_contents[0:insert_line_no]
new_contents.append(os.linesep)
new_contents.extend(copyright)
if len(original_contents) > insert_line_no and len(original_contents[insert_line_no].strip()) != 0:
new_contents.append(os.linesep)
new_contents.extend(original_contents[insert_line_no:])
new_contents="".join(new_contents)

with io.open(path, 'w') as output_file:
output_file.write(new_contents)



def main(argv=None):
parser = argparse.ArgumentParser(
description='Checker for copyright declaration.')
parser.add_argument('filenames', nargs='*', help='Filenames to check')
args = parser.parse_args(argv)

retv = 0
for path in args.filenames:
comment_mark = _get_comment_mark(path)
if comment_mark is None:
print("warning:Unsupported file", path, file=sys.stderr)
continue

if _check_copyright(path):
continue

generate_copyright(path, comment_mark)


if __name__ == '__main__':
exit(main())
7 changes: 7 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,10 @@ repos:
files: \.md$
- id: remove-tabs
files: \.md$
- repo: local
hooks:
- id: copyright_checker
name: copyright_checker
entry: python .copyright.hook
language: system
files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|proto|xpu|kps|py|sh)$
2 changes: 1 addition & 1 deletion docs/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ emb.set_state_dict(load_layer_state_dict) # 加载模型参数

**A:** 预训练模型通常会有配套的tokenzier和词典,对于大多数中文预训练模型,如ERNIE-3.0,使用的都是字粒度的输入,tokenzier会将句子转换为字粒度的形式,模型无法收到词粒度的输入。如果希望引入额外的词典,需要修改预训练模型的tokenizer和词典,可以参考这里[blog](https://kexue.fm/archives/7758/comment-page-1#Tokenizer ),另外注意embedding矩阵也要加上这些新增词的embedding表示。

另外还有一种方式可以使用这些字典信息,可以将数据中在词典信息中的词进行整体mask进行一个mask language model的二次预训练,这样经过二次训练的模型就包含了对额外字典的表征。可参考 [PaddleNLP 预训练数据流程](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-1.0/data_tools)。
另外还有一种方式可以使用这些字典信息,可以将数据中在词典信息中的词进行整体mask进行一个mask language model的二次预训练,这样经过二次训练的模型就包含了对额外字典的表征。可参考 [PaddleNLP 预训练数据流程](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-1.0/)。


此外还有些词粒度及字词混合粒度的预训练模型,在这些词粒度的模型下引入额外的词表也会容易些,我们也将持续丰富PaddleNLP中的预训练模型。
Expand Down
12 changes: 12 additions & 0 deletions docs/model_zoo/transformers/ERNIE/contents.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,14 @@ ERNIE模型汇总
| | | 12-heads, 108M parameters. |
| | | Trained on Chinese text. |
+----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
|``ernie-1.0-base-zh-cw`` | Chinese | 12-layer, 768-hidden, |
| | | 12-heads, 118M parameters. |
| | | Trained on Chinese text. |
+----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
|``ernie-1.0-large-zh-cw`` | Chinese | 24-layer, 1024-hidden, |
| | | 16-heads, 272M parameters. |
| | | Trained on Chinese text. |
+----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
|``ernie-tiny`` | Chinese | 3-layer, 1024-hidden, |
| | | 16-heads, _M parameters. |
| | | Trained on Chinese text. |
Expand All @@ -32,6 +40,10 @@ ERNIE模型汇总
| | | 16-heads, 336M parameters. |
| | | Trained on lower-cased English text. |
+----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
|``ernie-3.0-xbase-zh`` | Chinese | 20-layer, 1024-hidden, |
| | | 16-heads, 296M parameters. |
| | | Trained on Chinese text. |
+----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
|``ernie-3.0-base-zh`` | Chinese | 12-layer, 768-hidden, |
| | | 12-heads, 118M parameters. |
| | | Trained on Chinese text. |
Expand Down
94 changes: 66 additions & 28 deletions examples/benchmark/clue/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,14 +67,51 @@
<td style="text-align:center;">
<span style="font-size:18px;">C<sup>3</sup></span>
</td>
</tr> <tr>
<td rowspan=3 align=center> 24L1024H </td>
<td style="text-align:center">
<span style="font-size:18px">ERNIE 1.0-Large-zh-CW</span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>79.03</b></span>
</td>
<td style="text-align:center">
<span style="font-size:18px">75.97</span>
</td>
<td style="text-align:center">
<span style="font-size:18px">59.65</span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>62.91</b></span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>85.09</b></span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>81.73</b></span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>93.09</b></span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>84.53</b></span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>74.22/91.88</b></span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>88.57</b></span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>84.54</b></span>
</td>
</tr>
<tr>
<td rowspan=2 align=center> 24L1024H </td>
<td style="text-align:center">
<span style="font-size:18px">ERNIE 2.0-Large-zh</span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>77.03</b></span>
<span style="font-size:18px">77.03</span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>76.41</b></span>
Expand All @@ -89,16 +126,16 @@
<span style="font-size:18px">83.82</span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>79.69</b></span>
<span style="font-size:18px">79.69</span>
</td>
<td style="text-align:center">
<span style="font-size:18px">89.14</span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>84.10</b></span>
<span style="font-size:18px">84.10</span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>71.48/90.35</b></span>
<span style="font-size:18px">71.48/90.35</span>
</td>
<td style="text-align:center">
<span style="font-size:18px">85.52</span>
Expand All @@ -124,13 +161,13 @@
<span style="font-size:18px">62.02</span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>83.88</b></span>
<span style="font-size:18px">83.88</span>
</td>
<td style="text-align:center">
<span style="font-size:18px">78.81</span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>90.79</b></span>
<span style="font-size:18px">90.79</span>
</td>
<td style="text-align:center">
<span style="font-size:18px">83.67</span>
Expand All @@ -139,7 +176,7 @@
<span style="font-size:18px">70.58/89.82</span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>85.72</b></span>
<span style="font-size:18px">85.72</span>
</td>
<td style="text-align:center">
<span style="font-size:18px">75.26</span>
Expand All @@ -151,37 +188,37 @@
<span style="font-size:18px">ERNIE 3.0-Xbase-zh</span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>78.71</b></span>
<span style="font-size:18px"><b>78.39</b></span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>76.85</b></span>
<span style="font-size:18px"><b>76.16</b></span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>59.89</b></span>
<span style="font-size:18px"><b>59.55</b></span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>62.41</b></span>
<span style="font-size:18px"><b>61.87</b></span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>84.76</b></span>
<span style="font-size:18px"><b>84.40</b></span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>82.51</b></span>
<span style="font-size:18px"><b>81.73</b></span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>89.80</b></span>
<span style="font-size:18px"><b>88.82</b></span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>84.47</b></span>
<span style="font-size:18px"><b>83.60</b></span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>75.49/92.67</b></span>
<span style="font-size:18px"><b>75.99/93.00</b></span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>86.36</b></span>
<span style="font-size:18px"><b>86.78</b></span>
</td>
<td style="text-align:center">
<span style="font-size:18px"><b>84.59</b></span>
<span style="font-size:18px"><b>84.98</b></span>
</td>
</tr>
<tr>
Expand Down Expand Up @@ -270,31 +307,31 @@
<span style="font-size:18px">ERNIE 2.0-Base-zh</span>
</td>
<td style="text-align:center">
<span style="font-size:18px">74.95</span>
<span style="font-size:18px">74.32</span>
</td>
<td style="text-align:center">
<span style="font-size:18px">76.25</span>
<span style="font-size:18px">75.65</span>
</td>
<td style="text-align:center">
<span style="font-size:18px">58.53</span>
<span style="font-size:18px">58.25</span>
</td>
<td style="text-align:center">
<span style="font-size:18px">61.72</span>
<span style="font-size:18px">61.64</span>
</td>
<td style="text-align:center">
<span style="font-size:18px">83.07</span>
<span style="font-size:18px">82.62</span>
</td>
<td style="text-align:center">
<span style="font-size:18px">78.81</span>
<span style="font-size:18px">78.71</span>
</td>
<td style="text-align:center">
<span style="font-size:18px">84.21</span>
<span style="font-size:18px">81.91</span>
</td>
<td style="text-align:center">
<span style="font-size:18px">82.77</span>
<span style="font-size:18px">82.33</span>
</td>
<td style="text-align:center">
<span style="font-size:18px">68.22/88.71</span>
<span style="font-size:18px">66.08/87.46</span>
</td>
<td style="text-align:center">
<span style="font-size:18px">82.78</span>
Expand Down Expand Up @@ -1154,6 +1191,7 @@ AFQMC(语义相似度)、TNEWS(文本分类)、IFLYTEK(长文本分类

| Model | AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUEWSC2020 | CSL | CMRC2018 | CHID | C<sup>3</sup> |
| -------------------------------- | ------- | ------- | ------- | -------- | -------- | ----------- | ------- | -------- | ------- | ------------- |
| ERNIE 1.0-Large-zh-cw | 2e-5,64 | 3e-5,32 | 5e-5,16 | 2e-5,16 | 2e-5,32 | 1e-5,32 | 1e-5,16 | 2e-5,24 | 1e-5,24 | 2e-5,32 |
| ERNIE 3.0-Xbase-zh | 2e-5,16 | 3e-5,32 | 3e-5,32 | 3e-5,64 | 3e-5,64 | 2e-5,32 | 1e-5,16 | 3e-5,24 | 2e-5,24 | 3e-5,24 |
| ERNIE 2.0-Large-zh | 1e-5,32 | 3e-5,64 | 3e-5,32 | 2e-5,32 | 1e-5,16 | 3e-5,32 | 1e-5,64 | 2e-5,24 | 2e-5,24 | 3e-5,32 |
| HFL/RoBERTa-wwm-ext-large | 1e-5,32 | 3e-5,32 | 2e-5,32 | 1e-5,16 | 1e-5,16 | 2e-5,16 | 2e-5,16 | 3e-5,32 | 1e-5,24 | 2e-5,24 |
Expand Down
Loading

0 comments on commit 94dd90a

Please sign in to comment.