Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
lemon234071 committed Dec 23, 2020
1 parent 8687784 commit 5269176
Showing 1 changed file with 16 additions and 14 deletions.
30 changes: 16 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,10 +70,10 @@ LCCC-base 数据集中的原始对话数据来自于微博对话,LCCC-large

| 预训练模型 | 参数数量 | 预训练所使用数据 | 描述 |
|---------------------| ------ |--------------------------|-------------------------------------------------- |
| [GPT<sub>Novel</sub>](https://cloud.tsinghua.edu.cn/f/5de456dcf23c4535a321/) | 95.5M | 中文小说数据| 基于中文小说数据所构建中文预训练GPT模型 (该小说数据中共包括1.3B个字) |
| [CDial-GPT<sub>LCCC-base</sub>](https://cloud.tsinghua.edu.cn/f/3c5bf48b7c2d4ab28e08/) | 95.5M | LCCC-base | 在GPT<sub>Novel</sub>的基础上,使用 LCCC-base 训练得到的中文预训练GPT模型 |
| [CDial-GPT2<sub>LCCC-base</sub>](https://cloud.tsinghua.edu.cn/f/babe2be4f3f747abb887/) | 95.5M | LCCC-base | 在GPT<sub>Novel</sub>的基础上,使用 LCCC-base 训练得到的中文预训练GPT2模型 |
| [CDial-GPT<sub>LCCC-large</sub>](https://cloud.tsinghua.edu.cn/f/4dfb8c6c22ae47fbbe98/) | 95.5M | LCCC-large | 在GPT<sub>Novel</sub>的基础上,使用 LCCC-large 训练得到的中文预训练GPT模型 |
| GPT<sub>Novel</sub> | 95.5M | 中文小说数据| 基于中文小说数据所构建中文预训练GPT模型 (该小说数据中共包括1.3B个字) |
| [CDial-GPT<sub>LCCC-base</sub>](https://huggingface.co/lemon234071/CDial-GPT_LCCC-base) | 95.5M | LCCC-base | 在GPT<sub>Novel</sub>的基础上,使用 LCCC-base 训练得到的中文预训练GPT模型 |
| [CDial-GPT2<sub>LCCC-base</sub>](https://huggingface.co/lemon234071/CDial-GPT2_LCCC-base) | 95.5M | LCCC-base | 在GPT<sub>Novel</sub>的基础上,使用 LCCC-base 训练得到的中文预训练GPT2模型 |
| [CDial-GPT<sub>LCCC-large</sub>](https://huggingface.co/lemon234071/CDial-GPT_LCCC-large) | 95.5M | LCCC-large | 在GPT<sub>Novel</sub>的基础上,使用 LCCC-large 训练得到的中文预训练GPT模型 |

### 安装
从源代码直接安装:
Expand All @@ -86,15 +86,16 @@ LCCC-base 数据集中的原始对话数据来自于微博对话,LCCC-large
Step 1: 准备预训练模型和 fine-tuning 所需使用的数据集(如 [STC dataset](https://arxiv.org/abs/1503.02364) 或该项目目录中的toy数据 "data/toy_data.json"):

wget https://cloud.tsinghua.edu.cn/f/372be4a9994b4124810e/?dl=1 -O STC-corpus.zip # 下载 STC 数据集并将其解压至 "data_path" 目录 (如果微调所使用的数据集为 STC)
wget https://cloud.tsinghua.edu.cn/f/4dfb8c6c22ae47fbbe98/?dl=1 -O GPT_LCCC-large.zip # 下载 GPT_LCCC-large 模型权重文件,并将其解压至 "model_checkpoint" 目录
git lfs install
git clone https://huggingface.co/lemon234071/CDial-GPT_LCCC-large # 您可自行下载模型或者OpenAIGPTLMHeadModel.from_pretrained("lemon234071/CDial-GPT_LCCC-large")

Step 2: 训练模型

python train.py --pretrained --model_checkpoint ./models/ --data_path data/STC.json --scheduler linear # 使用单个GPU进行训练
python train.py --pretrained --model_checkpoint lemon234071/CDial-GPT_LCCC-large --data_path data/STC.json --scheduler linear # 使用单个GPU进行训练

或者

python -m torch.distributed.launch --nproc_per_node=8 train.py --pretrained --model_checkpoint ./models/ --data_path data/STC.json --scheduler linear # 以分布式的方式在8块GPU上训练
python -m torch.distributed.launch --nproc_per_node=8 train.py --pretrained --model_checkpoint lemon234071/CDial-GPT_LCCC-large --data_path data/STC.json --scheduler linear # 以分布式的方式在8块GPU上训练

我们的训练脚本中还提供了 ``train_path`` 参数,用户可使用该参数以切片地形式读取纯文本文件。如果您所使用的的系统中内存有限,可以考虑使用该参数读入训练数据。
如果您使用 ``train_path`` 则需要将 ``data_path`` 置空。
Expand Down Expand Up @@ -289,10 +290,10 @@ Similar to [TransferTransfo](https://arxiv.org/abs/1901.08149), we concatenate a

| Models | Parameter Size | Pre-training Dataset | Description |
|---------------------| ------ |--------------------------|-------------------------------------------------- |
| [GPT<sub>Novel</sub>](https://cloud.tsinghua.edu.cn/f/5de456dcf23c4535a321/) | 95.5M | Chinese Novel | A GPT model pre-trained on Chinese Novel dataset (1.3B words, note that we do not provide the detail of this model) |
| [CDial-GPT<sub>LCCC-base</sub>](https://cloud.tsinghua.edu.cn/f/3c5bf48b7c2d4ab28e08/) | 95.5M | [LCCC-base](##datasets) | A GPT model post-trained on LCCC-base dataset from GPT<sub>Novel</sub> |
| [CDial-GPT2<sub>LCCC-base</sub>](https://cloud.tsinghua.edu.cn/f/babe2be4f3f747abb887/) | 95.5M | [LCCC-base](##datasets) | A GPT2 model post-trained on LCCC-base dataset from GPT<sub>Novel</sub> |
| [CDial-GPT<sub>LCCC-large</sub>](https://cloud.tsinghua.edu.cn/f/4dfb8c6c22ae47fbbe98/) | 95.5M | [LCCC-large](##datasets) | A GPT model post-trained on LCCC-large dataset from GPT<sub>Novel</sub> |
| GPT<sub>Novel</sub> | 95.5M | Chinese Novel | A GPT model pre-trained on Chinese Novel dataset (1.3B words, note that we do not provide the detail of this model) |
| [CDial-GPT<sub>LCCC-base</sub>](https://huggingface.co/lemon234071/CDial-GPT_LCCC-base) | 95.5M | [LCCC-base](##datasets) | A GPT model post-trained on LCCC-base dataset from GPT<sub>Novel</sub> |
| [CDial-GPT2<sub>LCCC-base</sub>](https://huggingface.co/lemon234071/CDial-GPT2_LCCC-base) | 95.5M | [LCCC-base](##datasets) | A GPT2 model post-trained on LCCC-base dataset from GPT<sub>Novel</sub> |
| [CDial-GPT<sub>LCCC-large</sub>](https://huggingface.co/lemon234071/CDial-GPT_LCCC-large) | 95.5M | [LCCC-large](##datasets) | A GPT model post-trained on LCCC-large dataset from GPT<sub>Novel</sub> |

### Installation
Install from the source codes:
Expand All @@ -305,15 +306,16 @@ Install from the source codes:
Step 1: Prepare the data for fine-tuning (E.g., [STC dataset](https://arxiv.org/abs/1503.02364) or "data/toy_data.json" in our respository) and the pre-trianed model:

wget https://cloud.tsinghua.edu.cn/f/372be4a9994b4124810e/?dl=1 -O STC-corpus.zip # Download the STC dataset and unzip into "data_path" dir (fine-tuning on STC)
wget https://cloud.tsinghua.edu.cn/f/4dfb8c6c22ae47fbbe98/?dl=1 -O GPT_LCCC-large.zip # Download the GPT_LCCC-large weights file and unzip into "model_checkpoint" dir
git lfs install
git clone https://huggingface.co/lemon234071/CDial-GPT_LCCC-large # or OpenAIGPTLMHeadModel.from_pretrained("lemon234071/CDial-GPT_LCCC-large")

Step 2: Train the model

python train.py --pretrained --model_checkpoint ./models/ --data_path data/STC.json --scheduler linear # Single GPU training
python train.py --pretrained --model_checkpoint lemon234071/CDial-GPT_LCCC-large --data_path data/STC.json --scheduler linear # Single GPU training

or

python -m torch.distributed.launch --nproc_per_node=8 train.py --pretrained --model_checkpoint ./models/ --data_path data/STC.json --scheduler linear # Training on 8 GPUs
python -m torch.distributed.launch --nproc_per_node=8 train.py --pretrained --model_checkpoint lemon234071/CDial-GPT_LCCC-large --data_path data/STC.json --scheduler linear # Training on 8 GPUs

Note: We have also provided ``train_path`` argument in the training script to read dataset in plain text, which will be sliced and handled distributionally.
You can consider to use this argument if the dataset is too large for your system's memory. (also, remember to leave the ``data_path`` argument empty if you are using ``train_path``).
Expand Down

0 comments on commit 5269176

Please sign in to comment.