Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
zglxjtu committed May 21, 2024
1 parent 19494fa commit 3b811d6
Showing 1 changed file with 3 additions and 2 deletions.
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,11 +82,12 @@ python eval.py

| 函数名称 | 函数功能 | 去重效果 | Time Consuming |
|-----------------------------------------|----------------|--------------------|----------------|
| process_baike() | 短文本过滤+数据存储个数转换 | 5634898 to 3605212 | 552.046 s |
| process_baike() | 短文本过滤+数据存储格式转换 | 5634898 to 3605212 | 552.046 s |
| remove_dataset_duplicate_rows() | Minhash去重 | 3605212 to 2736033 | 4 h |
| remove_dataset_duplicate_rows_simhash() | Simhash去重 | 3605212 to 3548779 | 23 min |

推荐使用Minhash!
- 推荐使用parquet格式存储数据,可以大大减小存储占用
- 推荐使用Minhash去重,效果优于Simhash,但时间消耗长!

2. **分词器处理数据**:数据预处理采取GPT的通用做法,对语料进行提前分词,对一个样本做完分词后在末尾加上一个结束符号`<eos>`,与下一个样本区分开。然后将所有的训练语料拼接成一个数组(np.uint16)以.bin二进制格式存储到磁盘上。如果语料过大,避免内存溢出,可以选择mmap格式。
```bash
Expand Down

0 comments on commit 3b811d6

Please sign in to comment.