Merge branch 'PaddlePaddle:master' into ACL2022-DuLeMon

zhwesky2010 · Mar 29, 2022 · 6c31859 · 6c31859
2 parents b798736 + 2673b63
commit 6c31859
Show file tree

Hide file tree

Showing 70 changed files with 79,605 additions and 28 deletions.
diff --git a/KG/DuKEVU_Baseline/LICENSE b/KG/DuKEVU_Baseline/LICENSE
@@ -1,4 +1,4 @@
-Copyright (c) 2021 Baidu, Inc. All Rights Reserved
+Copyright (c) 2022 Baidu, Inc. All Rights Reserved
 
                                  Apache License
                            Version 2.0, January 2004
@@ -188,7 +188,7 @@ Copyright (c) 2021 Baidu, Inc. All Rights Reserved
       same "printed page" as the copyright notice for easier
       identification within third-party archives.
 
-   Copyright (c) 2021 Baidu, Inc. All Rights Reserved.
+   Copyright (c) 2022 Baidu, Inc. All Rights Reserved.
 
    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.

diff --git a/KG/DuKEVU_Baseline/README.md b/KG/DuKEVU_Baseline/README.md
@@ -1,19 +1,19 @@
-本项目为[CCKS 2021《知识增强的视频语义理解》技术评测任务](https://www.biendata.xyz/competition/ccks_2021_videounderstanding)的基准模型。包括两部分：1）视频分类标签模型 [paddle-video-classify-tag](./paddle-video-classify-tag)；2）视频语义标签模型 [paddle-video-semantic-tag](./paddle-video-semantic-tag)。视频分类标签模型根据视频内容在封闭的二级标签体系上进行分类，得到描述视频的分类标签；视频语义标签模型 从视频的文本信息中抽取实体语义标签（选手可进行升级，如利用给定的知识库进行推理、融合多模信息提升标签理解效果，或生成标签等）。两部分模型产出的标签结果，分别对应技术评测数据集中提供的分类标签、语义标签。
+本项目为《知识增强的视频语义理解》技术评测任务的基准模型。包括两部分：1）视频分类标签模型 [paddle-video-classify-tag](./paddle-video-classify-tag)；2）视频语义标签模型 [paddle-video-semantic-tag](./paddle-video-semantic-tag)。视频分类标签模型根据视频内容在封闭的二级标签体系上进行分类，得到描述视频的分类标签；视频语义标签模型 从视频的文本信息中抽取实体语义标签（选手可进行升级，如利用给定的知识库进行推理、融合多模信息提升标签理解效果，或生成标签等）。两部分模型产出的标签结果，分别对应技术评测数据集中提供的分类标签、语义标签。
 
 ## 数据集预处理
 
-首先下载训练和测试数据，并如下组织目录结构（注意根目录和各子目录中的`CCKS_dataset`为同一个文件夹）：
+首先下载训练和测试数据，并如下组织目录结构（注意根目录和各子目录中的`dataset`为同一个文件夹）：
 
 ```
 DuKEVU_baseline
-   |-- CCKS_dataset
-      |-- ccks2021
+   |-- dataset
+      |-- dataset
          |-- train.json
          |-- test_a.json
    |-- paddle-video-classify-tag
-       |-- CCKS_dataset -> ../CCKS_dataset
+       |-- dataset -> ../dataset
    |-- paddle-video-semantic-tag
-       |-- CCKS_dataset -> ../CCKS_dataset
+       |-- dataset -> ../dataset
 ```
 
 ## 实验环境配置
@@ -27,7 +27,7 @@ conda create -n paddle2.0 python=3.8
 conda activate paddle2.0
 conda install paddlepaddle-gpu==2.0.2 cudatoolkit=10.0 -c paddle
 pip install opencv-python -i https://mirror.baidu.com/pypi/simple
-pip install --upgrade paddlenlp -i https://mirror.baidu.com/pypi/simple
+pip install paddlenlp==2.0.1 -i https://mirror.baidu.com/pypi/simple
 pip install tqdm wget
 ```
 

diff --git a/KG/DuKEVU_Baseline/CCKS_dataset/.gitkeep → KG/DuKEVU_Baseline/dataset/.gitkeep b/KG/DuKEVU_Baseline/CCKS_dataset/.gitkeep → KG/DuKEVU_Baseline/dataset/.gitkeep
diff --git a/KG/DuKEVU_Baseline/generate_submission.py b/KG/DuKEVU_Baseline/generate_submission.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved.
+# Copyright (c) 2022 Baidu.com, Inc. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -20,7 +20,7 @@
 
 parser = argparse.ArgumentParser()
 parser.add_argument(
-    "--test_path", type=str, default="CCKS_dataset/ccks2021/test_a.json")
+    "--test_path", type=str, default="dataset/dataset/test_a.json")
 parser.add_argument(
     "--category_level1_result",
     type=str,

diff --git a/KG/DuKEVU_Baseline/paddle-video-classify-tag/CCKS_dataset b/KG/DuKEVU_Baseline/paddle-video-classify-tag/CCKS_dataset
diff --git a/KG/DuKEVU_Baseline/paddle-video-classify-tag/README.md b/KG/DuKEVU_Baseline/paddle-video-classify-tag/README.md
@@ -9,8 +9,8 @@
 
 ```
 DuKEVU_baseline
-   |-- CCKS_dataset
-      |-- ccks2021
+   |-- dataset
+      |-- dataset
          |-- train.json
          |-- test_a.json
       |--tsn_features_train
@@ -25,13 +25,13 @@ DuKEVU_baseline
 
 ```bash
 export CUDA_VISIBLE_DEVICES=0
-python tsn_extractor.py --model_name=TSN --config=./configs/tsn-single.yaml --weights=./weights/tsn.pdparams --filelist=./data/TsnExtractor.list --save_dir=./CCKS_dataset/tsn_features
+python tsn_extractor.py --model_name=TSN --config=./configs/tsn-single.yaml --weights=./weights/tsn.pdparams --filelist=./data/TsnExtractor.list --save_dir=./dataset/tsn_features
 ```
 
 如下准备视频语义理解数据集的label集合；准备训练、验证、测试的样本列表等。
 
 ```bash
-python prepare_ccks_videotag.py
+python prepare_videotag.py
 ```
 
 由于数据集上有两级标签，我们分别在一级标签（level1）和二级标签（level2）的设定下进行分类实验。

diff --git a/KG/DuKEVU_Baseline/paddle-video-classify-tag/dataset b/KG/DuKEVU_Baseline/paddle-video-classify-tag/dataset
@@ -0,0 +1 @@
+../dataset
diff --git a/...deo-classify-tag/prepare_ccks_videotag.py → ...le-video-classify-tag/prepare_videotag.py b/...deo-classify-tag/prepare_ccks_videotag.py → ...le-video-classify-tag/prepare_videotag.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved.
+# Copyright (c) 2022 Baidu.com, Inc. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -23,17 +23,17 @@
 
 parser = argparse.ArgumentParser()
 parser.add_argument(
-    "--trainval_path", type=str, default="CCKS_dataset/ccks2021/train.json")
+    "--trainval_path", type=str, default="dataset/dataset/train.json")
 parser.add_argument(
-    "--test_path", type=str, default="CCKS_dataset/ccks2021/test_a.json")
+    "--test_path", type=str, default="dataset/dataset/test_a.json")
 parser.add_argument(
     "--trainval_tsn_feature_dir",
     type=str,
-    default="CCKS_dataset/tsn_features_train")
+    default="dataset/tsn_features_train")
 parser.add_argument(
     "--test_tsn_feature_dir",
     type=str,
-    default="CCKS_dataset/tsn_features_test_a")
+    default="dataset/tsn_features_test_a")
 
 
 def create_splits_indice(n_samples, SPLITS):

diff --git a/KG/DuKEVU_Baseline/paddle-video-semantic-tag/CCKS_dataset b/KG/DuKEVU_Baseline/paddle-video-semantic-tag/CCKS_dataset
diff --git a/KG/DuKEVU_Baseline/paddle-video-semantic-tag/README.md b/KG/DuKEVU_Baseline/paddle-video-semantic-tag/README.md
@@ -7,7 +7,7 @@
 注：我们在数据处理阶段去除了未在title中出现的语义标签。
 
 ```bash
-python prepare_ccks_semantic_tag.py
+python prepare_semantic_tag.py
 ```
 
 得到如下的输出文件：
@@ -22,7 +22,6 @@ paddle-video-semantic-tag
 ```
 
 ## 训练与验证
-
 本模型使用了PaddleNLP模型库中的`bert-wwm-ext-chinese`模型，更多模型可参考[PaddleNLP Transformer API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/transformers.md)。
 
 ```bash
@@ -35,7 +34,7 @@ python train_semantic_tag.py \
     --num_train_epochs 3 \
     --logging_steps 1 \
     --save_steps 500 \
-    --output_dir ./data/checkpoints/ccks_semantic_tag/ \
+    --output_dir ./data/checkpoints/semantic_tag/ \
     --device gpu
 ```
 
@@ -60,7 +59,7 @@ python predict_semantic_tag.py \
     --max_seq_length 128 \
     --batch_size 32 \
     --device gpu \
-    --init_checkpoint_path data/checkpoints/ccks_semantic_tag/model_2500.pdparams
+    --init_checkpoint_path data/checkpoints/semantic_tag/model_2500.pdparams
 ```
 
 生成的命名实体识别结果存储在`./predict_results/ents_results.json`
diff --git a/KG/DuKEVU_Baseline/paddle-video-semantic-tag/dataset b/KG/DuKEVU_Baseline/paddle-video-semantic-tag/dataset
@@ -0,0 +1 @@
+../dataset
diff --git a/...semantic-tag/prepare_ccks_semantic_tag.py → ...ideo-semantic-tag/prepare_semantic_tag.py b/...semantic-tag/prepare_ccks_semantic_tag.py → ...ideo-semantic-tag/prepare_semantic_tag.py
@@ -21,9 +21,9 @@
 
 parser = argparse.ArgumentParser()
 parser.add_argument(
-    "--trainval_path", type=str, default="CCKS_dataset/ccks2021/train.json")
+    "--trainval_path", type=str, default="dataset/dataset/train.json")
 parser.add_argument(
-    "--test_path", type=str, default="CCKS_dataset/ccks2021/test_a.json")
+    "--test_path", type=str, default="dataset/dataset/test_a.json")
 
 TAG_NAMES = ["B-ENT", "I-ENT", "O"]
 

diff --git a/NLP/Text2SQL-DA-HIER/README.md b/NLP/Text2SQL-DA-HIER/README.md
@@ -0,0 +1,122 @@
+Text2sql DA HIER
+===
+Code for EMNLP2021 accepted paper "[Data Augmentation with Hierarchical SQL-to-Question Generation for Cross-domain Text-to-SQL Parsing](https://aclanthology.org/2021.emnlp-main.707/)". Our framework is shown as follows:
+
+![framework](framework.png)
+
+---
+## Environment
+
+    python == 3.6
+
+install nltk package
+
+    pip3 install nltk
+
+Other requirements refers to [OpenNMT](https://github.com/OpenNMT/OpenNMT-py) and different parsers.
+
+<!-- Packages
+
+    sh installs.sh -->
+
+## Stage A & B: generate SQL and sub-SQLs
+
+    python3 -c "import nltk;nltk.download('punkt')"
+
+    cd clause2subquestion
+
+### download db content
+download db_content.json from https://aistudio.baidu.com/aistudio/datasetdetail/130021 , and put db_content.json under clause2subquestion/auto_gen/
+
+### gengerate sub-SQLs
+generate SQL and sub-SQLs. The generated file clause_aug.json can be found under spider/
+
+    cd auto_gen
+    sh gen.sh
+
+## Stage C: Question Generation
+### model training
+
+We trained our model with OpenNMT: https://github.com/OpenNMT/OpenNMT-py
+
+We do our experiments with version v1.1.1.
+
+### generate data for training and pediction
+
+    # get sub-sql and sub-question
+    cd clause2subquestion
+    python data.py # training data
+
+
+    # get source data of augmentation
+    cd spider
+    python aug2src.py # augment data
+
+
+### prediction and compose the question
+
+Details refer to https://github.com/OpenNMT/OpenNMT-py.
+After get the sub-question for each sub-SQL, we need to compose the full question.
+
+    #question composition from aug_src.txt, aug_tgt.txt and clause_aug_sample.json, then output generated question-SQL dataset aug_output.json
+    python tgt2question.py
+
+Our generated data for Spider can be found at https://aistudio.baidu.com/aistudio/datasetdetail/123584 .
+
+
+## Stage D: Parser training
+
+We did our Spider experiments based on parsers([IRNet](https://github.com/microsoft/IRNet), and [RAT-SQL](https://github.com/Microsoft/rat-sql)).
+
+
+<!-- 
+### RATSQL
+#### download bert model
+
+    cd parsers/RAT-SQL
+    sh bert_download.sh
+
+#### data preprocess
+
+    mkdir data && cd data 
+
+#### you can download part of our preprocessed data from https://aistudio.baidu.com/aistudio/datasetdetail/120901/0 and put in the director data/spider/
+
+#### please download and unzip Spider dataset form url(https://yale-lily.github.io/spider). 
+
+```
+data
+├── spider
+|── |── nl2code,output_from=true,fs=2,emb=bert,cvlink
+    ├── database
+    │   └── ...
+    ├── dev.json
+    ├── dev_gold.sql
+    ├── tables.json
+    ├── train_gold.sql
+    ├── train_others.json
+    └── train_spider.json
+```
+
+#### Details of our baseline model refer to https://github.com/Microsoft/rat-sql. You can download codes director ratsql/ and put in parsers/RAT-SQL, and add key codes in parsers/RAT-SQL/readme.md
+
+#### Train with augmented data
+
+    export CUDA_VISIBLE_DEVICES=0 # device index
+    nohup python -u run.py train experiments/experiments/spider-label-smooth-bert-large-run.jsonnet >aug.log &2>1 &
+
+Alternatively, you can choose a train strategy betwwen aug and naive by changing the key named train_mode
+
+#### Evaluation
+
+    export CUDA_VISIBLE_DEVICES=0 # device index
+    nohup python run.py eval experiments/experiments/spider-label-smooth-bert-large-run.jsonnet > eval_step10-40.log &2>1 &
+
+we will save our model every 1k steps. It is neccessary to evaluate every model by assigning steps of eval_steps in configuation file experiments/experiments/spider-label-smooth-bert-large-run.jsonnet.
+The output file eval_step10-40.log logs the models' performance for certain steps. And our evaluation results are in the folder: eval_logs -->
+#### Tips
+
+<!-- Because of the unstability of RAT, "double descent" phenomenon could happen.
+It is required to change the random seed(att) in parsers/RAT-SQL/experiments/spider-label-smooth-bert-large-run.jsonnet, and modify the director name in logdir and soft linking "model_checkpoint".
+And then we can load the pretrained model and retrain it. -->
+Issue details can be found in issues of github repo(https://github.com/Microsoft/rat-sql).