Skip to content

Commit

Permalink
Merge branch 'PaddlePaddle:master' into ACL2022-DuLeMon
Browse files Browse the repository at this point in the history
  • Loading branch information
ZubinGou committed Mar 29, 2022
2 parents b798736 + 2673b63 commit 6c31859
Show file tree
Hide file tree
Showing 70 changed files with 79,605 additions and 28 deletions.
4 changes: 2 additions & 2 deletions KG/DuKEVU_Baseline/LICENSE
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Copyright (c) 2021 Baidu, Inc. All Rights Reserved
Copyright (c) 2022 Baidu, Inc. All Rights Reserved

Apache License
Version 2.0, January 2004
Expand Down Expand Up @@ -188,7 +188,7 @@ Copyright (c) 2021 Baidu, Inc. All Rights Reserved
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright (c) 2021 Baidu, Inc. All Rights Reserved.
Copyright (c) 2022 Baidu, Inc. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
14 changes: 7 additions & 7 deletions KG/DuKEVU_Baseline/README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
本项目为[CCKS 2021《知识增强的视频语义理解》技术评测任务](https://www.biendata.xyz/competition/ccks_2021_videounderstanding)的基准模型。包括两部分:1)视频分类标签模型 [paddle-video-classify-tag](./paddle-video-classify-tag);2)视频语义标签模型 [paddle-video-semantic-tag](./paddle-video-semantic-tag)。视频分类标签模型根据视频内容在封闭的二级标签体系上进行分类,得到描述视频的分类标签;视频语义标签模型 从视频的文本信息中抽取实体语义标签(选手可进行升级,如利用给定的知识库进行推理、融合多模信息提升标签理解效果,或生成标签等)。两部分模型产出的标签结果,分别对应技术评测数据集中提供的分类标签、语义标签。
本项目为《知识增强的视频语义理解》技术评测任务的基准模型。包括两部分:1)视频分类标签模型 [paddle-video-classify-tag](./paddle-video-classify-tag);2)视频语义标签模型 [paddle-video-semantic-tag](./paddle-video-semantic-tag)。视频分类标签模型根据视频内容在封闭的二级标签体系上进行分类,得到描述视频的分类标签;视频语义标签模型 从视频的文本信息中抽取实体语义标签(选手可进行升级,如利用给定的知识库进行推理、融合多模信息提升标签理解效果,或生成标签等)。两部分模型产出的标签结果,分别对应技术评测数据集中提供的分类标签、语义标签。

## 数据集预处理

首先下载训练和测试数据,并如下组织目录结构(注意根目录和各子目录中的`CCKS_dataset`为同一个文件夹):
首先下载训练和测试数据,并如下组织目录结构(注意根目录和各子目录中的`dataset`为同一个文件夹):

```
DuKEVU_baseline
|-- CCKS_dataset
|-- ccks2021
|-- dataset
|-- dataset
|-- train.json
|-- test_a.json
|-- paddle-video-classify-tag
|-- CCKS_dataset -> ../CCKS_dataset
|-- dataset -> ../dataset
|-- paddle-video-semantic-tag
|-- CCKS_dataset -> ../CCKS_dataset
|-- dataset -> ../dataset
```

## 实验环境配置
Expand All @@ -27,7 +27,7 @@ conda create -n paddle2.0 python=3.8
conda activate paddle2.0
conda install paddlepaddle-gpu==2.0.2 cudatoolkit=10.0 -c paddle
pip install opencv-python -i https://mirror.baidu.com/pypi/simple
pip install --upgrade paddlenlp -i https://mirror.baidu.com/pypi/simple
pip install paddlenlp==2.0.1 -i https://mirror.baidu.com/pypi/simple
pip install tqdm wget
```

Expand Down
File renamed without changes.
4 changes: 2 additions & 2 deletions KG/DuKEVU_Baseline/generate_submission.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved.
# Copyright (c) 2022 Baidu.com, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -20,7 +20,7 @@

parser = argparse.ArgumentParser()
parser.add_argument(
"--test_path", type=str, default="CCKS_dataset/ccks2021/test_a.json")
"--test_path", type=str, default="dataset/dataset/test_a.json")
parser.add_argument(
"--category_level1_result",
type=str,
Expand Down
1 change: 0 additions & 1 deletion KG/DuKEVU_Baseline/paddle-video-classify-tag/CCKS_dataset

This file was deleted.

8 changes: 4 additions & 4 deletions KG/DuKEVU_Baseline/paddle-video-classify-tag/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@

```
DuKEVU_baseline
|-- CCKS_dataset
|-- ccks2021
|-- dataset
|-- dataset
|-- train.json
|-- test_a.json
|--tsn_features_train
Expand All @@ -25,13 +25,13 @@ DuKEVU_baseline

```bash
export CUDA_VISIBLE_DEVICES=0
python tsn_extractor.py --model_name=TSN --config=./configs/tsn-single.yaml --weights=./weights/tsn.pdparams --filelist=./data/TsnExtractor.list --save_dir=./CCKS_dataset/tsn_features
python tsn_extractor.py --model_name=TSN --config=./configs/tsn-single.yaml --weights=./weights/tsn.pdparams --filelist=./data/TsnExtractor.list --save_dir=./dataset/tsn_features
```

如下准备视频语义理解数据集的label集合;准备训练、验证、测试的样本列表等。

```bash
python prepare_ccks_videotag.py
python prepare_videotag.py
```

由于数据集上有两级标签,我们分别在一级标签(level1)和二级标签(level2)的设定下进行分类实验。
Expand Down
1 change: 1 addition & 0 deletions KG/DuKEVU_Baseline/paddle-video-classify-tag/dataset
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved.
# Copyright (c) 2022 Baidu.com, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -23,17 +23,17 @@

parser = argparse.ArgumentParser()
parser.add_argument(
"--trainval_path", type=str, default="CCKS_dataset/ccks2021/train.json")
"--trainval_path", type=str, default="dataset/dataset/train.json")
parser.add_argument(
"--test_path", type=str, default="CCKS_dataset/ccks2021/test_a.json")
"--test_path", type=str, default="dataset/dataset/test_a.json")
parser.add_argument(
"--trainval_tsn_feature_dir",
type=str,
default="CCKS_dataset/tsn_features_train")
default="dataset/tsn_features_train")
parser.add_argument(
"--test_tsn_feature_dir",
type=str,
default="CCKS_dataset/tsn_features_test_a")
default="dataset/tsn_features_test_a")


def create_splits_indice(n_samples, SPLITS):
Expand Down
1 change: 0 additions & 1 deletion KG/DuKEVU_Baseline/paddle-video-semantic-tag/CCKS_dataset

This file was deleted.

7 changes: 3 additions & 4 deletions KG/DuKEVU_Baseline/paddle-video-semantic-tag/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
注:我们在数据处理阶段去除了未在title中出现的语义标签。

```bash
python prepare_ccks_semantic_tag.py
python prepare_semantic_tag.py
```

得到如下的输出文件:
Expand All @@ -22,7 +22,6 @@ paddle-video-semantic-tag
```

## 训练与验证

本模型使用了PaddleNLP模型库中的`bert-wwm-ext-chinese`模型,更多模型可参考[PaddleNLP Transformer API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/transformers.md)

```bash
Expand All @@ -35,7 +34,7 @@ python train_semantic_tag.py \
--num_train_epochs 3 \
--logging_steps 1 \
--save_steps 500 \
--output_dir ./data/checkpoints/ccks_semantic_tag/ \
--output_dir ./data/checkpoints/semantic_tag/ \
--device gpu
```

Expand All @@ -60,7 +59,7 @@ python predict_semantic_tag.py \
--max_seq_length 128 \
--batch_size 32 \
--device gpu \
--init_checkpoint_path data/checkpoints/ccks_semantic_tag/model_2500.pdparams
--init_checkpoint_path data/checkpoints/semantic_tag/model_2500.pdparams
```

生成的命名实体识别结果存储在`./predict_results/ents_results.json`
1 change: 1 addition & 0 deletions KG/DuKEVU_Baseline/paddle-video-semantic-tag/dataset
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@

parser = argparse.ArgumentParser()
parser.add_argument(
"--trainval_path", type=str, default="CCKS_dataset/ccks2021/train.json")
"--trainval_path", type=str, default="dataset/dataset/train.json")
parser.add_argument(
"--test_path", type=str, default="CCKS_dataset/ccks2021/test_a.json")
"--test_path", type=str, default="dataset/dataset/test_a.json")

TAG_NAMES = ["B-ENT", "I-ENT", "O"]

Expand Down
122 changes: 122 additions & 0 deletions NLP/Text2SQL-DA-HIER/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
Text2sql DA HIER
===
Code for EMNLP2021 accepted paper "[Data Augmentation with Hierarchical SQL-to-Question Generation for Cross-domain Text-to-SQL Parsing](https://aclanthology.org/2021.emnlp-main.707/)". Our framework is shown as follows:

![framework](framework.png)

---
## Environment

python == 3.6

install nltk package

pip3 install nltk

Other requirements refers to [OpenNMT](https://github.com/OpenNMT/OpenNMT-py) and different parsers.

<!-- Packages
sh installs.sh -->

## Stage A & B: generate SQL and sub-SQLs

python3 -c "import nltk;nltk.download('punkt')"

cd clause2subquestion

### download db content
download db_content.json from https://aistudio.baidu.com/aistudio/datasetdetail/130021 , and put db_content.json under clause2subquestion/auto_gen/

### gengerate sub-SQLs
generate SQL and sub-SQLs. The generated file clause_aug.json can be found under spider/

cd auto_gen
sh gen.sh

## Stage C: Question Generation
### model training

We trained our model with OpenNMT: https://github.com/OpenNMT/OpenNMT-py

We do our experiments with version v1.1.1.

### generate data for training and pediction

# get sub-sql and sub-question
cd clause2subquestion
python data.py # training data


# get source data of augmentation
cd spider
python aug2src.py # augment data


### prediction and compose the question

Details refer to https://github.com/OpenNMT/OpenNMT-py.
After get the sub-question for each sub-SQL, we need to compose the full question.

#question composition from aug_src.txt, aug_tgt.txt and clause_aug_sample.json, then output generated question-SQL dataset aug_output.json
python tgt2question.py

Our generated data for Spider can be found at https://aistudio.baidu.com/aistudio/datasetdetail/123584 .


## Stage D: Parser training

We did our Spider experiments based on parsers([IRNet](https://github.com/microsoft/IRNet), and [RAT-SQL](https://github.com/Microsoft/rat-sql)).


<!--
### RATSQL
#### download bert model
cd parsers/RAT-SQL
sh bert_download.sh
#### data preprocess
mkdir data && cd data
#### you can download part of our preprocessed data from https://aistudio.baidu.com/aistudio/datasetdetail/120901/0 and put in the director data/spider/
#### please download and unzip Spider dataset form url(https://yale-lily.github.io/spider).
```
data
├── spider
|── |── nl2code,output_from=true,fs=2,emb=bert,cvlink
   ├── database
│ └── ...
   ├── dev.json
   ├── dev_gold.sql
   ├── tables.json
   ├── train_gold.sql
   ├── train_others.json
   └── train_spider.json
```
#### Details of our baseline model refer to https://github.com/Microsoft/rat-sql. You can download codes director ratsql/ and put in parsers/RAT-SQL, and add key codes in parsers/RAT-SQL/readme.md
#### Train with augmented data
export CUDA_VISIBLE_DEVICES=0 # device index
nohup python -u run.py train experiments/experiments/spider-label-smooth-bert-large-run.jsonnet >aug.log &2>1 &
Alternatively, you can choose a train strategy betwwen aug and naive by changing the key named train_mode
#### Evaluation
export CUDA_VISIBLE_DEVICES=0 # device index
nohup python run.py eval experiments/experiments/spider-label-smooth-bert-large-run.jsonnet > eval_step10-40.log &2>1 &
we will save our model every 1k steps. It is neccessary to evaluate every model by assigning steps of eval_steps in configuation file experiments/experiments/spider-label-smooth-bert-large-run.jsonnet.
The output file eval_step10-40.log logs the models' performance for certain steps. And our evaluation results are in the folder: eval_logs -->
#### Tips

<!-- Because of the unstability of RAT, "double descent" phenomenon could happen.
It is required to change the random seed(att) in parsers/RAT-SQL/experiments/spider-label-smooth-bert-large-run.jsonnet, and modify the director name in logdir and soft linking "model_checkpoint".
And then we can load the pretrained model and retrain it. -->
Issue details can be found in issues of github repo(https://github.com/Microsoft/rat-sql).
Loading

0 comments on commit 6c31859

Please sign in to comment.